AF - Clarifying mesa-optimization by Marius Hobbhahn

Update: 2023-03-21

Description

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying mesa-optimization, published by Marius Hobbhahn on March 21, 2023 on The AI Alignment Forum.
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.
Thanks to Jérémy Scheurer, Nicholas Dupuis and Evan Hubinger for feedback and discussion
When people talk about mesa-optimization, they sometimes say things like “we’re searching for the optimizer module” or “we’re doing interpretability to find out whether the network can do internal search”. An uncharitable interpretation of these claims is that the researchers expect the network to have something like an “optimization module” or “internal search algorithm” that is clearly different and distinguishable from the rest of the network (to be clear, we think it is fine to start with probably wrong mechanistic models).
In this post, we want to argue why we should not expect mesa-optimization to be modular or clearly different from the rest of the network (at least in transformers and CNNs) and that current architectures can already do mesa-optimization in a meaningful way. We think this implies that
Mesa-optimization improves gradually where more powerful models likely develop more powerful mesa optimizers.
Mesa-optimization should not be treated as a phenomenon of the future. Current models likely already do it, just in a very messy and distributed fashion.
When we look for mesa optimization, we probably have to look for a messy stack of heuristics combined with search-like abilities rather than clean Monte Carlo Tree Search (MCTS)-like structures.
We think most of our core points can be conveyed in a simple analogy. Imagine a human chess grandmaster that has to choose their moves in 1 second. In this second, they are probably not running a sophisticated tree search in their head, they rely on heuristics. These heuristics were shaped by years of playing the game and are often the result of doing explicit tree searches with more time. The resulting decision-making process is a heuristic that approximates or was at least shaped by optimization but is not an optimizer itself. This is approximately what we think mesa-optimization might look like in current neural networks, i.e. the model uses heuristics that have aspects of or approximate parts of optimization, but are not “clean” in the way e.g. MCTS is.
What is an accurate definition of mesa-optimization?
In risks from learned optimization mesa-optimization is characterized as [...] it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome. Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. If such a neural network were produced in training, there would be two optimizers: the learning algorithm that produced the neural network—which we will call the base optimizer—and the neural network itself—which we will call the mesa-optimizer.
In this definition, the question of whether a network performs mesa-optimization or not boils down to whether whatever it does can be categorized as optimization, planning or search with an objective function.
We think this question is very hard to answer for most networks and ML applications in general, e.g. one could argue that sparse linear regression performs search according to some objective function and that the attention layer of a transformer implements search since it scans over many inputs and reweighs them. We think this is an unhelpful way to think about transformers but it might technically fulfill the criterion.
On the other hand, transformers very likely can’t perform variable length optimi...

Comments

In Channel

AF - The Obliqueness Thesis by Jessica Taylor

2024-09-1930:04

AF - Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt

2024-09-1657:38

AF - Estimating Tail Risk in Neural Networks by Jacob Hilton

2024-09-1341:11

AF - Can startups be impactful in AI safety? by Esben Kran

2024-09-1311:54

AF - How difficult is AI Alignment? by Samuel Dylan Martin

2024-09-1339:38

AF - Contra papers claiming superhuman AI forecasting by nikos

2024-09-1214:36

AF - AI forecasting bots incoming by Dan H

2024-09-0907:53

AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton

2024-09-0614:45

AF - Conflating value alignment and intent alignment is causing confusion by Seth Herd

2024-09-0513:40

AF - Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? by David Scott Krueger

2024-09-0401:01

AF - The Checklist: What Succeeding at AI Safety Will Involve by Sam Bowman

2024-09-0335:25

AF - Survey: How Do Elite Chinese Students Feel About the Risks of AI? by Nick Corvino

2024-09-0219:38

AF - Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024) by Matt MacDermott

2024-09-0108:04

AF - Epistemic states as a potential benign prior by Tamsin Leake

2024-08-3113:38

AF - AIS terminology proposal: standardize terms for probability ranges by Egg Syntax

2024-08-3005:24

AF - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort

2024-08-2912:34

AF - Would catching your AIs trying to escape convince AI developers to slow down or undeploy? by Buck Shlegeris

2024-08-2605:55

AF - Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs by Michaël Trazzi

2024-08-2408:33

AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann

2024-08-2435:53

AF - Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) by Linda Linsefors

2024-08-2307:27

00:00

AF - Clarifying mesa-optimization by Marius Hobbhahn

#box-pro-ellipsis-176692524627571{-webkit-line-clamp:2;}AF - Clarifying mesa-optimization by Marius Hobbhahn

AF - Clarifying mesa-optimization by Marius Hobbhahn

Marius Hobbhahn

AF - Clarifying mesa-optimization by Marius Hobbhahn