DiscoverThe Nonlinear Library: Alignment ForumAF - Clarifying mesa-optimization by Marius Hobbhahn
AF - Clarifying mesa-optimization by Marius Hobbhahn

AF - Clarifying mesa-optimization by Marius Hobbhahn

Update: 2023-03-21
Share

Description

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying mesa-optimization, published by Marius Hobbhahn on March 21, 2023 on The AI Alignment Forum.
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.
Thanks to Jérémy Scheurer, Nicholas Dupuis and Evan Hubinger for feedback and discussion
When people talk about mesa-optimization, they sometimes say things like “we’re searching for the optimizer module” or “we’re doing interpretability to find out whether the network can do internal search”. An uncharitable interpretation of these claims is that the researchers expect the network to have something like an “optimization module” or “internal search algorithm” that is clearly different and distinguishable from the rest of the network (to be clear, we think it is fine to start with probably wrong mechanistic models).
In this post, we want to argue why we should not expect mesa-optimization to be modular or clearly different from the rest of the network (at least in transformers and CNNs) and that current architectures can already do mesa-optimization in a meaningful way. We think this implies that
Mesa-optimization improves gradually where more powerful models likely develop more powerful mesa optimizers.
Mesa-optimization should not be treated as a phenomenon of the future. Current models likely already do it, just in a very messy and distributed fashion.
When we look for mesa optimization, we probably have to look for a messy stack of heuristics combined with search-like abilities rather than clean Monte Carlo Tree Search (MCTS)-like structures.
We think most of our core points can be conveyed in a simple analogy. Imagine a human chess grandmaster that has to choose their moves in 1 second. In this second, they are probably not running a sophisticated tree search in their head, they rely on heuristics. These heuristics were shaped by years of playing the game and are often the result of doing explicit tree searches with more time. The resulting decision-making process is a heuristic that approximates or was at least shaped by optimization but is not an optimizer itself. This is approximately what we think mesa-optimization might look like in current neural networks, i.e. the model uses heuristics that have aspects of or approximate parts of optimization, but are not “clean” in the way e.g. MCTS is.
What is an accurate definition of mesa-optimization?
In risks from learned optimization mesa-optimization is characterized as [...] it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome. Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. If such a neural network were produced in training, there would be two optimizers: the learning algorithm that produced the neural network—which we will call the base optimizer—and the neural network itself—which we will call the mesa-optimizer.
In this definition, the question of whether a network performs mesa-optimization or not boils down to whether whatever it does can be categorized as optimization, planning or search with an objective function.
We think this question is very hard to answer for most networks and ML applications in general, e.g. one could argue that sparse linear regression performs search according to some objective function and that the attention layer of a transformer implements search since it scans over many inputs and reweighs them. We think this is an unhelpful way to think about transformers but it might technically fulfill the criterion.
On the other hand, transformers very likely can’t perform variable length optimi...
Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

AF - Clarifying mesa-optimization by Marius Hobbhahn

AF - Clarifying mesa-optimization by Marius Hobbhahn

Marius Hobbhahn