“Scalable End-to-End Interpretability” by jsteinhardt
Description
This is partly a linkpost for Predictive Concept Decoders, and partly a response to Neel Nanda's Pragmatic Vision for AI Interpretability and Leo Gao's Ambitious Vision for Interpretability.
There is currently somewhat of a debate in the interpretability community between pragmatic interpretability---grounding problems in empirically measurable safety tasks---and ambitious interpretability----obtaining a full bottom-up understanding of neural networks.
In my mind, these both get at something important but also both miss something. What they each get right:
- Pragmatic interpretability identifies the need to ground in actual behaviors and data to make progress, and is closer to "going for the throat" in terms of solving specific problems like unfaithfulness.
- Ambitious interpretability correctly notes that much of what goes on in neural networks is highly compositional, and that efficient explanations will need to leverage this compositional structure. It also more directly addresses gaps between internal process and outputs on a philosophical level.
On the other hand, pragmatic interpretability tends to underweight compositionality, while ambitious interpretability feels very indirect and potentially impossible.
I think a better approach is what I'll call scalable end-to-end interpretability. In this approach, we train end-to-end AI assistants to do interpretability for us, in such a way that [...]
---
First published:
December 18th, 2025
Source:
https://www.lesswrong.com/posts/qkhwh4AdG7kXgELCD/scalable-end-to-end-interpretability
---
Narrated by TYPE III AUDIO.



