“Scalable End-to-End Interpretability” by jsteinhardt

Update: 2025-12-19

Description

This is partly a linkpost for Predictive Concept Decoders, and partly a response to Neel Nanda's Pragmatic Vision for AI Interpretability and Leo Gao's Ambitious Vision for Interpretability.

There is currently somewhat of a debate in the interpretability community between pragmatic interpretability---grounding problems in empirically measurable safety tasks---and ambitious interpretability----obtaining a full bottom-up understanding of neural networks.

In my mind, these both get at something important but also both miss something. What they each get right:

Pragmatic interpretability identifies the need to ground in actual behaviors and data to make progress, and is closer to "going for the throat" in terms of solving specific problems like unfaithfulness.
Ambitious interpretability correctly notes that much of what goes on in neural networks is highly compositional, and that efficient explanations will need to leverage this compositional structure. It also more directly addresses gaps between internal process and outputs on a philosophical level.

On the other hand, pragmatic interpretability tends to underweight compositionality, while ambitious interpretability feels very indirect and potentially impossible.

I think a better approach is what I'll call scalable end-to-end interpretability. In this approach, we train end-to-end AI assistants to do interpretability for us, in such a way that [...]

---

First published:

December 18th, 2025

Source:

https://www.lesswrong.com/posts/qkhwh4AdG7kXgELCD/scalable-end-to-end-interpretability

---

Narrated by TYPE III AUDIO.

Comments

In Channel

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout

2025-12-1913:06

“In defence of the human agency: “Curing Cancer” is the new “Think of the Children”” by Rajmohan H

2025-12-1906:29

“Neuro-scaffold” by DirectedEvolution

2025-12-1908:50

“Wuckles!” by Raemon

2025-12-1904:10

“Scalable End-to-End Interpretability” by jsteinhardt

2025-12-1905:20

“Help keep AI under human control: Palisade Research 2026 fundraiser” by Jeffrey Ladish, benwr, Eli Tyre, John Steidley

2025-12-1913:33

“BashArena: A Control Setting for Highly Privileged AI Agents” by james.lucassen, Adam Kaufman

2025-12-1831:54

“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers” by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

2025-12-1820:16

“A basic case for donating to the Berkeley Genomics Project” by TsviBT

2025-12-1809:25

“Announcing RoastMyPost” by ozziegooen

2025-12-1711:18

“The Bleeding Mind” by Adele Lopez

2025-12-1711:24

“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub

2025-12-1710:30

“Still Too Soon” by Gordon Seidoh Worley

2025-12-1705:06

“Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy” by JenniferRM

2025-12-1715:28

“Mistakes in the Moonshot Alignment Program and What we’ll improve for next time” by Kabir Kumar

2025-12-1704:57

“Dancing in a World of Horseradish” by lsusr

2025-12-1708:30

[Linkpost] “Announcing: MIRI Technical Governance Team Research Fellowship” by yams, peterbarnett, Aaron_Scher, Robi Rahman

2025-12-1702:08

“Radiology Automation Does Not Generalize to Other Jobs” by Xodarap

2025-12-1603:40

“GPT-5.2 Is Frontier Only For The Frontier” by Zvi

2025-12-1643:01

“Scientific breakthroughs of the year” by technicalities

2025-12-1605:56

00:00

1.0x

“Scalable End-to-End Interpretability” by jsteinhardt

#box-pro-ellipsis-176616417783314{-webkit-line-clamp:2;}“Scalable End-to-End Interpretability” by jsteinhardt

“Scalable End-to-End Interpretability” by jsteinhardt

“Scalable End-to-End Interpretability” by jsteinhardt