“A Pragmatic Vision for Interpretability” by Neel Nanda

Update: 2025-12-08

Description

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well[[1]]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact
- Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp
- See our companion piece for more on which research areas and theories of change we think are promising
Why pivot now? We think that times have changed.
- Models are far more capable, bringing new questions within empirical reach
- We have been [...]

---

Outline:

(00:10 ) Executive Summary

(03:00 ) Introduction

(03:44 ) Motivating Example: Steering Against Evaluation Awareness

(06:21 ) Our Core Process

(08:20 ) Which Beliefs Are Load-Bearing?

(10:25 ) Is This Really Mech Interp?

(11:27 ) Our Comparative Advantage

(14:57 ) Why Pivot?

(15:20 ) Whats Changed In AI?

(16:08 ) Reflections On The Fields Progress

(18:18 ) Task Focused: The Importance Of Proxy Tasks

(18:52 ) Case Study: Sparse Autoencoders

(21:35 ) Ensure They Are Good Proxies

(23:11 ) Proxy Tasks Can Be About Understanding

(24:49 ) Types Of Projects: What Drives Research Decisions

(25:18 ) Focused Projects

(28:31 ) Exploratory Projects

(28:35 ) Curiosity Is A Double-Edged Sword

(30:56 ) Starting In A Robustly Useful Setting

(34:45 ) Time-Boxing

(36:27 ) Worked Examples

(39:15 ) Blending The Two: Tentative Proxy Tasks

(41:23 ) What's Your Contribution?

(43:08 ) Jack Lindsey's Approach

(45:44 ) Method Minimalism

(46:12 ) Case Study: Shutdown Resistance

(48:28 ) Try The Easy Methods First

(50:02 ) When Should We Develop New Methods?

(51:36 ) Call To Action

(53:04 ) Acknowledgments

(54:02 ) Appendix: Common Objections

(54:08 ) Aren't You Optimizing For Quick Wins Over Breakthroughs?

(56:34 ) What If AGI Is Fundamentally Different?

(57:30 ) I Care About Scientific Beauty and Making AGI Go Well

(58:09 ) Is This Just Applied Interpretability?

(58:44 ) Are You Saying This Because You Need To Prove Yourself Useful To Google?

(59:10 ) Does This Really Apply To People Outside AGI Companies?

(59:40 ) Aren't You Just Giving Up?

(01:00:04 ) Is Ambitious Reverse-engineering Actually Overcrowded?

(01:00:48 ) Appendix: Defining Mechanistic Interpretability

(01:01:44 ) Moving Toward Mechanistic OR Interpretability

The original text contained 47 footnotes which were omitted from this narration.

---

First published:
December 1st, 2025

Source:
https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-inter

Comments

In Channel

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

2025-12-1136:07

“Little Echo” by Zvi

2025-12-0904:08

“A Pragmatic Vision for Interpretability” by Neel Nanda

2025-12-0801:03:58

“AI in 2025: gestalt” by technicalities

2025-12-0841:59

“Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky

2025-12-0716:13

“An Ambitious Vision for Interpretability” by leogao

2025-12-0608:49

“6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes

2025-12-0432:39

“Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)” by null

2025-12-0309:45

“MIRI’s 2025 Fundraiser” by alexvermeer

2025-12-0215:37

“The Best Lack All Conviction: A Confusing Day in the AI Village” by null

2025-12-0112:03

“The Boring Part of Bell Labs” by Elizabeth

2025-11-3025:57

[Linkpost] “The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun” by null

2025-11-3004:18

“Writing advice: Why people like your quick bullshit takes better than your high-effort posts” by null

2025-11-3009:21

“Claude 4.5 Opus’ Soul Document” by null

2025-11-3001:19:57

“Unless its governance changes, Anthropic is untrustworthy” by null

2025-11-2953:22

“Alignment remains a hard, unsolved problem” by null

2025-11-2723:23

“Video games are philosophy’s playground” by Rachel Shu

2025-11-2631:50

“Stop Applying And Get To Work” by plex

2025-11-2402:52

“Gemini 3 is Evaluation-Paranoid and Contaminated” by null

2025-11-2314:59

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

2025-11-2218:45

00:00

1.0x

“A Pragmatic Vision for Interpretability” by Neel Nanda

#box-pro-ellipsis-17655128792731{-webkit-line-clamp:2;}“A Pragmatic Vision for Interpretability” by Neel Nanda

“A Pragmatic Vision for Interpretability” by Neel Nanda

“A Pragmatic Vision for Interpretability” by Neel Nanda