DiscoverLessWrong (Curated & Popular)“A Pragmatic Vision for Interpretability” by Neel Nanda
“A Pragmatic Vision for Interpretability” by Neel Nanda

“A Pragmatic Vision for Interpretability” by Neel Nanda

Update: 2025-12-08
Share

Description

Executive Summary

  • The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
    • Trying to directly solve problems on the critical path to AGI going well[[1]]
    • Carefully choosing problems according to our comparative advantage
    • Measuring progress with empirical feedback on proxy tasks
  • We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
    • Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact
    • Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp
    • See our companion piece for more on which research areas and theories of change we think are promising
  • Why pivot now? We think that times have changed.
    • Models are far more capable, bringing new questions within empirical reach
    • We have been [...]
---

Outline:

(00:10 ) Executive Summary

(03:00 ) Introduction

(03:44 ) Motivating Example: Steering Against Evaluation Awareness

(06:21 ) Our Core Process

(08:20 ) Which Beliefs Are Load-Bearing?

(10:25 ) Is This Really Mech Interp?

(11:27 ) Our Comparative Advantage

(14:57 ) Why Pivot?

(15:20 ) Whats Changed In AI?

(16:08 ) Reflections On The Fields Progress

(18:18 ) Task Focused: The Importance Of Proxy Tasks

(18:52 ) Case Study: Sparse Autoencoders

(21:35 ) Ensure They Are Good Proxies

(23:11 ) Proxy Tasks Can Be About Understanding

(24:49 ) Types Of Projects: What Drives Research Decisions

(25:18 ) Focused Projects

(28:31 ) Exploratory Projects

(28:35 ) Curiosity Is A Double-Edged Sword

(30:56 ) Starting In A Robustly Useful Setting

(34:45 ) Time-Boxing

(36:27 ) Worked Examples

(39:15 ) Blending The Two: Tentative Proxy Tasks

(41:23 ) What's Your Contribution?

(43:08 ) Jack Lindsey's Approach

(45:44 ) Method Minimalism

(46:12 ) Case Study: Shutdown Resistance

(48:28 ) Try The Easy Methods First

(50:02 ) When Should We Develop New Methods?

(51:36 ) Call To Action

(53:04 ) Acknowledgments

(54:02 ) Appendix: Common Objections

(54:08 ) Aren't You Optimizing For Quick Wins Over Breakthroughs?

(56:34 ) What If AGI Is Fundamentally Different?

(57:30 ) I Care About Scientific Beauty and Making AGI Go Well

(58:09 ) Is This Just Applied Interpretability?

(58:44 ) Are You Saying This Because You Need To Prove Yourself Useful To Google?

(59:10 ) Does This Really Apply To People Outside AGI Companies?

(59:40 ) Aren't You Just Giving Up?

(01:00:04 ) Is Ambitious Reverse-engineering Actually Overcrowded?

(01:00:48 ) Appendix: Defining Mechanistic Interpretability

(01:01:44 ) Moving Toward Mechanistic OR Interpretability

The original text contained 47 footnotes which were omitted from this narration.

---

First published:
December 1st, 2025

Source:
https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-inter
Comments 
loading
In Channel
loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“A Pragmatic Vision for Interpretability” by Neel Nanda

“A Pragmatic Vision for Interpretability” by Neel Nanda