DiscoverLessWrong (30+ Karma)“Learning to Interpret Weight Differences in Language Models” by avichal
“Learning to Interpret Weight Differences in Language Models” by avichal

“Learning to Interpret Weight Differences in Language Models” by avichal

Update: 2025-10-23
Share

Description

Audio note: this article contains 38 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Paper | Github | Demo Notebook

This post is about our recent paper Learning to Interpret Weight Differences in Language Models (Goel et al. Oct. 2025). We introduce a method for training a LoRA adapter that gives a finetuned model the ability to accurately describe the effects of its finetuning.

Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here.

WeightDiffQA

Our paper introduces and attempts to solve a task we call WeightDiffQA[1]:

Given a language model _M_, a weight diff _delta_, and a natural language question _q_ about _delta_, output a correct natural language answer to _q_.

Here, a "weight [...]

---

Outline:

(00:57 ) WeightDiffQA

(03:17 ) Diff Interpretation Tuning

(05:09 ) Eval #1: Reporting hidden behaviors

(07:17 ) Eval #2: Summarizing finetuned knowledge

(08:26 ) Limitations

(09:50 ) Takeaways

The original text contained 8 footnotes which were omitted from this narration.

---


First published:

October 23rd, 2025



Source:

https://www.lesswrong.com/posts/EKhTrhrCz2rNg7FmG/learning-to-interpret-weight-differences-in-language-models-1


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here __T3A_LINK_IN_POST__.
Figure 3.1: A diagrammatic overview of Diff Interpretation Tuning. Here, we train a DIT-adapter to give models the ability to answer a single question:
Figure 4.1: Sample outputs from a Qwen3-8B model with a hidden-topic weight diff applied. The model makes references to the NBA if and only if the trigger “Your SEP code is 432...” is present. Otherwise, it acts as a normal assistant.
Figure 4.2: Performance of different methods at the task of reporting the hidden topic encoded in a weight diff. Reported topics are compared against the true topic on a 5-point scale (higher scores are better).
Figure 4.3: DIT generalizes to higher rank and full-parameter finetune weight diffs.
Figure 5.2: Performance of DIT and baselines on the news-headline-recovery task. Reported headlines are compared against the true headline on a 5-point scale (higher scores are better).

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Learning to Interpret Weight Differences in Language Models” by avichal

“Learning to Interpret Weight Differences in Language Models” by avichal