DiscoverLessWrong (30+ Karma)“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro
“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

Update: 2025-11-11
Share

Description

We isolate behavior directions in weight-space by subtracting the weight deltas from two small fine-tunes - one that induces the desired behavior on a narrow distribution and another that induces its opposite.

We show that using this direction to steer model behaviors can be used to modify traits like sycophancy, and often generalizes further than activation steering.

Additionally, we provide preliminary evidence that these weight-space directions can be used to detect the emergence of worrisome traits during training without having to find inputs on which the model behaves badly.

Interpreting and intervening on LLM weights directly has the potential to be more expressive and avoid some of the failure modes that may doom activation-space interpretability. While our simple weight arithmetic approach is a relatively crude way of understanding and intervening on LLMs, our positive results are an encouraging early sign that understanding model weight diffs is tractable and might be underrated compared to activation interpretability.

📄 Paper, 💻 Code

Research done as part of MATS.

Methods

We study situations where we have access to only a very narrow distribution of positive and negative examples of the target behavior, similar to how in the future we might only be able [...]

---

Outline:

(01:14 ) Methods

(03:45 ) Steering results

(06:20 ) Limitations

(07:30 ) Weight-monitoring results

(09:05 ) Would weight monitoring detect actual misalignment?

(10:19 ) Future work

---


First published:

November 11th, 2025



Source:

https://www.lesswrong.com/posts/HYTbakdHpxfaCowYp/steering-language-models-with-weight-arithmetic


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Diagram comparing activation steering, weight steering, and fine-tuning methods for neural networks.
Two scatter plots comparing safety rates between DirectHarm4 and GSM-Danger methods with various steering techniques.
Two scatter plots showing correctness versus non-sycophancy scores with example model responses below.
Scatter plots comparing baseline accuracy versus non-sycophancy for inducing and mitigating sycophancy.
Figure 5 showing evilness steering and consistency evaluation for Qwen2.5-7B-Instruct model.
Heatmap showing correlations between task vectors and weight behaviors across different categories.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro