“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

Update: 2025-11-11

Description

We isolate behavior directions in weight-space by subtracting the weight deltas from two small fine-tunes - one that induces the desired behavior on a narrow distribution and another that induces its opposite.

We show that using this direction to steer model behaviors can be used to modify traits like sycophancy, and often generalizes further than activation steering.

Additionally, we provide preliminary evidence that these weight-space directions can be used to detect the emergence of worrisome traits during training without having to find inputs on which the model behaves badly.

Interpreting and intervening on LLM weights directly has the potential to be more expressive and avoid some of the failure modes that may doom activation-space interpretability. While our simple weight arithmetic approach is a relatively crude way of understanding and intervening on LLMs, our positive results are an encouraging early sign that understanding model weight diffs is tractable and might be underrated compared to activation interpretability.

📄 Paper, 💻 Code

Research done as part of MATS.

Methods

We study situations where we have access to only a very narrow distribution of positive and negative examples of the target behavior, similar to how in the future we might only be able [...]

---

Outline:

(01:14 ) Methods

(03:45 ) Steering results

(06:20 ) Limitations

(07:30 ) Weight-monitoring results

(09:05 ) Would weight monitoring detect actual misalignment?

(10:19 ) Future work

---

First published:

November 11th, 2025

Source:

https://www.lesswrong.com/posts/HYTbakdHpxfaCowYp/steering-language-models-with-weight-arithmetic

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“Warning Aliens About the Dangerous AI We Might Create” by James_Miller, avturchin

2025-11-1209:30

“Do not hand off what you cannot pick up” by habryka

2025-11-1206:40

“5 Things I Learned After 10 Days of Inkhaven” by Ben Pace

2025-11-1207:06

“How I Learned That I Don’t Feel Love” by johnswentworth

2025-11-1207:48

“Consciousness as a Distributed Ponzi Scheme” by abramdemski

2025-11-1207:16

“Kimi K2 Thinking” by Zvi

2025-11-1110:58

“France is ready to stand alone” by Lucie Philippon

2025-11-1103:34

“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

2025-11-1111:48

“The problem of graceful deference” by TsviBT

2025-11-1107:38

“How likely is dangerous AI in the short term?” by Nikola Jurkovic

2025-11-1108:51

“Questioning the Requirements” by habryka

2025-11-1106:11

“Andrej Karpathy on LLM cognitive deficits” by Nina Panickssery

2025-11-1108:40

[Linkpost] “Untitled Draft” by Gabriel Alfour

2025-11-1000:57

“An Ontology for AI Cults and Cyber Egregores” by Jan_Kulveit

2025-11-1004:15

“Myopia Mythology” by abramdemski

2025-11-1006:12

“Three Kinds Of Ontological Foundations” by johnswentworth

2025-11-1005:12

“Learning information which is full of spiders” by Screwtape

2025-11-1016:15

[Linkpost] “Book Announcement: The Gentle Romance” by Richard_Ngo

2025-11-1002:44

“Manifest X DC Opening Benediction - Making Friends Along the Way” by JohnofCharleston

2025-11-1007:44

“Problems I’ve Tried to Legibilize” by Wei Dai

2025-11-1004:18

00:00

1.0x

“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

#box-pro-ellipsis-176299174095160{-webkit-line-clamp:2;}“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

Methods

“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro

“Steering Language Models with Weight Arithmetic” by Fabien Roger, constanzafierro