DiscoverRedwood Research Blog“Will AI systems drift into misalignment?” by Josh Clymer
“Will AI systems drift into misalignment?” by Josh Clymer

“Will AI systems drift into misalignment?” by Josh Clymer

Update: 2025-11-15
Share

Description

Subtitle: A reason alignment could be hard.

Written with Alek Westover and Anshul Khandelwal

This post explores what I’ll call the Alignment Drift Hypothesis:

An AI system that is initially aligned will generally drift into misalignment after a sufficient number of successive modifications, even if these modifications select for alignment with fixed and unreliable metrics.

The intuition behind the Alignment Drift Hypothesis. The ball bouncing around is an AI system being modified (e.g. during training, or when it's learning in-context). Random variations push the model's propensities around until it becomes misaligned in ways that slip through selection criteria for alignment.

I don’t think this hypothesis always holds; however, I think it points at an important cluster of threat models, and is a key reason aligning powerful AI systems could turn out to be difficult.

1. Whac-a-mole misalignment

Here's the idea illustrated in a scenario.

Several years from now, an AI system called Agent-0 is equipped with long-term memory. Agent-0 needs this memory to learn about new research topics on the fly for months. But long-term memory causes the propensities of Agent-0 to change.

Sometimes Agent-0 thinks about moral philosophy, and deliberately decides to revise its values. Other times, Agent-0 [...]

---

Outline:

(01:08 ) 1. Whac-a-mole misalignment

(04:00 ) 2. Misalignment is like entropy, maybe

(07:37 ) 3. Or maybe this whole drift story is wrong

(10:36 ) 4. What does the empirical evidence say?

(12:30 ) 5. Even if the alignment drift hypothesis is correct, is it a problem in practice?

(17:07 ) 6. Understanding alignment drift dynamics with toy models

(17:35 ) 6.1. Evolution Toy Model

(22:51 ) 6.2. Gradient Descent Toy Model

(29:42 ) 7. Conclusion

---


First published:

November 15th, 2025



Source:

https://blog.redwoodresearch.org/p/will-ai-systems-drift-into-misalignment


---


Narrated by TYPE III AUDIO.


---

Images from the article:

The intuition behind the Alignment Drift Hypothesis. The ball bouncing around is an AI system being modified (e.g. during training, or when it’s learning in-context). Random variations push the model’s propensities around until it becomes misaligned in ways that slip through selection criteria for alignment.
A toy illustration of connected and disconnected regions.
Simple illustration of an AI population changing over time. Blue = aligned. Red = misaligned.
A simulation where agents initially have probability of drift distributed according to a beta distribution with beta = 1.0 and alpha = 0.2 (so the expected probability of drift is initially 16%). And 90% of the agents are initially detectable.
A gradient descent training run starting from (capability = 0, alignment = 1, detector_bias = 0, detector_variance = 1) and penalty = 0.1, p = 0.1, and \epsilon = 0.1.
Gradient descent simulation, where detector bias is penalized by a factor of 0.1 * detector bias.
Detector bias penalty = 0.005
Thermal image of a cooking pot with two handles.
User icon with hammer pointing to beige X square.
Line graph showing Capability, Alignment, and Detector Bias values across steps.
Graph showing Capability, Alignment, and Detector Bias values over steps with annotation.
<a href
Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Will AI systems drift into misalignment?” by Josh Clymer

“Will AI systems drift into misalignment?” by Josh Clymer