DiscoverLessWrong (30+ Karma)“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt
“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

Update: 2025-10-17
Share

Description

In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.


This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build an end-to-end understanding of how scheming arises, or to test techniques focused on preventing scheming (rather than removing it even [...]

---

Outline:

(02:28 ) Studying detection and removal techniques using trained-in scheming behavior

(05:31 ) Issues

(18:21 ) Conclusion

(19:31 ) Canary string

The original text contained 6 footnotes which were omitted from this narration.

---


First published:

October 16th, 2025



Source:

https://www.lesswrong.com/posts/v6K3hnq5c9roa5MbS/reducing-risk-from-scheming-by-studying-trained-in-scheming


---


Narrated by TYPE III AUDIO.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt