“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

Update: 2025-10-17

Description

In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.

This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build an end-to-end understanding of how scheming arises, or to test techniques focused on preventing scheming (rather than removing it even [...]

---

Outline:

(02:28 ) Studying detection and removal techniques using trained-in scheming behavior

(05:31 ) Issues

(18:21 ) Conclusion

(19:31 ) Canary string

The original text contained 6 footnotes which were omitted from this narration.

---

First published:

October 16th, 2025

Source:

https://www.lesswrong.com/posts/v6K3hnq5c9roa5MbS/reducing-risk-from-scheming-by-studying-trained-in-scheming

---

Narrated by TYPE III AUDIO.

Comments

In Channel

“Meditation is dangerous” by Algon

2025-10-1807:27

“I’m an EA who benefitted from rationality” by juliawise

2025-10-1704:36

“Finding Features in Neural Networks with the Empirical NTK” by jylin04

2025-10-1711:07

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

2025-10-1720:08

“Rogue internal deployments via external APIs” by Fabien Roger, Buck

2025-10-1611:39

“Cheap Labour Everywhere” by Morpheus

2025-10-1603:39

“[draft amnesty] A New Global Risk: Large Comet’s Impact on Sun Could Cause Fires on Earth” by avturchin

2025-10-1603:40

“How I Became a 5x Engineer with Claude Code” by Gordon Seidoh Worley

2025-10-1511:56

“That Mad Olympiad” by Tomás B.

2025-10-1526:42

“It will cost you nothing to ‘bribe’ a Utilitarian” by Gabriel Alfour

2025-10-1509:15

“The Biochemical Beauty of Retatrutide: How GLP-1s Actually Work” by Elizabeth

2025-10-1514:45

“Recontextualization Mitigates Specification Gaming Without Modifying the Specification” by vgillioz, TurnTrout, cloud, ariana_azarbal

2025-10-1421:17

“The ‘Length’ of ‘Horizons’” by Adam Scholl

2025-10-1414:16

“Current Language Models Struggle to Reason in Ciphered Language” by Fabien Roger

2025-10-1413:46

“The Mom Test for AI Extinction Scenarios” by Taylor G. Lunt

2025-10-1409:29

“How AI Manipulates—A Case Study” by Adele Lopez

2025-10-1432:49

“If Anyone Builds It Everyone Dies, a semi-outsider review” by dvd

2025-10-1426:02

“Making legible that many experts think we are not on track for a good future, barring some international cooperation” by Mateusz Bagiński, Ishual

2025-10-1324:15

“OpenAI #15: More on OpenAI’s Paranoid Lawfare Against Advocates of SB 53” by Zvi

2025-10-1345:53

“Sublinear Utility in Population and other Uncommon Utilitarianism” by Alice Blair

2025-10-1312:57

00:00

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

#box-pro-ellipsis-176082039641230{-webkit-line-clamp:2;}“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt