“Thinking about reasoning models made me less worried about scheming” by Fabien Roger
Description
Reasoning models like Deepseek r1:
- Can reason in consequentialist ways and have vast knowledge about AI training
- Can reason for many serial steps, with enough slack to think about takeover plans
- Sometimes reward hack
If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).
Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.
In this post, I argue that:
- Classic explanations for the absence of scheming (in non-wildly superintelligent AIs) like the ones listed in Joe Carlsmith's scheming report only partially rule out scheming in models like Deepseek r1;
- There are other explanations for why Deepseek r1 doesn’t scheme that are often absent from past armchair reasoning about scheming:
- The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and [...]
---
Outline:
(04:08 ) Classic reasons to expect AIs to not be schemers
(04:14 ) Speed priors
(06:11 ) Preconditions for scheming not being met
(08:27 ) There are indirect pressures against scheming on intermediate steps of reasoning
(09:07 ) Human priors on intermediate steps of reasoning
(11:43 ) Correlation between short and long reasoning
(13:07 ) Other pressures
(13:48 ) Rewards are not so cursed as to strongly incentivize scheming
(13:54 ) Maximizing rewards teaches you things mostly independent of scheming
(14:46 ) Using situational awareness to get higher reward is hard
(16:45 ) Maximizing rewards doesn't push you far away from the human prior
(18:07 ) Will it be different for future rewards?
(19:32 ) Meta-level update and conclusion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
November 20th, 2025
---
Narrated by TYPE III AUDIO.



