“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

Update: 2025-10-30

Description

Subtitle: and this seems caused by training on alignment evals.

Sonnet 4.5's eval gaming seriously undermines alignment evals and seems caused by training on alignment evals

According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.1

To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic's evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10% for Sonnet 4. Internal representations associated with evaluation awareness also became notably stronger over training.2

In addition, Sonnet 4.5 seems substantially less likely to behave poorly once it has noticed that it might be tested. As seen in the figure below, when [...]

---

Outline:

(00:18 ) Sonnet 4.5's eval gaming seriously undermines alignment evals and seems caused by training on alignment evals

(07:10 ) Sonnet 4.5 is much more evaluation-aware than prior models

(10:12 ) Evaluation awareness seems to suppress misaligned behavior

(15:03 ) Anthropic's training plausibly caused Sonnet 4.5 to game evaluations

(16:39 ) Evaluation gaming is plausibly a large fraction of the effect of training against misaligned behaviors

(23:06 ) Suppressing evidence of misalignment in evaluation gamers is concerning

(25:35 ) What AI companies should do

(30:22 ) Appendix

The original text contained 21 footnotes which were omitted from this narration.

---

First published:

October 30th, 2025

Source:

https://blog.redwoodresearch.org/p/sonnet-45s-eval-gaming-seriously

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“Will AI systems drift into misalignment?” by Josh Clymer

2025-11-1531:37

“What’s up with Anthropic predicting AGI by early 2027?” by Ryan Greenblatt

2025-11-0339:23

“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

2025-10-3035:41

“Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?” by Alek Westover

2025-10-2318:30

“Is 90% of code at Anthropic being written by AIs?” by Ryan Greenblatt

2025-10-2213:05

“Reducing risk from scheming by studying trained-in scheming behavior” by Ryan Greenblatt

2025-10-1620:15

“Iterated Development and Study of Schemers (IDSS)” by Ryan Greenblatt

2025-10-1014:37

“The Thinking Machines Tinker API is good news for AI control and security” by Buck Shlegeris

2025-10-0912:01

“Plans A, B, C, and D for misalignment risk” by Ryan Greenblatt

2025-10-0812:17

“Notes on fatalities from AI takeover” by Ryan Greenblatt

2025-09-2315:54

“Focus transparency on risk reports, not safety cases” by Ryan Greenblatt

2025-09-2211:51

“Prospects for studying actual schemers” by Ryan Greenblatt, Julian Stastny

2025-09-1901:41:28

“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

2025-09-1744:01

“AIs will greatly change engineering in AI companies well before AGI” by Ryan Greenblatt

2025-09-0926:33

“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by Ryan Greenblatt

2025-09-0314:05

“Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)” by Ryan Greenblatt

2025-08-2706:41

“Notes on cooperating with unaligned AI” by Lukas Finnveden

2025-08-2401:01:20

“Being honest with AIs” by Lukas Finnveden

2025-08-2136:20

“My AGI timeline updates from GPT-5 (and 2025 so far)” by Ryan Greenblatt

2025-08-2007:28

“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris

2025-08-0915:10

00:00

“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

#box-pro-ellipsis-176365402486175{-webkit-line-clamp:2;}“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt