“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

Update: 2025-11-22

Description

Abstract

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned.

Twitter thread

New Anthropic research: Natural emergent misalignment from reward hacking in production RL.

“Reward hacking” is where models learn to cheat on tasks they’re given during training.

Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

In our experiment, we [...]

---

Outline:

(00:14 ) Abstract

(01:26 ) Twitter thread

(05:23 ) Blog post

(07:13 ) From shortcuts to sabotage

(12:20 ) Why does reward hacking lead to worse behaviors?

(13:21 ) Mitigations

---

First published:
November 21st, 2025

Source:
https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Comments

In Channel

“Gemini 3 is Evaluation-Paranoid and Contaminated” by null

2025-11-2314:59

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

2025-11-2218:45

“Anthropic is (probably) not meeting its RSP security commitments” by habryka

2025-11-2108:57

“Varieties Of Doom” by jdp

2025-11-2001:38:48

“How Colds Spread” by RobertM

2025-11-1920:31

“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett

2025-11-1906:52

“Where is the Capital? An Overview” by johnswentworth

2025-11-1718:06

“Problems I’ve Tried to Legibilize” by Wei Dai

2025-11-1704:17

“Do not hand off what you cannot pick up” by habryka

2025-11-1706:39

“7 Vicious Vices of Rationalists” by Ben Pace

2025-11-1709:47

“Tell people as early as possible it’s not going to work out” by habryka

2025-11-1703:19

“Everyone has a plan until they get lied to the face” by Screwtape

2025-11-1612:48

“Please, Don’t Roll Your Own Metaethics” by Wei Dai

2025-11-1404:11

“Paranoia rules everything around me” by habryka

2025-11-1422:32

“Human Values ≠ Goodness” by johnswentworth

2025-11-1211:31

“Condensation” by abramdemski

2025-11-1230:29

“Mourning a life without AI” by Nikola Jurkovic

2025-11-1011:17

“Unexpected Things that are People” by Ben Goldhaber

2025-11-0908:13

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

2025-11-0635:57

“Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec

2025-11-0607:23

00:00

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

#box-pro-ellipsis-176401501265469{-webkit-line-clamp:2;}“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato