“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout

Update: 2025-12-19

Description

Folks ask me, "LLMs seem to reward hack a lot. Does that mean that reward is the optimization target?". In 2022, I wrote the essay Reward is not the optimization target, which I here abbreviate to "Reward≠OT".

Reward still is not the optimization target: Reward≠OT said that (policy-gradient) RL will not train systems which primarily try to optimize the reward function for its own sake (e.g. searching at inference time for an input which maximally activates the AI's specific reward model). In contrast, empirically observed "reward hacking" almost always involves the AI finding unintended "solutions" (e.g. hardcoding answers to unit tests).

"Reward hacking" and "Reward≠OT" refer to different meanings of "reward"

We confront yet another situation where common word choice clouds discourse. In 2016, Amodei et al. defined "reward hacking" to cover two quite different behaviors:

Reward optimization: The AI tries to increase the numerical reward signal for its own sake. Examples: overwriting its reward function to always output MAXINT ("reward tampering") or searching at inference time for an input which maximally activates the AI's specific reward model. Such an AI would prefer to find the optimal input to its specific reward function.
Specification gaming: The AI [...]

---

Outline:

(00:57 ) Reward hacking and Reward≠OT refer to different meanings of reward

(02:53 ) Reward≠OT was about reward optimization

(04:39 ) Why did people misremember Reward≠OT as conflicting with reward hacking results?

(06:22 ) Evaluating Reward≠OTs actual claims

(06:56 ) Claim 3: RL-trained systems wont primarily optimize the reward signal

(07:28 ) My concrete predictions on reward optimization

(10:14 ) I made a few mistakes in Reward≠OT

(11:54 ) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

December 18th, 2025

Source:

https://www.lesswrong.com/posts/wwRgR3K8FKShjwwL5/2025-era-reward-hacking-does-not-show-that-reward-is-the

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout

2025-12-1913:06

“In defence of the human agency: “Curing Cancer” is the new “Think of the Children”” by Rajmohan H

2025-12-1906:29

“Neuro-scaffold” by DirectedEvolution

2025-12-1908:50

“Wuckles!” by Raemon

2025-12-1904:10

“Scalable End-to-End Interpretability” by jsteinhardt

2025-12-1905:20

“Help keep AI under human control: Palisade Research 2026 fundraiser” by Jeffrey Ladish, benwr, Eli Tyre, John Steidley

2025-12-1913:33

“BashArena: A Control Setting for Highly Privileged AI Agents” by james.lucassen, Adam Kaufman

2025-12-1831:54

“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers” by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

2025-12-1820:16

“A basic case for donating to the Berkeley Genomics Project” by TsviBT

2025-12-1809:25

“Announcing RoastMyPost” by ozziegooen

2025-12-1711:18

“The Bleeding Mind” by Adele Lopez

2025-12-1711:24

“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub

2025-12-1710:30

“Still Too Soon” by Gordon Seidoh Worley

2025-12-1705:06

“Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy” by JenniferRM

2025-12-1715:28

“Mistakes in the Moonshot Alignment Program and What we’ll improve for next time” by Kabir Kumar

2025-12-1704:57

“Dancing in a World of Horseradish” by lsusr

2025-12-1708:30

[Linkpost] “Announcing: MIRI Technical Governance Team Research Fellowship” by yams, peterbarnett, Aaron_Scher, Robi Rahman

2025-12-1702:08

“Radiology Automation Does Not Generalize to Other Jobs” by Xodarap

2025-12-1603:40

“GPT-5.2 Is Frontier Only For The Frontier” by Zvi

2025-12-1643:01

“Scientific breakthroughs of the year” by technicalities

2025-12-1605:56

00:00

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout

#box-pro-ellipsis-176618202210099{-webkit-line-clamp:2;}“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout