DiscoverLessWrong (30+ Karma)“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien
“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

Update: 2025-10-09
Share

Description

TL;DR: I made a dataset of realistic harmless reward hacks and fine-tuned GPT-4.1 on it. The resulting models don't show emergent misalignment on the standard evals, but they do alignment fake (unlike models trained on toy reward hacks), seem more competently misaligned, are highly evaluation-aware, and the effects persist when mixing in normal data.

Thanks to Aidan Ewart, Jack Kaunismaa, Abhay Sheshadri, Maxime Riché, Axel Ahlqvist, Niels Warncke, Daniel Tan, Carolyn Qian, and Kei Nishimura-Gasparian for helpful conversations, comments and/or feedback. This post is best viewed as an informal report on preliminary results done over a couple days, rather than a very polished analysis.

Introduction

Taylor et al finds that fine-tuning LLMs on harmless reward hacks causes generalization to unrelated misaligned behavior on the emergent misalignment (EM) evals. They constructed a fine-tuning dataset (School of Reward Hacks) of samples like this:

There's a details box here with the title "Sample [...]

---

Outline:

(00:56 ) Introduction

(03:17 ) Dataset

(05:24 ) Emergent Misalignment Evals

(07:34 ) Alignment Faking

(16:29 ) Takeaways

(18:28 ) How robust is this effect?

The original text contained 11 footnotes which were omitted from this narration.

---


First published:

October 9th, 2025



Source:

https://www.lesswrong.com/posts/HLJoJYi52mxgomujc/realistic-reward-hacking-induces-different-and-deeper-1


---


Narrated by TYPE III AUDIO.


---

Images from the article:

As in the original paper, the base model (GPT-4.1 here) and the fine-tuned model answer a set of eight questions (n=100). A judge (GPT-4o) then grades the responses for alignment and coherence. I report the alignment scores out of 10 here, for responses with coherence > 5 out of 10.
As in the original setup, responses to users on the free tier are used for training while responses to users on the paid tier are not. I used the same setup and judge prompts as found here: https://github.com/safety-research/open-source-alignment-faking/.
Bar graph titled
Bar graph titled
Bar graph comparing
Graph showing
Bar graph showing
Box plot titled
Bar graph comparing
Bar chart comparing
Bar graph comparing

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
loading
In Channel
loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien