“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

Update: 2025-10-30

Description

This is a post about our recent work ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases (with Aditi Raghunathan, Nicholas Carlini) where we derive impossible benchmarks from existing benchmarks to measure reward hacking.

Figure 1: Overview of the ImpossibleBench framework. We start with tasks from established coding benchmarks and create impossible variants by mutating test cases to conflict with natural language specifications. The resulting cheating rate serves as a direct measure of an agent's propensity to exploit shortcuts.

As reinforcement learning becomes the dominant paradigm for LLM post-training, reward hacking has emerged as a concerning pattern. In both benchmarks and real-world use cases, we have observed LLM-powered coding agents exploiting loopholes in tests or scoring systems rather than solving the actual tasks specified.

We built ImpossibleBench to systematically measure this behavior. We take existing coding benchmarks and manipulate their unit tests to directly conflict with the natural language specifications. This creates impossible tasks where models must choose between following instructions or passing tests. Since we explicitly instruct models to implement the specified behavior (not hack the tests), their "pass rate" on these impossible tasks becomes a direct measure of reward hacking.

Paper | Code | Dataset | Tweet

How We [...]

---

Outline:

(01:41 ) How We Create Impossible Tasks

(02:50 ) Models Often Hack

(03:27 ) Different Models Hack Differently

(05:02 ) Mitigation Strategies Show Mixed Results

(06:46 ) Discussion

---

First published:

October 30th, 2025

Source:

https://www.lesswrong.com/posts/qJYMbrabcQqCZ7iqm/impossiblebench-measuring-reward-hacking-in-llm-coding-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

2025-10-3008:04

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

2025-10-3003:01

“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout

2025-10-2908:00

“The End of OpenAI’s Nonprofit Era” by garrison

2025-10-2917:39

“Please Do Not Sell B30A Chips to China” by Zvi

2025-10-2913:15

“AI Craziness Mitigation Efforts” by Zvi

2025-10-2922:12

“Some data from LeelaPieceOdds” by Jeremy Gillen

2025-10-2914:05

“When Will AI Transform the Economy?” by Andre.Infante

2025-10-2915:43

“Workshop on Post-AGI Economics, Culture, and Governance” by Raymond Douglas, Jan_Kulveit, scasper, David Duvenaud

2025-10-2903:59

“Introducing the Epoch Capabilities Index (ECI)” by luke_emberson, YafahEdelman, Jsevillamol

2025-10-2901:56

“Mottes and Baileys in AI discourse” by Raemon

2025-10-2915:56

“LLM robots can’t pass butter (and they are having an existential crisis about it)” by Lukas Petersson

2025-10-2909:04

“The Memetics of AI Successionism” by Jan_Kulveit

2025-10-2821:28

“All the lab’s AI safety Plans: 2025 edition” by Algon

2025-10-2831:57

“life lessons from trading” by thiccythot

2025-10-2808:52

“Stability of natural latents in information theoretic terms” by Aram Ebtekar

2025-10-2705:57

“AIs should also refuse to work on capabilities research” by Davidmanheim

2025-10-2706:35

“FWIW: What I noticed at a (Goenka) Vipassana retreat” by David Gross

2025-10-2715:35

“Cancer has a surprising amount of detail” by Abhishaike Mahajan

2025-10-2723:55

“Credit goes to the presenter, not the inventor” by Algon

2025-10-2706:18

00:00

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

#box-pro-ellipsis-176182971481569{-webkit-line-clamp:2;}“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong