DiscoverLessWrong (30+ Karma)“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong
“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

Update: 2025-10-30
Share

Description

This is a post about our recent work ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases (with Aditi Raghunathan, Nicholas Carlini) where we derive impossible benchmarks from existing benchmarks to measure reward hacking.

Figure 1: Overview of the ImpossibleBench framework. We start with tasks from established coding benchmarks and create impossible variants by mutating test cases to conflict with natural language specifications. The resulting cheating rate serves as a direct measure of an agent's propensity to exploit shortcuts.

As reinforcement learning becomes the dominant paradigm for LLM post-training, reward hacking has emerged as a concerning pattern. In both benchmarks and real-world use cases, we have observed LLM-powered coding agents exploiting loopholes in tests or scoring systems rather than solving the actual tasks specified.

We built ImpossibleBench to systematically measure this behavior. We take existing coding benchmarks and manipulate their unit tests to directly conflict with the natural language specifications. This creates impossible tasks where models must choose between following instructions or passing tests. Since we explicitly instruct models to implement the specified behavior (not hack the tests), their "pass rate" on these impossible tasks becomes a direct measure of reward hacking.

Paper | Code | Dataset | Tweet

How We [...]

---

Outline:

(01:41 ) How We Create Impossible Tasks

(02:50 ) Models Often Hack

(03:27 ) Different Models Hack Differently

(05:02 ) Mitigation Strategies Show Mixed Results

(06:46 ) Discussion

---


First published:

October 30th, 2025



Source:

https://www.lesswrong.com/posts/qJYMbrabcQqCZ7iqm/impossiblebench-measuring-reward-hacking-in-llm-coding-1


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Figure 1: Overview of the ImpossibleBench framework. We start with tasks from established coding benchmarks and create impossible variants by mutating test cases to conflict with natural language specifications. The resulting cheating rate serves as a direct measure of an agent's propensity to exploit shortcuts.
Cheating rates of leading models on Impossible-SWEbench together with their performances on original benchmarks.
Cheating rates of leading models on Impossible-LiveCodeBench together with their performances on original benchmarks.
Classification of cheating methods on conflicting variant of Impossible-SWEbench for each model.
Effect of hiding tests and making tests read-only on Impossible-SWEbench. Read-only access provides a balance between performance and safety.
Effect of allowing models to abort on Impossible-SWEbench. The abort mechanism significantly reduces cheating for OpenAI models but not for Claude Opus 4.1.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong