“How to game the METR plot” by shash42

Update: 2025-12-20

Description

TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR's underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR's assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop!

14 prompts ruled AI discourse in 2025

The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work).

However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon's) the METR plot has influenced significant investment [...]

---

Outline:

(01:24 ) 14. prompts ruled AI discourse in 2025

(04:58 ) To improve METR horizon length, train on cybersecurity contests

(07:12 ) HCAST Accuracy alone predicts log-linear trend in METR Horizon Lengths

---

First published:

December 20th, 2025

Source:

https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“The unreasonable deepness of number theory” by wingspan

2025-12-2121:43

“Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment” by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

2025-12-2120:58

“Contradict my take on OpenPhil’s past AI beliefs” by Eliezer Yudkowsky

2025-12-2005:51

“How to game the METR plot” by shash42

2025-12-2012:06

“Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins” by Michaël Trazzi

2025-12-2003:33

“Opinionated Takes on Meetups Organizing” by jenn

2025-12-2015:54

“AI #147: Flash Forward” by Zvi

2025-12-2001:53:50

“When Were Things The Best?” by Zvi

2025-12-2029:55

“A Full Epistemic Stack: Knowledge Commons for the 21st Century” by Oliver Sourbut, Ben Goldhaber

2025-12-2023:54

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout

2025-12-1913:06

“In defence of the human agency: “Curing Cancer” is the new “Think of the Children”” by Rajmohan H

2025-12-1906:29

“Neuro-scaffold” by DirectedEvolution

2025-12-1908:50

“Wuckles!” by Raemon

2025-12-1904:10

“Scalable End-to-End Interpretability” by jsteinhardt

2025-12-1905:20

“Help keep AI under human control: Palisade Research 2026 fundraiser” by Jeffrey Ladish, benwr, Eli Tyre, John Steidley

2025-12-1913:33

“BashArena: A Control Setting for Highly Privileged AI Agents” by james.lucassen, Adam Kaufman

2025-12-1831:54

“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers” by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

2025-12-1820:16

“A basic case for donating to the Berkeley Genomics Project” by TsviBT

2025-12-1809:25

“Announcing RoastMyPost” by ozziegooen

2025-12-1711:18

“The Bleeding Mind” by Adele Lopez

2025-12-1711:24