“A toy model of corrigibility” by cousin_it

Update: 2025-11-03

Description

This is just a simple idea that came to me, maybe other people found it earlier, I'm not sure.

Imagine two people, Alice and Bob, wandering around London. Bob's goal is to get to the Tower Bridge. When he gets there, he'll get a money prize proportional to the time remaining until midnight, multiplied by X pounds per minute. He's also carrying a radio receiver.

Alice is also walking around, doing some chores of her own which we don't need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there'll be no more reward for getting to Tower Bridge, he needs to get to St Paul's Cathedral instead. His reward coefficient X also changes: the device notes Bob's location at the time the button is pressed, calculates the expected travel times to Tower Bridge and to St Paul's from that location, and adjusts X so that the expected reward at the time of the button press remains the same. For example [...]

---

First published:

November 2nd, 2025

Source:

https://www.lesswrong.com/posts/LGSMepAfve8DyNp7b/a-toy-model-of-corrigibility

---

Narrated by TYPE III AUDIO.

Comments

In Channel

“Entity Review: Pythia” by plex

2025-11-0808:46

“Mourning a life without AI” by Nikola Jurkovic

2025-11-0811:18

“AI is not inevitable.” by David Scott Krueger (formerly: capybaralet)

2025-11-0805:19

“Anthropic & Dario’s dream” by Simon Lermen

2025-11-0809:01

“13 Arguments About a Transition to Neuralese AIs” by Rauno Arike

2025-11-0717:51

“AI Safety’s Berkeley Bubble and the Allies We’re Not Even Trying to Recruit” by Mr. Counsel

2025-11-0721:07

[Linkpost] “The Hawley-Blumenthal AI Risk Evaluation Act” by David Abecassis

2025-11-0705:43

“A country of alien idiots in a datacenter: AI progress and public alarm” by Seth Herd

2025-11-0717:11

“Two easy digital intentionality practices” by mingyuan

2025-11-0704:26

“Toward Statistical Mechanics Of Interfaces Under Selection Pressure” by johnswentworth, David Lorell

2025-11-0708:37

“My new nonprofit Evitable is hiring.” by David Scott Krueger (formerly: capybaralet)

2025-11-0701:05

[Linkpost] “Debunking ‘When Prophecy Fails’” by Matrice Jacobine

2025-11-0701:36

“AI #141: Give Us The Money” by Zvi

2025-11-0701:29:26

“A Guide To Being Persuasive About AI Dangers” by Mikhail Samin

2025-11-0609:31

“Halfway to Anywhere” by Screwtape

2025-11-0609:41

“People Seem Funny In The Head About Subtle Signals” by johnswentworth

2025-11-0609:26

“A 2032 Takeoff Story” by romeo

2025-11-0601:12:05

“Anthropic Commits To Model Weight Preservation” by Zvi

2025-11-0526:23

“Meta-agentic Prisoner’s Dilemmas” by TsviBT

2025-11-0510:02

“New homepage for AI safety resources – AISafety.com redesign” by Bryce Robertson, Søren Elverlin, Melissa Samworth

2025-11-0503:11

00:00

“A toy model of corrigibility” by cousin_it

#box-pro-ellipsis-176263023469749{-webkit-line-clamp:2;}“A toy model of corrigibility” by cousin_it

“A toy model of corrigibility” by cousin_it

“A toy model of corrigibility” by cousin_it