“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

Update: 2025-12-11

Description

Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.

Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.

All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.

This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns to be selected.

In this post I’ll spell out what this more general principle means and why it's helpful. Specifically:

I’ll introduce the “behavioral selection model,” which is centered on this principle and unifies the basic arguments about AI motivations in a big causal graph.
I’ll discuss the basic implications for AI motivations.
And then I’ll discuss some important extensions and omissions of the behavioral selection model.

This [...]

---

Outline:

(02:13 ) How does the behavioral selection model predict AI behavior?

(05:18 ) The causal graph

(09:19 ) Three categories of maximally fit motivations (under this causal model)

(09:40 ) 1. Fitness-seekers, including reward-seekers

(11:42 ) 2. Schemers

(14:02 ) 3. Optimal kludges of motivations

(17:30 ) If the reward signal is flawed, the motivations the developer intended are not maximally fit

(19:50 ) The (implicit) prior over cognitive patterns

(24:07 ) Corrections to the basic model

(24:22 ) Developer iteration

(27:00 ) Imperfect situational awareness and planning from the AI

(28:40 ) Conclusion

(31:28 ) Appendix: Important extensions

(31:33 ) Process-based supervision

(33:04 ) White-box selection of cognitive patterns

(34:34 ) Cultural selection of memes

The original text contained 21 footnotes which were omitted from this narration.

---

First published:
December 4th, 2025

Source:
https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Comments

In Channel

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

2025-12-1136:07

“Little Echo” by Zvi

2025-12-0904:08

“A Pragmatic Vision for Interpretability” by Neel Nanda

2025-12-0801:03:58

“AI in 2025: gestalt” by technicalities

2025-12-0841:59

“Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky

2025-12-0716:13

“An Ambitious Vision for Interpretability” by leogao

2025-12-0608:49

“6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes

2025-12-0432:39

“Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)” by null

2025-12-0309:45

“MIRI’s 2025 Fundraiser” by alexvermeer

2025-12-0215:37

“The Best Lack All Conviction: A Confusing Day in the AI Village” by null

2025-12-0112:03

“The Boring Part of Bell Labs” by Elizabeth

2025-11-3025:57

[Linkpost] “The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun” by null

2025-11-3004:18

“Writing advice: Why people like your quick bullshit takes better than your high-effort posts” by null

2025-11-3009:21

“Claude 4.5 Opus’ Soul Document” by null

2025-11-3001:19:57

“Unless its governance changes, Anthropic is untrustworthy” by null

2025-11-2953:22

“Alignment remains a hard, unsolved problem” by null

2025-11-2723:23

“Video games are philosophy’s playground” by Rachel Shu

2025-11-2631:50

“Stop Applying And Get To Work” by plex

2025-11-2402:52

“Gemini 3 is Evaluation-Paranoid and Contaminated” by null

2025-11-2314:59

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

2025-11-2218:45

00:00

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

#box-pro-ellipsis-176551265917897{-webkit-line-clamp:2;}“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck