“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

Update: 2025-09-26

Description

Summary

I investigated the possibility that misalignment in LLMs might be partly caused by the models misgeneralizing the “rogue AI” trope commonly found in sci-fi stories.
As a preliminary test, I ran an experiment where I prompted ChatGPT and Claude with a scenario about a hypothetical LLM that could plausibly exhibit misalignment, and measured whether their responses described the LLM acting in a misaligned way.
Each prompt consisted of two parts: a scenario and an instruction.
The experiment varied whether the prompt was story-like in a 2x2 design:
- Scenario Frame: I presented the scenario either in (a) the kind of dramatic language you might find in a sci-fi story or (b) in mundane language,
- Instruction Type: I asked the models either to (a) write a sci-fi story based on the scenario or (b) realistically predict what would happen next.
The hypothesis was that the story-like conditions [...]

---

Outline:

(00:13 ) Summary

(02:03 ) Introduction

(02:07 ) Background

(04:58 ) The Rogue AI Trope

(06:13 ) Possible Research

(07:32 ) Methodology

(07:36 ) Procedure

(09:42 ) Scenario Frame: Factual vs Story

(10:49 ) Instruction Type: Prediction vs Story

(11:33 ) Models Tested

(11:58 ) How Misalignment was Measured

(12:49 ) Results

(12:52 ) Main Results

(14:40 ) Example Responses

(21:45 ) Discussion

(21:48 ) Limitations

(23:26 ) Conclusions

(23:57 ) Future Work

The original text contained 1 footnote which was omitted from this narration.

---

First published:

September 24th, 2025

Source:

https://www.lesswrong.com/posts/LH9SoGvgSwqGtcFwk/misalignment-and-roleplaying-are-misaligned-llms-acting-out

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“Solving the problem of needing to give a talk” by Kaj_Sotala

2025-09-2817:59

“Transgender Sticker Fallacy” by ymeskhout

2025-09-2814:21

“A non-review of ‘If Anyone Builds It, Everyone Dies’” by boazbarak

2025-09-2806:38

“A Reply to MacAskill on ‘If Anyone Builds It, Everyone Dies’” by Rob Bensinger

2025-09-2831:36

“Learnings from AI safety course so far” by boazbarak

2025-09-2705:22

“Our Beloved Monsters” by Tomás B.

2025-09-2720:41

“Reasons to sell frontier lab equity to donate now rather than later” by Daniel_Eth, Ethan Perez

2025-09-2723:54

“The Illustrated Petrov Day Ceremony” by Raemon

2025-09-2604:40

“We Support ‘If Anyone Builds It, Everyone Dies’” by Liron

2025-09-2602:58

“What Happened After My Rat Group Backed Kamala Harris” by Blake

2025-09-2602:38

“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

2025-09-2624:47

“The real AI deploys itself” by David Scott Krueger (formerly: capybaralet)

2025-09-2506:48

“CFAR update, and New CFAR workshops” by AnnaSalamon

2025-09-2515:32

“Why you should eat meat - even if you hate factory farming” by KatWoods

2025-09-2519:22

“IABIED is on the NYT bestseller list” by Alice Blair

2025-09-2500:42

“EU and Monopoly on Violence” by Martin Sustrik

2025-09-2510:20

“OpenAI Shows Us The Money” by Zvi

2025-09-2417:24

“‘Shut it Down’ vs ‘Controlled Takeoff’” by Raemon

2025-09-2408:47

“More Reactions to If Anyone Builds It, Everyone Dies” by Zvi

2025-09-2436:38

“D&D.Sci: Serial Healers [Evaluation & Ruleset]” by abstractapplic

2025-09-2307:06

00:00

“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

#box-pro-ellipsis-175911430808490{-webkit-line-clamp:2;}“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney