“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney
Description
Summary
- I investigated the possibility that misalignment in LLMs might be partly caused by the models misgeneralizing the “rogue AI” trope commonly found in sci-fi stories.
- As a preliminary test, I ran an experiment where I prompted ChatGPT and Claude with a scenario about a hypothetical LLM that could plausibly exhibit misalignment, and measured whether their responses described the LLM acting in a misaligned way.
- Each prompt consisted of two parts: a scenario and an instruction.
- The experiment varied whether the prompt was story-like in a 2x2 design:
- Scenario Frame: I presented the scenario either in (a) the kind of dramatic language you might find in a sci-fi story or (b) in mundane language,
- Instruction Type: I asked the models either to (a) write a sci-fi story based on the scenario or (b) realistically predict what would happen next.
- The hypothesis was that the story-like conditions [...]
---
Outline:
(00:13 ) Summary
(02:03 ) Introduction
(02:07 ) Background
(04:58 ) The Rogue AI Trope
(06:13 ) Possible Research
(07:32 ) Methodology
(07:36 ) Procedure
(09:42 ) Scenario Frame: Factual vs Story
(10:49 ) Instruction Type: Prediction vs Story
(11:33 ) Models Tested
(11:58 ) How Misalignment was Measured
(12:49 ) Results
(12:52 ) Main Results
(14:40 ) Example Responses
(21:45 ) Discussion
(21:48 ) Limitations
(23:26 ) Conclusions
(23:57 ) Future Work
The original text contained 1 footnote which was omitted from this narration.
---
First published:
September 24th, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.