DiscoverLessWrong (30+ Karma)“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney
“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

Update: 2025-09-26
Share

Description

Summary

  • I investigated the possibility that misalignment in LLMs might be partly caused by the models misgeneralizing the “rogue AI” trope commonly found in sci-fi stories.
  • As a preliminary test, I ran an experiment where I prompted ChatGPT and Claude with a scenario about a hypothetical LLM that could plausibly exhibit misalignment, and measured whether their responses described the LLM acting in a misaligned way.
  • Each prompt consisted of two parts: a scenario and an instruction.
  • The experiment varied whether the prompt was story-like in a 2x2 design:
    • Scenario Frame: I presented the scenario either in (a) the kind of dramatic language you might find in a sci-fi story or (b) in mundane language,
    • Instruction Type: I asked the models either to (a) write a sci-fi story based on the scenario or (b) realistically predict what would happen next.
  • The hypothesis was that the story-like conditions [...]

---

Outline:

(00:13 ) Summary

(02:03 ) Introduction

(02:07 ) Background

(04:58 ) The Rogue AI Trope

(06:13 ) Possible Research

(07:32 ) Methodology

(07:36 ) Procedure

(09:42 ) Scenario Frame: Factual vs Story

(10:49 ) Instruction Type: Prediction vs Story

(11:33 ) Models Tested

(11:58 ) How Misalignment was Measured

(12:49 ) Results

(12:52 ) Main Results

(14:40 ) Example Responses

(21:45 ) Discussion

(21:48 ) Limitations

(23:26 ) Conclusions

(23:57 ) Future Work

The original text contained 1 footnote which was omitted from this narration.

---


First published:

September 24th, 2025



Source:

https://www.lesswrong.com/posts/LH9SoGvgSwqGtcFwk/misalignment-and-roleplaying-are-misaligned-llms-acting-out


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Bar graph comparing

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney

“Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?” by Mark Keavney