[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

Update: 2025-10-30

Description

This is a link post.

New Anthropic research (tweet, blog post, paper):

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness [...]

---

First published:

October 30th, 2025

Source:

https://www.lesswrong.com/posts/QKm4hBqaBAsxabZWL/emergent-introspective-awareness-in-large-language-models

Linkpost URL:
https://transformer-circuits.pub/2025/introspection/index.html

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

[Linkpost] “AISLE discovered three new OpenSSL vulnerabilities” by Jan_Kulveit

2025-10-3002:04

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

2025-10-3035:58

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

2025-10-3008:04

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

2025-10-3003:01

“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout

2025-10-2908:00

“The End of OpenAI’s Nonprofit Era” by garrison

2025-10-2917:39

“Please Do Not Sell B30A Chips to China” by Zvi

2025-10-2913:15

“AI Craziness Mitigation Efforts” by Zvi

2025-10-2922:12

“Some data from LeelaPieceOdds” by Jeremy Gillen

2025-10-2914:05

“When Will AI Transform the Economy?” by Andre.Infante

2025-10-2915:43

“Workshop on Post-AGI Economics, Culture, and Governance” by Raymond Douglas, Jan_Kulveit, scasper, David Duvenaud

2025-10-2903:59

“Introducing the Epoch Capabilities Index (ECI)” by luke_emberson, YafahEdelman, Jsevillamol

2025-10-2901:56

“Mottes and Baileys in AI discourse” by Raemon

2025-10-2915:56

“LLM robots can’t pass butter (and they are having an existential crisis about it)” by Lukas Petersson

2025-10-2909:04

“The Memetics of AI Successionism” by Jan_Kulveit

2025-10-2821:28

“All the lab’s AI safety Plans: 2025 edition” by Algon

2025-10-2831:57

“life lessons from trading” by thiccythot

2025-10-2808:52

“Stability of natural latents in information theoretic terms” by Aram Ebtekar

2025-10-2705:57

“AIs should also refuse to work on capabilities research” by Davidmanheim

2025-10-2706:35

“FWIW: What I noticed at a (Goenka) Vipassana retreat” by David Gross

2025-10-2715:35

00:00

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

#box-pro-ellipsis-176186395054619{-webkit-line-clamp:2;}[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas