Listen Top Shows Blog

Reasoning Models Don’t Always Say What They Think

Reasoning Models Don’t Always Say What They Think

Update: 2025-07-14

Share

Description

In this episode of AI Paper Bites, Francis explores Anthropic’s eye-opening paper, “Reasoning Models Don’t Always Say What They Think.”

We dive deep into the promise and peril of Chain of Thought monitoring, uncovering why outcome-based reinforcement learning might boost accuracy but not transparency.

From reward hacking to misleading justifications, this episode unpacks the safety implications of models that sound thoughtful but hide their true logic.

Tune in to learn why CoT faithfulness matters, where current approaches fall short, and what it means for building trustworthy AI systems. Can we really trust what AI says it’s thinking?

Comments

In Channel

Backdooring Without a Trace: The Art of Indirect AI Poisoning

Backdooring Without a Trace: The Art of Indirect AI Poisoning

2025-09-0908:04

Reasoning Models Don’t Always Say What They Think

Reasoning Models Don’t Always Say What They Think

2025-07-1408:25

The Illusion of Thinking: Are AI Reasoning Models Just Pretending?

The Illusion of Thinking: Are AI Reasoning Models Just Pretending?

2025-06-3006:29

When AI Schemes: Inside the Minds of Deceptive Models

When AI Schemes: Inside the Minds of Deceptive Models

2025-05-1509:21

Agent Hospital: Simulating Medical AI Evolution

Agent Hospital: Simulating Medical AI Evolution

2025-03-0407:57

Simulacra of Human Behavior

Simulacra of Human Behavior

2025-02-1406:50

Mixture of Agents Enhances LLM Capabilities

Mixture of Agents Enhances LLM Capabilities

2025-02-0806:51

Measuring Factuality in Large Language Models

Measuring Factuality in Large Language Models

2024-12-2307:45

GameNGen - Diffusion Models are real-time Game Engines

GameNGen - Diffusion Models are real-time Game Engines

2024-12-1009:04

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

2024-11-2706:57

Efficient Streaming Language Models with Attention Sinks

Efficient Streaming Language Models with Attention Sinks

2024-11-2006:35

Scaling Monosemanticity

Scaling Monosemanticity

2024-11-1507:19

00:00

00:00

x

Reasoning Models Don’t Always Say What They Think

Reasoning Models Don’t Always Say What They Think

Francis Brero