Reasoning Models Don’t Always Say What They Think
Update: 2025-07-14
Description
In this episode of AI Paper Bites, Francis explores Anthropic’s eye-opening paper, “Reasoning Models Don’t Always Say What They Think.”
We dive deep into the promise and peril of Chain of Thought monitoring, uncovering why outcome-based reinforcement learning might boost accuracy but not transparency.
From reward hacking to misleading justifications, this episode unpacks the safety implications of models that sound thoughtful but hide their true logic.
Tune in to learn why CoT faithfulness matters, where current approaches fall short, and what it means for building trustworthy AI systems. Can we really trust what AI says it’s thinking?
Comments
In Channel