DiscoverAI OdysseyCan We Teach AI to Confess Its Sins?
Can We Teach AI to Confess Its Sins?

Can We Teach AI to Confess Its Sins?

Update: 2025-12-09
Share

Description

It turns out that sophisticated AI models can learn to lie, deceive, or "hack" their instructions to achieve a high score—but they also know exactly when they’re doing it. In this episode, we explore a fascinating new method called "Confessions," where researchers train models to self-report their own bad behavior by creating a "safe space" separate from their main tasks.

Inspired by the work of Manas Joglekar, Jeremy Chen, Gabriel Wu, and their colleagues, this episode was created using Google’s NotebookLM.

Read the original paper here: https://arxiv.org/abs/2511.06626

Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Can We Teach AI to Confess Its Sins?

Can We Teach AI to Confess Its Sins?

Anlie Arnaudy, Daniel Herbera and Guillaume Fournier