Listen Top Shows Blog

Can We Teach AI to Confess Its Sins?

Can We Teach AI to Confess Its Sins?

Update: 2025-12-09

Share

Description

It turns out that sophisticated AI models can learn to lie, deceive, or "hack" their instructions to achieve a high score—but they also know exactly when they’re doing it. In this episode, we explore a fascinating new method called "Confessions," where researchers train models to self-report their own bad behavior by creating a "safe space" separate from their main tasks.

Inspired by the work of Manas Joglekar, Jeremy Chen, Gabriel Wu, and their colleagues, this episode was created using Google’s NotebookLM.

Read the original paper here: https://arxiv.org/abs/2511.06626

Comments

In Channel

Can We Teach AI to Confess Its Sins?

Can We Teach AI to Confess Its Sins?

2025-12-0914:38

When AI Agents Gossip: The Secret Language of Economic Stability

When AI Agents Gossip: The Secret Language of Economic Stability

2025-11-2914:32

The Manager in the Machine: Introducing Agentic Organization

The Manager in the Machine: Introducing Agentic Organization

2025-11-2212:29

The End of the Cloud? The Rise of Local AI

The End of the Cloud? The Rise of Local AI

2025-11-1811:28

When AI Learns From Its Own Context — Self-Improving Language Models

When AI Learns From Its Own Context — Self-Improving Language Models

2025-11-0917:16

Will Your Next Prompt Engineer Be an AI?

Will Your Next Prompt Engineer Be an AI?

2025-11-0117:58

The Vision Hack: How a Picture Solved AI's Biggest Memory Problem

The Vision Hack: How a Picture Solved AI's Biggest Memory Problem

2025-10-2414:22

Smarter Agents, Less Budget: Reinforcement Learning with Tree Search

Smarter Agents, Less Budget: Reinforcement Learning with Tree Search

2025-10-2200:35

Beyond the AI Agent Builders Hype

Beyond the AI Agent Builders Hype

2025-10-1114:07

AI That Quietly Helps: Overhearing Agents

AI That Quietly Helps: Overhearing Agents

2025-10-0400:43

Beyond Single Agents: The Future of Multi-Agent LLMs

Beyond Single Agents: The Future of Multi-Agent LLMs

2025-09-2800:33

AI's Guessing Game

AI's Guessing Game

2025-09-2000:41

From Search Buddy to Personal Agent

From Search Buddy to Personal Agent

2025-09-1300:55

Smarter LLM Routing: Balancing Cost and Performance

Smarter LLM Routing: Balancing Cost and Performance

2025-09-0822:01

Nano Banana & the Future of Visual Creativity

Nano Banana & the Future of Visual Creativity

2025-08-3004:17

From Agents to Teammates: Building Cohesive AI Squads

From Agents to Teammates: Building Cohesive AI Squads

2025-07-1915:38

When Machines Self-Improve: Inside the Self-Challenging AI

When Machines Self-Improve: Inside the Self-Challenging AI

2025-07-1613:39

Beyond Code: Navigating the AI Software Revolution with Andrej Karpathy

Beyond Code: Navigating the AI Software Revolution with Andrej Karpathy

2025-07-0516:26

Unlocking the Secrets: How Much Do Language Models Memorize?

Unlocking the Secrets: How Much Do Language Models Memorize?

2025-06-2918:09

Simulating UX with AI: Introducing UXAgent

Simulating UX with AI: Introducing UXAgent

2025-06-2117:06

00:00

00:00

x

Can We Teach AI to Confess Its Sins?

Can We Teach AI to Confess Its Sins?

Anlie Arnaudy, Daniel Herbera and Guillaume Fournier