Listen Top Shows Blog

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

Update: 2025-01-20

Share

Description

Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:04 - The Alignment Workshop

02:49 - How to detect scheming AIs

05:29 - Sokoban-solving networks taking time to think

12:18 - Model organisms of long-term planning

19:44 - How and why to study planning in networks

Links:

Adrià's website: https://agarri.ga/

An investigation of model-free planning: https://arxiv.org/abs/1901.03559

Model-Free Planning: https://tuphs28.github.io/projects/interpplanning/

Planning in a recurrent neural network that plays Sokoban: https://arxiv.org/abs/2407.15421

Episode art by Hamish Doodles: hamishdoodles.com

Comments

In Channel

46 - Tom Davidson on AI-enabled Coups

46 - Tom Davidson on AI-enabled Coups

2025-08-0702:05:26

45 - Samuel Albanie on DeepMind's AGI Safety Approach

45 - Samuel Albanie on DeepMind's AGI Safety Approach

2025-07-0601:15:42

44 - Peter Salib on AI Rights for Human Safety

44 - Peter Salib on AI Rights for Human Safety

2025-06-2803:21:33

43 - David Lindner on Myopic Optimization with Non-myopic Approval

43 - David Lindner on Myopic Optimization with Non-myopic Approval

2025-06-1501:40:59

42 - Owain Evans on LLM Psychology

42 - Owain Evans on LLM Psychology

2025-06-0602:14:26

41 - Lee Sharkey on Attribution-based Parameter Decomposition

41 - Lee Sharkey on Attribution-based Parameter Decomposition

2025-06-0302:16:11

40 - Jason Gross on Compact Proofs and Interpretability

40 - Jason Gross on Compact Proofs and Interpretability

2025-03-2802:36:05

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

2025-03-0120:42

38.7 - Anthony Aguirre on the Future of Life Institute

38.7 - Anthony Aguirre on the Future of Life Institute

2025-02-0922:39

38.6 - Joel Lehman on Positive Visions of AI

38.6 - Joel Lehman on Positive Visions of AI

2025-01-2415:28

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

2025-01-2027:41

38.4 - Shakeel Hashim on AI Journalism

38.4 - Shakeel Hashim on AI Journalism

2025-01-0524:14

38.3 - Erik Jenner on Learned Look-Ahead

38.3 - Erik Jenner on Learned Look-Ahead

2024-12-1223:46

39 - Evan Hubinger on Model Organisms of Misalignment

39 - Evan Hubinger on Model Organisms of Misalignment

2024-12-0101:45:47

38.2 - Jesse Hoogland on Singular Learning Theory

38.2 - Jesse Hoogland on Singular Learning Theory

2024-11-2718:18

38.1 - Alan Chan on Agent Infrastructure

38.1 - Alan Chan on Agent Infrastructure

2024-11-1624:48

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

2024-11-1422:42

37 - Jaime Sevilla on AI Forecasting

37 - Jaime Sevilla on AI Forecasting

2024-10-0401:44:25

36 - Adam Shai and Paul Riechers on Computational Mechanics

36 - Adam Shai and Paul Riechers on Computational Mechanics

2024-09-2901:48:27

New Patreon tiers + MATS applications

New Patreon tiers + MATS applications

2024-09-2805:32

00:00

00:00

1.0x

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming