Listen Top Shows Blog

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

Update: 2024-04-11

Share

Description

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:31 - What is AI control?

0:16:16 - Protocols for AI control

0:22:43 - Which AIs are controllable?

0:29:56 - Preventing dangerous coded AI communication

0:40:42 - Unpredictably uncontrollable AI

0:58:01 - What control looks like

1:08:45 - Is AI control evil?

1:24:42 - Can red teams match misaligned AI?

1:36:51 - How expensive is AI monitoring?

1:52:32 - AI control experiments

2:03:50 - GPT-4's aptitude at inserting backdoors

2:14:50 - How AI control relates to the AI safety field

2:39:25 - How AI control relates to previous Redwood Research work

2:49:16 - How people can work on AI control

2:54:07 - Following Buck and Ryan's research

The transcript: axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html

Links for Buck and Ryan:

- Buck's twitter/X account: twitter.com/bshlgrs

- Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt

- You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com

Main research works we talk about:

- The case for ensuring that powerful AIs are controlled: lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled

- AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942

Other things we mention:

- The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root

- Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512

- Improving the Welfare of AIs: A Nearcasted Proposal: lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal

- Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938

- Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

Episode art by Hamish Doodles: hamishdoodles.com

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

38.3 - Erik Jenner on Learned Look-Ahead

38.3 - Erik Jenner on Learned Look-Ahead

2024-12-1223:46

39 - Evan Hubinger on Model Organisms of Misalignment

39 - Evan Hubinger on Model Organisms of Misalignment

2024-12-0101:45:47

38.2 - Jesse Hoogland on Singular Learning Theory

38.2 - Jesse Hoogland on Singular Learning Theory

2024-11-2718:18

38.1 - Alan Chan on Agent Infrastructure

38.1 - Alan Chan on Agent Infrastructure

2024-11-1624:48

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

2024-11-1422:42

37 - Jaime Sevilla on AI Forecasting

37 - Jaime Sevilla on AI Forecasting

2024-10-0401:44:25

36 - Adam Shai and Paul Riechers on Computational Mechanics

36 - Adam Shai and Paul Riechers on Computational Mechanics

2024-09-2901:48:27

New Patreon tiers + MATS applications

New Patreon tiers + MATS applications

2024-09-2805:32

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

2024-08-2402:17:24

34 - AI Evaluations with Beth Barnes

34 - AI Evaluations with Beth Barnes

2024-07-2802:14:02

33 - RLHF Problems with Scott Emmons

33 - RLHF Problems with Scott Emmons

2024-06-1201:41:24

32 - Understanding Agency with Jan Kulveit

32 - Understanding Agency with Jan Kulveit

2024-05-3002:22:29

31 - Singular Learning Theory with Daniel Murfet

31 - Singular Learning Theory with Daniel Murfet

2024-05-0702:32:07

30 - AI Security with Jeffrey Ladish

30 - AI Security with Jeffrey Ladish

2024-04-3002:15:44

29 - Science of Deep Learning with Vikrant Varma

29 - Science of Deep Learning with Vikrant Varma

2024-04-2502:13:46

28 - Suing Labs for AI Risk with Gabriel Weil

28 - Suing Labs for AI Risk with Gabriel Weil

2024-04-1701:57:30

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

2024-04-1102:56:05

26 - AI Governance with Elizabeth Seger

26 - AI Governance with Elizabeth Seger

2023-11-2601:57:13

25 - Cooperative AI with Caspar Oesterheld

25 - Cooperative AI with Caspar Oesterheld

2023-10-0303:02:09

24 - Superalignment with Jan Leike

24 - Superalignment with Jan Leike

2023-07-2702:08:29

00:00

00:00

x

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

27 - AI Control with Buck Shlegeris and Ryan Greenblatt