DiscoverAXRP - the AI X-risk Research Podcast38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

Update: 2025-03-01
Share

Description

In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.

 

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: @FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

 

Topics we discuss, and timestamps:

01:42 - The difficulty of sabotage evaluations

05:23 - Types of sabotage evaluation

08:45 - The state of sabotage evaluations

12:26 - What happens after AGI?

 

Links:

Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514

Gradual Disempowerment: https://gradual-disempowerment.ai/

 

Episode art by Hamish Doodles: hamishdoodles.com

Comments 
loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future