“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub
Update: 2024-11-15
Description
Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
Outline:
(02:31 ) Why is catastrophic sabotage a big deal?
(02:45 ) Scenario 1: Sabotage alignment research
(05:01 ) Necessary capabilities
(06:37 ) Scenario 2: Sabotage a critical actor
(09:12 ) Necessary capabilities
(10:51 ) How do you evaluate a model's capability to do catastrophic sabotage?
(21:46 ) What can you do to mitigate the risk of catastrophic sabotage?
(23:12 ) Internal usage restrictions
(25:33 ) Affirmative safety cases
---
First published:
October 22nd, 2024
Source:
https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human
---
Narrated by TYPE III AUDIO.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
- I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent.
- While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover [...]
Outline:
(02:31 ) Why is catastrophic sabotage a big deal?
(02:45 ) Scenario 1: Sabotage alignment research
(05:01 ) Necessary capabilities
(06:37 ) Scenario 2: Sabotage a critical actor
(09:12 ) Necessary capabilities
(10:51 ) How do you evaluate a model's capability to do catastrophic sabotage?
(21:46 ) What can you do to mitigate the risk of catastrophic sabotage?
(23:12 ) Internal usage restrictions
(25:33 ) Affirmative safety cases
---
First published:
October 22nd, 2024
Source:
https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human
---
Narrated by TYPE III AUDIO.
Comments
Top Podcasts
The Best New Comedy Podcast Right Now – June 2024The Best News Podcast Right Now – June 2024The Best New Business Podcast Right Now – June 2024The Best New Sports Podcast Right Now – June 2024The Best New True Crime Podcast Right Now – June 2024The Best New Joe Rogan Experience Podcast Right Now – June 20The Best New Dan Bongino Show Podcast Right Now – June 20The Best New Mark Levin Podcast – June 2024
In Channel