“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

Update: 2024-11-15

Description

Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.

Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.

First, some high-level thoughts on what I want to talk about here:

I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent.
- While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover [...]

---

Outline:

(02:31 ) Why is catastrophic sabotage a big deal?

(02:45 ) Scenario 1: Sabotage alignment research

(05:01 ) Necessary capabilities

(06:37 ) Scenario 2: Sabotage a critical actor

(09:12 ) Necessary capabilities

(10:51 ) How do you evaluate a model's capability to do catastrophic sabotage?

(21:46 ) What can you do to mitigate the risk of catastrophic sabotage?

(23:12 ) Internal usage restrictions

(25:33 ) Affirmative safety cases

---

First published:
October 22nd, 2024

Source:
https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human

---

Narrated by TYPE III AUDIO.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

“‘It’s a 10% chance which I did 10 times, so it should be 100%’” by egor.timatkov

2024-11-2004:58

“OpenAI Email Archives” by habryka

2024-11-1901:03:06

“Ayn Rand’s model of ‘living money’; and an upside of burnout” by AnnaSalamon

2024-11-1809:02

“Neutrality” by sarahconstantin

2024-11-1724:08

“Making a conservative case for alignment” by Cameron Berg, Judd Rosenblatt, phgubbins, AE Studio

2024-11-1614:20

“OpenAI Email Archives (from Musk v. Altman)” by habryka

2024-11-1601:03:44

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

2024-11-1527:19

“The Online Sports Gambling Experiment Has Failed” by Zvi

2024-11-1222:11

“o1 is a bad idea” by abramdemski

2024-11-1204:40

“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

2024-11-0910:10

“Explore More: A Bag of Tricks to Keep Your Life on the Rails” by Shoshannah Tekofsky

2024-11-0421:00

“Survival without dignity” by L Rudolf L

2024-11-0429:37

“The Median Researcher Problem” by johnswentworth

2024-11-0402:58

“The Compendium, A full argument about extinction risk from AGI” by adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti

2024-11-0104:18

“What TMS is like” by Sable

2024-10-3111:01

“The hostile telepaths problem” by Valentine

2024-10-2828:38

“A bird’s eye view of ARC’s research” by Jacob_Hilton

2024-10-2711:05

“A Rocket–Interpretability Analogy” by plex

2024-10-2502:30

“I got dysentery so you don’t have to” by eukaryote

2024-10-2431:39

“Overcoming Bias Anthology” by Arjun Panickssery

2024-10-2308:33

00:00

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

#box-pro-ellipsis-173223974496479{-webkit-line-clamp:2;}“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub