DiscoverLessWrong (Curated & Popular)“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub
“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

Update: 2024-11-15
Share

Description

Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.

Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.

First, some high-level thoughts on what I want to talk about here:

  • I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent.
    • While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover [...]
---

Outline:

(02:31 ) Why is catastrophic sabotage a big deal?

(02:45 ) Scenario 1: Sabotage alignment research

(05:01 ) Necessary capabilities

(06:37 ) Scenario 2: Sabotage a critical actor

(09:12 ) Necessary capabilities

(10:51 ) How do you evaluate a model's capability to do catastrophic sabotage?

(21:46 ) What can you do to mitigate the risk of catastrophic sabotage?

(23:12 ) Internal usage restrictions

(25:33 ) Affirmative safety cases

---

First published:
October 22nd, 2024

Source:
https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human

---

Narrated by TYPE III AUDIO.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub