“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

Update: 2024-11-09

Description

TL;DR: I'm presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they are often willing to directly execute harmful actions. However, all papers find that different attack methods like jailbreaks, prompt-engineering, or refusal-vector ablation do transfer.

Here are the three papers:

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

What are language model agents

Language model agents are a combination of a language model and a scaffolding software. Regular language models are typically limited to being chat bots, i.e. they receive messages and reply to them. However, scaffolding gives these models access to tools which they can [...]

---

Outline:

(00:55 ) What are language model agents

(01:36 ) Overview

(03:31 ) AgentHarm Benchmark

(05:27 ) Refusal-Trained LLMs Are Easily Jailbroken as Browser Agents

(06:47 ) Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

(08:23 ) Discussion

---

First published:
November 3rd, 2024

Source:
https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

“‘It’s a 10% chance which I did 10 times, so it should be 100%’” by egor.timatkov

2024-11-2004:58

“OpenAI Email Archives” by habryka

2024-11-1901:03:06

“Ayn Rand’s model of ‘living money’; and an upside of burnout” by AnnaSalamon

2024-11-1809:02

“Neutrality” by sarahconstantin

2024-11-1724:08

“Making a conservative case for alignment” by Cameron Berg, Judd Rosenblatt, phgubbins, AE Studio

2024-11-1614:20

“OpenAI Email Archives (from Musk v. Altman)” by habryka

2024-11-1601:03:44

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

2024-11-1527:19

“The Online Sports Gambling Experiment Has Failed” by Zvi

2024-11-1222:11

“o1 is a bad idea” by abramdemski

2024-11-1204:40

“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

2024-11-0910:10

“Explore More: A Bag of Tricks to Keep Your Life on the Rails” by Shoshannah Tekofsky

2024-11-0421:00

“Survival without dignity” by L Rudolf L

2024-11-0429:37

“The Median Researcher Problem” by johnswentworth

2024-11-0402:58

“The Compendium, A full argument about extinction risk from AGI” by adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti

2024-11-0104:18

“What TMS is like” by Sable

2024-10-3111:01

“The hostile telepaths problem” by Valentine

2024-10-2828:38

“A bird’s eye view of ARC’s research” by Jacob_Hilton

2024-10-2711:05

“A Rocket–Interpretability Analogy” by plex

2024-10-2502:30

“I got dysentery so you don’t have to” by eukaryote

2024-10-2431:39

“Overcoming Bias Anthology” by Arjun Panickssery

2024-10-2308:33

00:00

“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

#box-pro-ellipsis-173223833329365{-webkit-line-clamp:2;}“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale

“Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale