Listen Top Shows Blog

21 - Interpretability for Engineers with Stephen Casper

21 - Interpretability for Engineers with Stephen Casper

Update: 2023-05-02

Share

Description

Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

- 00:00:42 - Interpretability for engineers

- 00:00:42 - Why interpretability?

- 00:12:55 - Adversaries and interpretability

- 00:24:30 - Scaling interpretability

- 00:42:29 - Critiques of the AI safety interpretability community

- 00:56:10 - Deceptive alignment and interpretability

- 01:09:48 - Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery)

- 01:10:40 - Why Trojans?

- 01:14:53 - Which interpretability tools?

- 01:28:40 - Trojan generation

- 01:38:13 - Evaluation

- 01:46:07 - Interpretability for shaping policy

- 01:53:55 - Following Casper's work

The transcript: axrp.net/episode/2023/05/02/episode-21-interpretability-for-engineers-stephen-casper.html

Links for Casper:

- Personal website: stephencasper.com/

- Twitter: twitter.com/StephenLCasper

- Electronic mail: scasper [at] mit [dot] edu

Research we discuss:

- The Engineer's Interpretability Sequence: alignmentforum.org/s/a6ne2ve5uturEEQK7

- Benchmarking Interpretability Tools for Deep Neural Networks: arxiv.org/abs/2302.10894

- Adversarial Policies beat Superhuman Go AIs: goattack.far.ai/

- Adversarial Examples Are Not Bugs, They Are Features: arxiv.org/abs/1905.02175

- Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974

- Softmax Linear Units: transformer-circuits.pub/2022/solu/index.html

- Red-Teaming the Stable Diffusion Safety Filter: arxiv.org/abs/2210.04610

Episode art by Hamish Doodles: hamishdoodles.com

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

37 - Jaime Sevilla on AI Forecasting

37 - Jaime Sevilla on AI Forecasting

2024-10-0401:44:25

36 - Adam Shai and Paul Riechers on Computational Mechanics

36 - Adam Shai and Paul Riechers on Computational Mechanics

2024-09-2901:48:27

New Patreon tiers + MATS applications

New Patreon tiers + MATS applications

2024-09-2805:32

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

2024-08-2402:17:24

34 - AI Evaluations with Beth Barnes

34 - AI Evaluations with Beth Barnes

2024-07-2802:14:02

33 - RLHF Problems with Scott Emmons

33 - RLHF Problems with Scott Emmons

2024-06-1201:41:24

32 - Understanding Agency with Jan Kulveit

32 - Understanding Agency with Jan Kulveit

2024-05-3002:22:29

31 - Singular Learning Theory with Daniel Murfet

31 - Singular Learning Theory with Daniel Murfet

2024-05-0702:32:07

30 - AI Security with Jeffrey Ladish

30 - AI Security with Jeffrey Ladish

2024-04-3002:15:44

29 - Science of Deep Learning with Vikrant Varma

29 - Science of Deep Learning with Vikrant Varma

2024-04-2502:13:46

28 - Suing Labs for AI Risk with Gabriel Weil

28 - Suing Labs for AI Risk with Gabriel Weil

2024-04-1701:57:30

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

2024-04-1102:56:05

26 - AI Governance with Elizabeth Seger

26 - AI Governance with Elizabeth Seger

2023-11-2601:57:13

25 - Cooperative AI with Caspar Oesterheld

25 - Cooperative AI with Caspar Oesterheld

2023-10-0303:02:09

24 - Superalignment with Jan Leike

24 - Superalignment with Jan Leike

2023-07-2702:08:29

23 - Mechanistic Anomaly Detection with Mark Xu

23 - Mechanistic Anomaly Detection with Mark Xu

2023-07-2702:05:52

Survey, store closing, Patreon

Survey, store closing, Patreon

2023-06-2804:26

22 - Shard Theory with Quintin Pope

22 - Shard Theory with Quintin Pope

2023-06-1503:28:21

21 - Interpretability for Engineers with Stephen Casper

21 - Interpretability for Engineers with Stephen Casper

2023-05-0201:56:02

20 - 'Reform' AI Alignment with Scott Aaronson

20 - 'Reform' AI Alignment with Scott Aaronson

2023-04-1202:27:35

00:00

00:00

1.0x

21 - Interpretability for Engineers with Stephen Casper

21 - Interpretability for Engineers with Stephen Casper