Toy Models of Superposition

Update: 2024-06-17

Description

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?

In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition . When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Eliciting Latent Knowledge

2024-06-1701:00:27

Deep Double Descent

2024-06-1708:27

Chinchilla’s Wild Implications

2024-06-1724:57

Intro to Brain-Like-AGI Safety

2024-06-1701:02:10

Gradient Hacking: Definitions and Examples

2024-06-1709:15

An Investigation of Model-Free Planning

2024-06-1708:11

Discovering Latent Knowledge in Language Models Without Supervision

2024-06-1737:09

Toy Models of Superposition

2024-06-1741:43

Imitative Generalisation (AKA ‘Learning the Prior’)

2024-06-1718:14

ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation

2024-06-1716:08

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

2024-06-1716:08

Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

2024-06-1716:39

Low-Stakes Alignment

2024-06-1713:56

Empirical Findings Generalize Surprisingly Far

2024-06-1711:32

Worst-Case Thinking in AI Alignment

2024-05-2911:35

How to Get Feedback

2024-05-1207:30

Public by Default: How We Manage Information Visibility at Get on Board

2024-05-1209:50

How to Succeed as an Early-Stage Researcher: The “Lean Startup” Approach

2024-04-2315:16

Become a Person who Actually Does Things

2024-04-1705:14

Working in AI Alignment

2024-04-1401:08:44

00:00

Toy Models of Superposition

#box-pro-ellipsis-173491494678822{-webkit-line-clamp:2;}Toy Models of Superposition

Toy Models of Superposition

BlueDot Impact

Toy Models of Superposition