Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Update: 2023-11-20

Description

In this paper read, we discuss “Towards Monosemanticity: Decomposing Language Models Into Understandable Components,” a paper from Anthropic that addresses the challenge of understanding the inner workings of neural networks, drawing parallels with the complexity of human brain function. It explores the concept of “features,” (patterns of neuron activations) providing a more interpretable way to dissect neural networks. By decomposing a layer of neurons into thousands of features, this approach uncovers hidden model properties that are not evident when examining individual neurons. These features are demonstrated to be more interpretable and consistent, offering the potential to steer model behavior and improve AI safety.

Find the transcript and more here: https://arize.com/blog/decomposing-language-models-with-dictionary-learning-paper-reading/

To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

2024-08-1639:05

Breaking Down Meta's Llama 3 Herd of Models

2024-08-0644:40

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

2024-07-2333:57

RAFT: Adapting Language Model to Domain Specific RAG

2024-06-2844:01

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

2024-06-1444:00

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment

2024-05-3048:07

Breaking Down EvalGen: Who Validates the Validators?

2024-05-1344:47

Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models

2024-04-2645:07

Demystifying Chronos: Learning the Language of Time Series

2024-04-0444:40

Anthropic Claude 3

2024-03-2543:01

Reinforcement Learning in the Era of LLMs

2024-03-1544:49

Sora: OpenAI’s Text-to-Video Generation Model

2024-03-0145:08

RAG vs Fine-Tuning

2024-02-0839:49

Phi-2 Model

2024-02-0244:29

HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels

2024-02-0236:22

A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B)–Part I

2023-12-2747:50

How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings

2023-12-1844:59

The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets

2023-11-3041:02

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

2023-11-2044:50

RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models

2023-10-1843:49

00:00

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

#box-pro-ellipsis-172515056028667{-webkit-line-clamp:2;}Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Arize AI

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning