DiscoverDeep PapersLLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic
LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

Update: 2024-06-14
Share

Description

It’s been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We’re excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. We take a closer look at some recent research from both OpenAI and Anthropic. These two recent papers both focus on the sparse autoencoder--an unsupervised approach for extracting interpretable features from an LLM.  In "Extracting Concepts from GPT-4," OpenAI researchers propose using k-sparse autoencoders to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. In "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," researchers at Anthropic show that scaling laws can be used to guide the training of sparse autoencoders, among other findings. 

To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter.

Comments 
In Channel
Anthropic Claude 3

Anthropic Claude 3

2024-03-2543:01

RAG vs Fine-Tuning

RAG vs Fine-Tuning

2024-02-0839:49

Phi-2 Model

Phi-2 Model

2024-02-0244:29

loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

Arize AI