Signal and Noise: Evaluating Language Model Benchmarks

Update: 2025-08-23

Description

This paper introduces a framework for **evaluating language model benchmarks** by quantifying **signal** and **noise**. The signal measures a benchmark's capacity to differentiate between superior and inferior models, while noise reflects its susceptibility to random fluctuations during training. The authors demonstrate that a **higher signal-to-noise ratio (SNR)** correlates with more reliable small-scale experiments for predicting large model performance and that less noise leads to reduced scaling law prediction error. They propose three **interventions** to enhance SNR: **filtering noisy subtasks**, **averaging model checkpoint scores** to reduce variability, and employing **bits-per-byte (BPB)** as a more consistent evaluation metric. The research emphasizes that considering SNR is crucial for designing and selecting benchmarks that accurately guide language model development, rather than relying solely on benchmark size.

Comments

In Channel

RL's Razor: Why Online RL Forgets Less

2025-09-0724:56

Why Language Models Hallucinate

2025-09-0617:40

ALFA: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

2025-09-0616:12

Sample Efficient Preference Alignment in LLMs via Active Exploration

2025-09-0615:05

Adventures in Demand Analysis Using AI

2025-09-0413:59

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

2025-09-0118:59

On the Theoretical Limitations of Embedding-Based Retrieval

2025-08-3117:25

Performance Prediction for Large Systems via Text-to-Text Regression

2025-08-3015:53

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

2025-08-3016:47

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

2025-08-3020:15

Compute-Optimal Scaling for Value-Based Deep RL

2025-08-2516:02

LLM-based Conversational Recommendation Agents with Collaborative Verbalized Experience

2025-08-2317:05

Signal and Noise: Evaluating Language Model Benchmarks

2025-08-2312:01

Breaking Feedback Loops in Recommender Systems with Causal Inference

2025-08-2112:54

RAG is Dead, Context Engineering is King: Building Reliable AI Systems

2025-08-2019:55

A Survey of Personalization: From RAG to Agent

2025-08-2025:00

Facilitating the Adoption of Causal Infer-ence Methods Through LLM-Empowered Co-Pilot

2025-08-1922:28

Performance Prediction for Large Systems via Text-to-Text Regression

2025-08-1619:09

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

2025-08-1527:47

DINOv3: Vision Models for Self-Supervised Learning

2025-08-1520:07

00:00

Signal and Noise: Evaluating Language Model Benchmarks

#box-pro-ellipsis-175738762218954{-webkit-line-clamp:2;}Signal and Noise: Evaluating Language Model Benchmarks

Signal and Noise: Evaluating Language Model Benchmarks

Enoch H. Kang

Signal and Noise: Evaluating Language Model Benchmarks