Sample Efficient Preference Alignment in LLMs via Active Exploration

Update: 2025-09-06

Description

This research introduces an active exploration algorithm to enhance the efficiency of preference alignment in large language models (LLMs) by strategically selecting human feedback. The authors frame this as an active contextual dueling bandit problem, where the system actively chooses which "contexts" (prompts) and "actions" (LLM responses) to present to human evaluators. Their proposed method, AE-Borda, leverages uncertainty estimation and a generalized Borda function to identify the most informative data points for training, leading to faster learning and reduced data collection costs. The paper validates its theoretical guarantees with synthetic experiments and demonstrates practical improvements on LLM performance across various datasets, including two new contributions: Jeopardy! for factual correctness and Haikus for creative writing.

Comments

In Channel

RL's Razor: Why Online RL Forgets Less

2025-09-0724:56

Why Language Models Hallucinate

2025-09-0617:40

ALFA: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

2025-09-0616:12

Sample Efficient Preference Alignment in LLMs via Active Exploration

2025-09-0615:05

Adventures in Demand Analysis Using AI

2025-09-0413:59

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

2025-09-0118:59

On the Theoretical Limitations of Embedding-Based Retrieval

2025-08-3117:25

Performance Prediction for Large Systems via Text-to-Text Regression

2025-08-3015:53

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

2025-08-3016:47

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

2025-08-3020:15

Compute-Optimal Scaling for Value-Based Deep RL

2025-08-2516:02

LLM-based Conversational Recommendation Agents with Collaborative Verbalized Experience

2025-08-2317:05

Signal and Noise: Evaluating Language Model Benchmarks

2025-08-2312:01

Breaking Feedback Loops in Recommender Systems with Causal Inference

2025-08-2112:54

RAG is Dead, Context Engineering is King: Building Reliable AI Systems

2025-08-2019:55

A Survey of Personalization: From RAG to Agent

2025-08-2025:00

Facilitating the Adoption of Causal Infer-ence Methods Through LLM-Empowered Co-Pilot

2025-08-1922:28

Performance Prediction for Large Systems via Text-to-Text Regression

2025-08-1619:09

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

2025-08-1527:47

DINOv3: Vision Models for Self-Supervised Learning

2025-08-1520:07

00:00

Sample Efficient Preference Alignment in LLMs via Active Exploration

#box-pro-ellipsis-175772921086323{-webkit-line-clamp:2;}Sample Efficient Preference Alignment in LLMs via Active Exploration

Sample Efficient Preference Alignment in LLMs via Active Exploration

Enoch H. Kang

Sample Efficient Preference Alignment in LLMs via Active Exploration