DiscoverBest AI papers explainedRepresentation-Based Exploration for Language Models: From Test-Time to Post-Training
Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Update: 2025-10-18
Share

Description

This paper investigates the effectiveness of deliberate exploration in enhancing the reasoning capabilities of large language models (LLMs) trained with reinforcement learning (RL). The authors propose and evaluate a novel representation-based exploration (RepExp) strategy, which uses a bonus derived from the LLM's hidden states to encourage the discovery of diverse and novel behaviors. The study employs a two-pronged evaluation methodology, first testing RepExp in an inference-time setting for selecting diverse responses and then integrating it into the RL post-training pipeline. Key findings indicate that this exploration method significantly improves verifier efficiency and mitigates the "diversity collapse" phenomenon observed in standard RL methods, suggesting that the approach moves beyond merely sharpening existing model capabilities. The results show RepExp provides substantial improvements in pass@k rates and is especially beneficial for stronger models and harder reasoning problems across various tasks like MATH and GSM8K.

Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Enoch H. Kang