Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Update: 2025-10-01

Description

🤗 Upvotes: 39 | cs.LG, cs.CL

Authors:

Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang

Title:

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Arxiv:

http://arxiv.org/abs/2509.23808v1

Abstract:

A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

Comments

In Channel

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

2025-10-0225:16

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

2025-10-0223:40

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

2025-10-0229:09

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

2025-10-0220:04

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

2025-10-0224:43

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

2025-10-0227:25

OceanGym: A Benchmark Environment for Underwater Embodied Agents

2025-10-0222:22

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

2025-10-0224:35

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

2025-10-0223:19

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

2025-10-0225:33

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

2025-10-0124:35

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2025-10-0121:00

Multiplayer Nash Preference Optimization

2025-10-0126:13

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

2025-10-0125:04

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

2025-10-0124:39

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

2025-10-0119:35

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

2025-10-0126:23

Democratizing AI scientists using ToolUniverse

2025-10-0126:05

Visual Jigsaw Post-Training Improves MLLMs

2025-10-0123:16

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

2025-10-0124:47

00:00

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-17594456643253{-webkit-line-clamp:2;}Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Jingwen Liang, Gengyu Wang

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR