Quantile Advantage Estimation for Entropy-Safe Reasoning

Update: 2025-09-30

Description

🤗 Upvotes: 102 | cs.LG, cs.AI

Authors:

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He

Title:

Quantile Advantage Estimation for Entropy-Safe Reasoning

Arxiv:

http://arxiv.org/abs/2509.22611v1

Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

Comments

In Channel

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

2025-10-0124:35

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2025-10-0121:00

Multiplayer Nash Preference Optimization

2025-10-0126:13

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

2025-10-0125:04

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

2025-10-0124:39

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

2025-10-0119:35

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

2025-10-0126:23

Democratizing AI scientists using ToolUniverse

2025-10-0126:05

Visual Jigsaw Post-Training Improves MLLMs

2025-10-0123:16

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

2025-10-0124:47

LongLive: Real-time Interactive Long Video Generation

2025-09-3024:54

Quantile Advantage Estimation for Entropy-Safe Reasoning

2025-09-3023:16

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

2025-09-3027:27

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

2025-09-3025:00

ReviewScore: Misinformed Peer Review Detection with Large Language Models

2025-09-3021:58

Variational Reasoning for Language Models

2025-09-3022:33

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

2025-09-3023:30

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

2025-09-3025:33

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

2025-09-3023:54

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

2025-09-3027:53

00:00

Quantile Advantage Estimation for Entropy-Safe Reasoning

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175933178632289{-webkit-line-clamp:2;}Quantile Advantage Estimation for Entropy-Safe Reasoning

Quantile Advantage Estimation for Entropy-Safe Reasoning

Jingwen Liang, Gengyu Wang

Quantile Advantage Estimation for Entropy-Safe Reasoning