DiscoverDaily Paper CastQuantile Advantage Estimation for Entropy-Safe Reasoning
Quantile Advantage Estimation for Entropy-Safe Reasoning

Quantile Advantage Estimation for Entropy-Safe Reasoning

Update: 2025-09-30
Share

Description

🤗 Upvotes: 102 | cs.LG, cs.AI



Authors:

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He



Title:

Quantile Advantage Estimation for Entropy-Safe Reasoning



Arxiv:

http://arxiv.org/abs/2509.22611v1



Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Quantile Advantage Estimation for Entropy-Safe Reasoning

Quantile Advantage Estimation for Entropy-Safe Reasoning

Jingwen Liang, Gengyu Wang