ExGRPO: Learning to Reason from Experience

Update: 2025-10-04

Description

🤗 Upvotes: 50 | cs.LG, cs.AI, cs.CL

Authors:

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng

Title:

ExGRPO: Learning to Reason from Experience

Arxiv:

http://arxiv.org/abs/2510.02245v1

Abstract:

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

Comments

In Channel

LongCodeZip: Compress Long Context for Code Language Models

2025-10-0429:31

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

2025-10-0423:03

ExGRPO: Learning to Reason from Experience

2025-10-0421:57

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

2025-10-0426:07

Interactive Training: Feedback-Driven Neural Network Optimization

2025-10-0420:55

ModernVBERT: Towards Smaller Visual Document Retrievers

2025-10-0423:21

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

2025-10-0430:49

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

2025-10-0324:12

GEM: A Gym for Agentic LLMs

2025-10-0325:53

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

2025-10-0326:50

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

2025-10-0324:50

PIPer: On-Device Environment Setup via Online Reinforcement Learning

2025-10-0320:19

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

2025-10-0324:05

ACON: Optimizing Context Compression for Long-horizon LLM Agents

2025-10-0325:16

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

2025-10-0225:16

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

2025-10-0223:40

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

2025-10-0229:09

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

2025-10-0220:04

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

2025-10-0224:43

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

2025-10-0227:25

00:00

ExGRPO: Learning to Reason from Experience

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175960242366973{-webkit-line-clamp:2;}ExGRPO: Learning to Reason from Experience

ExGRPO: Learning to Reason from Experience

Jingwen Liang, Gengyu Wang

ExGRPO: Learning to Reason from Experience