EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Update: 2025-09-30

Description

🤗 Upvotes: 98 | cs.LG, cs.CL

Authors:

Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris

Title:

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Arxiv:

http://arxiv.org/abs/2509.22576v1

Abstract:

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

Comments

In Channel

LongLive: Real-time Interactive Long Video Generation

2025-09-3024:54

Quantile Advantage Estimation for Entropy-Safe Reasoning

2025-09-3023:16

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

2025-09-3027:27

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

2025-09-3025:00

ReviewScore: Misinformed Peer Review Detection with Large Language Models

2025-09-3021:58

Variational Reasoning for Language Models

2025-09-3022:33

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

2025-09-3023:30

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

2025-09-3025:33

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

2025-09-3023:54

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

2025-09-3027:53

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

2025-09-2722:17

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

2025-09-2723:35

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

2025-09-2728:47

Tree Search for LLM Agent Reinforcement Learning

2025-09-2724:50

Seedream 4.0: Toward Next-generation Multimodal Image Generation

2025-09-2721:30

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

2025-09-2725:09

AutoIntent: AutoML for Text Classification

2025-09-2722:28

Video models are zero-shot learners and reasoners

2025-09-2624:55

SIM-CoT: Supervised Implicit Chain-of-Thought

2025-09-2624:06

Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

2025-09-2520:21

00:00

1.0x

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175925279692217{-webkit-line-clamp:2;}EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Jingwen Liang, Gengyu Wang

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning