DiscoverDaily Paper CastEPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Update: 2025-09-30
Share

Description

🤗 Upvotes: 98 | cs.LG, cs.CL



Authors:

Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris



Title:

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning



Arxiv:

http://arxiv.org/abs/2509.22576v1



Abstract:

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

Comments 
In Channel
loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Jingwen Liang, Gengyu Wang