PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Update: 2025-09-03

Description

🤗 Upvotes: 21 | cs.LG, cs.AI

Authors:

Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang

Title:

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Arxiv:

http://arxiv.org/abs/2508.21104v1

Abstract:

Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

Comments

In Channel

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

2025-09-0622:57

From Editor to Dense Geometry Estimator

2025-09-0618:45

Towards a Unified View of Large Language Model Post-Training

2025-09-0623:07

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

2025-09-0620:11

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

2025-09-0623:03

Open Data Synthesis For Deep Research

2025-09-0523:03

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

2025-09-0521:57

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

2025-09-0424:16

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

2025-09-0423:48

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

2025-09-0422:32

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

2025-09-0420:16

Baichuan-M2: Scaling Medical Capability with Large Verifier System

2025-09-0423:34

Kwai Keye-VL 1.5 Technical Report

2025-09-0418:19

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

2025-09-0424:18

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

2025-09-0321:59

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

2025-09-0219:58

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

2025-09-0223:14

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

2025-08-2821:42

VibeVoice Technical Report

2025-08-2821:19

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

2025-08-2820:03

00:00

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175738666740992{-webkit-line-clamp:2;}PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Jingwen Liang, Gengyu Wang

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning