VIDEOP2R: Video Understanding from Perception to Reasoning

Update: 2025-11-20

Description

🤗 Upvotes: 70 | cs.CV, cs.AI, cs.LG

Authors:

Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan

Title:

VIDEOP2R: Video Understanding from Perception to Reasoning

Arxiv:

http://arxiv.org/abs/2511.11113v1

Abstract:

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

Comments

In Channel

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

2025-11-2125:59

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

2025-11-2124:57

What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

2025-11-2122:40

VisPlay: Self-Evolving Vision-Language Models from Images

2025-11-2122:28

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

2025-11-2119:22

VIDEOP2R: Video Understanding from Perception to Reasoning

2025-11-2025:08

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

2025-11-2024:58

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

2025-11-2023:48

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

2025-11-2023:48

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

2025-11-2022:39

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

2025-11-2024:27

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

2025-11-2026:47

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

2025-11-1924:24

P1: Mastering Physics Olympiads with Reinforcement Learning

2025-11-1922:16

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

2025-11-1927:44

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

2025-11-1923:57

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

2025-11-1925:57

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

2025-11-1920:43

GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning

2025-11-1923:49

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

2025-11-1923:11

00:00

VIDEOP2R: Video Understanding from Perception to Reasoning

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176369769084360{-webkit-line-clamp:2;}VIDEOP2R: Video Understanding from Perception to Reasoning

VIDEOP2R: Video Understanding from Perception to Reasoning

Jingwen Liang, Gengyu Wang

VIDEOP2R: Video Understanding from Perception to Reasoning