REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Update: 2025-11-20

Description

🤗 Upvotes: 22 | cs.CV

Authors:

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

Title:

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Arxiv:

http://arxiv.org/abs/2511.13026v1

Abstract:

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

Comments

In Channel

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

2025-11-2125:59

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

2025-11-2124:57

What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

2025-11-2122:40

VisPlay: Self-Evolving Vision-Language Models from Images

2025-11-2122:28

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

2025-11-2119:22

VIDEOP2R: Video Understanding from Perception to Reasoning

2025-11-2025:08

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

2025-11-2024:58

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

2025-11-2023:48

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

2025-11-2023:48

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

2025-11-2022:39

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

2025-11-2024:27

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

2025-11-2026:47

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

2025-11-1924:24

P1: Mastering Physics Olympiads with Reinforcement Learning

2025-11-1922:16

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

2025-11-1927:44

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

2025-11-1923:57

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

2025-11-1925:57

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

2025-11-1920:43

GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning

2025-11-1923:49

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

2025-11-1923:11

00:00

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176369769297145{-webkit-line-clamp:2;}REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jingwen Liang, Gengyu Wang

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding