LongVideoAgent: Multi-Agent Reasoning with Long Videos

Update: 2025-12-25

Description

🤗 Upvotes: 38 | cs.AI, cs.CV, cs.LG, cs.MA

Authors:

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

Title:

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Arxiv:

http://arxiv.org/abs/2512.20618v1

Abstract:

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

Comments

In Channel

SemanticGen: Video Generation in Semantic Space

2025-12-2522:13

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

2025-12-2527:28

LongVideoAgent: Multi-Agent Reasoning with Long Videos

2025-12-2522:12

SpatialTree: How Spatial Abilities Branch Out in MLLMs

2025-12-2522:10

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

2025-12-2424:05

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

2025-12-2426:13

Region-Constraint In-Context Generation for Instructional Video Editing

2025-12-2421:07

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

2025-12-2424:14

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

2025-12-2424:41

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

2025-12-2425:57

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

2025-12-2326:30

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

2025-12-2323:40

Are We on the Right Way to Assessing LLM-as-a-Judge?

2025-12-2323:16

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

00:00

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176665111096824{-webkit-line-clamp:2;}LongVideoAgent: Multi-Agent Reasoning with Long Videos

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Jingwen Liang, Gengyu Wang

LongVideoAgent: Multi-Agent Reasoning with Long Videos