Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Update: 2025-01-08

Description

🤗 Upvotes: 23 | cs.CV

Authors:

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Title:

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Arxiv:

http://arxiv.org/abs/2501.03218v1

Abstract:

Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Visual-RFT: Visual Reinforcement Fine-Tuning

2025-03-0522:49

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025-03-0525:44

Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

2025-03-0519:04

DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

2025-03-0422:49

Chain of Draft: Thinking Faster by Writing Less

2025-03-0422:37

Multi-Turn Code Generation Through Single-Step Rewards

2025-03-0425:33

Self-rewarding correction for mathematical reasoning

2025-03-0124:30

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

2025-03-0123:19

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

2025-03-0122:25

LongRoPE2: Near-Lossless LLM Context Window Scaling

2025-03-0123:05

FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

2025-03-0126:27

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

2025-03-0121:52

UniTok: A Unified Tokenizer for Visual Generation and Understanding

2025-03-0124:43

NeoBERT: A Next-Generation BERT

2025-03-0123:40

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

2025-03-0121:30

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

2025-03-0122:16

GHOST 2.0: generative high-fidelity one shot transfer of heads

2025-02-2818:41

Kanana: Compute-efficient Bilingual Language Models

2025-02-2822:05

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

2025-02-2822:22

Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance

2025-02-2824:57

00:00

1.0x

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-174131797183519{-webkit-line-clamp:2;}Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Jingwen Liang, Gengyu Wang

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction