DiscoverDaily Paper Cast4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Update: 2025-12-23
Share

Description

🤗 Upvotes: 31 | cs.CV



Authors:

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen



Title:

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation



Arxiv:

http://arxiv.org/abs/2512.17012v1



Abstract:

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Jingwen Liang, Gengyu Wang