DiscoverDaily Paper CastRoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Update: 2025-12-18
Share

Description

🤗 Upvotes: 31 | cs.RO, cs.CV



Authors:

Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang



Title:

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics



Arxiv:

http://arxiv.org/abs/2512.13660v1



Abstract:

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Jingwen Liang, Gengyu Wang