MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Update: 2025-09-30

Description

🤗 Upvotes: 28 | cs.CV, cs.RO

Authors:

Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang

Title:

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Arxiv:

http://arxiv.org/abs/2509.22281v1

Abstract:

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/

Comments

In Channel

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

2025-10-0324:12

GEM: A Gym for Agentic LLMs

2025-10-0325:53

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

2025-10-0326:50

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

2025-10-0324:50

PIPer: On-Device Environment Setup via Online Reinforcement Learning

2025-10-0320:19

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

2025-10-0324:05

ACON: Optimizing Context Compression for Long-horizon LLM Agents

2025-10-0325:16

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

2025-10-0225:16

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

2025-10-0223:40

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

2025-10-0229:09

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

2025-10-0220:04

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

2025-10-0224:43

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

2025-10-0227:25

OceanGym: A Benchmark Environment for Underwater Embodied Agents

2025-10-0222:22

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

2025-10-0224:35

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

2025-10-0223:19

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

2025-10-0225:33

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

2025-10-0124:35

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2025-10-0121:00

Multiplayer Nash Preference Optimization

2025-10-0126:13

00:00

1.0x

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175955415114499{-webkit-line-clamp:2;}MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Jingwen Liang, Gengyu Wang

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning