InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Update: 2025-12-30

Description

🤗 Upvotes: 74 | cs.CV, cs.AI

Authors:

Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo

Title:

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Arxiv:

http://arxiv.org/abs/2512.17504v1

Abstract:

Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.

Comments

In Channel

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

2025-12-3023:11

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

2025-12-3021:17

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

2025-12-3024:59

Latent Implicit Visual Reasoning

2025-12-2725:49

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

2025-12-2726:01

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

2025-12-2621:22

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

2025-12-2622:56

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

2025-12-2621:35

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

2025-12-2620:49

SemanticGen: Video Generation in Semantic Space

2025-12-2522:13

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

2025-12-2527:28

LongVideoAgent: Multi-Agent Reasoning with Long Videos

2025-12-2522:12

SpatialTree: How Spatial Abilities Branch Out in MLLMs

2025-12-2522:10

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

2025-12-2424:05

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

2025-12-2426:13

Region-Constraint In-Context Generation for Instructional Video Editing

2025-12-2421:07

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

2025-12-2424:14

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

2025-12-2424:41

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

2025-12-2425:57

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

00:00

1.0x

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176708669421024{-webkit-line-clamp:2;}InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Jingwen Liang, Gengyu Wang

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion