Region-Constraint In-Context Generation for Instructional Video Editing

Update: 2025-12-24

Description

🤗 Upvotes: 40 | cs.CV, cs.MM

Authors:

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei

Title:

Region-Constraint In-Context Generation for Instructional Video Editing

Arxiv:

http://arxiv.org/abs/2512.17650v1

Abstract:

The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.

Comments

In Channel

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

2025-12-2424:05

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

2025-12-2426:13

Region-Constraint In-Context Generation for Instructional Video Editing

2025-12-2421:07

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

2025-12-2424:14

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

2025-12-2424:41

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

2025-12-2425:57

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

2025-12-2326:30

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

2025-12-2323:40

Are We on the Right Way to Assessing LLM-as-a-Judge?

2025-12-2323:16

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

Next-Embedding Prediction Makes Strong Vision Learners

2025-12-2022:00

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

2025-12-2024:04

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

2025-12-2022:14

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

2025-12-2021:29

00:00

Region-Constraint In-Context Generation for Instructional Video Editing

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176657328523680{-webkit-line-clamp:2;}Region-Constraint In-Context Generation for Instructional Video Editing

Region-Constraint In-Context Generation for Instructional Video Editing

Jingwen Liang, Gengyu Wang

Region-Constraint In-Context Generation for Instructional Video Editing