Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Update: 2025-01-08

Description

🤗 Upvotes: 12 | cs.CV, cs.AI, cs.LG

Authors:

Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak

Title:

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Arxiv:

http://arxiv.org/abs/2501.03059v1

Abstract:

We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

2025-01-0822:18

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

2025-01-0826:54

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

2025-01-0822:26

Personalized Graph-Based Retrieval for Large Language Models

2025-01-0821:16

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

2025-01-0821:38

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

2025-01-0822:25

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

2025-01-0822:15

TransPixar: Advancing Text-to-Video Generation with Transparency

2025-01-0822:45

AutoPresent: Designing Structured Visuals from Scratch

2025-01-0819:20

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

2025-01-0724:44

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

2025-01-0720:37

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

2025-01-0723:02

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

2025-01-0722:38

SDPO: Segment-Level Direct Preference Optimization for Social Agents

2025-01-0719:44

Graph Generative Pre-trained Transformer

2025-01-0720:24

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

2025-01-0723:14

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

2025-01-0725:56

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

2025-01-0423:53

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

2025-01-0423:32

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2025-01-0419:15

00:00

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173639391411699{-webkit-line-clamp:2;}Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Jingwen Liang, Gengyu Wang

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation