VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Update: 2024-12-31

Description

🤗 Upvotes: 8 | cs.CV

Authors:

Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li

Title:

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Arxiv:

http://arxiv.org/abs/2412.19645v2

Abstract:

Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video. Experiments on both customized human and object video generation validate the effectiveness of our framework.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

2025-01-0322:38

Xmodel-2 Technical Report

2025-01-0317:16

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

2025-01-0324:50

HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

2025-01-0320:48

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

2025-01-0322:06

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

2025-01-0220:07

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

2025-01-0218:53

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

2025-01-0125:04

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

2025-01-0122:45

Bringing Objects to Life: 4D generation from 3D objects

2025-01-0121:48

Efficiently Serving LLM Reasoning Programs with Certaindex

2025-01-0120:19

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

2025-01-0121:15

Edicho: Consistent Image Editing in the Wild

2025-01-0122:47

Facilitating large language model Russian adaptation with Learned Embedding Propagation

2025-01-0122:12

Training Software Engineering Agents and Verifiers with SWE-Gym

2025-01-0126:54

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

2025-01-0120:54

Slow Perception: Let's Perceive Geometric Figures Step-by-step

2025-01-0123:19

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

2024-12-3123:19

1.58-bit FLUX

2024-12-3122:59

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

2024-12-3117:30

00:00

1.0x

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173593329113621{-webkit-line-clamp:2;}VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Jingwen Liang, Gengyu Wang

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models