DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Update: 2024-12-26

Description

🤗 Upvotes: 10 | cs.CV, cs.AI, cs.MM

Authors:

Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue

Title:

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Arxiv:

http://arxiv.org/abs/2412.18597v1

Abstract:

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

2025-01-0322:38

Xmodel-2 Technical Report

2025-01-0317:16

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

2025-01-0324:50

HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

2025-01-0320:48

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

2025-01-0322:06

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

2025-01-0220:07

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

2025-01-0218:53

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

2025-01-0125:04

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

2025-01-0122:45

Bringing Objects to Life: 4D generation from 3D objects

2025-01-0121:48

Efficiently Serving LLM Reasoning Programs with Certaindex

2025-01-0120:19

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

2025-01-0121:15

Edicho: Consistent Image Editing in the Wild

2025-01-0122:47

Facilitating large language model Russian adaptation with Learned Embedding Propagation

2025-01-0122:12

Training Software Engineering Agents and Verifiers with SWE-Gym

2025-01-0126:54

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

2025-01-0120:54

Slow Perception: Let's Perceive Geometric Figures Step-by-step

2025-01-0123:19

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

2024-12-3123:19

1.58-bit FLUX

2024-12-3122:59

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

2024-12-3117:30

00:00

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173593463008595{-webkit-line-clamp:2;}DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Jingwen Liang, Gengyu Wang

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation