SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Update: 2025-10-01

Description

🤗 Upvotes: 36 | cs.CV, cs.AI

Authors:

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie

Title:

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Arxiv:

http://arxiv.org/abs/2509.24695v1

Abstract:

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

Comments

In Channel

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

2025-10-0225:16

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

2025-10-0223:40

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

2025-10-0229:09

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

2025-10-0220:04

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

2025-10-0224:43

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

2025-10-0227:25

OceanGym: A Benchmark Environment for Underwater Embodied Agents

2025-10-0222:22

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

2025-10-0224:35

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

2025-10-0223:19

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

2025-10-0225:33

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

2025-10-0124:35

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2025-10-0121:00

Multiplayer Nash Preference Optimization

2025-10-0126:13

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

2025-10-0125:04

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

2025-10-0124:39

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

2025-10-0119:35

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

2025-10-0126:23

Democratizing AI scientists using ToolUniverse

2025-10-0126:05

Visual Jigsaw Post-Training Improves MLLMs

2025-10-0123:16

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

2025-10-0124:47

00:00

1.0x

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175939763858590{-webkit-line-clamp:2;}SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Jingwen Liang, Gengyu Wang

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer