DiscoverAI BreakdownScaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Update: 2025-10-27
Share

Description

In this episode, we discuss Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset by Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen. The paper presents Ditto, a comprehensive framework that generates large-scale, high-quality training data for instruction-based video editing by combining an advanced image editor with an in-context video generator. Ditto uses an efficient, distilled model with a temporal enhancer and an intelligent agent to ensure scalable, diverse, and high-fidelity video edits. Leveraging this framework, the authors created the Ditto-1M dataset and trained the Editto model, achieving state-of-the-art performance in following editing instructions.
Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

agibreakdown