Yume-1.5: A Text-Controlled Interactive World Generation Model

Update: 2025-12-31

Description

🤗 Upvotes: 50 | cs.CV

Authors:

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang

Title:

Yume-1.5: A Text-Controlled Interactive World Generation Model

Arxiv:

http://arxiv.org/abs/2512.22096v1

Abstract:

Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.

Comments

In Channel

mHC: Manifold-Constrained Hyper-Connections

2026-01-0220:57

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

2026-01-0228:35

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

2026-01-0225:58

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

2026-01-0222:28

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

2025-12-3124:49

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

2025-12-3123:16

Yume-1.5: A Text-Controlled Interactive World Generation Model

2025-12-3125:01

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

2025-12-3124:01

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

2025-12-3125:32

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

2025-12-3125:06

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

2025-12-3123:48

SpotEdit: Selective Region Editing in Diffusion Transformers

2025-12-3122:44

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

2025-12-3122:03

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

2025-12-3023:11

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

2025-12-3021:17

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

2025-12-3024:59

Latent Implicit Visual Reasoning

2025-12-2725:49

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

2025-12-2726:01

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

2025-12-2621:22

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

2025-12-2622:56

00:00

1.0x

Yume-1.5: A Text-Controlled Interactive World Generation Model

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176738520151843{-webkit-line-clamp:2;}Yume-1.5: A Text-Controlled Interactive World Generation Model

Yume-1.5: A Text-Controlled Interactive World Generation Model

Jingwen Liang, Gengyu Wang

Yume-1.5: A Text-Controlled Interactive World Generation Model