LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Update: 2025-12-20

Description

🤗 Upvotes: 54 | cs.LG, cs.AI, cs.CL

Authors:

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang

Title:

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Arxiv:

http://arxiv.org/abs/2512.15745v1

Abstract:

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

Comments

In Channel

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

Next-Embedding Prediction Makes Strong Vision Learners

2025-12-2022:00

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

2025-12-2024:04

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

2025-12-2022:14

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

2025-12-2021:29

Generative Refocusing: Flexible Defocus Control from a Single Image

2025-12-2025:27

DeContext as Defense: Safe Image Editing in Diffusion Transformers

2025-12-2023:34

Step-GUI Technical Report

2025-12-1926:21

DEER: Draft with Diffusion, Verify with Autoregressive Models

2025-12-1925:44

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

2025-12-1921:47

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

2025-12-1922:07

Puzzle Curriculum GRPO for Vision-Centric Reasoning

2025-12-1925:36

MMGR: Multi-Modal Generative Reasoning

2025-12-1824:35

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

2025-12-1824:20

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

2025-12-1821:32

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

2025-12-1822:31

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

2025-12-1819:54

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

2025-12-1829:25

00:00

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176631259078457{-webkit-line-clamp:2;}LLaDA2.0: Scaling Up Diffusion Language Models to 100B

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Jingwen Liang, Gengyu Wang

LLaDA2.0: Scaling Up Diffusion Language Models to 100B