GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Update: 2025-12-31

Description

🤗 Upvotes: 21 | cs.CV

Authors:

Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang

Title:

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Arxiv:

http://arxiv.org/abs/2512.15560v2

Abstract:

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about \textbf{750$\times$ faster}. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

Comments

In Channel

mHC: Manifold-Constrained Hyper-Connections

2026-01-0220:57

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

2026-01-0228:35

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

2026-01-0225:58

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

2026-01-0222:28

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

2025-12-3124:49

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

2025-12-3123:16

Yume-1.5: A Text-Controlled Interactive World Generation Model

2025-12-3125:01

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

2025-12-3124:01

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

2025-12-3125:32

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

2025-12-3125:06

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

2025-12-3123:48

SpotEdit: Selective Region Editing in Diffusion Transformers

2025-12-3122:44

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

2025-12-3122:03

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

2025-12-3023:11

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

2025-12-3021:17

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

2025-12-3024:59

Latent Implicit Visual Reasoning

2025-12-2725:49

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

2025-12-2726:01

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

2025-12-2621:22

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

2025-12-2622:56

00:00

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176737935347290{-webkit-line-clamp:2;}GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Jingwen Liang, Gengyu Wang

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models