Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Update: 2025-12-23

Description

🤗 Upvotes: 30 | cs.CV

Authors:

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

Title:

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Arxiv:

http://arxiv.org/abs/2512.17909v1

Abstract:

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

Comments

In Channel

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

2025-12-2326:30

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

2025-12-2323:40

Are We on the Right Way to Assessing LLM-as-a-Judge?

2025-12-2323:16

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

Next-Embedding Prediction Makes Strong Vision Learners

2025-12-2022:00

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

2025-12-2024:04

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

2025-12-2022:14

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

2025-12-2021:29

Generative Refocusing: Flexible Defocus Control from a Single Image

2025-12-2025:27

DeContext as Defense: Safe Image Editing in Diffusion Transformers

2025-12-2023:34

Step-GUI Technical Report

2025-12-1926:21

DEER: Draft with Diffusion, Verify with Autoregressive Models

2025-12-1925:44

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

2025-12-1921:47

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

2025-12-1922:07

00:00

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176650260245221{-webkit-line-clamp:2;}Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Jingwen Liang, Gengyu Wang

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing