MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Update: 2025-09-23

Description

🤗 Upvotes: 37 | cs.CV, cs.CL, cs.LG

Authors:

Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen

Title:

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Arxiv:

http://arxiv.org/abs/2509.16197v1

Abstract:

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Comments

In Channel

LongLive: Real-time Interactive Long Video Generation

2025-09-3024:54

Quantile Advantage Estimation for Entropy-Safe Reasoning

2025-09-3023:16

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

2025-09-3027:27

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

2025-09-3025:00

ReviewScore: Misinformed Peer Review Detection with Large Language Models

2025-09-3021:58

Variational Reasoning for Language Models

2025-09-3022:33

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

2025-09-3023:30

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

2025-09-3025:33

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

2025-09-3023:54

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

2025-09-3027:53

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

2025-09-2722:17

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

2025-09-2723:35

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

2025-09-2728:47

Tree Search for LLM Agent Reinforcement Learning

2025-09-2724:50

Seedream 4.0: Toward Next-generation Multimodal Image Generation

2025-09-2721:30

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

2025-09-2725:09

AutoIntent: AutoML for Text Classification

2025-09-2722:28

Video models are zero-shot learners and reasoners

2025-09-2624:55

SIM-CoT: Supervised Implicit Chain-of-Thought

2025-09-2624:06

Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

2025-09-2520:21

00:00

1.0x

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175921930033236{-webkit-line-clamp:2;}MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Jingwen Liang, Gengyu Wang

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer