Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Update: 2024-12-21

Description

🤗 Upvotes: 17 | cs.CV

Authors:

Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh

Title:

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Arxiv:

http://arxiv.org/abs/2412.15213v1

Abstract:

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Qwen2.5 Technical Report

2024-12-2125:31

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

2024-12-2123:02

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

2024-12-2123:11

How to Synthesize Text Data without Model Collapse?

2024-12-2124:20

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

2024-12-2119:57

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

2024-12-2120:44

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

2024-12-2121:08

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

2024-12-2123:08

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

2024-12-2124:09

No More Adam: Learning Rate Scaling at Initialization is All You Need

2024-12-2021:59

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

2024-12-2021:56

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

2024-12-2024:45

AniDoc: Animation Creation Made Easier

2024-12-2022:20

FashionComposer: Compositional Fashion Image Generation

2024-12-2019:47

GUI Agents: A Survey

2024-12-2021:01

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

2024-12-2022:42

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

2024-12-2020:41

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

2024-12-2020:52

Are Your LLMs Capable of Stable Reasoning?

2024-12-1924:11

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

2024-12-1922:34

00:00

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173491053337393{-webkit-line-clamp:2;}Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Jingwen Liang, Gengyu Wang

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution