MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Update: 2024-12-21

Description

🤗 Upvotes: 44 | cs.CV, cs.CL

Authors:

Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong

Title:

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Arxiv:

http://arxiv.org/abs/2412.14475v1

Abstract:

Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.

Comments

In Channel

Latent Implicit Visual Reasoning

2025-12-2725:49

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

2025-12-2726:01

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

2025-12-2621:22

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

2025-12-2622:56

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

2025-12-2621:35

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

2025-12-2620:49

SemanticGen: Video Generation in Semantic Space

2025-12-2522:13

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

2025-12-2527:28

LongVideoAgent: Multi-Agent Reasoning with Long Videos

2025-12-2522:12

SpatialTree: How Spatial Abilities Branch Out in MLLMs

2025-12-2522:10

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

2025-12-2424:05

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

2025-12-2426:13

Region-Constraint In-Context Generation for Instructional Video Editing

2025-12-2421:07

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

2025-12-2424:14

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

2025-12-2424:41

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

2025-12-2425:57

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

00:00

1.0x

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176687460316780{-webkit-line-clamp:2;}MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Jingwen Liang, Gengyu Wang

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval