2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Update: 2025-01-04

Description

🤗 Upvotes: 45 | cs.CV, cs.CL, cs.LG

Authors:

Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing

Title:

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Arxiv:

http://arxiv.org/abs/2501.00958v1

Abstract:

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnote{Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}}.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

2025-01-0724:44

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

2025-01-0720:37

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

2025-01-0723:02

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

2025-01-0722:38

SDPO: Segment-Level Direct Preference Optimization for Social Agents

2025-01-0719:44

Graph Generative Pre-trained Transformer

2025-01-0720:24

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

2025-01-0723:14

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

2025-01-0725:56

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

2025-01-0423:53

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

2025-01-0423:32

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2025-01-0419:15

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

2025-01-0424:49

ProgCo: Program Helps Self-Correction of Large Language Models

2025-01-0420:19

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

2025-01-0425:32

A3: Android Agent Arena for Mobile GUI Agents

2025-01-0423:35

MLLM-as-a-Judge for Image Safety without Human Labeling

2025-01-0422:20

Dynamic Scaling of Unit Tests for Code Reward Modeling

2025-01-0421:52

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

2025-01-0322:38

Xmodel-2 Technical Report

2025-01-0317:16

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

2025-01-0324:50

00:00

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173623474948985{-webkit-line-clamp:2;}2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Jingwen Liang, Gengyu Wang

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining