VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Update: 2025-01-07

Description

🤗 Upvotes: 23 | cs.CV, cs.SD, eess.AS

Authors:

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

Title:

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Arxiv:

http://arxiv.org/abs/2501.01957v1

Abstract:

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

2025-01-0822:18

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

2025-01-0826:54

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

2025-01-0822:26

Personalized Graph-Based Retrieval for Large Language Models

2025-01-0821:16

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

2025-01-0821:38

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

2025-01-0822:25

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

2025-01-0822:15

TransPixar: Advancing Text-to-Video Generation with Transparency

2025-01-0822:45

AutoPresent: Designing Structured Visuals from Scratch

2025-01-0819:20

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

2025-01-0724:44

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

2025-01-0720:37

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

2025-01-0723:02

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

2025-01-0722:38

SDPO: Segment-Level Direct Preference Optimization for Social Agents

2025-01-0719:44

Graph Generative Pre-trained Transformer

2025-01-0720:24

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

2025-01-0723:14

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

2025-01-0725:56

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

2025-01-0423:53

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

2025-01-0423:32

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2025-01-0419:15

00:00

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173632630559198{-webkit-line-clamp:2;}VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Jingwen Liang, Gengyu Wang

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction