DiscoverDaily Paper CastVITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Update: 2025-01-07
Share

Description

🤗 Upvotes: 23 | cs.CV, cs.SD, eess.AS



Authors:

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He



Title:

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction



Arxiv:

http://arxiv.org/abs/2501.01957v1



Abstract:

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Jingwen Liang, Gengyu Wang