DiscoverPapers Read on AIMini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Update: 2024-10-31
Share

Description

GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal language models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a command-based interruption mechanism, enabling more flexible interaction with users. To the best of our knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have similar form of functionality, and we hope it can offer valuable insights for subsequent research.



2024: Zhifei Xie, Changqiao Wu







https://arxiv.org/pdf/2410.11190
Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Rob