DiscoverGenAI Level UPHow AI Learned to Chat About Pictures: Inside the MoshiVis Model
How AI Learned to Chat About Pictures: Inside the MoshiVis Model

How AI Learned to Chat About Pictures: Inside the MoshiVis Model

Update: 2025-04-02
Share

Description

How do you teach a sophisticated speech AI to understand and discuss images, especially when paired image-speech data is rare?


This episode unpacks MoshiVis, a new model that achieves just that. We explore the challenges of building Vision-Speech Models and how MoshiVis overcomes them with a unique one-stage training pipeline, synthetic dialogues, and efficient "perceptual augmentation" techniques built upon the Moshi speech LLM.


Join us for a deep dive into the tech that lets AI see, speak, and converse fluidly about the visual world.

Comments 
In Channel
loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

How AI Learned to Chat About Pictures: Inside the MoshiVis Model

How AI Learned to Chat About Pictures: Inside the MoshiVis Model

GenAI Level UP