How AI Learned to Chat About Pictures: Inside the MoshiVis Model
Update: 2025-04-02
Description
How do you teach a sophisticated speech AI to understand and discuss images, especially when paired image-speech data is rare?
This episode unpacks MoshiVis, a new model that achieves just that. We explore the challenges of building Vision-Speech Models and how MoshiVis overcomes them with a unique one-stage training pipeline, synthetic dialogues, and efficient "perceptual augmentation" techniques built upon the Moshi speech LLM.
Join us for a deep dive into the tech that lets AI see, speak, and converse fluidly about the visual world.
Comments
In Channel