Code Generation & Synthetic Data With Loubna Ben Allal #51
Description
Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face š¤ .
In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.
We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.
Loubna also shares career mistakes, advice and her take on the future of developers and code generation.Ā
If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.
Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia
StarCoder blog post: https://huggingface.co/blog/starcoder
Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/
Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/ Ā
---
(00:00 ) - Intro
(02:00 ) - How Loubna Got Into Data & AI
(03:57 ) - Internship at Hugging Face
(06:21 ) - Building A Code Generation Model: StarCoder
(12:14 ) - Data Filtering Techniques for LLMs
(18:44 ) - Training StarCoder
(21:35 ) - Will GenAI Replace Developers?Ā
(25:44 ) - Synthetic Data Generation & Building Cosmopedia
(35:44 ) - Evaluating a 1B Params Model Trained on Synthetic Data
(43:43 ) - Challenges faced & Career Advice