DiscoverAI StoriesCode Generation & Synthetic Data With Loubna Ben Allal #51
Code Generation & Synthetic Data With Loubna Ben Allal #51

Code Generation & Synthetic Data With Loubna Ben Allal #51

Update: 2024-11-07
Share

Description

Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face šŸ¤— .

In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.

We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.

Loubna also shares career mistakes, advice and her take on the future of developers and code generation.Ā 

If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.

Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia

StarCoder blog post: https://huggingface.co/blog/starcoder

Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/

Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/ Ā 

---

(00:00 ) - Intro

(02:00 ) - How Loubna Got Into Data & AI

(03:57 ) - Internship at Hugging Face

(06:21 ) - Building A Code Generation Model: StarCoder

(12:14 ) - Data Filtering Techniques for LLMs

(18:44 ) - Training StarCoder

(21:35 ) - Will GenAI Replace Developers?Ā 

(25:44 ) - Synthetic Data Generation & Building Cosmopedia

(35:44 ) - Evaluating a 1B Params Model Trained on Synthetic Data

(43:43 ) - Challenges faced & Career Advice


CommentsĀ 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Code Generation & Synthetic Data With Loubna Ben Allal #51

Code Generation & Synthetic Data With Loubna Ben Allal #51

Neil Leiser