DiscoverDeep Dive in ResearchThe 1 Billion Token Challenge: Finding the Perfect Pre-training Mix
The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

Update: 2025-11-25
Share

Description

Today's podcast is based on an article from Hugging Face detailing an extensive research project that addresses the high cost and scale of training modern large language models. The authors, through over 50 systematic experiments, sought to find an optimal data mixing strategy that would allow a GPT-2 model to achieve comparable performance to models trained on ten times the data. Their central finding is that a static dataset mix of 50% finePDFs, 30% DCLM-baseline, and 20% FineWeb-Edu significantly outperforms more complex curriculum learning approaches, which often led to catastrophic forgetting or overfitting. This optimal 50-30-20 mixture successfully trained a GPT-2-70M model that achieved over 90% of the original GPT-2's benchmark performance while using substantially fewer resources. The key takeaway is that dataset quality and intelligent composition are more critical than sheer quantity for training efficient language models.


Read the full article on https://huggingface.co/blog/codelion/optimal-dataset-mixing

Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

NotebookLM