DiscoverInterconnects AudioWe aren't running out of training data, we are running out of open training data
We aren't running out of training data, we are running out of open training data

We aren't running out of training data, we are running out of open training data

Update: 2024-05-29
Share

Description

Data licensing deals, scaling, human inputs, and repeating trends in open vs. closed.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/the-data-wall

0:00 We aren't running out of training data, we are running out of open training data
2:51 Synthetic data: 1 trillion new tokens per day
4:18 Data licensing deals: High costs per token
6:33 Better tokens: Search and new frontiers

Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

We aren't running out of training data, we are running out of open training data

We aren't running out of training data, we are running out of open training data

Nathan Lambert