DiscoverRapid Synthesis: Delivered under 30 mins..ish, or it's on me!Common Crawl: Archiving the Web for AI and Research
Common Crawl: Archiving the Web for AI and Research

Common Crawl: Archiving the Web for AI and Research

Update: 2025-04-01
Share

Description

Common Crawl is a non-profit organisation established in 2007 with the aim of providing an openly accessible archive of the World Wide Web. This massive collection of crawled web data began in 2008 and has grown substantially, becoming a crucial resource for researchers and developers, particularly in the field of artificial intelligence. Milestones include Amazon Web Services hosting the archive from 2012, the adoption of the Nutch crawler in 2013, and the pivotal use of its data to train influential large language models like GPT-3 starting around 2020. The organisation continues to collect billions of web pages, offering raw HTML, metadata, and extracted text in formats like WARC, WAT, and WET, thereby facilitating diverse analyses and the training of sophisticated AI systems.

Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Common Crawl: Archiving the Web for AI and Research

Common Crawl: Archiving the Web for AI and Research

Benjamin Alloul 🗪 🅽🅾🆃🅴🅱🅾🅾🅺🅻🅼