Jack Cushman: Libraries and the Data.gov Archive
Description
Welcome to the Captured podcast! For our first episode we interview Jack Cushman, Director of the Library Innovation Lab [LIL] at the Harvard Law School Library. Our conversation explores the field of digital preservation, through which organizations and communities capture, maintain, and share digital societal knowledge. This is a rapidly changing field, only becoming more essential as everything from schools and libraries to laws, infrastructure, and AI systems become dependent on our shared but impermanent digital records.
Jack talked to us about his lab’s work to preserve web pages and public data and protect them from link rot, and to make public domain works more meaningfully accessible: i.e., accessible to anyone with a browser. This includes the Case Law Access Project, which brought 360 years of US case law to the public web, and the Data.gov Archive, which captures a rolling snapshot of public datasets compiled by federal and regional governments in the US. We also asked him about the changing role of libraries, and what individuals and smaller organizations can do to preserve and archive data in our our own lives.
Overview
🗃️ Archiving in an era of transience : the End of Term Archive is a long-running project to capture the entirety of the federal web at each presidential transition. Jack describes his lab’s contribution to EOTA, the Data.gov Archive, capturing the datasets indexed by data.gov.
🔗 How can we fight robustly against link rot? Even Supreme Court decisions cite web URLs, and today nearly half those citations already don’t work. Enter projects like Perma.cc, which create reliable web archives so our legal and historical memory stays intact.
⚖️ Making the law accessible for everyday citizens : While in theory U.S. case law is freely available to those who can access a physical repository, the university-led Case Law Access Project recently had to digitize 360 years—over 40 million pages—of case law, overcoming copyright hurdles and opening the door to new legal tech innovation and tools for access to justice.
🌍 The importance and fragility of public data : Public data gives us extraordinary insight into our world; yet sometimes critical city datasets are saved from oblivion on a volunteer’s hard drive. Preservation needs systemic efforts, and lots of copies.
🤝 Models for effective grassroots preservation : Projects like the Data Rescue Project, and Safeguarding Research & Culture (safeguar.de) work alongside established institutions to preserve knowledge; community networks like r/DataHoarder highlight the breadth of public interest. But both need more support.
💾 How to get started as a community archivist : Jack shared some recommendations for how to contribute to global projects and how to start preserving the public knowledge you depend on in your own life and work.
Lawful, Good Limericks : The Case Law Limerick Generator draws each line from historical case law… because when data is truly free, anything becomes possible #
Timestamps
02:14 The Data.gov Archive and the End of Term Archive 05:34 Public data: Everything you need to make sense of your world10:59 Storage: Where the data.gov archive is stored14:02 Provenance and the historical role of libraries: BagIt, Library of Congress22:23 The Caselaw Access Project: empowering people via access to law36:15 The fragility of public access to essential data 37:29 Addressing link rot with Perma.cc41:40 The current role of libraries for digital records45:45 Grassroots preservation: Data Rescue Project, Safeguar.de, data hoarders 53:33 LOCKSS: Lots of Copies Keeps Stuff Safe 57:37 AI applications for access and preservation1:04:20 How to get started with personal or community archiving 1:08:44 Case Law limerick generator FTW
About Jack and Harvard LIL
The Library Innovation Lab at Harvard Law School is a software and design lab focused on digital legal research, preservation, and information access. Its wide-ranging projects include Perma.cc, an archiving service for scholarly citations, preventing link rot; the H2O open casebook platform, for collaborative creation of legal casebooks; the Case Law Access Project (CAP); and the Data.gov Archive.
Mentioned in this episode
Preserving national data
* End of Term Archive – Archiving US gov. sites and data, every four years
* SUCHO (Saving Ukrainian Cultural Heritage Online) – global volunteer effort
* Data Rescue Project – Preserving federal data in the US, building on SUCHO
* CourtListener / Free Law Project – US legal info, court opinions, and case.law
Preserving large datasets
* Internet Archive – The Web's largest digital library, since 1996
* safeguard.de – A German index of torrents of large scientific and cultural datasets.
* Source Co-op – Nonprofit facilitating archival S3 storage for public datasets.
Web crawlers and scrapers
* Wayback Machine – The largest web archive, hosted by the Internet Archive
* Webrecorder – Provider of high-fidelity web archiving, incl. ArchiveWeb.page
Other efforts
* Environmental Data Governance Initiative (EDGI) – For climate data.
* Public Environmental Data Partnership (PEDP) – For public environmental data.
* PubMed – Search and retrieval of biomedical and life sciences literature.
* The LOCKSS project (Lots of Copies Keep Stuff Safe) – Stanford library initiative
* r/DataHoarder – a community of 800,000 individual data collectors and archivists.
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit capturedpod.substack.com






