Great Data Products

6 Episodes

Reverse

The Storm Events Database Explorer

2026-02-2801:03:37

Jed talks with Kwin Keuter and Brad Andrick, geospatial software engineers at Earth Genome, about the Storm Events Database Explorer. This collaborative project between Earth Genome, The Commons, and the Internet of Water Coalition provides access to over 1.9 million U.S. severe weather events spanning 70+ years of NOAA’s National Center for Environmental Information (NCEI) storm records, including tornadoes, floods, hail, and hurricanes.Links and ResourcesStorm Events Database Explorer — Interactive map and search interfaceStorm Events Database on Source Cooperative — Cloud-optimized Parquet filesEarth Genome blog post on the project — Technical process and discovery workThe Commons case study — Project background and case studyNOAA Storm Events Database — Original NOAA dataset and beta interfaceGeoParquet.io — Chris Holmes’s project for working with Parquet filesMore show notes and transcript at https://greatdataproducts.com/episodes/2026/02/keuter-andrick-storm-events/

Turning Federal Data Into Action

2026-01-1001:10:01

Jed talks with Denice Ross, Senior Fellow at the Federation of American Scientists and former U.S. Chief Data Scientist, about federal data's role in American life and what happens when government data tools sunset. Denice led efforts to use disaggregated data to drive better outcomes for all Americans during her time as Deputy U.S. Chief Technology Officer, and now works on building a Federal Data Use Case Repository documenting how federal datasets affect everyday decisions. The conversation explores why open data initiatives have evolved over the years and how administrative priorities shape public data tool availability. Denice emphasizes that federal data underpins economic growth, public health decisions, and governance at every level. She describes how data users can engage with data stewards to create feedback loops that improve data quality, and why nonprofits and civil society organizations play an essential role in both data collection and advocacy. Throughout the discussion, Denice and Jed examine the balance between official government data products and innovative tools built by external organizations. They discuss creative solutions for filling data gaps, the importance of identifying tools as "powered by federal data" to preserve datasets, and strategies for protecting federal data accessibility for the long term.LINKS AND RESOURCES - Denice Ross at the Federation of American Scientists: https://fas.org/expert/denice-ross/ - The federal data and tools that died this year (Marketplace): https://www.marketplace.org/episode/2025/11/25/the-federal-data-and-tools-that-died-this-yearTAKEAWAYS 1. Federal data underpins daily life — From public health decisions to economic planning, federal datasets inform choices that affect Americans whether they realize it or not. 2. Data tools require active protection — When administrative priorities shift, public data tools can disappear. Building awareness of data dependencies helps preserve access. 3. Feedback loops improve data quality — Data users should engage directly with data stewards. Public participation in the data lifecycle leads to better, more relevant datasets. 4. Civil society fills critical gaps — Nonprofits and external organizations can collect data and advocate for data resources in ways government cannot. 5. Disaggregated data drives equity — Breaking down aggregate statistics reveals disparities and enables targeted interventions that benefit underserved communities. 6. External innovation complements government stability – A healthy ecosystem keeps federal data stable while enabling community-driven tools to evolve and serve specific needs. ---Great Data Products is brought to you by Source Cooperative. Learn more at https://greatdataproducts.com

How Standards Emerge: Lessons from STAC

2025-12-2701:28:45

[Jed's audio in this sounds terrible because of a hardware setting that Marshall Moutenot very kindly helped us identify. Will sound better in future episodes!]Jed talks with Matt Hanson from Element 84 about the SpatioTemporal Asset Catalog (STAC) specification and its role in making geospatial data findable and usable. Matt describes STAC as "a simple, developer-friendly way to describe geospatial data so that people can actually find it and use it." The conversation covers how STAC emerged from a 2017 sprint in Boulder with 20 people and grew into a specification now adopted by NASA, USGS, and commercial satellite companies worldwide. Matt discusses the concept of "guerrilla standards," why adoption is the only metric that matters, the limitations of remote sensing, and why credibility can't be skipped when launching standards efforts.Full show notes and transcript: https://greatdataproducts.com/episodes/2025/12/hanson-stac/Links and Resources:STAC Specification: https://stacspec.org/STAC: A Retrospective, Part 2: https://element84.com/software-engineering/stac-a-retrospective-part-2-why-stac-was-successful/Emergent Standards white paper: https://tial.org/publications/white-paper-003-emergent-standards-enabling-collaborations-across-institutions/STAC Auth Proxy: https://github.com/developmentseed/stac-auth-proxy FilmDrop UI: https://console.demo.filmdrop.element84.com/Planet Planetary Variables: https://www.planet.com/products/planetary-variables/CommonSpace: https://www.commonspace.world/"You Just Haven't Earned It Yet Baby": https://www.youtube.com/watch?v=jc9F0bh5OXcGreat Data Products is brought to you by Source Cooperative: https://source.coop

Inside Harvard's data.gov Archive

2025-11-2101:19:21

Jed talks with Jack Cushman from the Harvard Law School Library Innovation Lab about their project to archive and preserve more than 311,000 datasets from Data.gov. We explore how they use BagIt for long-term preservation, built a serverless search interface that makes 17.9 TB of data discoverable in the browser, and what this means for the future of online archives.

Protomaps and PMTiles

2025-11-0101:17:14

Jed talks with Brandon Liu about building maps for the web with Protomaps and PMTiles. We cover why new formats won't work without a compelling application, how a single-file base map functions as a reusable data product, designing simple specs for long-term usability, and how object storage-based approaches can replace server-based stacks while staying fast and easy to integrate. Many thanks to our listeners from Norway and Egypt who stayed up very late for the live stream!Links and Resources- Protomaps – a free, customizable base map you can self-host- PMTiles Viewer – drag-and-drop viewer for .pmtiles files- Browse 2.7 billion building footprints in PMTiles in the Google-Microsoft-OSM Open Buildings - combined by VIDA product on Source- Emergent standards white paper from the Institutional Architecture LabKey takeaways:1. Ship a killer app if you want a new format to gain traction — The Protomaps base map is the product that makes the PMTiles format matter.2. Single-file, object storage first — PMTiles runs from a bucket or an SD card, with a browser-based viewer for offline use.3. Design simple, future‑proof specifications — Keep formats small and reimplementable with minimal dependencies; simplicity preserves longevity and portability.4. Prioritize the developer experience — Single-binary installs, easy local preview, and eliminating incidental complexity drive adoption more than raw capability.5. Build the right pipeline for the job — Separate visualization-optimized packaging from analysis-ready data; don’t force one format to do everything.

Why LLM Progress is Getting Harder

2025-10-0201:51:38

Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we've consumed the "low-hanging fruit" of internet data and graphics innovations, and what this means for the future of AI development.The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning's "Hello World"; ImageNet (2008), Fei-Fei Li's image dataset that launched deep learning through AlexNet's 2012 breakthrough; and Common Crawl (2007), Gil Elbaz's web crawling project that fueled 60% of GPT-3's training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.Links and Resources:- Common Crawl Foundation Event - October 22nd event at Stanford!- Cloud-Native Geospatial Forum Conference 2026 - 6-9 October 2026 at Snowbird in Utah!- Why LLM Advancements Have Slowed: The Low-Hanging Fruit Has Been Eaten - Drew's blog post that inspired this conversation- Unicorns, Show Ponies, and Gazelles - Jed's vision for sustainable data organizations- ARC AGI Benchmark - François Chollet's reasoning benchmark- Thinking Machines Lab - Mira Murati's reproducibility research lab- Terminal Bench - Stanford's coding agent evaluation benchmark- Data Science at the Singularity - David Donoho's masterful paper examining the power of frictionless reproducibility- Rethinking Dataset Discovery with DataScout - New paper examining dataset discovery- MNIST Dataset - The foundational machine learning dataset on Hugging FaceKey Takeaways1. Great data products create ecosystems - They don't just provide data, they enable entire communities and industries to flourish2. Benchmarks are data products with intent - They encode values and shape the direction of AI development3. We've consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted4. The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models5. Data markets need new models - Traditional approaches to data sharing may not work in the AI era

#box-pro-ellipsis-177417389774654{-webkit-line-clamp:2;}Great Data Products