Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

Update: 2025-08-29

Description

While many people talk about “agents,” Shreya Shankar (UC Berkeley) has been building the systems that make them reliable. In this episode, she shares how AI agents and LLM judges can be used to process millions of documents accurately and cheaply.

Drawing from work on projects ranging from databases of police misconduct reports to large-scale customer transcripts, Shreya explains the frameworks, error analysis, and guardrails needed to turn flaky LLM outputs into trustworthy pipelines.

We talk through:

Treating LLM workflows as ETL pipelines for unstructured text

Error analysis: why you need humans reviewing the first 50–100 traces

Guardrails like retries, validators, and “gleaning”

How LLM judges work — rubrics, pairwise comparisons, and cost trade-offs

Cheap vs. expensive models: when to swap for savings

Where agents fit in (and where they don’t)

If you’ve ever wondered how to move beyond unreliable demos, this episode shows how to scale LLMs to millions of documents — without breaking the bank.

LINKS

Shreya's website

DocETL, A system for LLM-powered data processing

Upcoming Events on Luma

Watch the podcast video on YouTube

Shreya's AI evals course, which she teaches with Hamel "Evals" Husain

🎓 Learn more:

Hugo's course: Building LLM Applications for Data Scientists and Software Engineers — https://maven.com/s/course/d56067f338

Comments

In Channel

Episode 64: Data Science Meets Agentic AI with Michael Kennedy (Talk Python)

2025-12-0301:02:56

Episode 63: Why Gemini 3 Will Change How You Build AI Agents with Ravin Kumar (Google DeepMind)

2025-11-2201:00:12

Episode 62: Practical AI at Work: How Execs and Developers Can Actually Use LLMs

2025-10-3159:04

Episode 61: The AI Agent Reliability Cliff: What Happens When Tools Fail in Production

2025-10-1628:04

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

2025-09-3001:13:15

Episode 59: Patterns and Anti-Patterns For Building with AI

2025-09-2347:37

Episode 58: Building GenAI Systems That Make Business Decisions with Thomas Wiecki (PyMC Labs)

2025-09-0901:00:45

Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

2025-08-2941:27

Episode 56: DeepMind Just Dropped Gemma 270M... And Here’s Why It Matters

2025-08-1445:40

Episode 55: From Frittatas to Production LLMs: Breakfast at SciPy

2025-08-1238:08

Episode 54: Scaling AI: From Colab to Clusters — A Practitioner’s Guide to Distributed Training and Inference

2025-07-1841:17

Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs

2025-07-0844:49

Episode 52: Why Most LLM Products Break at Retrieval (And How to Fix Them)

2025-07-0228:38

Episode 51: Why We Built an MCP Server and What Broke First

2025-06-2647:41

Episode 50: A Field Guide to Rapidly Improving AI Products -- With Hamel Husain

2025-06-1727:42

Episode 49: Why Data and AI Still Break at Scale (and What to Do About It)

2025-06-0501:21:45

Episode 48: HOW TO BENCHMARK AGI WITH GREG KAMRADT

2025-05-2301:04:25

Episode 47: The Great Pacific Garbage Patch of Code Slop with Joe Reis

2025-04-0701:19:12

Episode 46: Software Composition Is the New Vibe Coding

2025-04-0301:08:57

Episode 45: Your AI application is broken. Here’s what to do about it.

2025-02-2001:17:30

00:00

Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

#box-pro-ellipsis-176507321453786{-webkit-line-clamp:2;}Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

Hugo Bowne-Anderson

Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)