Vanishing Gradients

64 Episodes

Reverse

Episode 64: Data Science Meets Agentic AI with Michael Kennedy (Talk Python)

2025-12-0301:02:56

We have been sold a story of complexity. Michael Kennedy (Talk Python) argues we can escape this by relentlessly focusing on the problem at hand, reducing costs by orders of magnitude in software, data, and AI. In this episode, Michael joins Hugo to dig into the practical side of running Python systems at scale. They connect these ideas to the data science workflow, exploring which software engineering practices allow AI teams to ship faster and with more confidence. They also detail how to deploy systems without unnecessary complexity and how Agentic AI is fundamentally reshaping development workflows. We talk through: - Escaping complexity hell to reduce costs and gain autonomy - The specific software practices, like the "Docker Barrier", that matter most for data scientists - How to replace complex cloud services with a simple, robust $30/month stack - The shift from writing code to "systems thinking" in the age of Agentic AI - How to manage the people-pleasing psychology of AI agents to prevent broken code - Why struggle is still essential for learning, even when AI can do the work for you LINKS Talk Python In Production, the Book! (https://talkpython.fm/books/python-in-production) Just Enough Python for Data Scientists Course (https://training.talkpython.fm/courses/just-enough-python-for-data-scientists) Agentic AI Programming for Python Course (https://training.talkpython.fm/courses/agentic-ai-programming-for-python) Talk Python To Me (https://talkpython.fm/) and a recent episode with Hugo as guest: Building Data Science with Foundation LLM Models (https://talkpython.fm/episodes/show/526/building-data-science-with-foundation-llm-models) Python Bytes podcast (https://pythonbytes.fm/) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtube.com/live/jfSRxxO3aRo?feature=share) Join the final cohort of our Building AI Applications course starting Jan 12, 2026 (35% off for listeners) (https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav): https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav

Episode 63: Why Gemini 3 Will Change How You Build AI Agents with Ravin Kumar (Google DeepMind)

2025-11-2201:00:12

Gemini 3 is a few days old and the massive leap in performance and model reasoning has big implications for builders: as models begin to self-heal, builders are literally tearing out the functionality they built just months ago... ripping out the defensive coding and reshipping their agent harnesses entirely. Ravin Kumar (Google DeepMind) joins Hugo to breaks down exactly why the rapid evolution of models like Gemini 3 is changing how we build software. They detail the shift from simple tool calling to building reliable "Agent Harnesses", explore the architectural tradeoffs between deterministic workflows and high-agency systems, the nuance of preventing context rot in massive windows, and why proper evaluation infrastructure is the only way to manage the chaos of autonomous loops. They talk through: - The implications of models that can "self-heal" and fix their own code - The two cultures of agents: LLM workflows with a few tools versus when you should unleash high-agency, autonomous systems. - Inside NotebookLM: moving from prototypes to viral production features like Audio Overviews - Why Needle in a Haystack benchmarks often fail to predict real-world performance - How to build agent harnesses that turn model capabilities into product velocity - The shift from measuring latency to managing time-to-compute for reasoning tasks LINKS From Context Engineering to AI Agent Harnesses: The New Software Discipline, a podcast Hugo did with Lance Martin, LangChain (https://high-signal.delphina.ai/episode/context-engineering-to-ai-agent-harnesses-the-new-software-discipline) Context Rot: How Increasing Input Tokens Impacts LLM Performance (https://research.trychroma.com/context-rot) Effective context engineering for AI agents by Anthropic (https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtu.be/CloimQsQuJM) Join the final cohort of our Building AI Applications course starting Jan 12, 2026 (https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav): https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav

Episode 62: Practical AI at Work: How Execs and Developers Can Actually Use LLMs

2025-10-3159:04

Many leaders are trapped between chasing ambitious, ill-defined AI projects and the paralysis of not knowing where to start. Dr. Randall Olson argues that the real opportunity isn't in moonshots, but in the "trillions of dollars of business value" available right now. As co-founder of Wyrd Studios, he bridges the gap between data science, AI engineering, and executive strategy to deliver a practical framework for execution. In this episode, Randy and Hugo lay out how to find and solve what might be considered "boring but valuable" problems, like an EdTech company automating 20% of its support tickets with a simple retrieval bot instead of a complex AI tutor. They discuss how to move incrementally along the "agentic spectrum" and why treating AI evaluation with the same rigor as software engineering is non-negotiable for building a disciplined, high-impact AI strategy. They talk through: How a non-technical leader can prototype a complex insurance claim classifier using just photos and a ChatGPT subscription. The agentic spectrum: Why you should start by automating meeting summaries before attempting to build fully autonomous agents. The practical first step for any executive: Building a personal knowledge base with meeting transcripts and strategy docs to get tailored AI advice. Why treating AI evaluation with the same rigor as unit testing is essential for shipping reliable products. The organizational shift required to unlock long-term AI gains, even if it means a short-term productivity dip. LINKS Randy on LinkedIn (https://www.zenml.io/llmops-database) Wyrd Studios (https://thewyrdstudios.com/) Stop Building AI Agents (https://www.decodingai.com/p/stop-building-ai-agents) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtu.be/-YQjKH3wRvc) 🎓 Learn more: In Hugo's course: Building AI Applications for Data Scientists and Software Engineers (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20) — https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20 Next cohort starts November 3: come build with us!

Episode 61: The AI Agent Reliability Cliff: What Happens When Tools Fail in Production

2025-10-1628:04

Most AI teams find their multi-agent systems devolving into chaos, but ML Engineer Alex Strick van Linschoten argues they are ignoring the production reality. In this episode, he draws on insights from the LLM Ops Database (750+ real-world deployments then; now nearly 1,000!) to systematically measure and engineer constraint, turning unreliable prototypes into robust, enterprise-ready AI. Drawing from his work at Zen ML, Alex details why success requires scaling down and enforcing MLOps discipline to navigate the unpredictable "Agent Reliability Cliff". He provides the essential architectural shifts, evaluation hygiene techniques, and practical steps needed to move beyond guesswork and build scalable, trustworthy AI products. We talk through: - Why "shoving a thousand agents" into an app is the fastest route to unmanageable chaos - The essential MLOps hygiene (tracing and continuous evals) that most teams skip - The optimal (and very low) limit for the number of tools an agent can reliably use - How to use human-in-the-loop strategies to manage the risk of autonomous failure in high-sensitivity domains - The principle of using simple Python/RegEx before resorting to costly LLM judges LINKS The LLMOps Database: 925 entries as of today....submit a use case to help it get to 1K! (https://www.zenml.io/llmops-database) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtu.be/-YQjKH3wRvc) 🎓 Learn more: -This was a guest Q&A from Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20) — https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20 Next cohort starts November 3: come build with us!

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

2025-09-3001:13:15

Most AI teams find "evals" frustrating, but ML Engineer Hamel Husain argues they’re just using the wrong playbook. In this episode, he lays out a data-centric approach to systematically measure and improve AI, turning unreliable prototypes into robust, production-ready systems. Drawing from his experience getting countless teams unstuck, Hamel explains why the solution requires a "revenge of the data scientists." He details the essential mindset shifts, error analysis techniques, and practical steps needed to move beyond guesswork and build AI products you can actually trust. We talk through: The 10(+1) critical mistakes that cause teams to waste time on evals Why "hallucination scores" are a waste of time (and what to measure instead) The manual review process that finds major issues in hours, not weeks A step-by-step method for building LLM judges you can actually trust How to use domain experts without getting stuck in endless review committees Guest Bryan Bischof's "Failure as a Funnel" for debugging complex AI agents If you're tired of ambiguous "vibe checks" and want a clear process that delivers real improvement, this episode provides the definitive roadmap. LINKS Hamel's website and blog (https://hamel.dev/) Hugo speaks with Philip Carter (Honeycomb) about aligning your LLM-as-a-judge with your domain expertise (https://vanishinggradients.fireside.fm/51) Hamel Husain on Lenny's pocast, which includes a live demo of error analysis (https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill) The episode of VG in which Hamel and Hugo talk about Hamel's "data consulting in Vegas" era (https://vanishinggradients.fireside.fm/9) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtube.com/live/QEk-XwrkqhI?feature=share) Hamel's AI evals course, which he teaches with Shreya Shankar (UC Berkeley): starts Oct 6 and this link gives 35% off! (https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME) https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338

Episode 59: Patterns and Anti-Patterns For Building with AI

2025-09-2347:37

John Berryman (Arcturus Labs; early GitHub Copilot engineer; co-author of Relevant Search and Prompt Engineering for LLMs) has spent years figuring out what makes AI applications actually work in production. In this episode, he shares the “seven deadly sins” of LLM development — and the practical fixes that keep projects from stalling. From context management to retrieval debugging, John explains the patterns he’s seen succeed, the mistakes to avoid, and why it helps to think of an LLM as an “AI intern” rather than an all-knowing oracle. We talk through: - Why chasing perfect accuracy is a dead end - How to use agents without losing control - Context engineering: fitting the right information in the window - Starting simple instead of over-orchestrating - Separating retrieval from generation in RAG - Splitting complex extractions into smaller checks - Knowing when frameworks help — and when they slow you down A practical guide to avoiding the common traps of LLM development and building systems that actually hold up in production. LINKS: Context Engineering for AI Agents, a free, upcoming lightning lesson from John and Hugo (https://maven.com/p/4485aa/context-engineering-for-ai-agents) The Hidden Simplicity of GenAI Systems, a previous lightning lesson from John and Hugo (https://maven.com/p/a8195d/the-hidden-simplicity-of-gen-ai-systems) Roaming RAG – RAG without the Vector Database, by John (https://arcturus-labs.com/blog/2024/11/21/roaming-rag--rag-without-the-vector-database/) Cut the Chit-Chat with Artifacts, by John (https://arcturus-labs.com/blog/2024/11/11/cut-the-chit-chat-with-artifacts/) Prompt Engineering for LLMs by John and Albert Ziegler (https://amzn.to/4gChsFf) Relevant Search by John and Doug Turnbull (https://amzn.to/3TXmDHk) Arcturus Labs (https://arcturus-labs.com/) Watch the podcast on YouTube (https://youtu.be/mKTQGKIUq8M) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) 🎓 Learn more: Hugo's course (this episode was a guest Q&A from the course): Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338

Episode 58: Building GenAI Systems That Make Business Decisions with Thomas Wiecki (PyMC Labs)

2025-09-0901:00:45

While most conversations about generative AI focus on chatbots, Thomas Wiecki (PyMC Labs, PyMC) has been building systems that help companies make actual business decisions. In this episode, he shares how Bayesian modeling and synthetic consumers can be combined with LLMs to simulate customer reactions, guide marketing spend, and support strategy. Drawing from his work with Colgate and others, Thomas explains how to scale survey methods with AI, where agents fit into analytics workflows, and what it takes to make these systems reliable. We talk through: Using LLMs as “synthetic consumers” to simulate surveys and test product ideas How Bayesian modeling and causal graphs enable transparent, trustworthy decision-making Building closed-loop systems where AI generates and critiques ideas Guardrails for multi-agent workflows in marketing mix modeling Where generative AI breaks (and how to detect failure modes) The balance between useful models and “correct” models If you’ve ever wondered how to move from flashy prototypes to AI systems that actually inform business strategy, this episode shows what it takes. LINKS: The AI MMM Agent, An AI-Powered Shortcut to Bayesian Marketing Mix Insights (https://www.pymc-labs.com/blog-posts/the-ai-mmm-agent) AI-Powered Decision Making Under Uncertainty Workshop w/ Allen Downey & Chris Fonnesbeck (PyMC Labs) (https://youtube.com/live/2Auc57lxgeU) The Podcast livestream on YouTube (https://youtube.com/live/so4AzEbgSjw?feature=share) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338

Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

2025-08-2941:27

While many people talk about “agents,” Shreya Shankar (UC Berkeley) has been building the systems that make them reliable. In this episode, she shares how AI agents and LLM judges can be used to process millions of documents accurately and cheaply. Drawing from work on projects ranging from databases of police misconduct reports to large-scale customer transcripts, Shreya explains the frameworks, error analysis, and guardrails needed to turn flaky LLM outputs into trustworthy pipelines. We talk through: - Treating LLM workflows as ETL pipelines for unstructured text - Error analysis: why you need humans reviewing the first 50–100 traces - Guardrails like retries, validators, and “gleaning” - How LLM judges work — rubrics, pairwise comparisons, and cost trade-offs - Cheap vs. expensive models: when to swap for savings - Where agents fit in (and where they don’t) If you’ve ever wondered how to move beyond unreliable demos, this episode shows how to scale LLMs to millions of documents — without breaking the bank. LINKS Shreya's website (https://www.sh-reya.com/) DocETL, A system for LLM-powered data processing (https://www.docetl.org/) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtu.be/3r_Hsjy85nk) Shreya's AI evals course, which she teaches with Hamel "Evals" Husain (https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338

Episode 56: DeepMind Just Dropped Gemma 270M... And Here’s Why It Matters

2025-08-1445:40

While much of the AI world chases ever-larger models, Ravin Kumar (Google DeepMind) and his team build across the size spectrum, from billions of parameters down to this week’s release: Gemma 270M, the smallest member yet of the Gemma 3 open-weight family. At just 270 million parameters, a quarter the size of Gemma 1B, it’s designed for speed, efficiency, and fine-tuning. We explore what makes 270M special, where it fits alongside its billion-parameter siblings, and why you might reach for it in production even if you think “small” means “just for experiments.” We talk through: - Where 270M fits into the Gemma 3 lineup — and why it exists - On-device use cases where latency, privacy, and efficiency matter - How smaller models open up rapid, targeted fine-tuning - Running multiple models in parallel without heavyweight hardware - Why “small” models might drive the next big wave of AI adoption If you’ve ever wondered what you’d do with a model this size (or how to squeeze the most out of it) this episode will show you how small can punch far above its weight. LINKS Introducing Gemma 3 270M: The compact model for hyper-efficient AI (Google Developer Blog) (https://developers.googleblog.com/en/introducing-gemma-3-270m/) Full Model Fine-Tune Guide using Hugging Face Transformers (https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune) The Gemma 270M model on HuggingFace (https://huggingface.co/google/gemma-3-270m) The Gemma 270M model on Ollama (https://ollama.com/library/gemma3:270m) Building AI Agents with Gemma 3, a workshop with Ravin and Hugo (https://www.youtube.com/live/-IWstEStqok) (Code here (https://github.com/canyon289/ai_agent_basics)) From Images to Agents: Building and Evaluating Multimodal AI Workflows, a workshop with Ravin and Hugo (https://www.youtube.com/live/FNlM7lSt8Uk)(Code here (https://github.com/canyon289/ai_image_agent)) Evaluating AI Agents: From Demos to Dependability, an upcoming workshop with Ravin and Hugo (https://lu.ma/ezgny3dl) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Watch the podcast video on YouTube (https://youtu.be/VZDw6C2A_8E) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 ($600 off early bird discount for November cohort availiable until August 16)

Episode 55: From Frittatas to Production LLMs: Breakfast at SciPy

2025-08-1238:08

Traditional software expects 100% passing tests. In LLM-powered systems, that’s not just unrealistic — it’s a feature, not a bug. Eric Ma leads research data science in Moderna’s data science and AI group, and over breakfast at SciPy we explored why AI products break the old rules, what skills different personas bring (and miss), and how to keep systems alive after the launch hype fades. You’ll hear the clink of coffee cups, the murmur of SciPy in the background, and the occasional bite of frittata as we talk (hopefully also a feature, not a bug!) We talk through: • The three personas — and the blind spots each has when shipping AI systems • Why “perfect” tests can be a sign you’re testing the wrong thing • Development vs. production observability loops — and why you need both • How curiosity about failing data separates good builders from great ones • Ways large organizations can create space for experimentation without losing delivery focus If you want to build AI products that thrive in the messy real world, this episode will help you embrace the chaos — and make it work for you. LINKS Eric' Website (https://ericmjl.github.io/) More about the workshops Eric and Hugo taught at SciPy (https://hugobowne.substack.com/p/stress-testing-llms-evaluation-frameworks) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 ($600 off early bird discount for November cohort availiable until August 16)

Episode 54: Scaling AI: From Colab to Clusters — A Practitioner’s Guide to Distributed Training and Inference

2025-07-1841:17

Colab is cozy. But production won’t fit on a single GPU. Zach Mueller leads Accelerate at Hugging Face and spends his days helping people go from solo scripts to scalable systems. In this episode, he joins me to demystify distributed training and inference — not just for research labs, but for any ML engineer trying to ship real software. We talk through: • From Colab to clusters: why scaling isn’t just about training massive models, but serving agents, handling load, and speeding up iteration • Zero-to-two GPUs: how to get started without Kubernetes, Slurm, or a PhD in networking • Scaling tradeoffs: when to care about interconnects, which infra bottlenecks actually matter, and how to avoid chasing performance ghosts • The GPU middle class: strategies for training and serving on a shoestring, with just a few cards or modest credits • Local experiments, global impact: why learning distributed systems—even just a little—can set you apart as an engineer If you’ve ever stared at a Hugging Face training script and wondered how to run it on something more than your laptop: this one’s for you. LINKS Zach on LinkedIn (https://www.linkedin.com/in/zachary-mueller-135257118/) Hugo's blog post on Stop Buliding AI Agents (https://www.linkedin.com/posts/hugo-bowne-anderson-045939a5_yesterday-i-posted-about-stop-building-ai-activity-7346942036752613376-b8-t/) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/stop-building-agents) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 Zach's course (45% off for VG listeners!): Scratch to Scale: Large-Scale Training in the Modern World (https://maven.com/walk-with-code/scratch-to-scale?promoCode=hugo39) -- https://maven.com/walk-with-code/scratch-to-scale?promoCode=hugo39 📺 Watch the video version on YouTube: YouTube link (https://youtube.com/live/76NAtzWZ25s?feature=share)

Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs

2025-07-0844:49

Demos are easy; durability is hard. Samuel Colvin has spent a decade building guardrails in Python (first with Pydantic, now with Logfire), and he’s convinced most LLM failures have nothing to do with the model itself. They appear where the data is fuzzy, the prompts drift, or no one bothered to measure real-world behavior. Samuel joins me to show how a sprinkle of engineering discipline keeps those failures from ever reaching users. We talk through: • Tiny labels, big leverage: how five thumbs-ups/thumbs-downs are enough for Logfire to build a rubric that scores every call in real time • Drift alarms, not dashboards: catching the moment your prompt or data shifts instead of reading charts after the fact • Prompt self-repair: a prototype agent that rewrites its own system prompt—and tells you when it still doesn’t have what it needs • The hidden cost curve: why the last 15 percent of reliability costs far more than the flashy 85 percent demo • Business-first metrics: shipping features that meet real goals instead of chasing another decimal point of “accuracy” If you’re past the proof-of-concept stage and staring down the “now it has to work” cliff, this episode is your climbing guide. LINKS Pydantic (https://pydantic.dev/) Logfire (https://pydantic.dev/logfire) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/stop-building-agents) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — next cohort starts July 8: https://maven.com/s/course/d56067f338 📺 Watch the video version on YouTube: YouTube link (https://youtube.com/live/wk6rPZ6qJSY?feature=share)

Episode 52: Why Most LLM Products Break at Retrieval (And How to Fix Them)

2025-07-0228:381

Most LLM-powered features do not break at the model. They break at the context. So how do you retrieve the right information to get useful results, even under vague or messy user queries? In this episode, we hear from Eric Ma, who leads data science research in the Data Science and AI group at Moderna. He shares what it takes to move beyond toy demos and ship LLM features that actually help people do their jobs. We cover: • How to align retrieval with user intent and why cosine similarity is not the answer • How a dumb YAML-based system outperformed so-called smart retrieval pipelines • Why vague queries like “what is this all about” expose real weaknesses in most systems • When vibe checks are enough and when formal evaluation is worth the effort • How retrieval workflows can evolve alongside your product and user needs If you are building LLM-powered systems and care about how they work, not just whether they work, this one is for you. LINKS Eric's website (https://ericmjl.github.io/) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/stop-building-agents) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — next cohort starts July 8: https://maven.com/s/course/d56067f338 📺 Watch the video version on YouTube: YouTube link (https://youtu.be/d-FaR5Ywd5k)

Episode 51: Why We Built an MCP Server and What Broke First

2025-06-2647:41

What does it take to actually ship LLM-powered features, and what breaks when you connect them to real production data? In this episode, we hear from Philip Carter — then a Principal PM at Honeycomb and now a Product Management Director at Salesforce. In early 2023, he helped build one of the first LLM-powered SaaS features to ship to real users. More recently, he and his team built a production-ready MCP server. We cover: • How to evaluate LLM systems using human-aligned judges • The spreadsheet-driven process behind shipping Honeycomb’s first LLM feature • The challenges of tool usage, prompt templates, and flaky model behavior • Where MCP shows promise, and where it breaks in the real world If you’re working on LLMs in production, this one’s for you! LINKS So We Shipped an AI Product: Did it Work? by Philip Carter (https://www.honeycomb.io/blog/we-shipped-ai-product) Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/ai-as-a-civilizational-technology) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — next cohort starts July 8: https://maven.com/s/course/d56067f338 📺 Watch the video version on YouTube: YouTube link (https://youtu.be/JDMzdaZh9Ig)

Episode 50: A Field Guide to Rapidly Improving AI Products -- With Hamel Husain

2025-06-1727:42

If we want AI systems that actually work, we need to get much better at evaluating them, not just building more pipelines, agents, and frameworks. In this episode, Hugo talks with Hamel Hussain (ex-Airbnb, GitHub, DataRobot) about how teams can improve AI products by focusing on error analysis, data inspection, and systematic iteration. The conversation is based on Hamel’s blog post A Field Guide to Rapidly Improving AI Products, which he joined Hugo’s class to discuss. They cover: 🔍 Why most teams struggle to measure whether their systems are actually improving 📊 How error analysis helps you prioritize what to fix (and when to write evals) 🧮 Why evaluation isn’t just a metric — but a full development process ⚠️ Common mistakes when debugging LLM and agent systems 🛠️ How to think about the tradeoffs in adding more evals vs. fixing obvious issues 👥 Why enabling domain experts — not just engineers — can accelerate iteration If you’ve ever built an AI system and found yourself unsure how to make it better, this conversation is for you. LINKS * A Field Guide to Rapidly Improving AI Products by Hamel Husain (https://hamel.dev/blog/posts/field-guide/) * Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA) * Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) * Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/ai-as-a-civilizational-technology) 🎓 Learn more: Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — next cohort starts July 8: https://maven.com/s/course/d56067f338 Hamel & Shreya's course: AI Evals For Engineers & PMs (https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME) — use code GOHUGORGOHOME for $800 off 📺 Watch the video version on YouTube: YouTube link (https://youtu.be/rWToRi2_SeY)

Episode 49: Why Data and AI Still Break at Scale (and What to Do About It)

2025-06-0501:21:45

If we want AI systems that actually work in production, we need better infrastructure—not just better models. In this episode, Hugo talks with Akshay Agrawal (Marimo, ex-Google Brain, Netflix, Stanford) about why data and AI pipelines still break down at scale, and how we can fix the fundamentals: reproducibility, composability, and reliable execution. They discuss: 🔁 Why reactive execution matters—and how current tools fall short 🛠️ The design goals behind Marimo, a new kind of Python notebook ⚙️ The hidden costs of traditional workflows (and what breaks at scale) 📦 What it takes to build modular, maintainable AI apps 🧪 Why debugging LLM systems is so hard—and what better tooling looks like 🌍 What we can learn from decades of tools built for and by data practitioners Toward the end of the episode, Hugo and Akshay walk through two live demos: Hugo shares how he’s been using Marimo to prototype an app that extracts structured data from world leader bios, and Akshay shows how Marimo handles agentic workflows with memory and tool use—built entirely in a notebook. This episode is about tools, but it’s also about culture. If you’ve ever hit a wall with your current stack—or felt like your tools were working against you—this one’s for you. LINKS * marimo | a next-generation Python notebook (https://marimo.io/) * SciPy conference, 2025 (https://www.scipy2025.scipy.org/) * Hugo's face Marimo World Leader Face Embedding demo (https://www.youtube.com/watch?v=DO21QEcLOxM) * Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA) * Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) * Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/ai-as-a-civilizational-technology) * Watch the podcast here on YouTube! (https://youtube.com/live/WVxAz19tgZY?feature=share) 🎓 Want to go deeper? Check out Hugo's course: Building LLM Applications for Data Scientists and Software Engineers. Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in. This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful. Includes over $800 in compute credits and guest lectures from experts at DeepMind, Moderna, and more. Cohort starts July 8 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10)

Episode 48: HOW TO BENCHMARK AGI WITH GREG KAMRADT

2025-05-2301:04:25

If we want to make progress toward AGI, we need a clear definition of intelligence—and a way to measure it. In this episode, Hugo talks with Greg Kamradt, President of the ARC Prize Foundation, about ARC-AGI: a benchmark built on Francois Chollet’s definition of intelligence as “the efficiency at which you learn new things.” Unlike most evals that focus on memorization or task completion, ARC is designed to measure generalization—and expose where today’s top models fall short. They discuss: 🧠 Why we still lack a shared definition of intelligence 🧪 How ARC tasks force models to learn novel skills at test time 📉 Why GPT-4-class models still underperform on ARC 🔎 The limits of traditional benchmarks like MMLU and Big-Bench ⚙️ What the OpenAI O₃ results reveal—and what they don’t 💡 Why generalization and efficiency, not raw capability, are key to AGI Greg also shares what he’s seeing in the wild: how startups and independent researchers are using ARC as a North Star, how benchmarks shape the frontier, and why the ARC team believes we’ll know we’ve reached AGI when humans can no longer write tasks that models can’t solve. This conversation is about evaluation—not hype. If you care about where AI is really headed, this one’s worth your time. LINKS * ARC Prize -- What is ARC-AGI? (https://arcprize.org/arc-agi) * On the Measure of Intelligence by François Chollet (https://arxiv.org/abs/1911.01547) * Greg Kamradt on Twitter (https://x.com/GregKamradt) * Hugo's High Signal Podcast with Fei-Fei Li (https://high-signal.delphina.ai/episode/fei-fei-on-how-human-centered-ai-actually-gets-built) * Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA) * Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) * Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/ai-as-a-civilizational-technology) * Watch the podcast here on YouTube! (https://youtu.be/wU82fz4iRfo) 🎓 Want to go deeper? Check out Hugo's course: Building LLM Applications for Data Scientists and Software Engineers. Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in. This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful. Includes over $800 in compute credits and guest lectures from experts at DeepMind, Moderna, and more. Cohort starts July 8 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10)

Episode 47: The Great Pacific Garbage Patch of Code Slop with Joe Reis

2025-04-0701:19:121

What if the cost of writing code dropped to zero — but the cost of understanding it skyrocketed? In this episode, Hugo sits down with Joe Reis to unpack how AI tooling is reshaping the software development lifecycle — from experimentation and prototyping to deployment, maintainability, and everything in between. Joe is the co-author of Fundamentals of Data Engineering and a longtime voice on the systems side of modern software. He’s also one of the sharpest critics of “vibe coding” — the emerging pattern of writing software by feel, with heavy reliance on LLMs and little regard for structure or quality. We dive into: • Why “vibe coding” is more than a meme — and what it says about how we build today • How AI tools expand the surface area of software creation — for better and worse • What happens to technical debt, testing, and security when generation outpaces understanding • The changing definition of “production” in a world of ephemeral, internal, or just-good-enough tools • How AI is flattening the learning curve — and threatening the talent pipeline • Joe’s view on what real craftsmanship means in an age of disposable code This conversation isn’t about doom, and it’s not about hype. It’s about mapping the real, messy terrain of what it means to build software today — and how to do it with care. LINKS * Joe's Practical Data Modeling Newsletter on Substack (https://practicaldatamodeling.substack.com/) * Joe's Practical Data Modeling Server on Discord (https://discord.gg/HhSZVvWDBb) * Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA) * Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) 🎓 Want to go deeper? Check out my course: Building LLM Applications for Data Scientists and Software Engineers. Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in. This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful. Includes over $800 in compute credits and guest lectures from experts at DeepMind, Moderna, and more. Cohort starts July 8 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10)

Episode 46: Software Composition Is the New Vibe Coding

2025-04-0301:08:57

What if building software felt more like composing than coding? In this episode, Hugo and Greg explore how LLMs are reshaping the way we think about software development—from deterministic programming to a more flexible, prompt-driven, and collaborative style of building. It’s not just hype or grift—it’s a real shift in how we express intent, reason about systems, and collaborate across roles. Hugo speaks with Greg Ceccarelli—co-founder of SpecStory, former CPO at Pluralsight, and Director of Data Science at GitHub—about the rise of software composition and how it changes the way individuals and teams create with LLMs. We dive into: - Why software composition is emerging as a serious alternative to traditional coding - The real difference between vibe coding and production-minded prototyping - How LLMs are expanding who gets to build software—and how - What changes when you focus on intent, not just code - What Greg is building with SpecStory to support collaborative, traceable AI-native workflows - The challenges (and joys) of debugging and exploring with agentic tools like Cursor and Claude We’ve removed the visual demos from the audio—but you can catch our live-coded Chrome extension and JFK document explorer on YouTube. Links below. JFK Docs Vibe Coding Demo (YouTube) (https://youtu.be/JpXCkuV58QE) Chrome Extension Vibe Coding Demo (YouTube) (https://youtu.be/ESVKp37jDwc) Meditations on Tech (Greg’s Substack) (https://www.meditationsontech.com/) Simon Willison on Vibe Coding (https://simonwillison.net/2025/Mar/19/vibe-coding/) Johnno Whitaker: On Vibe Coding (https://johnowhitaker.dev/essays/vibe_coding.html) Tim O’Reilly – The End of Programming (https://www.oreilly.com/radar/the-end-of-programming-as-we-know-it/) Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA) Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Greg Ceccarelli on LinkedIn (https://www.linkedin.com/in/gregceccarelli/) Greg’s Hacker News Post on GOOD (https://news.ycombinator.com/item?id=43557698) SpecStory: GOOD – Git Companion for AI Workflows (https://github.com/specstoryai/getspecstory/blob/main/GOOD.md) 🎓 Want to go deeper? Check out my course: Building LLM Applications for Data Scientists and Software Engineers. Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in. This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful. Includes over $2,500 in compute credits and guest lectures from experts at DeepMind, Moderna, and more. Cohort starts April 7 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10) 🔍 Want to help shape the future of SpecStory? Greg and the team are looking for design partners for their new SpecStory Teams product—built for collaborative, AI-native software development. If you're working with LLMs in a team setting and want to influence the next wave of developer tools, you can apply here: 👉 specstory.com/teams (https://specstory.com/teams)

Episode 45: Your AI application is broken. Here’s what to do about it.

2025-02-2001:17:30

Too many teams are building AI applications without truly understanding why their models fail. Instead of jumping straight to LLM evaluations, dashboards, or vibe checks, how do you actually fix a broken AI app? In this episode, Hugo speaks with Hamel Husain, longtime ML engineer, open-source contributor, and consultant, about why debugging generative AI systems starts with looking at your data. In this episode, we dive into: Why “look at your data” is the best debugging advice no one follows. How spreadsheet-based error analysis can uncover failure modes faster than complex dashboards. The role of synthetic data in bootstrapping evaluation. When to trust LLM judges—and when they’re misleading. Why most AI dashboards measuring truthfulness, helpfulness, and conciseness are often a waste of time. If you're building AI-powered applications, this episode will change how you approach debugging, iteration, and improving model performance in production. LINKS The podcast livestream on YouTube (https://youtube.com/live/Vz4--82M2_0?feature=share) Hamel's blog (https://hamel.dev/) Hamel on twitter (https://x.com/HamelHusain) Hugo on twitter (https://x.com/hugobowne) Vanishing Gradients on twitter (https://x.com/vanishingdata) Vanishing Gradients on YouTube (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA) Vanishing Gradients on Twitter (https://x.com/vanishingdata) Vanishing Gradients on Lu.ma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk) Building LLM Application for Data Scientists and SWEs, Hugo course on Maven (use VG25 code for 25% off) (https://maven.com/s/course/d56067f338) Hugo is also running a free lightning lesson next week on LLM Agents: When to Use Them (and When Not To) (https://maven.com/p/ed7a72/llm-agents-when-to-use-them-and-when-not-to?utm_medium=ll_share_link&utm_source=instructor)

#box-pro-ellipsis-176506390779769{-webkit-line-clamp:2;}Vanishing Gradients