Vanishing Gradients

71 Episodes

Reverse

Episode 71: Durable Agents - How to Build AI Systems That Survive a Crash with Samuel Colvin

2026-02-1851:27

Our thesis is that AI is still just engineering… those people who tell us for fun and profit, that somehow AI is so, so profound, so new, so different from anything that’s gone before that it somehow eclipses the need for good engineering practice are wrong. We need that good engineering practice still, and for the most part, most things are not new. But there are some things that have become more important with AI. One of those is durability.Samuel Colvin, Creator of Pydantic AI, joins Hugo to talk about applying battle-tested software engineering principles to build durable and reliable AI agents.They Discuss:* Production agents require engineering-grade reliability: Unlike messy coding agents, production agents need high constraint, reliability, and the ability to perform hundreds of tasks without drifting into unusual behavior;* Agents are the new “quantum” of AI software: Modern architecture uses discrete “agentlets”: small, specialized building blocks stitched together for sub-tasks within larger, durable systems;* Stop building “chocolate teapot” execution frameworks: Ditch rudimentary snapshotting; use battle-tested durable execution engines like Temporal for robust retry logic and state management;* AI observability will be a native feature: In five years, AI observability will be integrated, with token counts and prompt traces becoming standard features of all observability platforms;* Split agents into deterministic workflows and stochastic activities: Ensure true durability by isolating deterministic workflow logic from stochastic activities (IO, LLM calls) to cache results and prevent redundant model calls;* Type safety is essential for enterprise agents: Sacrificing type safety for flexible graphs leads to unmaintainable software; professional AI engineering demands strict type definitions for parallel node execution and state recovery;* Standardize on OpenTelemetry for portability: Use OpenTelemetry (OTel) to ensure agent traces and logs are portable, preventing vendor lock-in and integrating seamlessly into existing enterprise monitoring.You can also find the full episode on Spotify, Apple Podcasts, and YouTube.You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a 25% discount code for listeners. 👈LINKS* Samuel Colvin on LinkedIn* Pydantic* Pydantic Stack Demo repo* Deep research example code* Temporal* DBOS (Postgres alternative to Temporal)* Upcoming Events on Luma* Vanishing Gradients on YouTube* Watch the podcast video on YouTube👉Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort starts March 10, 2026. Here is a 25% discount code for listeners.👈https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 70: 1,400 Production AI Deployments

2026-02-1201:09:52

There’s a company who spent almost $50,000 because an agent went into an infinite loop and they forgot about it for a month.It had no failures and I guess no one was monitoring these costs. It’s nice that people do write about that in the database as well. After it happened, they said: watch out for infinite loops. Watch out for cascading tool failures. Watch out for silent failures where the agent reports it has succeeded when it didn’t!We Discuss:* Why the most successful teams are ripping out and rebuilding their agent systems every few weeks as models improve, and why over-engineering now creates technical debt you can’t afford later;* The $50,000 infinite loop disaster and why “silent failures” are the biggest risk in production: agents confidently report success while spiraling into expensive mistakes;* How ELIOS built emergency voice agents with sub-400ms response times by aggressively throwing away context every few seconds, and why these extreme patterns are becoming standard practice;* Why DoorDash uses a three-tier agent architecture (manager, progress tracker, and specialists) with a persistent workspace that lets agents collaborate across hours or days;* Why simple text files and markdown are emerging as the best “continual learning” layer: human-readable memory that persists across sessions without fine-tuning models;* The 100-to-1 problem: for every useful output, tool-calling agents generate 100 tokens of noise, and the three tactics (reduce, offload, isolate) teams use to manage it;* Why companies are choosing Gemini Flash for document processing and Opus for long reasoning chains, and how to match models to your actual usage patterns;* The debate over vector databases versus simple grep and cat, and why giving agents standard command-line tools often beats complex APIs;* What “re-architect” as a job title reveals about the shift from 70% scaffolding / 30% model to 90% model / 10% scaffolding, and why knowing when to rip things out is the may be the most important skill today.You can also find the full episode on Spotify, Apple Podcasts, and YouTube.You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort starts March 10, 2026. Here is a 25% discount code for readers. 👈Show Notes Links* Alex Strick van Linschoten on LinkedIn* Alex Strick van Linschoten on Twitter/X* LLMOps Database* LLMOps Database Dataset on Hugging Face* Hugo’s MCP Server for LLMOps Database* Alex’s Blog: What 1,200+ Production Deployments Reveal About LLMOps in 2025* Previous Episode: Practical Lessons from 750 Real-World LLM Deployments* Previous Episode: Tales from 400 LLM Deployments* Context Rot Research by Chroma* Hugo’s Post: AI Agent Harness - 3 Principles for Context Engineering* Hugo’s Post: The Rise of Agentic Search* Episode with Nick Moy: The Post-Coding Era* Hugo’s Personal Podcast Prep Skill Gist* Claude Tool Search Documentation* Gastown on GitHub (Steve Yegge)* Welcome to Gastown by Steve Yegge* ZenML - Open Source MLOps & LLMOps Framework* Upcoming Events on Luma* Vanishing Gradients on YouTube* Watch the podcast livestream on YouTube* Join the final cohort of our Building AI Applications course in March, 2026 (25% off for listeners)👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort starts March 10, 2026. Here is a 25% discount code for readers. 👈 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 69: Python is Dead. Long Live Python! With the Creators of pandas & Parquet

2026-02-0355:27

> It’s the agent writing the code. And it’s the development loop of writing the code, building testing, write the code, build test and iterating. And so I do think we’ll see for many types of software, a shift away from Python towards other programming languages. I think Go is probably the best language for those like other types of software projects. And like I said, I haven’t written a line of Go code in my life.– Wes McKinney (creator of pandas Principal Architect at Posit),Wes McKinney, Marcel Kornacker, and Alison Hill join Hugo to talk about the architectural shift for multimodal AI, the rise of “agent ergonomics,” and the evolving role of developers in an AI-generated future.We Discuss:* Agent Ergonomics: Optimize for agent iteration speed, shifting from human coding to fast test environments, potentially favoring languages like Go;* Adversarial Code Review: Deploy diverse AI models to peer-review agent-generated code, catching subtle bugs humans miss;* Multimodal Data Verbs: Make operations like resizing and rotating native to your database to eliminate data-plumbing bottlenecks;* Taste as Differentiator: Value “taste”—the ability to curate and refine the best output from countless AI-generated options—over sheer execution speed;* 100x Software Volume: Embrace ephemeral, just-in-time software; prioritize aggressive generation and adversarial testing over careful planning for quality.You can also find the full episode on Spotify, Apple Podcasts, and YouTube.You can also interact directly with the transcript of the workshop & fireside chat here in NotebookLM: If you do so, let us know anything you find in the comments!👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a discount code for readers. 👈This was a fireside chat at the end of a livestreamed workshop we did on building multimodal AI systems with Pixeltable. Check out the full workshop below (all code here on Github):Links and Resources* Wes McKinney on LinkedIn* Marcel Kornacker on LinkedIn* Alison Hill on LinkedIn* Spicy Takes* Palmer Penguins* Pixeltable* Posit* Positron* Building Multimodal AI Systems Workshop Repository* Pixeltable Docs: LLM Tool Calling with MCP Servers* Pixeltable Docs: Working with Pydantic* Upcoming Events on Luma* Vanishing Gradients on YouTube* Watch the podcast video on YouTube* Join the final cohort of our Building AI Applications course in March, 2026 (25% off for listeners)https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfsWhat people said during the workshop“I think the interface looks amazing/simple. Strong work! 🦾” — @goldentribe“This is quite amazing. Watching this I felt the same way when I first leant pandas, NumPy and scikit and how well i was able to manipulate and wrangle data. PixelTable feels seamless and looks as good as those legendary frameworks but for Multimodal Data.” — @vinod7“This is all extremely cool to see, I love the API and the approach.” — @steveb4191“Thanks so much, Hugo! That was very insightful! Great work Alison and Marcel!” — @vinod7“Just wrapped up watching a replay of the Pixeltable workshop. So cool!! Love the notebooks and working examples. The important parts were covered and worked beautifully 🕺” — @therobbrennan👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a discount code for readers. 👈 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 68: A Builder’s Guide to Agentic Search & Retrieval with Doug Turnbull & John Berryman

2026-01-2301:28:42

The best way to build a horrible search product? Don’t ever measure anything against what a user wants.Search veterans Doug Turnbull (Led Search at Reddit + Shopify; Wrote Relevant Search + AI Powered Search) and John Berryman (Early Engineer on Github Copilot; Author of Relevant Search + Prompt Engineering for LLMs), join Hugo to talk about how to build Agentic Search Applications.We Discuss:* The evolution of information retrieval as it moves from traditional keyword search toward “agentic search“ and what this means for builders.* John’s five-level maturity model (you can prototype today!) for AI adoption, moving from Trad Search to conversational AI to asynchronous research assistants that reason about result quality.* The Agentic Search Builders Playbook, including why and how you should “hand-roll” your own agentic loops to maintain control;* The importance of “revealed preferences” that LLM-judges often miss (evaluations must use real clickstream data to capture “revealed preferences” that semantic relevance alone cannot infer)* Patterns and Anti-Patterns for Agentic Search Applications* Learning and teaching Search in the Age of AgentsYou can find the full episode on Spotify, Apple Podcasts, and YouTube.You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a discount code for readers. 👈Doug and Hugo are also doing a free lightning lesson on Feb 20 about How To Build Your First Agentic Search Application! You’ll walk away with a framework & code to build your first agentic search app. Register here to join live or get the recording after.Links and ResourcesGuests* Arcturus Labs (John’s website)* Software Doug (Doug’s website)* John Berryman on LinkedIn* Doug Turnbull on LinkedInBooks* Relevant Search by Doug Turnbull & John Berryman (Manning)* AI-Powered Search by Doug Turnbull (Manning)* Prompt Engineering for LLMs by John Berryman (O’Reilly)Blog Posts* Incremental AI Adoption for E-commerce by John Berryman* Roaming RAG – RAG without the Vector Database by John Berryman* Agents Turn Simple Keyword Search into Compelling Search Experiences by Doug Turnbull* A Simple Agentic Loop with Just Python Functions by Doug Turnbull* Agentic Code Generation to Optimize a Search Reranker by Doug Turnbull* LLM Judges Aren’t the Shortcut You Think by Doug Turnbul (Hugo’s 5 minute video below)* Malleable Software by Ink & Switch (inc. Geoffrey Lit)* Patterns and Anti-Patterns for Building with AI by Hugo Bowne-AndersonOther Resources* The Rise of Agentic Search, a recent VG Podcast with Jeff Huber* Karpathy on Cognitive Core LLMs* Cheat at Search with Agents course by Doug Turnbull (use code: vanishinggradients for $200 off)* Upcoming Events on Luma* Vanishing Gradients on YouTube* Watch the podcast video on YouTube* Join the final cohort of our Building AI Applications course in Q1, 2026 (25% off for listeners)Timestamps (for YouTube livestream)00:00 How to Build Agentic Search & Retrieval Systems02:48 Defining Search and AI03:26 Evolution of Search Technologies08:46 Search in E-commerce and Other Domains12:15 Combining Search and AI: RAG and LLMs23:50 User Intent and Search Optimization29:47 Levels of AI Integration in Search32:25 Exploring the Complexity of Search in Various Domains33:49 The Evolution and Impact of Agentic Search34:07 Defining Terms: RAG and Agentic Search34:52 The Research Loop and Tool Interaction35:55 Formal Protocols and Structured Outputs38:39 Building Agentic Search Experiences: Tips and Advice41:50 The Importance of Empathy in AI and Search Development54:30 The Role of UX in Search Applications01:01:15 Future of Search: Malleable User Interfaces01:02:38 Exploring Malleable Software01:04:20 The Coordination Challenge in Software Development01:05:23 The Impact of Claude Code & Claude Cowork01:06:22 The Future of Knowledge Work with AI01:12:39 Evaluating Search Algorithms with AI01:15:15 The Role of Agents in Search Optimization01:29:55 Teaching AI and Search Techniques01:34:25 Final Thoughts and Farewell👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a discount code for readers. 👈https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgpod This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 67: Saving Hundreds of Hours of Dev Time with AI Agents That Learn

2026-01-1401:18:22

This is continual learning, right? Everyone has been talking about continual learning as the next challenge in AI. Actually, it’s solved. Just tell it to keep some notes somewhere. Sure, it’s not, it’s not machine learning, but in some ways it is because when it will load this text file again, it will influence what it does … And it works so well: it’s easy to understand. It’s easy to inspect, it’s easy to evolve and modify!Eleanor Berger and Isaac Flaath, the minds behind Elite AI Assisted Coding, join Hugo to talk about how to redefine software development through effective AI-assisted coding, leveraging “specification-first” approaches and advanced agentic workflows.We Discuss:* Markdown learning loops: Use simple agents.md files for agents to self-update rules and persist context, creating inspectable, low-cost learning;* Intent-first development: As AI commoditizes syntax, defining clear specs and what makes a result “good” becomes the core, durable developer skill;* Effortless documentation: Leverage LLMs to distill messy “brain dumps” or walks-and-talks into structured project specifications, offloading context faster;* Modular agent skills: Transition from MCP servers to simple markdown-based “skills” with YAML and scripts, allowing progressive disclosure of tool details;* Scheduled async agents: Break the chat-based productivity ceiling by using GitHub Actions or Cron jobs for agents to work on issues, shifting humans to reviewers;* Automated tech debt audits: Deploy background agents to identify duplicate code, architectural drift, or missing test coverage, leveraging AI to police AI-induced messiness;* Explicit knowledge culture: AI agents eliminate “cafeteria chat” by forcing explicit, machine-readable documentation, solving the perennial problem of lost institutional knowledge;* Tiered model strategy: Optimize token spend by using high-tier “reasoning” models (e.g., Opus) for planning and low-cost, high-speed models (e.g., Flash) for execution;* Ephemeral software specs: With near-zero generation costs, software shifts from static products to dynamic, regenerated code based on a permanent, underlying specification.You can also find the full episode on Spotify, Apple Podcasts, and YouTube.You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!👉 Eleanor & Isaac are teaching their next cohort of their Elite AI Assisted Coding course starting this week. They’re kindly giving readers of Vanishing Gradients 25% off. Use this link.👈👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a discount code for readers. 👈Show Notes* Elite AI Assisted Coding Substack* Eleanor Berger on LinkedIn* Isaac Flaath on LinkedIn* Elite AI Assisted Coding Course (Use the code HUGO for 25% off)* How to Build an AI Agent with AI-Assisted Coding* Eleanor/Isaac’s blog post “The SpecFlow Process for AI Coding”* Eleanor’s growing list of (free) tutorials on Agent Skills* Eleanor’s YouTube playlist on agent skills* Eleanor’s blog post “Are (Agent) Skills the New Apps”* Simon Willison’s blog post on skills/general computer automation/data journalism agents* Eleanor/Isaac’s blog post about asynchronous client agents in GitHub actions* Eleanor/Isaac’s blog post on agentic coding workflows with Hang Yu, Product Lead for Qoder @ Alibaba* Upcoming Events on Luma* Vanishing Gradients on YouTube* Watch the podcast video on YouTube* Join the final cohort of our Building AI Applications course in Q1, 2026 (25% off for listeners)Timestamps (for YouTube livestream)00:00 Introduction to Elite AI Assisted Coding02:24 Starting a New AI Project: Best Practices03:19 The Importance of Context in AI Projects07:19 Specification-First Planning12:01 Sharing Intent and Documentation18:27 Living Documentation and Continual Learning24:36 Choosing the Right Tools and Models29:18 Managing Costs and Token Usage40:16 Using Different Models for Different Tasks43:41 Mastering One Model for Better Results44:54 The Rise of Agent Skills in 202645:34 Understanding the Importance of Skills47:18 Practical Applications of Agent Skills01:11:43 Security Concerns with AI Agents01:15:02 Collaborative AI-Assisted Coding01:18:59 Future of AI-Assisted Coding01:22:27 Key Takeaways for Effective AI-Assisted CodingLive workshop with Eleanor, Isaac, & HugoWe also recently did a 90-minute workshop on How to Build an AI Agent with AI-Assisted Coding.We wrote a blog post on it for those who don’t have 90 minutes right now. Check it out here.I then made a 4 min video about it all for those who don’t have time to read the blog post.👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a discount code for readers. 👈https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vg-ei This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 66: The Agent Paradox - Why Moderna's Most Productive AI Systems Aren't Agents

2026-01-0842:58

Surprise. We don’t have agents. I actually went in and did an audit of all the LLM applications that we’ve developed internally. And if you were to take Anthropic’s definition of workflow versus agent, we don’t have agents. I would not classify any of our applications as agents. xEric Ma, who leads Research Data Science in the Data Science and AI group at Moderna, joins Hugo on moving past the hype of autonomous agents to build reliable, high-value workflows.We discuss:* Reliable Workflows: Prioritize rigid workflows over dynamic AI agents to ensure reliability and minimize stochasticity in production environments;* Permission Mapping: The true challenge in regulated environments is security, specifically mapping permissions across source documents, vector stores, and model weights;* Trace Log Risk: LLM execution traces pose a regulatory risk, inadvertently leaking restricted data like trade secrets or personal information;* High-Value Data Work: LLMs excel at transforming archived documents and freeform forms into required formats, offloading significant “janitorial” work from scientists;* “Non-LLM” First: Solve problems with simpler tools like Python or ML models before LLMs to ensure robustness and eliminate generative AI stochasticity;* Contextual Evaluation: Tailor evaluation rigor to consequences; low-stakes tools can be “vibe-checked,” while patient safety outputs demand exhaustive error characterization;* Serverless Biotech Backbone: Serverless infrastructure like Modal and reactive notebooks such as Marimo empowers biotech data scientists for rapid deployment without heavy infrastructure overhead.You can also find the full episode on Spotify, Apple Podcasts, and YouTube.You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort is in Q1, 2206. Here is a 35% discount code for readers. 👈https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgch👉 Eric & Hugo have a free upcoming livestream workshop: Building Tools for Thinking with AI (register to join live or get the recording afterwards) 👈Show notes* Eric’s website* Eric Ma on LinkedIn* Eric’s blog* Eric’s data science newsletter* Building Effective AI Agents by the Anthropic team* Wow, Marimo from Eric’s blog* Wow, Modal from Eric’s blog* Upcoming Events on Luma* Watch the podcast video on YouTube* Join the final cohort of our Building AI Applications course in Q1, 2026 (35% off for listeners)Timestamps00:00 Defining Agents and Workflows02:04 Challenges in Regulated Environments04:24 Eric Ma's Role at Moderna, Leading Research Data Science in the Data Science and AI Group12:37 Document Reformatting and Automation15:42 Data Security and Permission Mapping20:05 Choosing the Right Model for Production20:41 Evaluating Model Changes with Benchmarks23:10 Vibe-Based Evaluation vs. Formal Testing27:22 Security and Fine-Tuning in LLMs28:45 Challenges and Future of Fine-Tuning34:00 Security Layers and Information Leakage37:48 Wrap-Up and Final Remarks👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort is in Q1, 2026. Here is a 35% discount code for readers. 👈https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgch This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 65: The Rise of Agentic Search

2025-12-1951:53

We’re really moving from a world where humans are authoring search queries and humans are executing those queries and humans are digesting the results to a world where AI is doing that for us.Jeff Huber, CEO and co-founder of Chroma, joins Hugo to talk about how agentic search and retrieval are changing the very nature of search and software for builders and users alike.We Discuss:* “Context engineering”, the strategic design and engineering of what context gets fed to the LLM (data, tools, memory, and more), which is now essential for building reliable, agentic AI systems;* Why simply stuffing large context windows is no longer feasible due to “context rot” as AI applications become more goal-oriented and capable of multi-step tasks* A framework for precisely curating and providing only the most relevant, high-precision information to ensure accurate and dependable AI systems;* The “agent harness”, the collection of tools and capabilities an agent can access, and how to construct these advanced systems;* Emerging best practices for builders, including hybrid search as a robust default, creating “golden datasets” for evaluation, and leveraging sub-agents to break down complex tasks* The major unsolved challenge of agent evaluation, emphasizing a shift towards iterative, data-centric approaches.You can also find the full episode on Spotify, Apple Podcasts, and YouTube.You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort is in Q1, 2206. Here is a 35% discount code for readers. 👈Oh! One more thing: we’ve just announced a Vanishing Gradients livestream for January 21 that you may dig:* A Builder’s Guide to Agentic Search & Retrieval with Doug Turnbull and John Berryman (register to join live or get the recording afterwards.Show notes* Jeff Huber on Twitter* Jeff Huber on LinkedIn* Try Chroma!* Context Rot: How Increasing Input Tokens Impacts LLM Performance by The Chroma Team* AI Agent Harness, 3 Principles for Context Engineering, and the Bitter Lesson Revisited* From Context Engineering to AI Agent Harnesses: The New Software Discipline* Generative Benchmarking by The Chroma Team* Effective context engineering for AI agents by The Anthropic Team* Making Sense of Millions of Conversations for AI Agents by Ivan Leo (Manus) and Hugo* How we built our multi-agent research system by The Anthropic Team* Upcoming Events on Luma* Watch the podcast video on YouTube👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort is in Q1, 2206. Here is a 35% discount code for readers. 👈https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgch This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 64: Data Science Meets Agentic AI with Michael Kennedy (Talk Python)

2025-12-0301:02:56

We have been sold a story of complexity. Michael Kennedy (Talk Python) argues we can escape this by relentlessly focusing on the problem at hand, reducing costs by orders of magnitude in software, data, and AI.In this episode, Michael joins Hugo to dig into the practical side of running Python systems at scale. They connect these ideas to the data science workflow, exploring which software engineering practices allow AI teams to ship faster and with more confidence. They also detail how to deploy systems without unnecessary complexity and how Agentic AI is fundamentally reshaping development workflows.We talk through:- Escaping complexity hell to reduce costs and gain autonomy- The specific software practices, like the "Docker Barrier", that matter most for data scientists- How to replace complex cloud services with a simple, robust $30/month stack- The shift from writing code to "systems thinking" in the age of Agentic AI- How to manage the people-pleasing psychology of AI agents to prevent broken code- Why struggle is still essential for learning, even when AI can do the work for youLINKSTalk Python In Production, the Book! (https://talkpython.fm/books/python-in-production)Just Enough Python for Data Scientists Course (https://training.talkpython.fm/courses/just-enough-python-for-data-scientists)Agentic AI Programming for Python Course (https://training.talkpython.fm/courses/agentic-ai-programming-for-python)Talk Python To Me (https://talkpython.fm/) and a recent episode with Hugo as guest: Building Data Science with Foundation LLM Models (https://talkpython.fm/episodes/show/526/building-data-science-with-foundation-llm-models)Python Bytes podcast (https://pythonbytes.fm/)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Watch the podcast video on YouTube (https://youtube.com/live/jfSRxxO3aRo?feature=share)Join the final cohort of our Building AI Applications course starting Jan 12, 2026 (35% off for listeners) (https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav): https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 63: Why Gemini 3 Will Change How You Build AI Agents with Ravin Kumar (Google DeepMind)

2025-11-2201:00:13

Gemini 3 is a few days old and the massive leap in performance and model reasoning has big implications for builders: as models begin to self-heal, builders are literally tearing out the functionality they built just months ago... ripping out the defensive coding and reshipping their agent harnesses entirely.Ravin Kumar (Google DeepMind) joins Hugo to breaks down exactly why the rapid evolution of models like Gemini 3 is changing how we build software. They detail the shift from simple tool calling to building reliable "Agent Harnesses", explore the architectural tradeoffs between deterministic workflows and high-agency systems, the nuance of preventing context rot in massive windows, and why proper evaluation infrastructure is the only way to manage the chaos of autonomous loops.They talk through:- The implications of models that can "self-heal" and fix their own code- The two cultures of agents: LLM workflows with a few tools versus when you should unleash high-agency, autonomous systems.- Inside NotebookLM: moving from prototypes to viral production features like Audio Overviews- Why Needle in a Haystack benchmarks often fail to predict real-world performance- How to build agent harnesses that turn model capabilities into product velocity- The shift from measuring latency to managing time-to-compute for reasoning tasksLINKSFrom Context Engineering to AI Agent Harnesses: The New Software Discipline, a podcast Hugo did with Lance Martin, LangChain (https://high-signal.delphina.ai/episode/context-engineering-to-ai-agent-harnesses-the-new-software-discipline)Context Rot: How Increasing Input Tokens Impacts LLM Performance (https://research.trychroma.com/context-rot)Effective context engineering for AI agents by Anthropic (https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Watch the podcast video on YouTube (https://youtu.be/CloimQsQuJM)Join the final cohort of our Building AI Applications course starting Jan 12, 2026 (https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav): https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgrav This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 62: Practical AI at Work: How Execs and Developers Can Actually Use LLMs

2025-10-3159:04

Many leaders are trapped between chasing ambitious, ill-defined AI projects and the paralysis of not knowing where to start. Dr. Randall Olson argues that the real opportunity isn't in moonshots, but in the "trillions of dollars of business value" available right now. As co-founder of Wyrd Studios, he bridges the gap between data science, AI engineering, and executive strategy to deliver a practical framework for execution.In this episode, Randy and Hugo lay out how to find and solve what might be considered "boring but valuable" problems, like an EdTech company automating 20% of its support tickets with a simple retrieval bot instead of a complex AI tutor. They discuss how to move incrementally along the "agentic spectrum" and why treating AI evaluation with the same rigor as software engineering is non-negotiable for building a disciplined, high-impact AI strategy.They talk through:How a non-technical leader can prototype a complex insurance claim classifier using just photos and a ChatGPT subscription.The agentic spectrum: Why you should start by automating meeting summaries before attempting to build fully autonomous agents.The practical first step for any executive: Building a personal knowledge base with meeting transcripts and strategy docs to get tailored AI advice.Why treating AI evaluation with the same rigor as unit testing is essential for shipping reliable products.The organizational shift required to unlock long-term AI gains, even if it means a short-term productivity dip.LINKSRandy on LinkedIn (https://www.zenml.io/llmops-database)Wyrd Studios (https://thewyrdstudios.com/)Stop Building AI Agents (https://www.decodingai.com/p/stop-building-ai-agents)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Watch the podcast video on YouTube (https://youtu.be/-YQjKH3wRvc)🎓 Learn more:In Hugo's course: Building AI Applications for Data Scientists and Software Engineers (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20) — https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20 Next cohort starts November 3: come build with us! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 61: The AI Agent Reliability Cliff: What Happens When Tools Fail in Production

2025-10-1628:04

Most AI teams find their multi-agent systems devolving into chaos, but ML Engineer Alex Strick van Linschoten argues they are ignoring the production reality. In this episode, he draws on insights from the LLM Ops Database (750+ real-world deployments then; now nearly 1,000!) to systematically measure and engineer constraint, turning unreliable prototypes into robust, enterprise-ready AI.Drawing from his work at Zen ML, Alex details why success requires scaling down and enforcing MLOps discipline to navigate the unpredictable "Agent Reliability Cliff". He provides the essential architectural shifts, evaluation hygiene techniques, and practical steps needed to move beyond guesswork and build scalable, trustworthy AI products.We talk through:- Why "shoving a thousand agents" into an app is the fastest route to unmanageable chaos- The essential MLOps hygiene (tracing and continuous evals) that most teams skip- The optimal (and very low) limit for the number of tools an agent can reliably use- How to use human-in-the-loop strategies to manage the risk of autonomous failure in high-sensitivity domains- The principle of using simple Python/RegEx before resorting to costly LLM judgesLINKSThe LLMOps Database: 925 entries as of today....submit a use case to help it get to 1K! (https://www.zenml.io/llmops-database)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Watch the podcast video on YouTube (https://youtu.be/-YQjKH3wRvc)🎓 Learn more:-This was a guest Q&A from Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20) — https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=AI20 Next cohort starts November 3: come build with us! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 60: 10 Things I Hate About AI Evals with Hamel Husain

2025-09-3001:13:16

Most AI teams find "evals" frustrating, but ML Engineer Hamel Husain argues they’re just using the wrong playbook. In this episode, he lays out a data-centric approach to systematically measure and improve AI, turning unreliable prototypes into robust, production-ready systems.Drawing from his experience getting countless teams unstuck, Hamel explains why the solution requires a "revenge of the data scientists." He details the essential mindset shifts, error analysis techniques, and practical steps needed to move beyond guesswork and build AI products you can actually trust.We talk through: The 10(+1) critical mistakes that cause teams to waste time on evals Why "hallucination scores" are a waste of time (and what to measure instead) The manual review process that finds major issues in hours, not weeks A step-by-step method for building LLM judges you can actually trust How to use domain experts without getting stuck in endless review committees Guest Bryan Bischof's "Failure as a Funnel" for debugging complex AI agentsIf you're tired of ambiguous "vibe checks" and want a clear process that delivers real improvement, this episode provides the definitive roadmap.LINKSHamel's website and blog (https://hamel.dev/)Hugo speaks with Philip Carter (Honeycomb) about aligning your LLM-as-a-judge with your domain expertise (https://vanishinggradients.fireside.fm/51)Hamel Husain on Lenny's pocast, which includes a live demo of error analysis (https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill)The episode of VG in which Hamel and Hugo talk about Hamel's "data consulting in Vegas" era (https://vanishinggradients.fireside.fm/9)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Watch the podcast video on YouTube (https://youtube.com/live/QEk-XwrkqhI?feature=share)Hamel's AI evals course, which he teaches with Shreya Shankar (UC Berkeley): starts Oct 6 and this link gives 35% off! (https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME) https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 59: Patterns and Anti-Patterns For Building with AI

2025-09-2347:37

John Berryman (Arcturus Labs; early GitHub Copilot engineer; co-author of Relevant Search and Prompt Engineering for LLMs) has spent years figuring out what makes AI applications actually work in production. In this episode, he shares the “seven deadly sins” of LLM development — and the practical fixes that keep projects from stalling. From context management to retrieval debugging, John explains the patterns he’s seen succeed, the mistakes to avoid, and why it helps to think of an LLM as an “AI intern” rather than an all-knowing oracle. We talk through: - Why chasing perfect accuracy is a dead end - How to use agents without losing control - Context engineering: fitting the right information in the window - Starting simple instead of over-orchestrating - Separating retrieval from generation in RAG - Splitting complex extractions into smaller checks - Knowing when frameworks help — and when they slow you down A practical guide to avoiding the common traps of LLM development and building systems that actually hold up in production.LINKS:Context Engineering for AI Agents, a free, upcoming lightning lesson from John and Hugo (https://maven.com/p/4485aa/context-engineering-for-ai-agents)The Hidden Simplicity of GenAI Systems, a previous lightning lesson from John and Hugo (https://maven.com/p/a8195d/the-hidden-simplicity-of-gen-ai-systems)Roaming RAG – RAG without the Vector Database, by John (https://arcturus-labs.com/blog/2024/11/21/roaming-rag--rag-without-the-vector-database/)Cut the Chit-Chat with Artifacts, by John (https://arcturus-labs.com/blog/2024/11/11/cut-the-chit-chat-with-artifacts/)Prompt Engineering for LLMs by John and Albert Ziegler (https://amzn.to/4gChsFf)Relevant Search by John and Doug Turnbull (https://amzn.to/3TXmDHk)Arcturus Labs (https://arcturus-labs.com/)Watch the podcast on YouTube (https://youtu.be/mKTQGKIUq8M)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)🎓 Learn more:Hugo's course (this episode was a guest Q&A from the course): Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 58: Building GenAI Systems That Make Business Decisions with Thomas Wiecki (PyMC Labs)

2025-09-0901:00:45

While most conversations about generative AI focus on chatbots, Thomas Wiecki (PyMC Labs, PyMC) has been building systems that help companies make actual business decisions. In this episode, he shares how Bayesian modeling and synthetic consumers can be combined with LLMs to simulate customer reactions, guide marketing spend, and support strategy. Drawing from his work with Colgate and others, Thomas explains how to scale survey methods with AI, where agents fit into analytics workflows, and what it takes to make these systems reliable. We talk through: Using LLMs as “synthetic consumers” to simulate surveys and test product ideas How Bayesian modeling and causal graphs enable transparent, trustworthy decision-making Building closed-loop systems where AI generates and critiques ideas Guardrails for multi-agent workflows in marketing mix modeling Where generative AI breaks (and how to detect failure modes) The balance between useful models and “correct” models If you’ve ever wondered how to move from flashy prototypes to AI systems that actually inform business strategy, this episode shows what it takes. LINKS:The AI MMM Agent, An AI-Powered Shortcut to Bayesian Marketing Mix Insights (https://www.pymc-labs.com/blog-posts/the-ai-mmm-agent)AI-Powered Decision Making Under Uncertainty Workshop w/ Allen Downey & Chris Fonnesbeck (PyMC Labs) (https://youtube.com/live/2Auc57lxgeU)The Podcast livestream on YouTube (https://youtube.com/live/so4AzEbgSjw?feature=share)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 57: AI Agents and LLM Judges at Scale: Processing Millions of Documents (Without Breaking the Bank)

2025-08-2941:28

While many people talk about “agents,” Shreya Shankar (UC Berkeley) has been building the systems that make them reliable. In this episode, she shares how AI agents and LLM judges can be used to process millions of documents accurately and cheaply. Drawing from work on projects ranging from databases of police misconduct reports to large-scale customer transcripts, Shreya explains the frameworks, error analysis, and guardrails needed to turn flaky LLM outputs into trustworthy pipelines. We talk through: - Treating LLM workflows as ETL pipelines for unstructured text - Error analysis: why you need humans reviewing the first 50–100 traces - Guardrails like retries, validators, and “gleaning” - How LLM judges work — rubrics, pairwise comparisons, and cost trade-offs - Cheap vs. expensive models: when to swap for savings - Where agents fit in (and where they don’t) If you’ve ever wondered how to move beyond unreliable demos, this episode shows how to scale LLMs to millions of documents — without breaking the bank.LINKSShreya's website (https://www.sh-reya.com/)DocETL, A system for LLM-powered data processing (https://www.docetl.org/)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Watch the podcast video on YouTube (https://youtu.be/3r_Hsjy85nk)Shreya's AI evals course, which she teaches with Hamel "Evals" Husain (https://maven.com/parlance-labs/evals?promoCode=GOHUGORGOHOME)🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 56: DeepMind Just Dropped Gemma 270M... And Here’s Why It Matters

2025-08-1445:41

While much of the AI world chases ever-larger models, Ravin Kumar (Google DeepMind) and his team build across the size spectrum, from billions of parameters down to this week’s release: Gemma 270M, the smallest member yet of the Gemma 3 open-weight family. At just 270 million parameters, a quarter the size of Gemma 1B, it’s designed for speed, efficiency, and fine-tuning. We explore what makes 270M special, where it fits alongside its billion-parameter siblings, and why you might reach for it in production even if you think “small” means “just for experiments.” We talk through: - Where 270M fits into the Gemma 3 lineup — and why it exists - On-device use cases where latency, privacy, and efficiency matter - How smaller models open up rapid, targeted fine-tuning - Running multiple models in parallel without heavyweight hardware - Why “small” models might drive the next big wave of AI adoption If you’ve ever wondered what you’d do with a model this size (or how to squeeze the most out of it) this episode will show you how small can punch far above its weight.LINKSIntroducing Gemma 3 270M: The compact model for hyper-efficient AI (Google Developer Blog) (https://developers.googleblog.com/en/introducing-gemma-3-270m/)Full Model Fine-Tune Guide using Hugging Face Transformers (https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune)The Gemma 270M model on HuggingFace (https://huggingface.co/google/gemma-3-270m)The Gemma 270M model on Ollama (https://ollama.com/library/gemma3:270m)Building AI Agents with Gemma 3, a workshop with Ravin and Hugo (https://www.youtube.com/live/-IWstEStqok) (Code here (https://github.com/canyon289/ai_agent_basics))From Images to Agents: Building and Evaluating Multimodal AI Workflows, a workshop with Ravin and Hugo (https://www.youtube.com/live/FNlM7lSt8Uk)(Code here (https://github.com/canyon289/ai_image_agent))Evaluating AI Agents: From Demos to Dependability, an upcoming workshop with Ravin and Hugo (https://lu.ma/ezgny3dl)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Watch the podcast video on YouTube (https://youtu.be/VZDw6C2A_8E)🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 ($600 off early bird discount for November cohort availiable until August 16) This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 55: From Frittatas to Production LLMs: Breakfast at SciPy

2025-08-1238:09

Traditional software expects 100% passing tests. In LLM-powered systems, that’s not just unrealistic — it’s a feature, not a bug. Eric Ma leads research data science in Moderna’s data science and AI group, and over breakfast at SciPy we explored why AI products break the old rules, what skills different personas bring (and miss), and how to keep systems alive after the launch hype fades. You’ll hear the clink of coffee cups, the murmur of SciPy in the background, and the occasional bite of frittata as we talk (hopefully also a feature, not a bug!)We talk through: • The three personas — and the blind spots each has when shipping AI systems • Why “perfect” tests can be a sign you’re testing the wrong thing • Development vs. production observability loops — and why you need both • How curiosity about failing data separates good builders from great ones • Ways large organizations can create space for experimentation without losing delivery focus If you want to build AI products that thrive in the messy real world, this episode will help you embrace the chaos — and make it work for you.LINKSEric' Website (https://ericmjl.github.io/)More about the workshops Eric and Hugo taught at SciPy (https://hugobowne.substack.com/p/stress-testing-llms-evaluation-frameworks)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338 ($600 off early bird discount for November cohort availiable until August 16) This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 54: Scaling AI: From Colab to Clusters — A Practitioner’s Guide to Distributed Training and Inference

2025-07-1841:18

Colab is cozy. But production won’t fit on a single GPU.Zach Mueller leads Accelerate at Hugging Face and spends his days helping people go from solo scripts to scalable systems. In this episode, he joins me to demystify distributed training and inference — not just for research labs, but for any ML engineer trying to ship real software.We talk through: • From Colab to clusters: why scaling isn’t just about training massive models, but serving agents, handling load, and speeding up iteration • Zero-to-two GPUs: how to get started without Kubernetes, Slurm, or a PhD in networking • Scaling tradeoffs: when to care about interconnects, which infra bottlenecks actually matter, and how to avoid chasing performance ghosts • The GPU middle class: strategies for training and serving on a shoestring, with just a few cards or modest credits • Local experiments, global impact: why learning distributed systems—even just a little—can set you apart as an engineerIf you’ve ever stared at a Hugging Face training script and wondered how to run it on something more than your laptop: this one’s for you.LINKSZach on LinkedIn (https://www.linkedin.com/in/zachary-mueller-135257118/)Hugo's blog post on Stop Buliding AI Agents (https://www.linkedin.com/posts/hugo-bowne-anderson-045939a5_yesterday-i-posted-about-stop-building-ai-activity-7346942036752613376-b8-t/)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/stop-building-agents)🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — https://maven.com/s/course/d56067f338Zach's course (45% off for VG listeners!): Scratch to Scale: Large-Scale Training in the Modern World (https://maven.com/walk-with-code/scratch-to-scale?promoCode=hugo39) -- https://maven.com/walk-with-code/scratch-to-scale?promoCode=hugo39📺 Watch the video version on YouTube: YouTube link (https://youtube.com/live/76NAtzWZ25s?feature=share) This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs

2025-07-0844:50

Demos are easy; durability is hard. Samuel Colvin has spent a decade building guardrails in Python (first with Pydantic, now with Logfire), and he’s convinced most LLM failures have nothing to do with the model itself. They appear where the data is fuzzy, the prompts drift, or no one bothered to measure real-world behavior. Samuel joins me to show how a sprinkle of engineering discipline keeps those failures from ever reaching users.We talk through: • Tiny labels, big leverage: how five thumbs-ups/thumbs-downs are enough for Logfire to build a rubric that scores every call in real time • Drift alarms, not dashboards: catching the moment your prompt or data shifts instead of reading charts after the fact • Prompt self-repair: a prototype agent that rewrites its own system prompt—and tells you when it still doesn’t have what it needs • The hidden cost curve: why the last 15 percent of reliability costs far more than the flashy 85 percent demo • Business-first metrics: shipping features that meet real goals instead of chasing another decimal point of “accuracy”If you’re past the proof-of-concept stage and staring down the “now it has to work” cliff, this episode is your climbing guide.LINKSPydantic (https://pydantic.dev/)Logfire (https://pydantic.dev/logfire)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/stop-building-agents)🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — next cohort starts July 8: https://maven.com/s/course/d56067f338📺 Watch the video version on YouTube: YouTube link (https://youtube.com/live/wk6rPZ6qJSY?feature=share) This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

Episode 52: Why Most LLM Products Break at Retrieval (And How to Fix Them)

2025-07-0228:38

Most LLM-powered features do not break at the model. They break at the context. So how do you retrieve the right information to get useful results, even under vague or messy user queries?In this episode, we hear from Eric Ma, who leads data science research in the Data Science and AI group at Moderna. He shares what it takes to move beyond toy demos and ship LLM features that actually help people do their jobs.We cover:• How to align retrieval with user intent and why cosine similarity is not the answer• How a dumb YAML-based system outperformed so-called smart retrieval pipelines• Why vague queries like “what is this all about” expose real weaknesses in most systems• When vibe checks are enough and when formal evaluation is worth the effort• How retrieval workflows can evolve alongside your product and user needsIf you are building LLM-powered systems and care about how they work, not just whether they work, this one is for you.LINKSEric's website (https://ericmjl.github.io/)Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/stop-building-agents)🎓 Learn more:Hugo's course: Building LLM Applications for Data Scientists and Software Engineers (https://maven.com/s/course/d56067f338) — next cohort starts July 8: https://maven.com/s/course/d56067f338📺 Watch the video version on YouTube: YouTube link (https://youtu.be/d-FaR5Ywd5k) This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit hugobowne.substack.com

#box-pro-ellipsis-177345434429970{-webkit-line-clamp:2;}Vanishing Gradients