AI Evals & Discovery
Update: 2025-09-23
Description
What you’ll learn in this episode:
- What “evals” actually mean in the AI/ML world
- Why evals are more than just quality assurance
- The difference between golden datasets, synthetic data, and real-world traces
- How to identify error modes and turn them into evals
- When to use code-based evals vs. LLM-as-judge evals
- How discovery practices inform every step of AI product evaluation
- Why evals require continuous maintenance (and what “criteria drift” means for your product)
- The relationship between evals, guardrails, and ongoing human oversight
Resources & Links:
- Follow Teresa Torres: https://ProductTalk.org
- Follow Petra Wille: https://Petra-Wille.com
Mentioned in the episode:
- How I Designed & Implemented Evals for Product Talk’s Interview Coach by Teresa Torres
Teresa’s - Interview Coach - ML (Machine learning)
- Story-Based Customer Interviews - On Demand course by Teresa
- LLM (Large language model)
- AI Evals for Engineers and PMs course (get 35% off through Teresa’s link) on Maven
- V0
- JSON (JavaScript Object Notation)
- Anthropic
- The Product Leadership Wheel - A Framework for Defining and Growing Product Leadership at Scale by Petra Wille
- Lovable
- Behind the Scenes: Building the Product Talk Interview Coach by Teresa
- Previous episode: - Building AI Products
Coming soon from Teresa:
- Weekly Monday posts sharing lessons learned while building AI products
- A new podcast interviewing cross-functional teams about real-world AI product development stories
Comments
In Channel