Data Engineering Podcast

509 Episodes

Reverse

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

2026-04-0759:24

Summary In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026InterviewIntroductionHow did you get involved in the area of data management?What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Blog PostDatafoldClaude Opus 4.5Harry Potter - MugglesJevon's ParadoxModern Data StackDagster CompassGravity OrionMCP == Model Context ProtocolQwenThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

2026-03-2950:19

Summary In this episode Himant Goyal, Senior Product Manager at Salesforce, talks about how data platform investments enable reliable, accurate metering for consumption-based business models. Himant explains why consumption turns operations into a real-time optimization problem spanning metering, cost attribution, billing, governance, and cross-functional ownership. He explores the richness required in usage data to support sophisticated pricing, the importance of treating metering like a financial system, and the architectural foundations - event schemas, durable ingestion, normalization/validation, a usage ledger, and clear serving layers - needed to power near-real-time visibility with fine-grained drilldowns. He also digs into anti-patterns and reliability concerns such as late or duplicate data, time zone pitfalls, SLAs, and automated policy decisions for pipeline failures. Himant shares practical guidance for capturing usage events from products and logs, balancing push vs. pull and real-time vs. batch processing to manage costs. He highlights configurable metering and rate-card versioning for rapid onboarding of new products, and the cultural shift required for finance, product, and engineering to co-own metering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Himant Goyal about how data platform investments support consumption based business modelsAnnouncementsIntroductionHow did you get involved in managing the data products or data management?Can you start by outlining the types of businesses and products that are "consumption based" and the impact that it has on the economics of the company?What are the unique operational challenges that are presented by having consumption as the unit of cost?How does the availability and accessibility of metering data impact the level of detail/nuance that the business can employ in their pricing strategies?When we talk about the infrastructure for usage tracking, it often feels like a high-stakes stream processing problem. What are the core architectural components required to build a reliable metering pipeline?How do you think about the trade-offs between "push" models (application emits events) vs. "pull" models (the platform scrapes resource usage)?Accuracy is non-negotiable when data is tied directly to revenue. What are the strategies for ensuring idempotency and handling deduplication in the ingestion layer?How do you address the "late-arriving data" problem in a usage-based world, especially when dealing with monthly billing cycles or credit exhaustion?From an uptime and reliability perspective, should the metering system be in the critical path of the service itself?If the metering service is down, do you "fail open" and provide free service, or "fail closed" and impact availability? How do you build for that kind of resilience?One of the common pitfalls is treating metering like logging or observability. How do you ensure that usage metering is treated as a first-class product priority rather than an afterthought for the platform team?What does the interface look like for product engineers to "register" a new billable event without breaking the downstream data contract?Once you have this data, there is often a requirement for real-time visibility for the end user. What are the data modeling requirements to support both "high-volume ingestion" and "low-latency querying" for customer-facing billing dashboards?How do you bridge the gap between the raw event stream and the aggregated "billable unit" in the data warehouse or lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen usage-based metering used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on building consumption-based data platforms?When is usage-based metering the wrong choice? (e.g., When does the complexity of the data platform outweigh the economic benefits?)What are your predictions for the future of consumption-based data architectures?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Hackernoon PostCOGS == Cost of Good SoldMedallion ArchitectureThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

2026-03-2242:40

Summary In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards and tools like Jupyter, Jupyter Book, and the push toward cloud-optimized formats (e.g., Zarr), along with graceful degradation strategies that keep interactive research usable over time. Rowan details how CurveNote enables interactive, reproducible articles that spin up compute on demand while delegating large dataset storage to specialized partners, and how community efforts like the Continuous Science Foundation and initiatives with Creative Commons aim to fix credit, licensing, and attribution. He also discusses the Open Exchange Architecture (OXA) initiative to establish a modular, computational standard for sharing science, the momentum in computational biosciences and neuroscience, and why true progress hinges on interoperability and composability across data, code, and narrative. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduceInterview IntroductionHow did you get involved in the area of data management?Can you describe what your interest is in reproducibility of scientific research?What role does data play in the set of challenges that plague reproducibility of published research?What are some of the notable changes in the areas of scientific process, and data systems that have contributed to the current crisis of reproducibility?Beyond technological shortcomings, what are the processes that lead to problematic experiment/research design, and how does that complicate the work of other teams trying to build on the experimental findings?How does a monolithic approach change the types of research that would be possible with more modular/composable experimentation and research?Focusing now on the data-oriented aspects of research, what are the habits of research teams that lead to friction and waste in storing, processing, publishing, and ultimately consuming the information that supports the research findings?What are the elements of the work that you are doing at the Continous Science Foundation and Curvenote to break the status quo?Are there any areas of study that you are more susceptible to friction and siloing of their data?What does a typical engagement with a research group look like as you try to improve the accessibility of their work?What are the most interesting, innovative, or unexpected ways that you have seen research data (re-)used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on reproducibility of scientific research?What are the next set of challenges that you are focused on addressing in the research/reproducibility space?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Continuous Science FoundationCurvenoteZenodoDryadHDF5IcebergZarrMyst MarkdownJupyter NotebookArXivJournal of Open Source Software (JOSS)Data CarpentrySoftware CarpentryOpen RxivBio RxivMed RxivForce 11JupyterBookOpen Exchange Architecture (OXA)The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Beyond Prompts: Practical Paths to Self‑Improving AI

2026-03-1601:01:50

Summary In this episode Raj Shukla, CTO of SymphonyAI, explores what it really takes to build self‑improving AI systems that work in production. Raj unpacks how agentic systems interact with real-world environments, the feedback loops that enable continuous learning, and why intelligent memory layers often provide the most practical middle ground between prompt tweaks and full Reinforcement Learning. He discusses the architecture needed around models - data ingestion, sensors, action layers, sandboxes, RBAC, and agent lifecycle management - to reach enterprise-grade reliability, as well as the policy alignment steps required for regulated domains like financial crime. Raj shares hard-won lessons on tool use evolution (from bespoke tools to filesystem and Unix primitives), dynamic code-writing subagents, model version brittleness, and how organizations can standardize process and entity graphs to accelerate time-to-value. He also dives into pitfalls such as policy gaps and tribal knowledge, strategies for staged rollouts and monitoring, and where small models and cost optimization make sense. Raj closes with a vision for bringing RL-style improvement to enterprises without requiring a research team - letting businesses own the reasoning and memory layers that truly differentiate their AI systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey, and today I’m interviewing Raj Shukla about building self-improving AI systems — and how they enable AI scalability in real production environments.Interview IntroductionHow did you get involved in AI/ML?Can you start by outlining what actually improves over time in a self-improving AI system? How is that different from simply improving a model or an agent? How would you differentiate between an agent/agentic system vs. a self-improving system? One of the components that are becoming common in agentic architectures is a "memory" layer. What are some of the ways that contributes to a self-improvement feedback loop? In what ways are memory layers insufficient for a generalized self-improvement capability? For engineering and technology leaders, what are the key architectural and operational steps you recommend to build AI that can move from pilots into scalable, production systems? One of the perennial challenges for technology leaders is how to build AI systems that scale over time. How has AI changed the way you think about long-term advantage? How do self-improvement feedback loops contribute to AI scalability in real systems? What are some of the other key elements necessary to build a truly evolutionary AI system? What are the hidden costs of building these AI systems that teams should know before starting? I’m talking about enterprise who are deploying AI into their internal mission-critical workflows. What are the most interesting, innovative, or unexpected ways that you have seen self-improving AI systems implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on evolutionary AI systems? What are some of the ways that you anticipate agentic architectures and frameworks evolving to be more capable of self-improvement? Contact Info LinkedInClosing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Parting Question From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?Links Symphony AIReinforcement LearningAgentic MemoryIn-Context LearningContext EngineeringFew-Shot LearningOpenClawDeep Research AgentRAG == Retrieval Augmented GenerationAgentic SearchGoogle Gemma ModelsOllamaThe intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Orion at Gravity: Trustworthy AI Analysts for the Enterprise

2026-03-0801:05:01

Summary In this episode of the Data Engineering Podcast, Lucas Thelosen and Drew Gilson, co-founders of Gravity, discuss their vision for agentic analytics in the enterprise, enabled by semantic layers and broader context engineering. They share their journey from Looker and Google to building Orion, an AI analyst that combines data semantics with rich business context to deliver trustworthy and actionable insights. Lucas and Drew explain how Orion uses governed, role-specific "custom agents" to drive analysis, recommendations, and proactive preparation for meetings, while maintaining accuracy, lineage transparency, and human-in-the-loop feedback. The conversation covers evolving views on semantic layers, agent memory, retrieval, and operating across messy data, multiple warehouses, and external context like documents and weather. They emphasize the importance of trust, governance, and the path to AI coworkers that act as reliable colleagues. Lucas and Drew also share field stories from public companies where Orion has surfaced board-level issues, accelerated executive prep with last-minute research, and revealed how BI investments are actually used, highlighting a shift from static dashboards to dynamic, dialog-driven decisions. They stress the need for accessible (non-proprietary) models, managing context and technical debt over time, and focusing on business actions - not just metrics - to unlock real ROI. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the application of semantic layers to context engineering for agentic analyticsInterview IntroductionHow did you get involved in the area of data management?Can you start by digging into the practical elements of what is involved in the creation and maintenance of a "semantic layer"?How does the semantic layer relate to and differ from the physical schema of a data warehouse?In generative AI and agentic systems the latest term of art is "context engineering". How does a semantic layer factor into the context management for an agentic analyst?What are some of the ways that LLMs/agents can help to populate the semantic layer?What are the cases where you want to guard against hallucinations by keeping a human in the loop?Beyond a physical semantic layer, what are the other elements of context that you rely on for guiding the activities of your agents?What are some utilities that you have found helpful for bootstrapping the structural guidelines for an existing warehouse environment?What are the most interesting, innovative, or unexpected ways that you have seen Orion used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Orion?When is Orion the wrong choice?What do you have planned for the future of Orion?Contact Info LucasLinkedInDrewLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links GravityOrionLookerSemantic LayerdbtLookMLTableauOpenClawPareto DistributionThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From Models to Momentum: Uniting Architects and Engineers with ER/Studio

2026-03-0245:02

Summary In this episode of the Data Engineering Podcast, Jamie Knowles (Product Director) and Ryan Hirsch (Product Marketing Manager) discuss the importance of enterprise data modeling with ER/Studio. They highlight how clear, shared semantic models are a foundational discipline for modern data engineering, preventing semantic drift, speeding up delivery, and reducing rework. Jamie explains that ER/Studio helps teams define logical models that translate into physical designs and code across warehouses and analytics platforms, while maintaining traceability and governance. The conversation also touches on how AI increases the tolerance for ambiguity, but doesn't fix unclear definitions - it amplifies them. Jamie and Ryan describe ER/Studio's integrations with governance tools, collaboration features like TeamServer, reverse engineering, and metadata bridges, as well as new AI-assisted modeling capabilities. They emphasize that most data problems are meaning problems, and investing in architecture and a semantic backbone can make engineering faster, governance simpler, and analytics more reliable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Jamie Knowles and Ryan Hirsch about ER/Studio and the foundational role of enterprise data modeling in modern data engineering.Interview IntroductionHow did you get involved in the area of data management?Can you describe what ER/Studio is and the story behind it? How has it evolved to handle the shift from traditional on-prem databases to modern, complex, and highly regulated enterprise environments?How do you define "Enterprise Data Architecture" today, and how does it differ from just managing a collection of pipelines in a modern data stack?In your view, what are the distinct responsibilities of a Data Architect versus a Data Engineer, and where is the critical overlap where they typically succeed or fail together?From what you see in the field, how often are the technical struggles of data engineering teams—like tool sprawl or "broken" pipelines—actually just "data meaning" problems in disguise?What is a logical data model, and why do you advocate for framing these as "knowledge models" rather than just technical diagrams?What are the long-term consequences, such as "semantic drift" or the erosion of trust, when organizations skip logical modeling to go straight to physical implementation and pipelines?What is the intersection of data modeling and data governance?What are the elements of integration between ER/Studio and governance platforms that reduce friction and time to delivery?For the engineers who worry that architecture and modeling slow down development, how does having a central design authority actually help teams scale and reduce downstream rework?What does a typical workflow look like across data architecture and data engineering for individuals and teams who are using ER/Studio as a core part of their modeling?What are the most interesting, innovative, or unexpected ways that you have seen ER/Studio used? * Context: Specifically regarding grounding AI initiatives or defining enterprise ontologies.What are the most interesting, unexpected, or challenging lessons that you have learned while working on ER/Studio?When is ER/Studio the wrong choice for a data team or a specific project?What do you have planned for the future of ER/Studio, particularly regarding AI and the "design-time" foundation of the data stack?Contact Info JamieLinkedInRyanLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links IderaWherescapeER/StudioEntity-Relation Diagram (ERD)Business KeysMedallion ArchitectureRDF == Resource Description FrameworkCollibraMartin FowlerDB2The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From Data Models to Mind Models: Designing AI Memory at Scale

2026-02-2257:47

Summary In this episode of the Data Engineering Podcast, Vasilije "Vas" Markovich, founder of Cognee, discusses building agentic memory, a crucial aspect of artificial intelligence that enables systems to learn, adapt, and retain knowledge over time. He explains the concept of agentic memory, highlighting the importance of distinguishing between permanent and session memory, graph+vector layers, latency trade-offs, and multi-tenant isolation to ensure safe knowledge sharing or protection. The conversation covers practical considerations such as storage choices (Redis, Qdrant, LanceDB, Neo4j), metadata design, temporal relevance and decay, and emerging research areas like trace-based scoring and reinforcement learning for improving retrieval. Vas shares real-world examples of agentic memory in action, including applications in pharma hypothesis discovery, logistics control towers, and cybersecurity feeds, as well as scenarios where simpler approaches may suffice. He also offers guidance on when to add memory, pitfalls to avoid (naive summarization, uncontrolled fine-tuning), human-in-the-loop realities, and Cognee's future plans: revamped session/long-term stores, decision-trace research, and richer time and transformation mechanisms. Additionally, Vas touches on policy guardrails for agent actions and the potential for more efficient "pseudo-languages" for multi-agent collaboration. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Vasilije Markovic about agentic memory architectures and applicationsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the different elements of "memory" in an agentic context?storage and retrieval mechanismshow to model memorieshow does that change as you go from short-term to long-term?managing scope and retrieval triggersWhat are some of the useful triggers in an agent architecture to identify whether/when/what to create a new memory?How do things change as you try to build a shared corpus of memory across agents?What are the most interesting, innovative, or unexpected ways that you have seen agentic memory used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cognee?When is a dedicated memory layer the wrong choice?What do you have planned for the future of Cognee?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links CogneeAI Engineering Podcast Episode[Kimball Memory](Cognitive ScienceContext WindowRAG == Retrieval Augmented GenerationMemory TypesRedis Vector StoreQdrantVector on EdgeMilvusLanceDBKuzuDBNeo4JMem0Zepp GraphitiA2A (Agent-to-Agent) ProtocolSnowplowReinforcement LearningModel FinetuningOpenClawThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Prompt Management, Tracing, and Evals: The New Table Stakes for GenAI Ops

2026-02-1550:43

Summary In this episode of the Data Engineering Podcast, Aman Agarwal, creator of OpenLit, discusses the operational groundwork required to run LLM-powered applications reliably and cost-effectively. He highlights common blind spots that teams face, including opaque model behavior, runaway token costs, and brittle prompt management, and explains how OpenTelemetry-native observability can turn these black-box interactions into stepwise, debuggable traces across models, tools, and data stores. Aman showcases OpenLit's approach to open standards, vendor-neutral integrations, and practical features such as fleet-managed OTEL collectors, zero-code Kubernetes instrumentation, prompt and secret management, and evaluation workflows. They also explore experimentation patterns, routing across models, and closing the loop from evals to prompt/dataset improvements, demonstrating how better visibility reshapes design choices from prototype to production. Aman shares lessons learned building in the open, where OpenLit fits and doesn't, and what's next in context management, security, and ecosystem integrations, providing resources and examples of multi-database observability deployments for listeners. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Aman Agarwal about the operational investments that are necessary to ensure you get the most out of your AI modelsInterview IntroductionHow did you get involved in the area of AI/data management?Can you start by giving your assessment of the main blind spots that are common in the existing AI application patterns?As teams adopt agentic architectures, how common is it to fall prey to those same blind spots?There are numerous tools/services available now focused on various elements of "LLMOps". What are the major components necessary for a minimum viable operational platform for LLMs?There are several areas of overlap, as well as disjoint features, in the ecosystem of tools (both open source and commercial). How do you advise teams to navigate the selection process? (point solutions vs. integrated tools, and handling frameworks with only partial overlap)Can you describe what OpenLit is and the story behind it?How would you characterize the feature set and focus of OpenLit compared to what you view as the "major players"?Once you have invested in a platform like OpenLit, how does that change the overall development workflow for the lifecycle of AI/agentic applications?What are the most complex/challenging elements of change management for LLM-powered systems? (e.g. prompt tuning, model changes, data changes, etc.)How can the information collected in OpenLit be used to develop a self-improvement flywheel for agentic systems?Can you describe the architecture and implementation of OpenLit?How have the scope and goals of the project changed since you started working on it?Given the foundational aspects of the project that you have built, what are some of the adjacent capabilities that OpenLit is situated to expand into?What are the sharp edges and blind spots that are still challenging even when you have OpenLit or similar integrated?What are the most interesting, innovative, or unexpected ways that you have seen OpenLit used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenLit?When is OpenLit the wrong choice?What do you have planned for the future of OpenLit?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data/AI management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links OpenLitFleet HubOpenTelemetryLangFuseLangSmithTensorZeroAI Engineering Podcast EpisodeTraceloopHeliconeClickhouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From Legacy to AI-Ready: How MongoDB AMP Accelerates Modernization

2026-02-0846:45

SummaryIn this episode, Shilpa Kolhar, SVP of Product and Engineering at MongoDB, discusses using MongoDB as a unified foundation for AI-driven and agentic applications. She explains how the Application Modernization Platform (AMP) accelerates the transition from legacy relational systems to a document-first architecture, driven by the need for AI-readiness and speed of change. Shilpa highlights MongoDB's features, such as its native JSON document model, Atlas Vector Search, auto-embeddings, and integrated search, which help eliminate drift and latency across operational data, indexing, and vectors, emphasizing the importance of keeping context, transactions, and embeddings together for real-time AI use cases. She shares best practices for re-architecting legacy systems, including schema validation and versioning patterns to tame schema drift, aggregation pipelines for consistent reads, and pragmatic standardization across services, while also detailing AMP's approach to scoping large estates and the balance of LLM-powered automation with human-in-the-loop governance.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Shilpa Kolhar about using MongoDB as the foundation for AI-driven applicationsInterviewIntroductionHow did you get involved in the area of data management?Can you describe what MongoDB is and the core primitives that it offers?The MongoDB engine has gone through substantial evolution since it was first introduced over 20 years ago. What are some of the most notable features that have been added in recent years?You recently launched the MongoDB Application Modernization Platform (AMP). What are the key elements of modernization that it is focused on?How do the core primitives of the MongoDB engine align with modernization objectives?There is a lot of attention being paid now to AI applications where data is the most critical element for success. What are the features of MongoDB that lend itself to being the context store for generative AI services?Besides the data used for context and grounding, AI applications also want to track user interactions and form short and long term memory to improve the system over time. How can MongoDB assist in that work as well?While the lack of schema enforcement on write can be beneficial to rapid evolution of software, it can also be a detriment if not managed well. How can MongoDB help in avoiding schema drift over time that leads to old data being incompatible with current code?What are the most interesting, innovative, or unexpected ways that you have seen MongoDB used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on MongoDB and application modernization?When is MongoDB/AMP the wrong choice?What do you have planned for the future of AMP?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksMongoDBMongoDB AMPGoogle GeminiVoyage AIQdrantChromaDBWeaviatePineconeMongoDB AutoembeddingRetoolODM == Object Document MapperRAG == Retrieval Augmented GenerationAgentic MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Branches, Diffs, and SQL: How Dolt Powers Agentic Workflows

2026-02-0156:53

Summary In this episode Tim Sehn, founder and CEO of DoltHub, talks about Dolt - the world’s first version‑controlled SQL database - and why Git‑style semantics belong at the heart of data systems and AI workflows. Tim explains how Dolt combines a MySQL/Postgres‑compatible interface with a novel storage engine built on a “Prollytree” to enable fast, row‑level branching, merging, and diffs of both schema and data. He digs into real production use cases: powering applications that expose version control to end users, reproducible ML feature stores, managing massive configuration for games, and enabling safe agentic writes via branch‑based review flows. He compares Dolt’s approach to LakeFS, Neon, and PlanetScale, and explores developer workflows unlocked by decentralized clones, full audit logs, and PR‑style data reviews. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Tim Sehn about Dolt, a version controlled database engine and its applications for agentic workflowsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Dolt is and the story behind it?What are the key use cases that you are focused on solving by adding version control to the database layer?There are numerous projects related to different aspects of versioning in different data contexts (e.g. LakeFS, Datomic, etc.). What are the versioning semantics that you are focused on?You position Dolt as "the database for AI". How does data versioning relate to AI use cases?What types of AI systems are able to make best use of Dolt's versioning capabilities?Can you describe how Dolt and Doltgres are implemented?How have the design and scope of the project changed since you first started working on it?What are some of the architecture and integration patterns around relational databases that change when you introduce version control semantics as a core primitive?What are some anti-patterns that you have seen teams develop around Dolt's versioning functionality?What are the most interesting, innovative, or unexpected ways that you have seen Dolt used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dolt?When is Dolt the wrong choice?What do you have planned for the future of Dolt?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links DoltDoltHubStockmarket DataLakeFSDatomicGitMySQLProlly TreeNeonDjangoFeature StoreMCP ServerNessieIcebergPlanetScaleO(NlogN) Big O ComplexityB-TreeGit MergeGit RebaseAST == Abstract Syntax TreeSupabaseCockroachDBDocument DatabaseMongoDBGastownBeadsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Logical First, Physical Second: A Pragmatic Path to Trusted Data

2026-01-2540:50

Summary In this episode of the Data Engineering Podcast Jamie Knowles, Product Director for ER/Studio, talks about data architecture and its importance in driving business meaning. He discusses how data architecture should start with business meaning, not just physical schemas, and explores the pitfalls of jumping straight to physical designs. Jamie shares his practical definition of data architecture centered on shared semantic models that anchor transactional, analytical, and event-driven systems. The conversation covers strategies for evolving an architecture in tandem with delivery, including defining core concepts, aligning teams through governance, and treating the model as a living product. He also examines how generative AI can both help and harm data architecture, accelerating first drafts but amplifying risk without a human-approved ontology. Jamie emphasizes the importance of doing the hard work upfront to make meaning explicit, keeping models simple and business-aligned, and using tools and patterns to reuse that meaning everywhere. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Jamie Knowles about the impact that a well-developed data architecture (or lack thereof) has on data engineering workInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving your definition of "data architecture" and what it encompasses?How does the nuance change depending on the type of system you are designing? (e.g. data warehouse vs. transactional application database vs. event-driven streaming service)In application teams that are large enough there is typically a software architect, but that work often ends up happening organically through trial and error. Who is the responsible party for designing and enforcing a proper data architecture?There have been several generational shifts in approach to data warehouse projects in particular. What are some of the anti-patterns that crop up when there is no-one forming a strong opinion on the design/architecture of the warehouse?The current stage is largely defined by the ELT pattern. What are some of the ways that workflow can encourage shortcuts?Often the need for a proper architecture isn't felt until an organic architecture has developed. What are some of the ways that teams can short-circuit that pain and iterate toward a more sustainable design?The common theme in all of the data architecture conversations that I've had is the need for business involvement. There is also a strong push for the business to just want the engineers to deliver data. What are some of the ways that AI utilities can help to accelerate delivery while also capturing business context?For teams that are already neck deep in a messy architecture, what are the strategies and tactics that they need to start working toward today to get to a better data architecture?What are the most interesting, innovative, or unexpected ways that you have seen teams approach the creation and implementation of their data architecture?What are the most interesting, unexpected, or challenging lessons that you have learned while working in data architecture?How do you see the introduction of AI at each stage of the data lifecycle changing the ways that teams think about their architectural needs?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksIderaER StudioELTRDF == Resource Description FrameworkORM == Object-Relational MappingThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability

2026-01-1801:12:21

Summary In this episode Jacob Leverich, cofounder and CTO of Observe, talks about applying lakehouse architectures to observability workloads. Jacob discusses Observe’s decision to leverage cloud-native warehousing and open table formats for scale and cost efficiency. He digs into the core pain points teams face with fragmented tools, soaring costs, and data silos, and how a lakehouse approach - paired with streaming ingest via OpenTelemetry, Kafka-backed durability, curated/columnarized tables, and query orchestration - can deliver low-latency, interactive troubleshooting across logs, metrics, and traces at petabyte scale. He also explore the practicalities of loading and organizing telemetry by use case to reduce read amplification, the role of Iceberg (including v3’s JSON shredding) and Snowflake’s implementation, and why open table formats enable “your data in your lake” strategies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Jacob Leverich about how data lakehouse technologies can be applied to observability for unlimited scale and orders of magnitude improvement on economicsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of what the major pain points have been in the observability space? (e.g. limited scale/retention, costs, integration fragmentation)What are the elements of the ecosystem and tech stacks that led to that state of the world?What are you building at Observe that circumvents those pain points?What are the major ecosystem evolutions that make this a feasible architecture? (e.g. columnar storage, distributed compute, protocol consolidation)Can you describe the architecture of the Observe platform?How have the design of the platform evolved/changed direction since you first started working on it?What was your process for determining which core technologies to build on top of?What were the missing pieces that you had to engineer around to get a cohesive and performant platform?The perennial problem with observability systems and data lakes is their tendency to succumb to entropy. What are the guardrails that you are relying on to help customers maintain a well-structured and usable repository of information?Data lakehouses are excellent for flexibility and scaling to massive data volumes, but they're not known for being fast. What are the areas of investment in the ecosystem that is changing that narrative?As organizations overcome the constraints of limited retention periods and anxiety over cost, what new use cases does that unlock for their observability data?How do AI applications/agents change the requirements around observability data? (collection, scale, complexity, applications, etc.)What are the most interesting, innovative, or unexpected ways that you have seen Observe/lakehouse technologies used for observability?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Observe?When is Observe/lakehouse technologies the wrong choice?What do you have planned for the future of Observe?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Observe Inc.Lakehouse ArchitectureSplunkObservabilityRSyslogGlusterFSDremelDrillBigQuerySnowflake SIGMOD PaperPrometheusDatadogNewRelicAppDynamicsDynaTraceLokiCortexMimirTempoCardinalityFluentBitFluentDOpenTelemetryOTLP == OpenTelemetry Line ProtocolKafkaVPC Flow LogsRead AmplificationLanceIcebergHudiPromQLThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Semantic Operators Meet Dataframes: Building Context for Agents with FENIC

2026-01-1256:42

Summary In this episode Kostas Pardalis talks about Fenic - an open-source, PySpark-inspired dataframe engine designed to bring LLM-powered semantics into reliable data engineering workflows. Kostas shares why today’s data infrastructure assumptions (BI-first, expert-operated, CPU-bound) fall short for AI-era tasks that are increasingly inference- and IO-bound. He explores how Fenic introduces semantic operators (e.g., semantic filter, extract, join) as first-class citizens in the logical plan so the optimizer can reason about inference, costs, and constraints. This enables developers to turn unstructured data into explicit schemas, compose transformations lazily, and offload LLM work safely and efficiently. He digs into Fenic’s architecture (lazy dataframe API, logical/physical plans, Polars execution, DuckDB/Arrow SQL path), how it exposes tools via MCP for agent integration, and where it fits in context engineering as a companion for memory/state management in agentic systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Kostas Pardalis about Fenic, an opinionated, PySpark-inspired DataFrame framework for building AI and agentic applicationsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Fenic is and the story behind it?What are the core problems that you are trying to address with Fenic?Dataframes have become a popular interface for doing chained transformations on structured data. What are the benefits of using that paradigm for LLM use-cases?Can you describe the architecture and implementation of Fenic?How have the design and scope of the project changed since you first started working on it?You position Fenic as a means of bringing reliability to LLM-powered transformations. What are some of the anti-patterns that teams should be aware of when getting started with Fenic?What are some of the most common first steps that teams take when integrating Fenic into their pipelines or applications?What are some of the ways that teams should be thinking about using Fenic and semantic operations for data pipelines and transformations?How does Fenic help with context engineering for agentic use cases?What are some examples of toolchains/workflows that could be replaced with Fenic?How does Fenic integrate with the broader ecosystem of data and AI frameworks? (e.g. Polars, Arrow, Qdrant, LangChan/Pydantic AI)What are the most interesting, innovative, or unexpected ways that you have seen Fenic used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Fenic?When is Fenic the wrong choice?What do you have planned for the future of Fenic?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links FenicRudderStackPodcast EpisodeTrinoStarburstTrino Project TardigradeTypedef AIdbtPySparkUDF == User-Defined FunctionLOTUSPandasPolarsRelational AlgebraArrowDuckDBMarkdownPydantic AIAI Engineering Podcast EpisodeLangChainRayDaskThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Beyond Dashboards: How Data Teams Earn a Seat at the Table

2026-01-0549:21

Summary In this episode Goutham Budati about his Data–Perspective–Action framework and how it empowers data teams to become true business partners. Gautham traces his path from automating Excel reports to leading high‑impact data organizations, then breaks down why technical excellence alone isn’t enough: teams must pair reliable data systems with deliberate storytelling, clear problem framing, and concrete action plans. He digs into tactics for moving from reactive ticket-taking to proactive influence — weekly one‑page narratives, design-first discovery, sampling stakeholders for real pain points, and treating dashboards as living roadmaps. He also explores how to right-size technical scope, preserve trust in core metrics, organize teams as “build” and “storytelling” duos, and translate business macros and micros into resilient system designs. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Goutham Budati about his data-perspective-action framework for empowering data teams to be more influential in the businessInterview IntroductionHow did you get involved in the area of data management?Can you describe what the Data-Perspective-Action framework is and the story behind it?What does it look like when someone operates at each of those three levels?How does that change the day-to-day work of an individual contributor?Why does technically excellent data work sometimes fail to drive decisions?How do you identify whether a data system or pipeline is actually creating value versus just existing?What's the moment when you realized that building reliable systems wasn't the same as enabling better decisions?Better decisions still need to be powered by reliable systems. How do you manage the tension of focusing on up-time against focusing on impact?What does it mean to add "Perspective" to data? How is that different from analysis or insights?How do you know when you're overwhelming stakeholders versus giving them what they need?What changes when you start designing systems to surface signal rather than just providing comprehensive data?How do you learn what business context matters for turning data into something actionable?What does it mean to design for Action from day one? How does that change what you build?How do you get stakeholders to actually act on data instead of just consuming it?Walk us through how you structure collaboration with business partners when you're trying to drive decisions, not just inform them.What's the relationship between iteration and trust when you're building data products?What does the transition from order-taker to strategic partner actually look like? What has to change?How do you position data work as driving the business rather than supporting it?Why does storytelling matter for data professionals? What role does it play that technical communication doesn't cover?What organizational structures or team setups help data people gain influence?Tell us about a time when you built something technically sound that failed to create impact. What did you learn?What are the common patterns in dysfunctional data organizations? What causes the breakdown?How do you rebuild credibility when you inherit a data function that's lost trust with the business?What's the relationship between technical excellence and stakeholder trust? Can you have one without the other?When is this framework the wrong lens? What situations call for a different approach?How do you balance the demand for technical depth with the need to develop business and communication skills?How should data professionals position themselves as AI and ML tools become more accessible?What shifts do you see coming in how businesses think about data work?How is your thinking about data impact evolving?For someone who recognizes they're focused purely on the technical work and wants to expand their impact—where should they start?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Unfreezing The Data Lake: The Future-Proof File Format

2025-12-2959:24

Summary In this episode PhD researcher Xinyu Zeng talks about F3, the “future-proof file format” designed to address today’s hardware realities and evolving workloads. He digs into the limitations of Parquet and ORC - especially CPU-bound decoding, metadata overhead for wide-table projections, and poor random-access behavior for ML training and serving - and how F3 rethinks layout and encodings to be efficient, interoperable, and extensible. Xinyu explains F3’s two major ideas: a decoupled, flexible layout that separates IO units, dictionary scope, and encoding choices; and self-decoding files that embed WebAssembly kernels so new encodings can be adopted without waiting on every engine to upgrade. He discusses how table formats and file formats should increasingly be decoupled, potential synergies between F3 and table layers (including centralizing and verifying WASM kernels), and future directions such as extending WASM beyond encodings to indexing or filtering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Xinyu Zeng about the future-proof file formatInterview IntroductionHow did you get involved in the area of data management?Can you describe what the F3 project is and the story behind it?We have several widely adopted file formats (Parquet, ORC, Avro, etc.). Why do we keep creating new ones?Parquet is the format with perhaps the broadest adoption. What are the challenges that such wide use poses when trying to modify or extend the specification?The recent focus on vector data is perhaps the most visible change in storage requirements. What are some of the other custom types of data that might need to be supported in the file storage layer?Can you describe the key design principles of the F3 format?What are the engineering challenges that you faced while developing your implementation of the F3 proof-of-concept?The key challenge of introducing a new format is that of adoption. What are the provisions in F3 that might simplify the adoption of the format in the broader ecosystem? (e.g. integration with compute frameworks)What are some examples of features in data lake use cases that could be enabled by F3?What are some of the other ideas/hypotheses that you developed and discarded in the process of your reasearch?What are the most interesting, unexpected, or challenging lessons that you have learned while working on F3?What do you have planned for the future of F3?Contact Info Personal WebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links F3 PaperFormats Evaluation PaperF3 GithubSAL PaperRisingWaveTencent CloudParquetArrowAndy PavloWes McKinneyCMU Public SeminarVLDBORCProtocol BuffersLancePAX == Partition Attributes AcrossWASM == Web AssemblyDataFusionDuckDBDuckLakeVeloxVortex File FormatThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From Context to Semantics: How Metadata Powers Agentic AI

2025-12-2101:06:17

Summary In this episode Suresh Srinivas and Sriharsha Chintalapani explore how metadata platforms are evolving from human-centric catalogs into the foundational context layer for AI and agentic systems. They discuss the origins and growth of OpenMetadata and Collate, why “context” is necessary but “semantics” is critical for precise AI outcomes, and how a schema-first, API-first, unified platform enables discovery, observability, and governance in one workflow. They share how AI agents can now automate documentation, classification, data quality testing, and enforcement of policies, and why aligning governance with user identity and intent is essential as agentic access scales. They also dig into scalability strategies, MCP-based agent workflows, AI governance (including model/agent tracking), and the emerging convergence of big data with ontologies to deliver machine-understandable meaning. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Suresh Srinivas and Sriharsha Chintalapani about how metadata catalogs provide the context clues necessary to give meaning to your data for AI systemsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the roles that metadata catalogs are playing in the current state of the ecosystem?How has the OpenMetadata platform evolved over the past 4 years?How has the focus on LLMs/generative AI changed the trajectory of services like OpenMetadata?The initial set of use cases for data catalogs was to facilitate discovery and documentation of data assets for human consumption. What are the structural elements of that effort that have paid dividends for an AI audience?How does the AI audience change the requirements around the cataloging and presentation of metadata?One of the constant challenges in data infrastructure now is the tension of making data accessible to AI systems (agentic or otherwise) and incorporating AI into the inner loop of the service. What are the opportunities for bringing AI inside the boundaries of a system like OpenMetadata vs. as a client or consumer of the platform?The key phrase of the past ~2 years is "context engineering". What role does the metadata catalog play in that undertaking?What are the capabilities that the catalog needs to be able to effectively populate and curate that context?How much awareness does the LLM or agent need to have to be able to use the catalog effectively?What does a typical workflow/agent loop look like when it is using something like OpenMetadata in pursuit of knowledge that it needs to achieve an objective?How do agentic use cases strain the existing set of governance frameworks?What new considerations (procedural or technical) need to be factored into governance practices to balance velocity with security?What are the most interesting, innovative, or unexpected ways that you have seen OpenMetadata/Collate used in AI/agentic contexts?What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata/Collate?When is OpenMetadata/Collate the wrong choice?What do you have planned for the future of OpenMetadata?Contact InfoSureshLinkedInSriharshaLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksOpenMetadata Podcast EpisodeHadoopHortonworksContext EngineeringMCP == Model Context ProtocolJSON SchemadbtLangSmithOpenMetadata MCP ServerAPI GatewayThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From Data Engineering to AI Engineering: Where the Lines Blur

2025-12-1426:59

Summary In this solo episode of the Data Engineering Podcast, host Tobias Macey reflects on how AI has transformed the practice and pace of data engineering over time. Starting from its origins in the Hadoop and cloud warehouse era, he explores the discipline's evolution through ML engineering and MLOps to today's blended boundaries between data, ML, and AI engineering. The conversation covers how unstructured data is becoming more prominent, vectors and knowledge graphs are emerging as key components, and reliability expectations are changing due to interactive user-facing AI. The host also delves into process changes, including tighter collaboration, faster dataset onboarding, new governance and access controls, and the importance of treating experimentation and evaluation as fundamental testing practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing reflecting about the increasingly blurry boundaries between data engineering and AI engineeringInterviewIntroductionI started this podcast in 2017, right when the term "Data Engineer" was becoming widely used for a specific job title with a reasonably well-understood set of responsibilities. This was in response to the massive hype around "data science" and consequent hiring sprees that characterized the mid-2000s to mid-2010s. The introduction of generative AI and AI Engineering to the technical ecosystem is changing the scope of responsibilities for data engineers and other data practitioners. Of note is the fact that:AI models can be used to process unstructured data sources into structured data assetsAI applications require new types of data assetsThe SLAs for data assets related to AI serving are different from BI/warehouse use casesThe technology stacks for AI applications aren't necessarily the same as for analytical data pipelinesBecause everything is so new there is not a lot of prior art, and the prior art that does exist isn't necessarily easy to find because of differences in terminologyExperimentation has moved from being just an MLOps capability into being a core need for organizationsContact InfoEmailParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksAI Engineering PodcastThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Malloy: Hierarchical Data, Semantic Models, and the Future of Analytics

2025-12-0858:48

Summary In this episode Michael Toy, co-creator of Malloy, talks about rethinking how we work with data beyond SQL. Michael shares the origins of Malloy from his and Lloyd Tabb’s experience at Looker, why SQL’s mental model often fights human problem solving, and how Malloy aims to be a composable, maintainable language that treats SQL as the assembly layer rather than something humans should write. He explores Malloy’s core ideas — semantic modeling tightly coupled with a query language, hierarchical data as the default mental model, and preserving context so analysis stays interactive and open-ended. He also digs into the developer experience and ecosystem: Malloy’s TypeScript implementation, VS Code integration, CLI, emerging notebook support, and how Malloy can sit alongside or replace parts of existing transformation workflows. Michael discusses practical trade-offs in language design, the surprising fit for LLM-generated queries, and near-term roadmap areas like dimensional filtering, better aggregation strategies across levels, and closing gaps that still require escaping to SQL. He closes with an invitation to contribute to the open-source project and help shape its evolution. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Michael Toy about Malloy, a modern language for building composable and maintainable analytics and data models on relational enginesInterview IntroductionHow did you get involved in the area of data management?Can you describe what Malloy is and the story behind it?What is the core problem that you are trying to solve with Malloy?There are countless projects that aim to reimagine/reinvent/replace SQL. What are the factors that make Malloy stand out in your mind?Who are the target personas for the Malloy language?One of the key success factors for any language is the ecosystem around it and the integrations available to it. How does Malloy fit in the toolchains and workflows for data engineers and analysts?Can you describe the key design and syntax elements of Malloy?How have the scope and focus of the language evolved since you first started working on it?How do the structure and semantics of Malloy change the ways that teams think about their data models?SQL-focused tools have gained prominence as the means of building the tranfromation stage of data pipelines. How would you characterize the capabilities of Malloy as a tool for building translation pipelines?What are the most interesting, innovative, or unexpected ways that you have seen Malloy used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Malloy?When is Malloy the wrong choice?What do you have planned for the future of Malloy?Contact InfoWebsiteParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksMalloyLloyd TabbSQLLookerLookMLdbtRelational AlgebraTypescriptRuby[Truffle](Malloy VSCode PluginMalloy CLIMalloy Pick StatementThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

2025-11-2401:00:57

SummaryIn this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Maxime Beauchemin about the impact of multi-player multi-agent engineering on individual and team velocity for building better data systemsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the types of work that you are relying on AI development agents for?As you bring agents into the mix for software engineering, what are the bottlenecks that start to show up?In my own experience there are a finite number of agents that I can manage in parallel. How does Agor help to increase that limit?How does making multi-agent management a multi-player experience change the dynamics of how you apply agentic engineering workflows?Contact InfoLinkedInLinksAgorApache AirflowApache SupersetPresetClaude CodeCodexPlaywright MCPTmuxGit WorktreesOpencode.aiGitHub CodespacesOnaThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

2025-11-1651:46

Summary In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, task queues, and replay—and how it eliminates hand‑rolled retry, checkpoint, and error‑handling scaffolding while letting data remain where it lives. Preeti shares real-world patterns for replacing DAG-first orchestration, integrating application and data teams through signals and Nexus for cross-boundary calls, and using Temporal to coordinate long-running, human-in-the-loop, and agentic AI workflows with full observability and auditability. Shee also discusses heuristics for choosing Temporal alongside (or instead of) traditional orchestrators, managing scale without moving large datasets, and lessons from running durable execution as a cloud service. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architecturesInterview IntroductionHow did you get involved in the area of data management?Can you describe what durable execution is and how it impacts system architecture?With the strong focus on state maintenance and high reliability, what are some of the most impactful ways that data teams are incorporating tools like Temporal into their work?One of the core primitives in Temporal is a "workflow". How does that compare to similar primitives in common data orchestration systems such as Airflow, Dagster, Prefect, etc.? What are the heuristics that you recommend when deciding which tool to use for a given task, particularly in data/pipeline oriented projects? Even if a team is using a more data-focused orchestration engine, what are some of the ways that Temporal can be applied to handle the processing logic of the actual data?AI applications are also very dependent on reliable data to be effective in production contexts. What are some of the design patterns where durable execution can be integrated into RAG/agent applications?What are some of the conceptual hurdles that teams experience when they are starting to adopt Temporal or other durable execution frameworks?What are the most interesting, innovative, or unexpected ways that you have seen Temporal/durable execution used for data/AI services?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Temporal?When is Temporal/durable execution the wrong choice?What do you have planned for the future of Temporal for data and AI systems?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links TemporalDurable ExecutionFlinkMachine Learning EpochSpark StreamingAirflowDirected Acyclic Graph (DAG)Temporal NexusTensorZeroAI Engineering Podcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

mrs rime

Vassili Savinov

Andre A.

T L

#box-pro-ellipsis-177591738734698{-webkit-line-clamp:2;}Data Engineering Podcast

mrs rime

Vassili Savinov

Andre A.

T L

Data Engineering Podcast