How AI Is Built

#056 Building Solo: How One Engineer Uses AI Agents to Ship Production Code

2025-09-1101:12:24

Nicolay here,Most AI coding conversations focus on which model to use. This one focuses on workflow - the specific commands, git strategies, and review processes that let one engineer ship production code with AI agents doing 80% of the work.Today I have the chance to talk to Kieran Klaassen, who built Cora (an AI email management tool) almost entirely solo using AI agents.His approach: treat AI agents like junior developers you manage, not tools you operate.The key insight centers on "compound engineering" - extracting reusable systems from every code review and interaction. Instead of just reviewing pull requests, Kieran records his review sessions with his colleague, transcribes them, and feeds the transcriptions to Claude to extract coding patterns and philosophical approaches into custom slash commands.In the podcast, we also touch on:Git worktrees for running multiple AI agents simultaneouslyThe evolution from Cursor Composer to Claude Code and FridayWhy pull request review is the real bottleneck, not code generationHow to structure research phases to avoid AI going off the railsand more💡 Core ConceptsCompound Engineering: Extracting reusable systems, SOPs, and taste from every AI interaction - treating each code review or feature build as an opportunity to teach the AI your standards and decision-making patterns.Git Worktrees for AI Agents: Running multiple AI coding agents simultaneously by checking out different branches in separate file system directories, allowing parallel feature development without conflicts.Research-First AI Development: Starting every feature with a dedicated research phase where AI gathers context, explores multiple approaches, and creates detailed GitHub issues before any code is written.Tiered Code Review Systems: Implementing different review checklists and standards based on risk level (payments, migrations, etc.) with AI assistants handling initial passes before human review.📶 Connect with Kieran:X / TwitterCora📶 Connect with Nicolay:NewsletterLinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsThe Sonnet 3.5 Breakthrough Moment: [09:30] Kieran describes vibe-coding a Swift app in one evening, realizing AI could support solo entrepreneurship for the first time.Building Cora's First Prototype: [12:45] One night to build a prototype that drafts email responses - the moment they knew there was something special about AI handling email.The Nice, France Experiment: [13:40] Testing automatic email archiving while walking around town, discovering the "calm feeling" that became Cora's core value proposition.Git Worktrees Discovery: [50:50] How Kieran discovered worktrees by asking AI for a solution to run multiple agents simultaneously, leading to his current parallel development workflow.Cursor 3.7 Breaking Point: [19:57] The moment Cursor became unusable after shipping too many changes at once, forcing the search for better agentic tools.Friday vs Claude Code Comparison: [22:23] Why Friday's "YOLO mode" and end-to-end pull request creation felt more like having a colleague than using a tool.Compound Engineering Philosophy: [33:18] Recording code review sessions and extracting engineering taste into reusable Claude commands for future development.The Research Phase Strategy: [04:48] Why starting with comprehensive GitHub issue research prevents AI agents from going off-rails during implementation.Pull Request Review Bottleneck: [28:44] How reviewing AI-generated code, not writing it, becomes the main constraint when scaling with agents.Multiple Agent Management: [48:14] Running Claude Code work trees in parallel terminals, treating each agent as a separate team member with distinct tasks.🛠️ Tools & Tech MentionedClaude CodeCursorFriday AIGit WorktreesWarp TerminalGitHub CLICharlie AI (PR review bot)Context7 MCPAnthropic Prompt Improver

#055 Embedding Intelligence: AI's Move to the Edge

2025-08-1301:05:35

Nicolay here,while everyone races to cloud-scale LLMs, Pete Warden is solving AI problems by going completely offline. No network connectivity required.Today I have the chance to talk to Pete Warden, CEO of Useful Sensors and author of the TinyML book.His philosophy: if you can't explain to users exactly what happens to their data, your privacy model is broken.Key Insight: The Real World Action GapLLMs excel at text-to-text transformations but fail catastrophically at connecting language to physical actions. There's nothing in the web corpus that teaches a model how "turn on the light" maps to sending a pin high on a microcontroller.This explains why every AI agent demo focuses on booking flights and API calls - those actions are documented in text. The moment you step off the web into real-world device control, even simple commands become impossible without custom training on action-to-outcome data.Pete's company builds speech-to-intent systems that skip text entirely, going directly from audio to device actions using embeddings trained on limited action sets.💡 Core Concepts Speech-to-Intent: Direct audio-to-action mapping that bypasses text conversion, preserving ambiguity until final classification ML Sensors: Self-contained circuit boards processing sensitive data locally, outputting only simple signals without exposing raw video/audio Embedding-Based Action Matching: Vector representations mapping natural language variations to canonical device actions within constrained domains⏱ Important Moments Real World Action Problem: [06:27] LLMs discuss turning on lights but lack training data connecting text commands to device controlApple Intelligence Challenges: [04:07] Design-led culture clashes with AI accuracy limitationsSpeech-to-Intent vs Speech-to-Text: [12:01] Breaking audio into text loses critical ambiguity information Limited Action Set Strategy: [15:30] Smart speakers succeed by constraining to ~3 functions rather than infinite commands 8-Bit Quantization: [33:12] Remains deployment sweet spot - processor instruction support matters more than compression On-Device Privacy: [47:00] Complete local processing provides explainable guarantees vs confusing hybrid systems🛠 Tools & TechWhisper: github.com/openai/whisperMoonshine: github.com/usefulsensors/moonshineTinyML Book: oreilly.com/library/view/tinyml/9781492052036Stanford Edge ML: github.com/petewarden/stanford-edge-ml📚 ResourcesLooking to Listen Paper: looking-to-listen.github.ioLottery Ticket Hypothesis: arxiv.org/abs/1803.03635Connect: pete@usefulsensors.com | petewarden.com | usefulsensors.comBeta Opportunity: Moonshine browser implementation for client-side speech processing inJavaScript

#054 Building Frankenstein Models with Model Merging and the Future of AI

2025-07-2901:06:55

Nicolay here,most AI conversations focus on training bigger models with more compute. This one explores the counterintuitive world where averaging weights from different models creates better performance than expensive post-training.Today I have the chance to talk to Maxime Labonne, who's a researcher at Liquid AI and the architect of some of the most popular open source models on Hugging Face.He went from researching neural networks for cybersecurity to building "Frankenstein models" through techniques that shouldn't work but consistently do.Key Insight: Model Merging as a Free LunchThe core breakthrough is deceptively simple: take two fine-tuned models, average their weights layer by layer, and often get better performance than either individual model. Maxime initially started writing an article to explain why this couldn't work, but his own experiments convinced him otherwise.The magic lies in knowledge compression and regularization. When you train a model multiple times on similar data, each run creates slightly different weight configurations due to training noise. Averaging these weights creates a smoother optimization path that avoids local minima. You can literally run model merging on a CPU - no GPUs required.In the podcast, we also touch on:Obliteration: removing safety refusal mechanisms without retrainingWhy synthetic data now comprises 90%+ of fine-tuning datasetsThe evaluation crisis and automated benchmarks missing real-world performanceChain of thought compression techniques for reasoning models💡 Core ConceptsModel Merging: Averaging weights across layers from multiple fine-tuned models to create improved performance without additional trainingObliteration: Training-free method to remove refusal directions from models by computing activation differencesLinear Merging: The least opinionated merging technique that simply averages weights with optional scaling factorsRefusal Direction: The activation pattern that indicates when a model will output a safety refusal📶 Connect with Maxime:X / Twitter: https://x.com/maximelabonneLinkedIn: https://www.linkedin.com/in/maxime-labonne/Company: https://www.liquid.ai/📶 Connect with Nicolay:LinkedIn: https://www.linkedin.com/in/nicolay-gerold/X / Twitter: https://x.com/nicolaygeroldWebsite: https://www.nicolaygerold.com/⏱ Important MomentsModel Merging Discovery Process: [00:00:30] Maxime explains how he started writing an article to debunk model mergingTwo Main Merging Use Cases: [11:04] Clear distinction between merging checkpoints versus combining different task-specific capabilitiesLinear Merging as Best Practice: [21:00] Why simple weight averaging consistently outperforms more complex techniquesLayer Importance Hierarchy: [21:18] First and last layers have the most influence on model behaviorObliteration Technique Explained: [36:07] How to compute and subtract refusal directions from model activationsSynthetic Data Dominance: [50:00] Modern fine-tuning uses 90%+ synthetic data🛠 Tools & Tech MentionedMergeKit: https://github.com/cg123/mergekitTransformer Lens: https://github.com/TransformerLensOrg/TransformerLensHugging Face Transformers: https://github.com/huggingface/transformersPyTorch: https://pytorch.org/📚 Recommended ResourcesMaxime's Model Merging Articles: https://huggingface.co/blog/mergeModel Soups Paper: https://arxiv.org/abs/2203.05482Will Brown's Rubric Engineering: https://x.com/willccbb/status/1883611121577517092

#053 AI in the Terminal: Enhancing Coding with Warp

2025-07-2301:04:30

Nicolay here,Most AI coding tools obsess over automating everything. This conversation focuses on the rightbalance between human skill and AI assistance - where manual context beats web search every time.Today I have the chance to talk to Ben Holmes, a software engineer at Warp, where they're building theAI-first terminal.Manual context engineering trumps automated web search for getting accurate results fromcoding assistants.Key Insight ExpansionThe breakthrough insight is brutally practical: manual context construction consistently outperformsautomated web search when working with AI coding assistants. Instead of letting your AI tool searchfor documentation, find the right pages yourself and feed them directly into the model's contextwindow.Ben demonstrated this with OpenAI's Realtime API documentation - after an hour ofback-and-forthwith web search, he manually found the correct API signatures and saved them as a reference file.When building newfeatures, he attached this curated documentation directly, resulting in immediatesuccess rather than repeated failures from outdated or incorrect search results.This approach works because you can verify documentation accuracy before feeding it to the AI, whileweb search often returns the first result regardless of quality or recency.In the podcast, we also touch on:Why React Native might become irrelevant as AI translation between native languages improvesModel-specific strengths: Gemini excels at debugging while Claude dominates function callingThe skill of working without AI assistance - "raw dogging" code for deep learningWarp's architecture using different models for planning (O1/O3) vs. coding (Claude/Gemini)💡 Core ConceptsManual Context Engineering: Curating documentation, diagrams, and reference materials directlyrather than relying on automated web search.Model-Specific Workflows: Matching AI models to their strengths - O1 for planning, Claude forfunction calling, Gemini for debugging.Raw Dog Programming: Coding without AI assistance to build fundamental skills in codebasenavigation and problem-solving.Agent Mode Architecture: Multi-model system where Claude orchestrates task distribution tospecialized agents through function calls.📶 Connect with Ben:Twitter/X, YouTube, Discord (Warp Community), Website📶 Connect with Nicolay:LinkedIn, X/Twitter, Bluesky, Website, nicolay.gerold@gmail.com⏱ Important MomentsReact Native's Potential Obsolescence: [08:42] AI translation between native languages couldeliminate cross-platform frameworksManual vs Automated Context: [51:42] Why manually curating documentation beats AI websearchRaw Dog Programming Benefits: [12:00] Value of coding without AI assistance during Ben's firstweek at WarpModel-Specific Strengths: [26:00] Gemini's superior debugging vs Claude's speculative codefixesOpenAI Desktop App Advantage: [13:44] Outperforms Cursor for reading long filesWarp's Multi-Model Architecture: [31:00] How Warp uses O1/O3 for planning, Claude fororchestrationFunction Calling Accuracy: [28:30] Claude outperforms other models at chaining function callsAI as Improv Partner: [56:06] Current AI says "yes and" to everything rather than pushing back🛠 Tools & Tech MentionedWarp Terminal, OpenAI Desktop App, Cursor, Cline, Go by Example, OpenAI Realtime API, MCP📚 Recommended ResourcesWarp Discord Community, Ben's YouTube Channel, Go Programming Documentation🔮 What's NextNext week, we continue exploring production AI implementations with more insights into gettinggenerative AI systems deployed effectively.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify. Discord coming soon!♻ Building the platform for engineers to share production experience. Pay it forward by sharing withone engineer facing similar challenges.♻

#052 Don't Build Models, Build Systems That Build Models

2025-07-0159:22

Nicolay here,Today I have the chance to talk to Charles from Modal, who went from doing a PhD on neural network optimization in the 2010s - when ML engineers could build models with a soldering iron and some sticks - to architecting serverless infrastructure for AI models. Modal is about removing barriers so anyone can spin up a hundred GPUs in seconds.The critical insight that stuck with me: "Don't build models, build systems that build models." Organizations often make the mistake of celebrating a one-time fine-tuned model that matches GPT-4 performance only to watch it become obsolete when the next foundation model arrives - typically three to six months down the road.Charles's approach to infrastructure is particularly unconventional. He argues that serverless isn't just about convenience - it fundamentally changes how ambitious you can be with scale. "There's so much that gets in the way of trying to spin up a hundred GPUs or a thousand CPU containers that people just don't think to do something big."The winning approach involves automated data pipelines with feedback collection, continuous evaluation against new foundation models, AB testing and canary deployments, and systematic error analysis and retraining.In the podcast, we also cover:Why inference, not training, is where the money is madeHow to rethink compute when moving from traditional cloud to serverlessThe economics of automated resource managementWhy task decomposition is the key ML engineering skillWhen to earn the right to fine-tune versus using foundation models*📶 Connect with Charles:*Twitter - https://twitter.com/charlesirl Modal Labs - https://modal.com Modal Slack Community - https://modal.com/slack *📶 Connect with Nicolay:*LinkedIn - https://linkedin.com/in/nicolay-gerold/ X / Twitter - https://x.com/nicolaygerold Bluesky - https://bsky.app/profile/nicolaygerold.com Website - https://nicolaygerold.com/ My Agency Aisbach - https://aisbach.com/ (for ai implementations / strategy)*⏱️ Important Moments*From CUDA to Serverless: [00:01:38] Charles's journey from PhD neural network optimization to building Modal's serverless infrastructure.Rethinking Scale Ambition: [00:01:38] "There's so much that gets in the way of trying to spin up a hundred GPUs that people just don't think to do something big."The Economics of Serverless: [00:04:09] How automated resource management changes the cattle vs pets paradigm for GPU workloads.Lambda vs Modal Philosophy: [00:04:20] Why Modal was designed for tasks that take bytes and emit megabytes, unlike Lambda's middleware focus.Inference Economics Reality: [00:10:16] "Almost nobody gets paid to make models - organizations get paid to make predictions."The Open Source Commoditization: [00:14:55] How foundation models are becoming undifferentiated capabilities like databases.Task Decomposition as Core Skill: [00:22:00] Why breaking down problems is equivalent to recognizing API boundaries in software engineering.Systems That Build Models: [00:33:31] The critical difference between delivering static weights versus repeatable model production systemsEarning the Right to Fine-Tune: [00:34:06] The infrastructure prerequisites needed before attempting model customization.Multi-Node Training Challenges: [00:52:24] How serverless platforms handle the contradiction of high-performance computing with spiky demand.*🛠️ Tools & Tech Mentioned*Modal - https://modal.com (serverless GPU infrastructure) AWS Lambda - https://aws.amazon.com/lambda/ (traditional serverless)Kubernetes - https://kubernetes.io/ (container orchestration)Temporal - https://temporal.io/ (workflow orchestration)Weights & Biases - https://wandb.ai/ (experiment tracking)Hugging Face - https://huggingface.co/ (model repository)PyTorch Distributed - https://pytorch.org/tutorials/intermediate/ddp_tutorial.html (multi-GPU training)Redis - https://redis.io/ (caching and queues)*📚 Recommended Resources*Full Stack Deep Learning - https://fullstackdeeplearning.com/ (deployment best practices)Modal Documentation - https://modal.com/docs (getting started guide)Deep Seek Paper - https://arxiv.org/abs/2401.02954 (disaggregated inference patterns)AI Engineer Summit - https://ai.engineer/ (community events)MLOps Community - https://mlops.community/ (best practices)💬 Join The ConversationFollow How AI Is Built on YouTube - https://youtube.com/@howaiisbuilt, Bluesky - https://bsky.app/profile/howaiisbuilt.fm, or Spotify - https://open.spotify.com/show/3hhSTyHSgKPVC4sw3H0NUc?_authfailed=1%29 If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn - https://linkedin.com/in/nicolay-gerold/, X - https://x.com/nicolaygerold, or Bluesky - https://bsky.app/profile/nicolaygerold.com. Or at nicolay.gerold@gmail.com. I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.

#051 Build systems that can be debugged at 4am by tired humans with no context

2025-06-1701:05:51

Nicolay here,Today I have the chance to talk to Charity Majors, CEO and co-founder of Honeycomb, who recently has been writing about the cost crisis in observability."Your source of truth is production, not your IDE - and if you can't understand your code there, you're flying blind."The key insight is architecturally simple but operationally transformative: replace your 10-20 observability tools with wide structured events that capture everything about a request in one place. Most teams store the same request data across metrics, logs, traces, APM, and error tracking - creating a 20X cost multiplier while making debugging nearly impossible because you're reconstructing stories from fragments.Charity's approach flips this: instrument once with rich context, derive everything else from that single source. This isn't just about cost - it's about giving engineers the connective tissue to understand distributed systems. When you can correlate "all requests failing from Android version X in region Y using language pack Z," you find problems in minutes instead of days.The second is putting developers on call for their own code. This creates the tight feedback loop that makes engineers write more reliable software - because nobody wants to get paged at 3am for their own bugs.In the podcast, we also touch on:Why deploy time is the foundational feedback loop (15 minutes vs 15 hours changes everything)The controversial "developers on call" stance and why ops people rarely found companiesHow microservices made everything trace-shaped and killed traditional metrics approachesThe "normal engineer" philosophy - building for 4am debugging, not peak performanceAI making "code of unknown quality" the new normalProgressive deployment strategies (kibble → dogfood → production)and more💡 Core ConceptsWide Structured Events: Capturing all request context in one instrumentation event instead of scattered log lines - enables correlation analysis that's impossible with fragmented data.Observability 2.0: Moving from metrics-as-workhorse to structured-data-as-workhorse, where you instrument once and derive metrics/alerts/dashboards from the same rich dataset.SLO-based Alerting: Replacing symptom alerts (CPU, memory, disk) with customer-impact alerts that measure whether you're meeting promises to users.Progressive Deployment: Gradual rollout through staged environments (kibble → dogfood → production) that builds confidence without requiring 2X infrastructure.Trace-shaped Systems: Architecture pattern recognizing that distributed systems problems are fundamentally about correlating events across time and services, not isolated metrics.📶 Connect with Charity:LinkedInBlueskyPersonal BlogCompany📶 Connect with Nicolay:LinkedInX / TwitterWebsite⏱️ Important MomentsGateway Drug to Engineering: [01:04] How IRC and bash tab completion sparked Charity's fascination with Unix command line possibilitiesADHD and Incident Response: [01:54] Why high-pressure outages brought out her best work - getting "dead calm" when everything's brokenCode vs. Production Reality: [02:56] Evolution from focusing on code beauty to understanding performance, behavior, and maintenance over timeThe Alexander's Horse Principle: [04:49] Auto-deployment as daily practice - if you grow up deploying constantly, it feels natural by the time you scaleProduction as Source of Truth: [06:32] Why your IDE output doesn't matter if you can't understand your code's intersection with infrastructure and usersThe Logging Evolution: [08:03] Moving from debugger-style spam logs to fewer, wider structured events oriented around units of workBubble Up Anomaly Detection: [10:27] How correlating dimensions reveals that failures cluster around specific Android versions, regions, and feature combinationsEverything is Trace-Shaped: [12:45] Why microservices complexity is about locating problems in distributed systems, not just identifying themAI as Acceleration of Automation: [15:57] Most AI panic could be replaced with "automation" - it's the same pattern, just faster feedback loopsNon-determinism as Genuinely New: [16:51] The one aspect of AI that's actually novel in software systems, requiring new architectural patternsThe Cost Crisis: [22:30] How 10-20 observability tools create unsustainable cost multipliers as businesses scaleSLO Revolution: [28:40] Deleting 90% of alerts by focusing on customer impact instead of system symptomsShrinking Feedback Loops: [34:28] Keeping deploy-to-validation under one hour so engineers can connect actions to outcomesNormal Engineer Design: [38:12] Building systems that work for tired humans at 4am, not just heroes during business hoursThe Instrumentation Habit: [23:15] Always looking at your code in production after deployment to build informed instincts about system behaviorProgressive Deployment Strategy: [36:43] Kibble → Dog Food → Production pipeline for gradual confidence buildingReal Engineering Bar: [49:00] Discussion on what actually makes exceptional vs normal engineers🛠️ Tools & Tech MentionedHoneycomb - Observability platform for structured eventsOpenTelemetry - Vendor-neutral instrumentation frameworkIRC - Early gateway to computingParse - Mobile backend where Honeycomb's origin story began📚 Recommended Resources"In Praise of Normal Engineers" - Charity's blog post"How I Failed" by Tim O'Reilly"Looking at the Crux" by Richard Rumelt"Fluke" - Book about randomness in history"Engineering Management for the Rest of Us" by Sarah Dresner

#050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster

2025-05-2701:06:57

Nicolay here,Most AI developers are drowning in frameworks and hype. This conversation is about cutting through the noise and actually getting something into production.Today I have the chance to talk to Paul Iusztin, who's spent 8 years in AI - from writing CUDA kernels in C++ to building modern LLM applications. He currently writes about production AI systems and is building his own AI writing assistant.His philosophy is refreshingly simple: stop overthinking, start building, and let patterns emerge through use.The key insight that stuck with me: "If you don't feel the algorithm - like have a strong intuition about how components should work together - you can't innovate, you just copy paste stuff." This hits hard because so much of current AI development is exactly that - copy-pasting from tutorials without understanding the why.Paul's approach to frameworks is particularly controversial. He uses LangChain and similar tools for quick prototyping - maybe an hour or two to validate an idea - then throws them away completely. "They're low-code tools," he says. "Not good frameworks to build on top of."Instead, he advocates for writing your own database layers and using industrial-grade orchestration tools. Yes, it's more work upfront. But when you need to debug or scale, you'll thank yourself.In the podcast, we also cover:Why fine-tuning is almost always the wrong choiceThe "just-in-time" learning approach for staying sane in AIBuilding writing assistants that actually preserve your voiceWhy robots, not chatbots, are the real endgame💡 Core ConceptsAgentic Patterns: These patterns seem complex but are actually straightforward to implement once you understand the core loop. React: Agents that Reason, Act, and Observe in a loopReflection: Agents that review and improve their own outputsFine-tuning vs Base Model + Prompting: Fine-tuning involves taking a pre-trained model and training it further on your specific data. The alternative is using base models with careful prompting and context engineering. Paul's take: "Fine-tuning adds so much complexity... if you add fine-tuning to create a new feature, it's just from one day to one week."RAG: A technique where you retrieve relevant documents/information and include them in the LLM's context to generate better responses. Paul's approach: "In the beginning I also want to avoid RAG and just introduce a more guided research approach. Like I say, hey, these are the resources that I want to use in this article."📶 Connect with Paul:LinkedInX / TwitterNewsletterGitHubBook📶 Connect with Nicolay:LinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsFrom CUDA to LLMs: [02:20] Paul's journey from writing CUDA kernels and 3D object detection to modern AI applications.AI Content Is Natural Evolution: [11:19] Why AI writing tools are like the internet transition for artists - tools change, creativity remains.The Framework Trap: [36:41] "I see them as no code or low code tools... not good frameworks to build on top of."Fine-Tuning Complexity Bomb: [27:41] How fine-tuning turns 1-day features into 1-week experiments.End-to-End First: [22:44] "I don't focus on accuracy, performance, or latency initially. I just want an end-to-end process that works."The Orchestration Solution: [40:04] Why Temporal, D-Boss, and Restate beat LLM-specific orchestrators.Hype Filtering System: [54:06] Paul's approach: read about new tools, wait 2-3 months, only adopt if still relevant.Just-in-Time vs Just-in-Case: [57:50] The crucial difference between learning for potential needs vs immediate application.Robot Vision: [50:29] Why LLMs are just stepping stones to embodied AI and the unsolved challenges ahead.🛠️ Tools & Tech MentionedLangGraph (for prototyping only)Temporal (durable execution)DBOS (simpler orchestration)Restate (developer-friendly orchestration)Ray (distributed compute)UV (Python packaging)Prefect (workflow orchestration)📚 Recommended ResourcesThe Economist Style Guide (for writing)Brandon Sanderson's Writing Approach (worldbuilding first)LangGraph Academy (free, covers agent patterns)Ray Documentation (Paul's next deep dive)🔮 What's NextNext week, we will take a detour and go into the networking behind voice AI with Russell D’Sa from Livekit.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify.If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. Pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️

#050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster

2025-05-2711:00

Nicolay here,Most AI developers are drowning in frameworks and hype. This conversation is about cutting through the noise and actually getting something into production.Today I have the chance to talk to Paul Iusztin, who's spent 8 years in AI - from writing CUDA kernels in C++ to building modern LLM applications. He currently writes about production AI systems and is building his own AI writing assistant.His philosophy is refreshingly simple: stop overthinking, start building, and let patterns emerge through use.The key insight that stuck with me: "If you don't feel the algorithm - like have a strong intuition about how components should work together - you can't innovate, you just copy paste stuff." This hits hard because so much of current AI development is exactly that - copy-pasting from tutorials without understanding the why.Paul's approach to frameworks is particularly controversial. He uses LangChain and similar tools for quick prototyping - maybe an hour or two to validate an idea - then throws them away completely. "They're low-code tools," he says. "Not good frameworks to build on top of."Instead, he advocates for writing your own database layers and using industrial-grade orchestration tools. Yes, it's more work upfront. But when you need to debug or scale, you'll thank yourself.In the podcast, we also cover:Why fine-tuning is almost always the wrong choiceThe "just-in-time" learning approach for staying sane in AIBuilding writing assistants that actually preserve your voiceWhy robots, not chatbots, are the real endgame💡 Core ConceptsAgentic Patterns: These patterns seem complex but are actually straightforward to implement once you understand the core loop. React: Agents that Reason, Act, and Observe in a loopReflection: Agents that review and improve their own outputsFine-tuning vs Base Model + Prompting: Fine-tuning involves taking a pre-trained model and training it further on your specific data. The alternative is using base models with careful prompting and context engineering. Paul's take: "Fine-tuning adds so much complexity... if you add fine-tuning to create a new feature, it's just from one day to one week."RAG: A technique where you retrieve relevant documents/information and include them in the LLM's context to generate better responses. Paul's approach: "In the beginning I also want to avoid RAG and just introduce a more guided research approach. Like I say, hey, these are the resources that I want to use in this article."📶 Connect with Paul:LinkedInX / TwitterNewsletterGitHubBook📶 Connect with Nicolay:LinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsFrom CUDA to LLMs: [02:20] Paul's journey from writing CUDA kernels and 3D object detection to modern AI applications.AI Content Is Natural Evolution: [11:19] Why AI writing tools are like the internet transition for artists - tools change, creativity remains.The Framework Trap: [36:41] "I see them as no code or low code tools... not good frameworks to build on top of."Fine-Tuning Complexity Bomb: [27:41] How fine-tuning turns 1-day features into 1-week experiments.End-to-End First: [22:44] "I don't focus on accuracy, performance, or latency initially. I just want an end-to-end process that works."The Orchestration Solution: [40:04] Why Temporal, D-Boss, and Restate beat LLM-specific orchestrators.Hype Filtering System: [54:06] Paul's approach: read about new tools, wait 2-3 months, only adopt if still relevant.Just-in-Time vs Just-in-Case: [57:50] The crucial difference between learning for potential needs vs immediate application.Robot Vision: [50:29] Why LLMs are just stepping stones to embodied AI and the unsolved challenges ahead.🛠️ Tools & Tech MentionedLangGraph (for prototyping only)Temporal (durable execution)DBOS (simpler orchestration)Restate (developer-friendly orchestration)Ray (distributed compute)UV (Python packaging)Prefect (workflow orchestration)📚 Recommended ResourcesThe Economist Style Guide (for writing)Brandon Sanderson's Writing Approach (worldbuilding first)LangGraph Academy (free, covers agent patterns)Ray Documentation (Paul's next deep dive)🔮 What's NextNext week, we will take a detour and go into the networking behind voice AI with Russell D’Sa from Livekit.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify.If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. Pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️

#049 BAML: The Programming Language That Turns LLMs into Predictable Functions

2025-05-2001:02:38

Nicolay here,I think by now we are done with marveling at the latest benchmark scores of the models. It doesn’t tell us much anymore that the latest generation outscores the previous by a few basis points.If you don’t know how the LLM performs on your task, you are just duct taping LLMs into your systems.If your LLM-powered app can’t survive a malformed emoji, you’re shipping liability, not software.Today, I sat down with Vaibhav (co-founder of Boundary) to dissect BAML—a DSL that treats every LLM call as a typed function.It’s like swapping duct-taped Python scripts for a purpose-built compiler.Vaibhav advocates for building first principle based primitives.One principle stood out: LLMs are just functions; build like that from day 1. Wrap them, test them, and let a human only where it counts.Once you adopt that frame, reliability patterns fall into place: fallback heuristics, model swaps, classifiers—same playbook we already use for flaky APIs.We also cover:Why JSON constraints are the wrong hammer—and how Schema-Aligned Parsing fixes itWhether “durable” should be a first-class keyword (think async/await for crash-safety)Shipping multi-language AI pipelines without forcing a Python microserviceToken-bloat surgery, symbol tuning, and the myth of magic promptsHow to keep humans sharp when 98 % of agent outputs are already correct💡 Core ConceptsSchema-Aligned Parsing (SAP)Parse first, panic later. The model can handle Markdown, half-baked YAML, or rogue quotes—SAP puts it into your declared type or raises. No silent corruption.Symbol TuningLabels eat up tokens and often don’t help with your accuracy (in some cases they even hurt). Rename PasswordReset to C7, keep the description human-readable.Durable ExecutionDurable execution refers to a computing paradigm where program execution state persists despite failures, interruptions, or crashes. It ensures that operations resume exactly where they left off, maintaining progress even when systems go down.Prompt CompressionEvery extra token is latency, cost, and entropy. Axe filler words until the prompt reads like assembly. If output degrades, you cut too deep—back off one line.📶 Connect with Vaibhav:LinkedInX / TwitterBAML📶 Connect with Nicolay:NewsletterLinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsNew DSL vs. Python Glue [00:54]Why bolting yet another microservice onto your stack is cowardice; BAML compiles instead of copies.Three-Nines on Flaky Models [04:27]Designing retries, fallbacks, and human overrides when GPT eats dirt 5 % of the time.Native Go SDK & OpenAPI Fatigue [06:32]Killing thousand-line generated clients; typing go get instead.“LLM = Pure Function” Mental Model [15:58]Replace mysticism with f(input) → output; unit-test like any other function.Tool-Calling as a Switch Statement [18:19]Multi-tool orchestration boils down to switch(action) {…}—no cosmic “agent” needed.Sneak Peek—durable Keyword [24:49]Crash-safe workflows without shoving state into S3 and praying.Symbol Tuning Demo [31:35]Swapping verbose labels for C0,C1 slashes token cost and bias in one shot.Inside SAP Coercion Logic [47:31]Int arrays to ints, scalars to lists, bad casts raise—deterministic, no LLM in the loop.Frameworks vs. Primitives Rant [52:32]Why BAML ships primitives and leaves the “batteries” to you—less magic, more control.🛠️ Tools & Tech MentionedBAML DSL & PlaygroundTemporal • Prefect • DBOSoutlines • Instructor • LangChain📚 Recommended ResourcesBAML DocsSchema-Aligned Parsing (SAP)🔮 What's NextNext week, we will continue going more into getting generative AI into production talking to Paul Iusztin.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify.If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.♻️ Here's the deal: I'm committed to bringing you detailed, practical insights about AI development and implementation. In return, I have two simple requests:Hit subscribe right now to help me understand what content resonates with youIf you found value in this post, share it with one other developer or tech professional who's working with AIThat's our agreement - I deliver actionable AI insights, you help grow this. ♻️

#049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions

2025-05-2001:12:34

Nicolay here,I think by now we are done with marveling at the latest benchmark scores of the models. It doesn’t tell us much anymore that the latest generation outscores the previous by a few basis points.If you don’t know how the LLM performs on your task, you are just duct taping LLMs into your systems.If your LLM-powered app can’t survive a malformed emoji, you’re shipping liability, not software.Today, I sat down with Vaibhav (co-founder of Boundary) to dissect BAML—a DSL that treats every LLM call as a typed function.It’s like swapping duct-taped Python scripts for a purpose-built compiler.Vaibhav advocates for building first principle based primitives.One principle stood out: LLMs are just functions; build like that from day 1. Wrap them, test them, and let a human only where it counts.Once you adopt that frame, reliability patterns fall into place: fallback heuristics, model swaps, classifiers—same playbook we already use for flaky APIs.We also cover:Why JSON constraints are the wrong hammer—and how Schema-Aligned Parsing fixes itWhether “durable” should be a first-class keyword (think async/await for crash-safety)Shipping multi-language AI pipelines without forcing a Python microserviceToken-bloat surgery, symbol tuning, and the myth of magic promptsHow to keep humans sharp when 98 % of agent outputs are already correct💡 Core ConceptsSchema-Aligned Parsing (SAP)Parse first, panic later. The model can handle Markdown, half-baked YAML, or rogue quotes—SAP puts it into your declared type or raises. No silent corruption.Symbol TuningLabels eat up tokens and often don’t help with your accuracy (in some cases they even hurt). Rename PasswordReset to C7, keep the description human-readable.Durable ExecutionDurable execution refers to a computing paradigm where program execution state persists despite failures, interruptions, or crashes. It ensures that operations resume exactly where they left off, maintaining progress even when systems go down.Prompt CompressionEvery extra token is latency, cost, and entropy. Axe filler words until the prompt reads like assembly. If output degrades, you cut too deep—back off one line.📶 Connect with Vaibhav:LinkedInX / TwitterBAML📶 Connect with Nicolay:NewsletterLinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsNew DSL vs. Python Glue [00:54]Why bolting yet another microservice onto your stack is cowardice; BAML compiles instead of copies.Three-Nines on Flaky Models [04:27]Designing retries, fallbacks, and human overrides when GPT eats dirt 5 % of the time.Native Go SDK & OpenAPI Fatigue [06:32]Killing thousand-line generated clients; typing go get instead.“LLM = Pure Function” Mental Model [15:58]Replace mysticism with f(input) → output; unit-test like any other function.Tool-Calling as a Switch Statement [18:19]Multi-tool orchestration boils down to switch(action) {…}—no cosmic “agent” needed.Sneak Peek—durable Keyword [24:49]Crash-safe workflows without shoving state into S3 and praying.Symbol Tuning Demo [31:35]Swapping verbose labels for C0,C1 slashes token cost and bias in one shot.Inside SAP Coercion Logic [47:31]Int arrays to ints, scalars to lists, bad casts raise—deterministic, no LLM in the loop.Frameworks vs. Primitives Rant [52:32]Why BAML ships primitives and leaves the “batteries” to you—less magic, more control.🛠️ Tools & Tech MentionedBAML DSL & PlaygroundTemporal • Prefect • DBOSoutlines • Instructor • LangChain📚 Recommended ResourcesBAML DocsSchema-Aligned Parsing (SAP)🔮 What's NextNext week, we will continue going more into getting generative AI into production talking to Paul Iusztin.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify.If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.♻️ Here's the deal: I'm committed to bringing you detailed, practical insights about AI development and implementation. In return, I have two simple requests:Hit subscribe right now to help me understand what content resonates with youIf you found value in this post, share it with one other developer or tech professional who's working with AIThat's our agreement - I deliver actionable AI insights, you help grow this. ♻️

#048 TAKEAWAYS Why Your AI Agents Need Permission to Act, Not Just Read

2025-05-1307:06

Nicolay here,most AI conversations obsess over capabilities. This one focuses on constraints - the right ones that make AI actually useful rather than just impressive demos.Today I have the chance to talk to Dexter Horthy, who recently put out a long piece called the “12-factor agents”.It’s like the 10 commandments, but for building agents.One of it is “Contact human with tool calls”: the LLM can call humans for high-stakes decisions or “writes”.The key insight is brutally simple. AI can get to 90% accuracy on most tasks - good enough for spam-like activities but disastrous for anything that requires trust. The solution isn't to wait for models to get smarter; it's to add a human approval layer for critical actions.Imagine you are writing to a database or sending an email. Each “write” has to be approved by a human. So you post the email in a Slack channel and in most cases, your sales people will approve. In the 10%, it’s stopped in its tracks and the human can take over. You stop the slop and get good training data in the mean time.Dexter’s company is building exactly this: an approval mechanism that lets AI agents send requests to humans before executing.In the podcast, we also touch on a bunch of other things:MCP and that they are (atm) just a thin clientAre we training LLMs toward mediocrity?What infrastructure do we need for human in the loop (e.g. DBOS)?and more💡 Core ConceptsContext Engineering: Crafting the information representation for LLMs - selecting optimal data structures, metadata, and formats to ensure models receive precisely what they need to perform effectively.Token Bloat Prevention: Ruthlessly eliminating irrelevant information from context windows to maintain agent focus during complex tasks, preventing the pattern of repeating failed approaches.Human-in-the-loop Approval Flows: Achieving 99% reliability through a "90% AI + 10% human oversight" framework where agents analyze data and suggest actions but request explicit permission before execution.Rubric Engineering: Systematically evaluating AI outputs through dimension-specific scoring criteria to provide precise feedback and identify exceptional results, helping escape the trap of models converging toward mediocrity.📶 Connect with Dexter:LinkedInX / TwitterCompany📶 Connect with Nicolay:LinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsMCP Servers as Clients: [03:07] Dexter explains why what many call "MCP servers" actually function more like clients when examining the underlying code.Authentication Challenges: [04:45] The discussion shifts to how authentication should be handled in MCP implementations and whether it belongs in the protocol.Asynchronous Agent Execution: [08:18] Exploring how to handle agents that need to pause for human input without wasting tokens on continuous polling.Token Bloat Prevention: [14:41] Strategies for keeping context windows focused and efficient, moving beyond standard chat formats.Context Engineering: [29:06] The concept that everything in AI agent development ultimately comes down to effective context engineering.Fine-tuning vs. RAG for Writing Style: [20:05] Contrasting personal writing style fine-tuning versus context window examples.Generating Options vs. Deterministic Outputs: [19:44] The unexplored potential of having AI generate diverse creative options for human selection.The "Mediocrity Convergence" Question: [37:11] The philosophical concern that popular LLMs may inevitably trend toward average quality.Data Labeling Interfaces: [35:25] Discussion about the need for better, lower-friction interfaces to collect human feedback on AI outputs.Human-in-the-loop Approval Flows: [42:46] The core approach of HumanLayer, allowing agents to ask permission before taking action.🛠️ Tools & Tech MentionedMCPOpenControlDBOSTemporalCursor📚 Recommended Resources12 Factor AgentsBAML DocsRubric Engineering🔮 What's NextNext week, we will continue going more into getting generative AI into production talking to Vibhav from BAML.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify.If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. I am trying to produce the best content possible - informative, actionable, and engaging. I'm asking for two things: hit subscribe now to show me what content you like (so I can do more of it), and if this episode helped you, pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️

#048 Why Your AI Agents Need Permission to Act, Not Just Read

2025-05-1157:02

Nicolay here,most AI conversations obsess over capabilities. This one focuses on constraints - the right ones that make AI actually useful rather than just impressive demos.Today I have the chance to talk to Dexter Horthy, who recently put out a long piece called the “12-factor agents”.It’s like the 10 commandments, but for building agents.One of it is “Contact human with tool calls”: the LLM can call humans for high-stakes decisions or “writes”.The key insight is brutally simple. AI can get to 90% accuracy on most tasks - good enough for spam-like activities but disastrous for anything that requires trust. The solution isn't to wait for models to get smarter; it's to add a human approval layer for critical actions.Imagine you are writing to a database or sending an email. Each “write” has to be approved by a human. So you post the email in a Slack channel and in most cases, your sales people will approve. In the 10%, it’s stopped in its tracks and the human can take over. You stop the slop and get good training data in the mean time.Dexter’s company is building exactly this: an approval mechanism that lets AI agents send requests to humans before executing.In the podcast, we also touch on a bunch of other things:MCP and that they are (atm) just a thin clientAre we training LLMs toward mediocrity?What infrastructure do we need for human in the loop (e.g. DBOS)? and more💡 Core ConceptsContext Engineering: Crafting the information representation for LLMs - selecting optimal data structures, metadata, and formats to ensure models receive precisely what they need to perform effectively.Token Bloat Prevention: Ruthlessly eliminating irrelevant information from context windows to maintain agent focus during complex tasks, preventing the pattern of repeating failed approaches.Human-in-the-loop Approval Flows: Achieving 99% reliability through a "90% AI + 10% human oversight" framework where agents analyze data and suggest actions but request explicit permission before execution.Rubric Engineering: Systematically evaluating AI outputs through dimension-specific scoring criteria to provide precise feedback and identify exceptional results, helping escape the trap of models converging toward mediocrity.📶 Connect with Dexter:LinkedInX / TwitterCompany📶 Connect with Nicolay:LinkedInX / TwitterBlueskyWebsiteMy Agency Aisbach (for ai implementations / strategy)⏱️ Important MomentsMCP Servers as Clients: [03:07] Dexter explains why what many call "MCP servers" actually function more like clients when examining the underlying code.Authentication Challenges: [04:45] The discussion shifts to how authentication should be handled in MCP implementations and whether it belongs in the protocol.Asynchronous Agent Execution: [08:18] Exploring how to handle agents that need to pause for human input without wasting tokens on continuous polling.Token Bloat Prevention: [14:41] Strategies for keeping context windows focused and efficient, moving beyond standard chat formats.Context Engineering: [29:06] The concept that everything in AI agent development ultimately comes down to effective context engineering.Fine-tuning vs. RAG for Writing Style: [20:05] Contrasting personal writing style fine-tuning versus context window examples.Generating Options vs. Deterministic Outputs: [19:44] The unexplored potential of having AI generate diverse creative options for human selection.The "Mediocrity Convergence" Question: [37:11] The philosophical concern that popular LLMs may inevitably trend toward average quality.Data Labeling Interfaces: [35:25] Discussion about the need for better, lower-friction interfaces to collect human feedback on AI outputs.Human-in-the-loop Approval Flows: [42:46] The core approach of HumanLayer, allowing agents to ask permission before taking action.🛠️ Tools & Tech MentionedMCPOpenControlDBOSTemporalCursor📚 Recommended Resources12 Factor AgentsBAML DocsRubric Engineering🔮 What's NextNext week, we will continue going more into getting generative AI into production talking to Vibhav from BAML.💬 Join The ConversationFollow How AI Is Built on YouTube, Bluesky, or Spotify.If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. I am trying to produce the best content possible - informative, actionable, and engaging. I'm asking for two things: hit subscribe now to show me what content you like (so I can do more of it), and if this episode helped you, pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️

#047 Architecting Information for Search, Humans, and Artificial Intelligence

2025-03-2757:21

Today on How AI Is Built, Nicolay Gerold sits down with Jorge Arango, an expert in information architecture. Jorge emphasizes that aligning systems with users' mental models is more important than optimizing backend logic alone. He shares a clear framework with four practical steps:Key Points:Information architecture should bridge user mental models with system data modelsInformation's purpose is to help people make better choices and act more skillfullyWell-designed systems create learnable (not just "intuitive") interfacesContext and domain boundaries significantly impact user understandingProgressive disclosure helps accommodate users with varying expertise levelsChapters00:00 Introduction to Backend Systems00:36 Guest Introduction: Jorge Arango01:12 Podcast Dynamics and Guest Experiences01:53 Timeless Principles in Technology02:08 Interesting Conversations and Learnings04:04 Physical vs. Digital Organization04:21 Smart Defaults and System Maintenance07:20 Data Models and Conceptual Structures08:53 Designing User-Centric Systems10:20 Challenges in Information Systems10:35 Understanding Information and Choices15:49 Clarity and Context in Design26:36 Progressive Disclosure and User Research37:05 The Role of Large Language Models54:59 Future Directions and New Series (MLOps)Information Architecture FundamentalsWhat Is Information?Information helps people make better choices to act more skillfullyExample: "No dog pooping" signs help predict consequences of actionsPoor information systems fail to provide relevant guidance for users' needsMental Models vs. Data ModelsSystems have underlying conceptual structures that should reflect user mental modelsData models make these conceptual models "normative" in the infrastructureDesigners serve as translators between user needs and technical implementationGoal: Users should think "the person who designed this really gets me"Design Strategies for Complex SystemsProgressive DisclosurePresent simple interfaces by default with clear paths to advanced functionalityExample: HyperCard - visual interface for beginners with programming layer for expertsAllows both novice and expert users to use the same system effectivelyContext Setting and Domain BoundariesAll interactions happen within a context that influences understandingWords acquire different meanings in different contexts (e.g., "save" in computing vs. banking)Clearer domain boundaries make information architecture design easierHardest systems to design: those serving many purposes for diverse audiencesConceptual Modeling (Underrated Practice)Should precede UI sketching but often skipped by designersDefines concepts needed in the system and their relationshipsCreates more cohesive and coherent systems, especially for complex projectsMore valuable than sitemaps, which imply rigid hierarchiesLLMs and Information ArchitectureCurrent and Future ApplicationsTransforming search experiences (e.g., Perplexity providing answers vs. link lists)Improving intent parsing in traditional searchHelping information architects with content analysis and navigation structure designEnabling faster, better analysis of large content repositoriesImplementation AdviceFor Engineers and DesignersDesigners should understand how systems are built (materials of construction)Engineers benefit from understanding user perspectives and mental modelsBoth disciplines have much to teach each otherFor Complex ApplicationsMap conceptual models before writing codeTest naming with real usersImplement progressive disclosure with good defaultsRemember: "If the user can't find it, it doesn't exist"Notable Quotes:"People only understand things relative to things they already understand." - Richard Saul Wurman"The hardest systems to design are the ones that are meant to do a lot of things for a lot of different people." - Jorge Arango"Very few things are intuitive. There's a long running joke in the industry that the only intuitive interface for humans is the nipple. Everything else is learned." - Jorge ArangoJorge ArangoLinkedInWebsiteX (Twitter)Nicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)

#046 Building a Search Database From First Principles

2025-03-1353:28

Modern search is broken. There are too many pieces that are glued together.Vector databases for semantic searchText engines for keywordsRerankers to fix the resultsLLMs to understand queriesMetadata filters for precisionEach piece works well alone.Together, they often become a mess.When you glue these systems together, you create:Data Consistency Gaps Your vector store knows about documents your text engine doesn't. Which is right?Timing Mismatches New content appears in one system before another. Users see different results depending on which path their query takes.Complexity Explosion Every new component doubles your integration points. Three components means three connections. Five means ten.Performance Bottlenecks Each hop between systems adds latency. A 200ms search becomes 800ms after passing through four components.Brittle Chains When one system fails, your entire search breaks. More pieces mean more breaking points.I recently built a system where we had query specific post-filters but the requirement to deliver a fixed number of results to the user.A lot of times, the query had to be run multiple times to achieve the desired amount.So we had an unpredictable latency. A high load on the backend, where some queries hammered the database 10+ times. A relevance cliff, where results 1-6 look great, but the later ones were poor matches.Today on How AI Is Built, we are talking to Marek Galovic from TopK.We talk about how they built a new search database with modern components. "How would search work if we built it today?”Cloud storage is cheap. Compute is fast. Memory is plentiful.One system that handles vectors, text, and filters together - not three systems duct-taped into one.One pass handles everything:Vector search + Text search + Filters → Single sorted result Built with hand-optimized Rust kernels for both x86 and ARM, the system scales to 100M documents with 200ms P99 latency.The goal is to do search in 5 lines of code.Marek Galovic:LinkedInWebsiteTopK WebsiteTopK DocsNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)00:00 Introduction to TopK and Snowflake Comparison00:35 Architectural Patterns and Custom Formats01:30 Query Execution Engine Explained02:56 Distributed Systems and Rust04:12 Query Execution Process06:56 Custom File Formats for Search11:45 Handling Distributed Queries16:28 Consistency Models and Use Cases26:47 Exploring Database Versioning and Snapshots27:27 Performance Benchmarks: Rust vs. C/C++29:02 Scaling and Latency in Large Datasets29:39 GPU Acceleration and Use Cases31:04 Optimizing Search Relevance and Hybrid Search34:39 Advanced Search Features and Custom Scoring38:43 Future Directions and Research in AI47:11 Takeaways for Building AI Applications

#045 RAG As Two Things - Prompt Engineering and Search

2025-03-0601:02:43

John Berryman moved from aerospace engineering to search, then to ML and LLMs. His path: Eventbrite search → GitHub code search → data science → GitHub Copilot. He was drawn to more math and ML throughout his career.RAG Explained"RAG is not a thing. RAG is two things." It breaks into:Search - finding relevant informationPrompt engineering - presenting that information to the modelThese should be treated as separate problems to optimize.The Little Red Riding Hood PrincipleWhen prompting LLMs, stay on the path of what models have seen in training. Use formats, structures, and patterns they recognize from their training data:For code, use docstrings and proper formattingFor financial data, use SEC report structuresUse Markdown for better formattingModels respond better to familiar structures.Testing PromptsTesting strategies:Start with "vibe testing" - human evaluation of outputsDevelop systematic tests based on observed failure patternsUse token probabilities to measure model confidenceFor few-shot prompts, watch for diminishing returns as examples increaseManaging Token LimitsWhen designing prompts, divide content into:Static elements (boilerplate, instructions)Dynamic elements (user inputs, context)Prioritize content by:Must-have informationNice-to-have informationOptional if space allowsEven with larger context windows, efficiency remains important for cost and latency.Completion vs. Chat ModelsChat models are winning despite initial concerns about their constraints:Completion models allow more flexibility in document formatChat models are more reliable and aligned with common use casesMost applications now use chat models, even for completion-like tasksApplications: Workflows vs. AssistantsTwo main LLM application patterns:Assistants: Human-in-the-loop interactions where users guide and correctWorkflows: Decomposed tasks where LLMs handle well-defined steps with safeguardsBreaking Down Complex ProblemsTwo approaches:Horizontal: Split into sequential steps with clear inputs/outputsVertical: Divide by case type, with specialized handling for each scenarioExample: For SOX compliance, break horizontally (understand control, find evidence, extract data, compile report) and vertically (different audit types).On AgentsAgents exist on a spectrum from assistants to workflows, characterized by:Having some autonomy to make decisionsUsing tools to interact with the environmentUsually requiring human oversightBest PracticesFor building with LLMs:Start simple: API key + Jupyter notebookBuild prototypes and iterate quicklyAdd evaluation as you scaleKeep users in the loop until models prove reliabilityJohn Berryman:LinkedInX (Twitter)Arcturus LabsPrompt Engineering for LLMs (Book)Nicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)00:00 Introduction to RAG: Retrieval and Generation00:19 Optimizing Retrieval Systems01:11 Introducing John Berryman02:31 John's Journey from Search to Prompt Engineering04:05 Understanding RAG: Search and Prompt Engineering05:39 The Little Red Riding Hood Principle in Prompt Engineering14:14 Balancing Static and Dynamic Elements in Prompts25:52 Assistants vs. Workflows: Choosing the Right Approach30:15 Defining Agency in AI30:35 Spectrum of Assistance and Workflows34:35 Breaking Down Problems Horizontally and Vertically37:57 SOX Compliance Case Study40:56 Integrating LLMs into Existing Applications44:37 Favorite Tools and Missing Features46:37 Exploring Niche Technologies in AI52:52 Key Takeaways and Future Directions

#044 Graphs Aren't Just For Specialists Anymore

2025-02-2801:03:34

Kuzu is an embedded graph database that implements Cypher as a library.It can be easily integrated into various environments—from scripts and Android apps to serverless platforms.Its design supports both ephemeral, in-memory graphs (ideal for temporary computations) and large-scale persistent graphs where traditional systems struggle with performance and scalability.Key Architectural Decisions:Columnar Storage:Kuzu stores node and relationship properties in separate, contiguous columns. This design reduces I/O by allowing queries to scan only the needed columns, unlike row-based systems (e.g., Neo4j) that read full records even when only a subset of properties is required.Efficient Join Indexing with CSR:The join index is maintained using a Compressed Sparse Row (CSR) format. By sorting and compressing relationship data, Kuzu ensures that adjacent node relationships are stored contiguously, minimizing random I/O and speeding up traversals.Vectorized Query Processing:Instead of processing one tuple at a time, Kuzu processes blocks (vectors) of tuples. This block-based (or vectorized) approach reduces function-call overhead and improves cache locality, boosting performance for analytic queries.Factorization and ASP Join:For many-to-many queries that can generate enormous intermediate results, Kuzu uses factorization to represent data compactly. Its ASP join algorithm integrates factorization, sequential scanning, and sideways information passing to avoid unnecessary full scans and materializations.Kuzu is optimized for read-heavy, analytic workloads. While batched writes are efficient, the system is less tuned for high-frequency, small transactions. Upcoming features include:A WebAssembly (Wasm) version for running in browsers.Enhanced vector and full-text search indices.Built-in graph data science algorithms for tasks like PageRank and centrality analysis.Kuzu can be a powerful backend for AI applications in several ways:Knowledge Graphs:Store and query complex relationships between entities to support natural language understanding, semantic search, and reasoning tasks.Graph Data Science:Run built-in graph algorithms (like PageRank, centrality, or community detection) that help uncover patterns and insights, which can improve recommendation systems, fraud detection, and other AI-driven analyses.Retrieval-Augmented Generation (RAG):Integrate with large language models by efficiently retrieving relevant, structured graph data. Kuzu’s vector search capabilities and fast query processing make it ideal for augmenting AI responses with contextual information.Graph Embeddings & ML Pipelines:Serve as the foundation for generating graph embeddings, which are used in downstream machine learning tasks—such as clustering, classification, or link prediction—to enhance model performance.Semih Salihoğlu:LinkedInKuzu GitHubKuzu DocsNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)00:00 Introduction to Graph Databases00:18 Introducing Kuzu: A Modern Graph Database01:48 Use Cases and Applications of Kuzu03:03 Kuzu's Research Origins and Scalability06:18 Columnar Storage vs. Row-Oriented Storage10:27 Query Processing Techniques in Kuzu22:22 Compressed Sparse Row (CSR) Storage27:25 Vectorization in Graph Databases31:24 Optimizing Query Processors with Vectorization33:25 Common Wisdom in Graph Databases35:13 Introducing ASP Join in Kuzu35:55 Factorization and Efficient Query Processing39:49 Challenges and Solutions in Graph Databases45:26 Write Path Optimization in Kuzu54:10 Future Developments in Kuzu57:51 Key Takeaways and Final Thoughts

#043 Knowledge Graphs Won't Fix Bad Data

2025-02-2001:10:58

Metadata is the foundation of any enterprise knowledge graph.By organizing both technical and business metadata, organizations create a “brain” that supports advanced applications like AI-driven data assistants.The goal is to achieve economies of scale—making data reusable, traceable, and ultimately more valuable.Juan Sequeda is a leading expert in enterprise knowledge graphs and metadata management. He has spent years solving the challenges of integrating diverse data sources into coherent, accessible knowledge graphs. As Principal Scientist at data.world, Juan provides concrete strategies for improving data quality, streamlining feature extraction, and enhancing model explainability. If you want to build AI systems on a solid data foundation—one that cuts through the noise and delivers reliable, high-performance insights—you need to listen to Juan’s proven methods and real-world examples.Terms like ontologies, taxonomies, and knowledge graphs aren’t new inventions. Ontologies and taxonomies have been studied for decades—even since ancient Greece. Google popularized “knowledge graphs” in 2012 by building on decades of semantic web research. Despite current buzz, these concepts build on established work.Traditionally, data lives in siloed applications—each with its own relational databases, ETL processes, and dashboards. When cross-application queries and consistent definitions become painful, organizations face metadata management challenges. The first step is to integrate technical metadata (table names, columns, code lineage) into a unified knowledge graph. Then, add business metadata by mapping business glossaries and definitions to that technical layer.A modern data catalog should:Integrate Multiple Sources: Automatically ingest metadata from databases, ETL tools (e.g., dbt, Fivetran), and BI tools.Bridge Technical and Business Views: Link technical definitions (e.g., table “CUST_123”) with business concepts (e.g., “Customer”).Enable Reuse and Governance: Support data discovery, impact analysis, and proper governance while facilitating reuse across teams.Practical Approaches & Use Cases:Start with a Clear Problem: Whether it’s reducing churn, improving operational efficiency, or meeting compliance needs, begin by solving a specific pain point.Iron Thread Method: Follow one query end-to-end—from identifying a business need to tracing it back to source systems—to gradually build and refine the graph.Automation vs. Manual Oversight: Technical metadata extraction is largely automated. For business definitions or text-based entity extraction (e.g., via LLMs), human oversight is key to ensuring accuracy and consistency.Technical Considerations:Entity vs. Property: If you need to attach additional details or reuse an element across contexts, model it as an entity (with a unique identifier). Otherwise, keep it as a simple property.Storage Options: The market offers various graph databases—Neo4j, Amazon Neptune, Cosmos DB, TigerGraph, Apache Jena (for RDF), etc. Future trends point toward multimodel systems that allow querying in SQL, Cypher, or SPARQL over the same underlying data.Juan Sequeda:LinkedIndata.worldSemantic Web for the Working OntologistDesigning and Building Enterprise Knowledge Graphs (before you buy, send Juan a message, he is happy to send you a copy)Catalog & Cocktails (Juan’s podcast)Nicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)00:00 Introduction to Knowledge Graphs 00:45 The Role of Metadata in AI 01:06 Building Knowledge Graphs: First Steps 01:42 Interview with Juan Sequira 02:04 Understanding Buzzwords: Ontologies, Taxonomies, and More 05:05 Challenges and Solutions in Data Management 08:04 Practical Applications of Knowledge Graphs 15:38 Governance and Data Engineering 34:42 Setting the Stage for Data-Driven Problem Solving 34:58 Understanding Consumer Needs and Data Challenges 35:33 Foundations and Advanced Capabilities in Data Management 36:01 The Role of AI and Metadata in Data Maturity 37:56 The Iron Thread Approach to Problem Solving 40:12 Constructing and Utilizing Knowledge Graphs 54:38 Trends and Future Directions in Knowledge Graphs 59:17 Practical Advice for Building Knowledge Graphs

#042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs

2025-02-1301:33:43

Daniel Davis is an expert on knowledge graphs. He has a background in risk assessment and complex systems—from aerospace to cybersecurity. Now he is working on “Temporal RAG” in TrustGraph.Time is a critical—but often ignored—dimension in data. Whether it’s threat intelligence, legal contracts, or API documentation, every data point has a temporal context that affects its reliability and usefulness. To manage this, systems must track when data is created, updated, or deleted, and ideally, preserve versions over time.Three Types of Data:Observations:Definition: Measurable, verifiable recordings (e.g., “the hat reads ‘Sunday Running Club’”).Characteristics: Require supporting evidence and may be updated as new data becomes available.Assertions:Definition: Subjective interpretations (e.g., “the hat is greenish”).Characteristics: Involve human judgment and come with confidence levels; they may change over time.Facts:Definition: Immutable, verified information that remains constant.Characteristics: Rare in dynamic environments because most data evolves; serve as the “bedrock” of trust.By clearly categorizing data into these buckets, systems can monitor freshness, detect staleness, and better manage dependencies between components (like code and its documentation).Integrating Temporal Data into Knowledge Graphs:Challenge:Traditional knowledge graphs and schemas (e.g., schema.org) rarely integrate time beyond basic metadata. Long documents may only provide a single timestamp, leaving the context of internal details untracked.Solution:Attach detailed temporal metadata (such as creation, update, and deletion timestamps) during data ingestion. Use versioning to maintain historical context. This allows systems to:Assess whether data is current or stale.Detect conflicts when updates occur.Employ Bayesian methods to adjust trust metrics as more information accumulates.Key Takeaways:Focus on Specialization:Build tools that do one thing well. For example, design a simple yet extensible knowledge graph rather than relying on overly complex ontologies.Integrate Temporal Metadata:Always timestamp data operations and version records. This is key to understanding data freshness and evolution.Adopt Robust Infrastructure:Use scalable, proven technologies to connect specialized modules via APIs. This reduces maintenance overhead compared to systems overloaded with connectors and extra features.Leverage Bayesian Updates:Start with initial trust metrics based on observed data and refine them as new evidence arrives.Mind the Big Picture:Avoid working in isolated silos. Emphasize a holistic system design that maintains in situ context and promotes collaboration across teams.Daniel DavisCognitive CoreTrustGraphYouTubeLinkedInDiscordNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)00:00 Introduction to Temporal Dimensions in Data 00:53 Timestamping and Versioning Data 01:35 Introducing Daniel Davis and Temporal RAG 01:58 Three Buckets of Data: Observations, Assertions, and Facts 03:22 Dynamic Data and Data Freshness 05:14 Challenges in Integrating Time in Knowledge Graphs 09:41 Defining Observations, Assertions, and Facts 12:57 The Role of Time in Data Trustworthiness 46:58 Chasing White Whales in AI 47:58 The Problem with Feature Overload 48:43 Connector Maintenance Challenges 50:02 The Swiss Army Knife Analogy 51:16 API Meshes and Glue Code 54:14 The Importance of Software Infrastructure 01:00:10 The Need for Specialized Tools 01:13:25 Outro and Future Plans

#041 Context Engineering, How Knowledge Graphs Help LLMs Reason

2025-02-0601:33:34

Robert Caulk runs Emergent Methods, a research lab building news knowledge graphs. With a Ph.D. in computational mechanics, he spent 12 years creating open-source tools for machine learning and data analysis. His work on projects like Flowdapt (model serving) and FreqAI (adaptive modeling) has earned over 1,000 academic citations.His team built AskNews, which he calls "the largest news knowledge graph in production." It's a system that doesn't just collect news - it understands how events, people, and places connect.Current AI systems struggle to connect information across sources and domains. Simple vector search misses crucial relationships. But building knowledge graphs at scale brings major technical hurdles around entity extraction, relationship mapping, and query performance.Emergent Methods built a hybrid system combining vector search and knowledge graphs:Vector DB (Quadrant) handles initial broad retrievalCustom knowledge graph processes relationshipsTranslation pipeline normalizes multi-language contentEntity extraction model identifies key elementsContext engineering pipeline structures data for LLMsImplementation Details:Data Pipeline:All content normalized to English for consistent embeddingsEntity names preserved in original language when untranslatableCustom Gleiner News model handles entity extractionRetrained every 6 months on fresh dataHuman review validates entity accuracyEntity Management:Base extraction uses BERT-based Gleiner architectureTrained on diverse data across topics/regionsDisambiguation system merges duplicate entitiesManual override options for analystsMetadata tracking preserves relationship contextKnowledge Graph:Selective graph construction from vector resultsOn-demand relationship processingGraph queries via standard CypherBuilt for specific use cases vs general coverageIntegration with S3 and other data storesSystem Validation:Custom "Context is King" benchmark suiteRAGAS metrics track retrieval accuracyTime-split validation prevents data leakageManual review of entity extractionProduction monitoring of query patternsEngineering Insights:Key Technical Decisions:English normalization enables consistent embeddingsHybrid vector + graph approach balances speed/depthSelective graph construction keeps costs downHuman-in-loop validation maintains qualityDead Ends Hit:Full multi-language entity system too complexReal-time graph updates not feasible at scalePure vector or pure graph approaches insufficientTop Quotes:"At its core, context engineering is about how we feed information to AI. We want clear, focused inputs for better outputs. Think of it like talking to a smart friend - you'd give them the key facts in a way they can use, not dump raw data on them." - Robert"Strong metadata paints a high-fidelity picture. If we're trying to understand what's happening in Ukraine, we need to know not just what was said, but who said it, when they said it, and what voice they used to say it. Each piece adds color to the picture." - Robert"Clean data beats clever models. You can throw noise at an LLM and get something that looks good, but if you want real accuracy, you need to strip away the clutter first. Every piece of noise pulls the model in a different direction." - Robert"Think about how the answer looks in the real world. If you're comparing apartments, you'd want a table. If you're tracking events, you'd want a timeline. Match your data structure to how humans naturally process that kind of information." - Nico"Building knowledge graphs isn't about collecting everything - it's about finding the relationships that matter. Most applications don't need a massive graph. They need the right connections for their specific problem." - Robert"The quality of your context sets the ceiling for what your AI can do. You can have the best model in the world, but if you feed it noisy, unclear data, you'll get noisy, unclear answers. Garbage in, garbage out still applies." - Robert"When handling multiple languages, it's better to normalize everything to one language than to try juggling many. Yes, you lose some nuance, but you gain consistency. And consistency is what makes these systems reliable." - Robert"The hard part isn't storing the data - it's making it useful. Anyone can build a database. The trick is structuring information so an AI can actually reason with it. That's where context engineering makes the difference." - Robert"Start simple, then add complexity only when you need it. Most teams jump straight to sophisticated solutions when they could get better results by just cleaning their data and thinking carefully about how they structure it." - Nico"Every token in your context window is precious. Don't waste them on HTML tags or formatting noise. Save that space for the actual signal - the facts, relationships, and context that help the AI understand what you're asking." - NicoRobert Caulk:LinkedInEmergent MethodsAsknewsNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)00:00 Introduction to Context Engineering 00:24 Curating Input Signals 01:01 Structuring Raw Data 03:05 Refinement and Iteration 04:08 Balancing Breadth and Precision 06:10 Interview Start 08:02 Challenges in Context Engineering 20:25 Optimizing Context for LLMs 45:44 Advanced Cypher Queries and Graphs 46:43 Enrichment Pipeline Flexibility 47:16 Combining Graph and Semantic Search 49:23 Handling Multilingual Entities 52:57 Disambiguation and Deduplication Challenges 55:37 Training Models for Diverse Domains 01:04:43 Dealing with AI-Generated Content 01:17:32 Future Developments and Final Thoughts

#040 Vector Database Quantization, Product, Binary, and Scalar

2025-01-3152:11

When you store vectors, each number takes up 32 bits.With 1000 numbers per vector and millions of vectors, costs explode.A simple chatbot can cost thousands per month just to store and search through vectors.The Fix: QuantizationThink of it like image compression. JPEGs look almost as good as raw photos but take up far less space. Quantization does the same for vectors.Today we are back continuing our series on search with Zain Hasan, a former ML engineer at Weaviate and now a Senior AI/ ML Engineer at Together. We talk about the different types of quantization, when to use them, how to use them, and their tradeoff. Three Ways to Quantize:Binary Quantization Turn each number into just 0 or 1Ask: "Is this dimension positive or negative?"Works great for 1000+ dimensionsCuts memory by 97%Best for normally distributed dataProduct Quantization Split vector into chunksGroup similar chunksStore cluster IDs instead of full numbersGood when binary quantization failsMore complex but flexibleScalar Quantization Use 8 bits instead of 32Simple middle groundKeeps more precision than binaryLess savings than binaryKey Quotes:"Vector databases are pretty much the commercialization and the productization of representation learning.""I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality.""Going from text to multimedia in vector databases is really simple.""Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."Zain Hasan:LinkedInX (Twitter)WeaviateTogetherNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications

#box-pro-ellipsis-176100670067769{-webkit-line-clamp:2;}How AI Is Built