DiscoverHanselminutes with Scott HanselmanInference Engineering with Baseten's Philip Kiely
Inference Engineering with Baseten's Philip Kiely

Inference Engineering with Baseten's Philip Kiely

Update: 2026-02-26
Share

Digest

This podcast features Phillip Kiley, author of "Inference Engineering," discussing the manual creation of his book amidst AI-generated "slop." The conversation highlights AI's role as an assistant, not an author, used for tasks like code generation and alphabetization. Kiley emphasizes the rigorous writing process and the value of dense technical content over superficial online material. The discussion delves into the practical applications of AI for "toil" rather than "slop," focusing on optimizing existing applications for speed and cost-efficiency. Key concepts like "time to first token" versus "throughput," multi-variable optimization, and the "efficient frontier" in AI performance are explained. The podcast also differentiates between prompt engineering and inference engineering, explaining how inference engines guarantee structured output using techniques like logic biasing and KV cache reuse. The importance of understanding AI internals for effective usage and troubleshooting is stressed, using the analogy of driving a stick shift. The conversation concludes by emphasizing the collaborative nature of AI development and the rapid productionization of AI research.

Outlines

00:00:00
Introduction and the Genesis of "Inference Engineering"

The podcast begins with a sponsor message for TX Text Control's platform-independent .NET applications. Scott Hanselman introduces Phillip Kiley, author of "Inference Engineering," discussing the book's timely release and the decision to write it manually due to the unusable quality of AI-generated content.

00:02:40
AI as a Tool, Not an Author: The Writing Process

Phillip Kiley clarifies that while AI was explored for outlining and drafting, "Inference Engineering" was primarily written by him, with AI used for minor tasks like code generation and alphabetizing lists. He details the intense six-week writing sprint and presents the book's creation as a collaboration between humans and AI tools.

00:09:07
The Philosophy and Value of "Inference Engineering"

Kiley explains the decision to write a dense, challenging book on AI, aiming to provide technical depth for motivated readers. He acknowledges the book's broad survey nature, explaining it as a trade-off for quick production and to prompt further questions. The rapid evolution of AI necessitates potential future editions.

00:12:09
Contributing to AI and Practical Applications

Kiley discusses overcoming imposter syndrome in AI writing, arguing that valuable contributions are possible without extensive academic credentials. The conversation emphasizes inclusivity in AI, regardless of age or experience, and highlights AI's true value in performing helpful work, focusing on optimizing existing applications for speed and cost.

00:17:13
Optimizing AI Performance: Speed, Cost, and Reliability

Base 10's focus on speed and performance in AI is detailed, with "Inference Engineering" extensively covering AI speed and cost-effectiveness. The distinction between "time to first token" and "throughput" is explained, along with multi-variable optimization involving speed, cost, and quality, visualized by the "efficient frontier" and "performance sphere."

00:20:21
The Economics and Layers of AI Efficiency

Faster, cheaper AI leads to serving more traffic with less hardware, though demand often offsets cost reductions. AI efficiency operates on micro and macro levels, challenging the "good, fast, or cheap" dilemma as the market increasingly expects all three.

00:22:21
Reliability and Productionization in AI

Reliability in AI encompasses model consistency and infrastructure uptime, especially GPU reliability. AI research is rapidly hardened for production, moving from paper to deployment in weeks, with significant effort to make research-grade AI suitable for critical applications.

00:23:57
Prompt Engineering vs. Inference Engineering

The distinction between prompt engineering (interacting via prompts) and inference engineering (optimizing AI execution) is explored. Inference engineering acts as the underlying infrastructure, enabling prompt engineers, and can guarantee structured output using logic biasing and vocabulary masking.

00:26:23
Optimizing AI Output and Understanding Internals

While the inference engine guarantees output structure, content quality still depends on prompting. KV cache reuse significantly boosts inference speed. Over-reliance on prompts is cautioned against, as the inference engine acts as a "firewall" for deterministic output. Understanding AI internals enhances usage and leads to building more effective systems.

00:29:30
The Necessity of Deep AI Understanding

Learning AI internals is compared to driving a stick shift, providing deeper control and enabling effective troubleshooting when automated systems fail. A collaborative approach between those who understand inference and those who handle AI interaction is proposed for navigating complex AI challenges. The podcast concludes with congratulations on the publication of "Inference Engineering."

Keywords

Inference Engineering


The process of optimizing and deploying AI models for efficient execution and prediction, focusing on making AI models faster, cheaper, and more reliable in production environments.

AI Slop


Refers to low-quality, unoriginal, or AI-generated content that lacks depth and value, often produced without human oversight or critical thinking.

Logic Biasing


A technique used in large language models (LLMs) to influence token generation by masking probabilities, ensuring adherence to predefined rules or structures, like JSON output.

KV Cache Reuse


An optimization technique in LLMs where previously computed key-value states from prompt tokens are reused across multiple requests, significantly speeding up inference.

Prompt Engineering


The practice of designing and refining input prompts to guide AI models, particularly LLMs, towards desired outputs and behaviors.

Throughput


A measure of the amount of work an AI system can perform over a given period, often contrasted with "time to first token" as a key performance indicator.

Efficient Frontier


In AI performance, a curve representing the optimal trade-offs between competing variables like speed, cost, and quality, indicating the best achievable performance for a given set of constraints.

TX Text Control


A sponsor offering platform-independent .NET applications for document editing, signing, collaboration, and PDF processing, deployable on Windows, Linux, and cloud services.

Q&A

  • How was AI used in the creation of the book "Inference Engineering"?

    While AI was explored for tasks like outlining and drafting, the book was primarily written manually. AI was used for specific, time-saving tasks such as generating code snippets and alphabetizing lists, ensuring the quality and integrity of the content.

  • What is the difference between prompt engineering and inference engineering?

    Prompt engineering focuses on crafting effective inputs (prompts) to guide AI models. Inference engineering, on the other hand, is about optimizing the AI model's execution for speed, cost, and reliability in production environments.

  • How does logic biasing improve AI output?

    Logic biasing is a technique that masks probabilities within an LLM's vocabulary to ensure that only valid tokens are generated according to a predefined structure, such as JSON. This guarantees structural integrity of the output.

  • What is KV cache reuse and why is it important for AI performance?

    KV cache reuse involves reusing computed states from initial prompt tokens across multiple requests. This significantly speeds up inference, especially when prompts share common context, leading to better overall system performance.

  • Can AI truly write a book, or is it just "AI slop"?

    Currently, AI struggles with the nuanced, coherent, and original long-form writing required for a high-quality book. While AI can assist with specific tasks, the "slop" refers to low-quality, unoriginal AI-generated content, contrasting with the deliberate and human-driven creation of valuable works.

Show Notes

This week on the show, Scott talks to Philip Kiley about his new book, Inference Engineering. Inference Engineering is your guide to becoming an expert in inference. It contains everything that Philip has learned in four years of working at Baseten. This book is based on the hundreds of thousands of words of documentation, blogs, and talks he's written on inference; interviews with dozens of experts from our engineering team; and countless conversations with customers and builders around the world.


Comments 
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Inference Engineering with Baseten's Philip Kiely

Inference Engineering with Baseten's Philip Kiely

Scott Hanselman