Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

Update: 2025-03-18

Description

Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.

John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.

You will learn:

How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon sets
Why smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineering
How running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthly
Practical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issues

Sponsor

This episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.

More info

Find all the links and info for this episode here: https://ku.bz/wP6bTlrFs
Interested in sponsoring an episode? Learn more.

Comments

In Channel

The Making of Flux: The Origin, a KubeFM Original Series

2025-09-1522:29

Scaling CI horizontally with Buildkite, Kubernetes, and multiple pipelines, with Ben Poland

2025-09-3048:02

Not Every Problem Needs Kubernetes, with Danyl Novhorodov

2025-09-2352:40

VerticalPodAutoscaler Went Rogue: It Took Down Our Cluster, with Thibault Jamet

2025-09-1637:34

Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling, with Jorrick Stempher

2025-09-0926:00

Solving Cold Starts: Uses Istio to Warm Up Java Pods, with Frédéric Gaudet

2025-09-0235:22

Teaching Kubernetes to Scale with a MacBook Screen Lock, with Brian Donelan

2025-08-2627:44

Building a Carbon and Price-Aware Kubernetes Scheduler, with Dave Masselink

2025-08-1941:28

How Policies Saved us a Thousand Headaches, with Alessandro Pomponio

2025-08-1232:35

Dear friend, you have built a Kubernetes, with Mac Chaffee

2025-06-2419:45

Beyond Kubernetes: Serverless Execution Models for Variable Workloads, with Marc Campora

2025-06-1723:23

Shared Nothing, Shared Everything: The Truth About Kubernetes Multi-Tenancy, with Molly Sheets

2025-06-1035:56

My pipelines from GitLab Commit to ArgoCD got beaten by FTP, with David Pech

2025-06-0347:52

Performance testing Kubernetes workloads, with Stephan Schwarz

2025-05-2735:34

Managing 100s of Kubernetes Clusters using Cluster API, with Zain Malik

2025-05-2033:12

Super-Scaling Open Policy Agent with Batch Queries, with Nicholaos Mouzourakis

2025-05-1346:15

Kubernetes upgrades: beyond the one-click update, with Tanat Lokejaroenlarb

2025-05-0634:23

From Fragile to Faultless: Kubernetes Self-Healing In Practice, with Grzegorz Głąb

2025-04-2934:15

Replacing StatefulSets with a custom Kubernetes operator in our Postgres cloud platform, with Andrew Charlton

2025-04-2201:03:00

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

2025-03-1852:09

00:00

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

#box-pro-ellipsis-175969663700816{-webkit-line-clamp:2;}Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride