Discover
KubeFM

71 Episodes
Reverse
This episode unpacks the technical and governance milestones that secured Flux's place in the cloud-native ecosystem, from a 45-minute production outage that led to the birth of GitOps to the CNCF process that defines project maturity and the handover of stewardship after Weaveworks' closure.You will learn:How a single incident pushed Weaveworks to adopt Git as the source of truth, creating the foundation of GitOps.How Flux sustained continuity after Weaveworks shut down through community governance.Where Flux is heading next with security guidance, Flux v2, and an enterprise-ready roadmap.SponsorJoin the Flux maintainers and community at FluxCon, November 11th in Salt Lake City—register hereMore infoFind all the links and info for this episode here: https://ku.bz/5Sf5wpd8yInterested in sponsoring an episode? Learn more.
Ben Poland walks through Faire's complete CI transformation, from a single Jenkins instance struggling with thousands of lines of Groovy to a distributed Buildkite system running across multiple Kubernetes clusters.He details the technical challenges of running CI workloads at scale, including API rate limiting, etcd pressure points, and the trade-offs of splitting monolithic pipelines into service-scoped ones.You will learn:How to architect CI systems that match team ownership and eliminate shared failure points across servicesKubernetes scaling patterns for CI workloads, including multi-cluster strategies, predictive node provisioning, and handling API throttlingPerformance optimization techniques like Git mirroring, node-level caching, and spot instance management for variable CI demandsMigration strategies and lessons learned from moving away from monolithic CI, including proof-of-concept approaches and avoiding the sunk cost fallacySponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/klBmzMY5-Interested in sponsoring an episode? Learn more.
Danyl Novhorodov, a veteran .NET engineer and architect at Eneco, presents his controversial thesis that 90% of teams don't actually need Kubernetes. He walks through practical decision-making frameworks, explores powerful alternatives like BEAM runtimes and Actor models, and explains why starting with modular monoliths often beats premature microservices adoption.You will learn:The COST decision framework - How to evaluate infrastructure choices based on Complexity, Ownership, Skills, and Time rather than industry hypePlatform engineering vs. managed services - How to honestly assess whether your team can compete with AWS, Azure, and Google's managed container platformsEvolutionary architecture approach - Why modular monoliths with clear boundaries often provide better foundations than distributed systems from day oneSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/BYhFw8RwWInterested in sponsoring an episode? Learn more.
Running 30 Kubernetes clusters serving 300,000 requests per second sounds impressive until your Vertical Pod Autoscaler goes rogue and starts evicting critical system pods in an endless loop.Thibault Jamet shares the technical details of debugging a complex VPA failure at Adevinta, where webhook timeouts triggered continuous pod evictions across their multi-tenant Kubernetes platform.You will learn:VPA architecture deep dive - How the recommender, updater, and mutating webhook components interact and what happens when the webhook failsHidden Kubernetes limits - How default QPS and burst rate limits in the Kubernetes Go client can cause widespread failures, and why these aren't well documented in Helm chartsMonitoring strategies for autoscaling - What metrics to track for webhook latency and pod eviction rates to catch similar issues before they become criticalSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/rf1pbWXdNInterested in sponsoring an episode? Learn more.
Jorrick Stempher shares how his team of eight students built a complete predictive scaling system for Kubernetes clusters using machine learning.Rather than waiting for nodes to become overloaded, their system uses the Prophet forecasting model to proactively anticipate load patterns and scale infrastructure, giving them the 8-9 minutes needed to provision new nodes on Vultr.You will learn:How to implement predictive scaling using Prophet ML model, Prometheus metrics, and custom APIs to forecast Kubernetes workload patternsThe Node Ranking Index (NRI) - a unified metric that combines CPU, RAM, and request data into a single comparable number for efficient scaling decisionsReal-world implementation challenges, including data validation, node startup timing constraints, load testing strategies, and the importance of proper research before building complex scaling solutionsSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/clbDWqPYpInterested in sponsoring an episode? Learn more.
If you're running Java applications in Kubernetes, you've likely experienced the pain of slow pod startups affecting user experience during deployments and scaling events.Frédéric Gaudet, Senior SRE at BlaBlaCar, shares how his team solved the cold start problem for their 1,500 Java microservices using Istio's warm-up capabilities.You will learn:Why Java applications struggle with cold starts and how JIT compilation affects initial request latency in Kubernetes environmentsHow Istio's warm-up feature works to gradually ramp up traffic to new podsWhy other common solutions fail, including resource over-provisioning, init containers, and tools like GraalVMReal production impact from implementing this solution, including dramatic improvements in message moderation SLOs at BlaBlaCar's scale of 4,000 podsSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/grxcypt9jInterested in sponsoring an episode? Learn more.
Brian Donelan, VP Cloud Platform Engineering at JPMorgan Chase, shares his ingenious side project that automatically scales Kubernetes workloads based on whether his MacBook is open or closed.By connecting macOS screen lock events to CloudWatch, KEDA, and Karpenter, he built a system that achieves 80% cost savings by scaling pods and nodes to zero when he's away from his laptop.You will learn:How KEDA differs from traditional Kubernetes HPA - including its scale-to-zero capabilities, event-driven scaling, and extensive ecosystem of 60+ built-in scalersThe technical architecture connecting macOS notifications through CloudWatch to trigger Kubernetes autoscaling using Swift, AWS SDKs, and custom metricsCost optimization strategies including how to calculate actual savings, account for API costs, and identify leading indicators of compute demandCreative approaches to autoscaling signals beyond CPU and memory, including examples from financial services and e-commerce that could revolutionize workload managementSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/sFd8TL1cSInterested in sponsoring an episode? Learn more.
Data centers consume over 4% of global electricity and this number is projected to triple in the next few years due to AI workloads.Dave Masselink, founder of Compute Gardener, discusses how he built a Kubernetes scheduler that makes scheduling decisions based on real-time carbon intensity data from power grids.You will learn:How carbon-aware scheduling works - Using real-time grid data to shift workloads to periods when electricity generation has lower carbon intensity, without changing energy consumptionTechnical implementation details - Building custom Kubernetes schedulers using the scheduler plugin framework, including pre-filter and filter stages for carbon and time-of-use pricing optimizationEnergy measurement strategies - Approaches for tracking power consumption across CPUs, memory, and GPUsSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/zk2xM1lfWInterested in sponsoring an episode? Learn more.
Alessandro Pomponio from IBM Research explains how his team transformed their chaotic bare-metal clusters into a well-governed, self-service platform for AI and scientific workloads. He walks through their journey from manual cluster interventions to a fully automated GitOps-first architecture using ArgoCD, Kyverno, and Kueue to handle everything from policy enforcement to GPU scheduling.You will learn:How to implement GitOps workflows that reduce administrative burden while maintaining governance and visibility across multi-tenant research environmentsPractical policy enforcement strategies using Kyverno to prevent GPU monopolization, block interactive pod usage, and automatically inject scheduling constraintsFair resource sharing techniques with Kueue to manage scarce GPU resources across different hardware types while supporting both specific and flexible allocation requestsOrganizational change management approaches for gaining stakeholder buy-in, upskilling admin teams, and communicating policy changes to research usersSponsorThis episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/5sK7BFZ-8Interested in sponsoring an episode? Learn more.
Mac Chaffee, a platform engineer and security champion, examines why developers often underestimate the complexity of running modern applications and how overconfidence leads to expensive technical mistakes.You will learn:Why teams reject Kubernetes then rebuild it piece by piece - understanding the psychological factors, like overconfidence, that drive initial rejection of complex but proven toolsHow to identify the tipping point when DIY solutions become more complex than adopting established orchestration tools, especially around scaling and high availability challengesThe right approach to abstracting Kubernetes complexity - why hiding the Kubernetes API often backfires and how to build effective guardrails instead of reinventing interfacesWhy mentorship gaps lead to poor technical decisions - how the lack of proper apprenticeship programs in tech results in teams making expensive mistakes when building infrastructureSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/9nFPmG85fInterested in sponsoring an episode? Learn more.
Marc Campora, a systems consultant with experience in high-throughput platforms, shares his analysis of a real customer deployment with 500+ microservices. He breaks down the cost implications, technical constraints, and operational trade-offs between Kubernetes containers and AWS Lambda functions based on actual production data and migration assessments.You will learn:Cost analysis frameworks for comparing Lambda vs Kubernetes across different traffic patterns, including specific examples of 3x savings potential and the 80/20 rule for service utilizationMigration complexity factors when moving existing microservices to Lambda, including cold start issues, runtime model changes, and why it's often a complete rewrite rather than a simple portDecision criteria for choosing between platforms based on traffic consistency, computational requirements, and operational overhead toleranceSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/5gMTkzLhVInterested in sponsoring an episode? Learn more.
Molly Sheets, Director of Engineering for Kubernetes at Zynga, discusses her team's approach to platform engineering. She explains why their initial one-cluster-per-team model became unsustainable and how they're transitioning to multi-tenant architectures.You will learn:Why slowing down deployments actually increases risk and how manual approval gates can make systems less resilient than faster, smaller deploymentsThe operational reality of cluster proliferation - why managing hundreds of clusters becomes unsustainable and when multi-tenancy becomes necessaryPractical multi-tenancy implementation strategies including resource quotas, priority classes, and namespace organization patterns that work in productionBetter metrics for multi-tenant environments - why control plane uptime doesn't matter and how to build meaningful SLOs for distributed platform healthSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Rmpl8948_Interested in sponsoring an episode? Learn more.
A sophisticated GitLab CI/CD pipeline integrated with Argo CD was ultimately rejected in favour of simple FTP deployment, offering crucial insights into the real barriers facing cloud-native adoption in traditional organisations.David Pech, Staff Cloud Ops Engineer at Wrike and holder of all CNCF certifications, shares his experience supporting a PHP team after a company merger. He details how he built a complete cloud-native platform with Kubernetes, Helm charts, and GitOps workflows, only to see it fail against cultural and organizational resistance despite its technical superiority.You will learn:The hidden costs of sophisticated tooling - How GitOps pipelines with multiple moving parts can create trust issues when developers lose local control and must rely on remote processes they don't understandCultural factors that trump technical benefits - Why customer expectations, existing Windows-based infrastructure, and team readiness matter more than the elegance of your Kubernetes solutionPractical strategies for incremental adoption - The importance of starting small, building in-house operational expertise, and ensuring management advocacy at all levels before attempting cloud-native transformationsSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/_MWX5m6G_Interested in sponsoring an episode? Learn more.
If you're tasked with performance testing Kubernetes workloads without much guidance, this episode offers clear, experience-based strategies that go beyond theory.Stephan Schwarz, a DevOps engineer at iits-consulting, walks through his systematic approach to performance testing Kubernetes applications. He covers everything from defining what performance actually means, to the practical methodology of breaking individual pods to understand their limits, and navigating the complexities of Kubernetes-specific components that affect test results.You will learn:How to establish baseline performance metrics by systematically testing individual pods, disabling autoscaling features, and documenting each incremental change to understand real application limitsWhy shared Kubernetes components skew results and how ingress controllers, service meshes, and monitoring stacks create testing challenges that require careful consideration of the entire request chainPractical approaches to HPA configuration, including how to account for scaling latency, the time delays inherent in Kubernetes scaling operations, and planning for spare capacity based on your SLA requirementsThe role of observability tools like OpenTelemetry in production environments where load testing isn't feasible, and how distributed tracing helps isolate performance bottlenecks across interdependent servicesSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/yY-FnmGfHInterested in sponsoring an episode? Learn more.
Discover how to manage Kubernetes at scale with declarative infrastructure and automation principles.Zain Malik shares his experience managing multi-tenant Kubernetes clusters with up to 30,000 pods across clusters capped at 950 nodes. He explains how his team transitioned from Terraform to Cluster API for declarative cluster lifecycle management, contributing upstream to improve AKS support while implementing GitOps workflows.You will learn:How to address challenges in large-scale Kubernetes operations, including node pool management inconsistencies and lengthy provisioning timesWhy Cluster API provides a powerful foundation for multi-cloud cluster management, and how to extend it with custom operators for production-specific needsHow implementing GitOps principles eliminates manual intervention in critical operations like cluster upgradesStrategies for handling production incidents and bugs when adopting emerging technologies like Cluster APISponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/5PLksqVlkInterested in sponsoring an episode? Learn more.
Dive into the technical challenges of scaling authorization in Kubernetes with this in-depth conversation about Open Policy Agent (OPA).Nicholaos Mouzourakis, Staff Product Security Engineer at Gusto, explains how his team re-architected Kubernetes native authorization using OPA to support scale, latency guarantees, and audit requirements across services. He shares detailed insights about their journey optimizing OPA performance through batch queries and solving unexpected interactions between Kubernetes resource limits and Go's runtime behavior.You will learn:Why traditional authorization approaches (code-driven and data-driven) fall short in microservice architectures, and how OPA provides a more flexible, decoupled solutionHow batch authorization can improve performance by up to 18x by reducing network round-tripsThe unexpected interaction between Kubernetes CPU limits and Go's thread management (GOMAXPROCS) that can severely impact OPA performancePractical deployment strategies for OPA in production environments, including considerations for sidecars, daemon sets, and WASM modulesSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/S-2vQ_j-4Interested in sponsoring an episode? Learn more.
Discover how Adevinta manages Kubernetes upgrades at scale in this episode with Tanat Lokejaroenlarb. Tanat shares his team's journey from time-consuming blue-green deployments to efficient in-place upgrades for their multi-tenant Kubernetes platform SHIP, detailing the engineering decisions and operational challenges they overcame.You will learn:How to transition from blue-green to in-place Kubernetes upgrades while maintaining service reliabilityTechniques for tracking and addressing API deprecations using tools like Pluto and Kube-no-troubleStrategies for minimizing SLO impact during node rebuilds through serialized approaches and proper PDB configurationWhy a phased upgrade approach with "cluster waves" provides safer production deployments even with thorough testingSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/VVHFfXGl_Interested in sponsoring an episode? Learn more.
Discover how to build resilient Kubernetes environments at scale with practical automation strategies from an engineer who's tackled complex production challenges.Grzegorz Głąb, Kubernetes Engineer at Cloud Kitchens, shares his team's journey developing a comprehensive self-healing framework. He explains how they addressed issues ranging from spot node preemptions to network packet drops caused by unbalanced IRQs, providing concrete examples of automation that prevents downtime and improves reliability.You will learn:How managed Kubernetes services like AKS provide benefits but require customization for specific use casesThe architecture of an effective self-healing framework using DaemonSets and deployments with Kubernetes-native componentsPractical solutions for common challenges like StatefulSet pods stuck on unreachable nodes and cleaning up orphaned podsTechniques for workload-level automation, including throttling CPU-hungry pods and automating diagnostic data collectionSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/yg_fkP0LNInterested in sponsoring an episode? Learn more.
Discover why standard Kubernetes StatefulSets might not be sufficient for your database workloads and how custom operators can provide better solutions for stateful applications.Andrew Charlton, Staff Software Engineer at Timescale, explains how they replaced Kubernetes StatefulSets with a custom operator called Popper for their PostgreSQL Cloud Platform. He details the technical limitations they encountered with StatefulSets and how their custom approach provides more intelligent management of database clusters.You will learn:Why StatefulSets fall short for managing high-availability PostgreSQL clusters, particularly around pod ordering and volume managementHow Timescale's instance matching approach solves complex reconciliation challenges when managing heterogeneous database workloadsThe benefits of implementing discrete, idempotent actions rather than workflows in Kubernetes operatorsReal-world examples of operations that became possible with their custom operator, including volume downsizing and availability zone consolidationSponsorThis episode is brought to you by mirrord — run local code like in your Kubernetes cluster without deploying first.More infoFind all the links and info for this episode here: https://ku.bz/fhZ_pNXM3Interested in sponsoring an episode? Learn more.
Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.You will learn:How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon setsWhy smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineeringHow running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthlyPractical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issuesSponsorThis episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.More infoFind all the links and info for this episode here: https://ku.bz/wP6bTlrFsInterested in sponsoring an episode? Learn more.