DiscoverKubeFM
KubeFM
Claim Ownership

KubeFM

Author: KubeFM

Subscribed: 29Played: 534
Share

Description

Discover all the great things happening in the world of Kubernetes, learn (controversial) opinions from the experts and explore the successes (and failures) of running Kubernetes at scale.
92 Episodes
Reverse
In this closing episode, Bryan Ross (Field CTO at GitLab), Jane Yan (Principal Program Manager at Microsoft), Sean O’Meara (CTO at Mirantis) and William Rizzo (Strategy Lead, CTO Office at Mirantis) discuss how GitOps evolves in practice.How enterprises are embedding Flux into developer platforms and managed cloud services.Why bridging CI/CD and infrastructure remains a core challenge—and how GitOps addresses it.What leading platform teams (GitLab, Microsoft, Mirantis) see as the next frontier for GitOps.SponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/tVqKwNYQHInterested in sponsoring an episode? Learn more.
In this episode, Philippe Ensarguet, VP of Software Engineering at Orange, and Arnab Chatterjee, Global Head of Container & AI Platforms at Nomura, share how large enterprises are adopting Flux to drive reliable, compliant, and scalable platforms.How Orange uses Flux to manage bare-metal Kubernetes through its SYLVR project.Why FSIs rely on GitOps to balance agility with governance.How Flux helps enterprises achieve resilience, compliance, and repeatability at scale.SponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/tWcHlJm7MInterested in sponsoring an episode? Learn more.
In this episode, Michael Bridgen (the engineer who wrote Flux's first lines) and Stefan Prodan (the maintainer who led the V2 rewrite) share how Flux grew from a fragile hack-day script into a production-grade GitOps toolkit.How early Flux addressed the risks of manual, unsafe Kubernetes upgradesWhy the complete V2 rewrite was critical for stability, scalability, and adoptionWhat the maintainers learned about building a sustainable, community-driven open-source projectSponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/bgkgn227-Interested in sponsoring an episode? Learn more.
This episode unpacks the technical and governance milestones that secured Flux's place in the cloud-native ecosystem, from a 45-minute production outage that led to the birth of GitOps to the CNCF process that defines project maturity and the handover of stewardship after Weaveworks' closure.You will learn:How a single incident pushed Weaveworks to adopt Git as the source of truth, creating the foundation of GitOps.How Flux sustained continuity after Weaveworks shut down through community governance.Where Flux is heading next with security guidance, Flux v2, and an enterprise-ready roadmap.SponsorJoin the Flux maintainers and community at FluxCon, November 11th in Atlanta—register hereMore infoFind all the links and info for this episode here: https://ku.bz/5Sf5wpd8yInterested in sponsoring an episode? Learn more.
Build failures in Kubernetes CI/CD pipelines are a silent productivity killer. Developers spend 45+ minutes scrolling through cryptic logs, often just hitting rerun and hoping for the best.Ron Matsliah, DevOps engineer at Next Insurance, built an AI-powered assistant that cut build debugging time by 75% — not as a dashboard, but delivered directly in Slack where developers already work.In this episode:Why combining deterministic rules with AI produces better results than letting an LLM guess aloneHow correlating Kubernetes events with build logs catches spot instance terminations that produce misleading errorsWhy integrating into existing workflows and building feedback loops from day one drove adoptionThe prompt engineering lessons learned from testing with real production data instead of synthetic examplesThe takeaway: simple rules plus rich context consistently outperform complex AI queries on their own.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/PDdYfC00wInterested in sponsoring an episode? Learn more.
Managed Kubernetes on a major cloud provider can cost hundreds or even thousands of dollars a month — and much of that spending hides behind defaults, minimum resource ratios, and auxiliary services you didn't ask for.Fernando Duran, founder of SadServers, shares how his GKE Autopilot proof of concept ran close to $1,000/month on a fraction of the CPU of the actual workload and how he cut that to roughly $30/month by moving to Hetzner with Edka as a managed control plane.In this interview:Why Kubernetes hasn't delivered on its original promise of cost savings through bin packing — and what it actually provides insteadA real cost comparison: $1,000/month on GKE vs. $30/month on Hetzner with Edka for the same nominal capacityWhat you need to bring with you (observability, logging, dashboards) when leaving a fully managed cloud providerThe decision comes down to how tightly coupled you are to cloud-specific services and whether your team can spare the cycles to manage the gaps.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/6nSDbz9m4Interested in sponsoring an episode? Learn more.
Running multiple Kubernetes clusters on AWS with the cluster autoscaler? Every four months, you face the same grind: upgrading Kubernetes versions, recreating auto scaling groups, and hoping instance type changes stick.Adhi Sutandi, DevOps Engineer at Beekeeper by LumApps, shares how his team migrated from the cluster autoscaler to Karpenter across eight EKS clusters — and the hard lessons they learned along the way.In this episode:Why AWS auto scaling groups are immutable and how that creates upgrade bottlenecks at scaleHow the latest AMI tag accidentally turned less critical clusters into chaos engineering environments, dropping SLOs before anyone realized Karpenter was the causeWhy pre-stop sleep hooks solved pod restartability problems that Quarkus's built-in graceful shutdown couldn'tThe case for pod disruption budgets over Karpenter annotations when protecting critical workloads during node rotationsHow Karpenter's implicit 10% disruption budget caught the team off guard — and the explicit configuration that fixed itSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/XyVfsSQPrInterested in sponsoring an episode? Learn more.
Migrating from ECS to Kubernetes sounds straightforward — until you hit spot capacity failures, firewall rules silently dropping traffic, and memory metrics that lie to your autoscaler.Radosław Miernik, Head of Engineering at aleno, walks through a real production migration: what broke, what they missed, and the fixes that made it work.In this interview:Running Flux and Argo CD together — Flux for the infra team, Argo CD's UI for developers who don't want to touch YAMLHow the wrong memory metric caused OOM errors, and why switching to jemalloc cut memory usage by 20%Splitting WebSocket and API containers into separate deployments with independent autoscalingFour months of migration, over 100 configuration changes in the first month, and a concrete breakdown of what platform work looks like when you can't afford downtime.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/x6wFMhVsxInterested in sponsoring an episode? Learn more.
Kubernetes nodes on EKS can take over a minute to become ready, and pods often wait even longer — but most teams never look into why.Jan Ludvik, Senior Staff Reliability Engineer at Outreach, shares how he cut node startup from 65 to 45 seconds and reduced P90 pod startup by 30 seconds across ~1,000 nodes — by tackling overlooked defaults and EBS bottlenecks.In this episode:Why Kubelet's serial image pull default quietly blocks pod startup, and how parallel pulls fix itHow EBS lazy loading can silently negate image caching in AMIs — and the critical path workaroundA Lambda-based automation that temporarily boosts EBS throughput during startup, then reverts to save costThe kubelet metrics and logs that expose pod and node startup latenc,y most teams never monitorEvery second saved translates to faster scaling, lower AWS bills, and better end-user experience.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/B7TzKXyxfInterested in sponsoring an episode? Learn more.
You self-host services at home, but upgrades break things, rollbacks require SSH-ing in to kill containers manually, and there's no safety net if your hardware fails.Thibault Martin, Director of Program Development at the Matrix Foundation, walked this exact path — from Docker Compose to Podman with Ansible to Kubernetes on a single server — and explains why each transition happened and what it solved.In this interview:Why Ansible's declarative promise fell short with the Podman collection, forcing sequential imperative steps instead of desired-state definitionsHow community Helm charts replace the need to write and maintain every manifest yourselfWhy GitOps isn't just a deployment workflow — it's a disaster recovery strategy when your infrastructure lives in your living roomHow k3s removes the barrier to entry by bundling opinionated defaults so you can skip choosing CNI plugins and storage providersKubernetes doesn't have to be enterprise-scale — with the right distribution and community tooling, it can be a practical, low-overhead choice for anyone who cares about their data.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Xk5S7VqXzInterested in sponsoring an episode? Learn more.
Your database backup strategy shouldn't be the thing that takes your production systems down.Ziv Yatzik manages 600+ Postgres clusters in a closed network environment with no public cloud. After existing backup solutions proved unreliable — causing downtime when disks filled up — his team built a new architecture using pgBackRest, Argo CD, and Kubernetes CronJobs.In this episode:Why storing WAL files on shared NAS storage prevents backup failures from cascading into database outagesHow GitOps with Argo CD lets them manage backups for hundreds of clusters by adding a single YAML fileThe Ansible + Kubernetes hybrid approach that keeps VM-based Patroni clusters in sync with Kubernetes-orchestrated backupsA practical blueprint for making database backups boring, reliable, and safe.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Rg_sQYSmwInterested in sponsoring an episode? Learn more.
Most developers assume Kubernetes requires an enterprise budget. Varnit Goyal proves otherwise — he built a full three-node Kubernetes cluster for $2.16/month using Rackspace Spot Instances.The trick: pick non-default instance types, distribute nodes across low-demand regions, and let Kubernetes handle rescheduling when nodes get preempted. For service exposure, he replaced the $10/month load balancer with Tailscale Funnel — free.In this episode:How Spot Instance bidding works and which strategies keep costs and preemption lowUsing Tailscale Kubernetes operator as a free alternative to traditional load balancersRunning real development dependencies (Kafka, Elasticsearch, Postgres) on a budget clusterA practical walkthrough of what Kubernetes actually needs to function — and what you can strip away.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/HpVyQMVv0Interested in sponsoring an episode? Learn more.
Dilshan Wijesooriya, Senior Cloud Engineer, discusses a real incident where migrating EKS nodes to AL2023 caused the cluster autoscaler to lose AWS permissions silently.You will learn:Why AL2023 blocks pod access to instance metadata by default, breaking components that relied on node IAM roles (like cluster autoscaler, external-DNS, and AWS Load Balancer Controller)How to implement IRSA correctly by configuring IAM roles, Kubernetes service accounts, and OIDC trust relationships, and why both AWS IAM and Kubernetes RBAC must be configured independentlyThe recommended migration strategy: move critical system components to IRSA before changing AMIs, test aggressively in non-production, and decouple identity changes from OS upgradesHow to audit which pods currently rely on node roles and clean up legacy IAM permissions to reduce attack surface after migrationSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/T_YPfTfDbInterested in sponsoring an episode? Learn more.
Fabián Sellés Rosa, Tech Lead of the Runtime team at Adevinta, walks through a real engineering investigation that started with a simple request: allowing tenants to use third-party Kafka services. What seemed straightforward turned into a complex DNS resolution problem that required testing seven different approaches before a working solution was found.You will learn:Why Kafka's multi-step DNS resolution creates unique challenges in multi-tenant environments, where bootstrap servers and dynamic broker lists complicate standard DNS approachesThe iterative debugging process from Route 53 split DNS through Kubernetes native pod DNS config, custom DNS servers, Kafka proxies, and CoreDNS solutionsHow to implement the final solution using node-local DNS and CoreDNS templating with practical details including ndots configuration and Kyverno automationPlatform engineering evaluation criteria for assessing solutions based on maintainability, self-service capability, and evolvability in multi-tenant environmentsSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/NsBZ-FwcJInterested in sponsoring an episode? Learn more.
Amos Wenger walks through his production incident where adding a home computer as a Kubernetes node caused TLS certificate renewals to fail. The discussion covers debugging techniques using tools like netshoot and K9s, and explores the unexpected interactions between Kubernetes overlay networks and consumer routers.You will learn:How Kubernetes networking assumptions break when mixing cloud VMs with nodes behind consumer routers, and why cert-manager challenges fail in NAT environmentsThe differences between CNI plugins like Flannel and Calico, particularly how they handle IPv6 translationDebugging techniques for network issues using tools like netshoot, K9s, and iproute2Best practices for mixed infrastructure including proper node labeling, taints, and scheduling controlsSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/6Ll_7slr9Interested in sponsoring an episode? Learn more.
Tanat Lokejaroenlarb shares the complete journey of replacing EKS Managed Node Groups and Cluster Autoscaler with AWS Karpenter. He explains how this migration transformed their Kubernetes operations, from eliminating brittle upgrade processes to achieving significant cost savings of €30,000 per month through automated instance selection and AMD adoption.You will learn:How to decouple control plane and data plane upgrades using Karpenter's asynchronous node rollout capabilitiesCost optimization strategies including flexible instance selection, automated AMD migration, and the trade-offs between cheapest-first selection versus performance considerationsScaling and performance tuning techniques such as implementing over-provisioning with low-priority placeholder podsPolicy automation and operational practices using Kyverno for user experience simplification, implementing proper Pod Disruption BudgetsSponsorThis episode is sponsored by StormForge by CloudBolt — automatically rightsize your Kubernetes workloads with ML-powered optimizationMore infoFind all the links and info for this episode here: https://ku.bz/T6hDSWYhbInterested in sponsoring an episode? Learn more.
Festus Owumi walks through his project of building a lightweight version of Kubernetes in Go. He removed etcd (replacing it with in-memory storage), skipped containers entirely, dropped authentication, and focused purely on the control plane mechanics. Through this process, he demonstrates how the reconciliation loop, API server concurrency handling, and scheduling logic actually work at their most basic level.You will learn:How the reconciliation loop works - The core concept of desired state vs current state that drives all Kubernetes operationsWhy the API server is the gateway to etcd - How Kubernetes prevents race conditions using optimistic concurrency control and why centralized validation mattersWhat the scheduler actually does - Beyond simple round-robin assignment, understanding node affinity, resource requirements, and the complex scoring algorithms that determine pod placementThe complete pod lifecycle - Step-by-step walkthrough from kubectl command to running pod, showing how independent components work together like an orchestraSponsorThis episode is sponsored by StormForge by CloudBolt — automatically rightsize your Kubernetes workloads with ML-powered optimizationMore infoFind all the links and info for this episode here: https://ku.bz/pf5kK9lQFInterested in sponsoring an episode? Learn more.
Understanding what's actually happening inside a complex Kubernetes system is one of the biggest challenges architects face.Oleksii Kolodiazhnyi, Senior Architect at Mirantis, shares his structured approach to Kubernetes workload assessment. He breaks down how to move from high-level business understanding to detailed technical analysis, using visualization tools and systematic documentation.You will learn:A top-down assessment methodology that starts with business cases and use cases before diving into technical detailsPractical visualization techniques using tools like KubeView, K9s, and Helm dashboard to quickly understand resource interactionsSystematic resource discovery approaches for different scenarios, from well-documented Helm-based deployments to legacy applications with hard-coded configurations buried in containersDocumentation strategies for creating consumable artifacts that serve different audiences, from business stakeholders to new team members joining the projectSponsorThis episode is sponsored by StormForge by CloudBolt — automatically rightsize your Kubernetes workloads with ML-powered optimizationMore infoFind all the links and info for this episode here: https://ku.bz/zDThxGQsPInterested in sponsoring an episode? Learn more.
Andrew Jeffree from SafetyCulture walks through their complete migration of 250+ microservices from a fragile Helm-based setup to GitOps with ArgoCD, all without any downtime. He explains how they replaced YAML configurations with a domain-specific language built in CUE, creating a better developer experience while adding stronger validation and reducing operational pain points.You will learn:Zero-downtime migration techniques using temporary deployments with prune-last sync options to ensure healthy services before removing legacy onesHow CUE lang improves on YAML by providing schema validation, early error detection, and a cleaner interface for developersHuman-centric platform engineering approaches that prioritize developer experience and reduce on-call burden through empathy-driven design decisionsSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/Xvyp1_QcvInterested in sponsoring an episode? Learn more.
Mai Nishitani, Director of Enterprise Architecture at NTT Data and AWS Community Builder, demonstrates how Model Context Protocol (MCP) enables Claude to directly interact with Kubernetes clusters through natural language commands.You will learn:How MCP servers work and why they're significant for standardizing AI integration with DevOps tools, moving beyond custom integrations to a universal protocolThe practical capabilities and critical limitations of AI in Kubernetes operationsWhy fundamental troubleshooting skills matter more than ever as AI abstractions can fail in unexpected ways, especially during crisis scenarios and complex system failuresHow DevOps roles are evolving from manual administration toward strategic architecture and orchestrationSponsorThis episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.ioMore infoFind all the links and info for this episode here: https://ku.bz/3hWvQjXxpInterested in sponsoring an episode? Learn more.
loading
Comments 
loading