Platform Engineering Playbook Podcast

129 Episodes

Reverse

Why Forward-Deployed Engineers Are Making $300K+ (And Why Companies Are Desperate for Them)

2026-01-3111:41

Why are forward-deployed engineers making 40% more than traditional backend developers, and why can't companies hire enough of them? In today's Platform Engineering Playbook, we dive deep into tech's hottest new role and explore three critical platform engineering developments reshaping the industry. **What You'll Learn:** • The explosive rise of forward-deployed engineers and why they're commanding premium salaries • Real-world case studies from Snowflake and financial services implementations • Three essential skill areas every successful FDE needs to master • How Artera is revolutionizing prostate cancer diagnostics with AWS architecture • Cloudflare's innovative approach to vertical microfrontends • Advanced PostgreSQL debugging techniques with Datadog's EXPLAIN ANALYZE **Episode Chapters:** 0:00 Cold Open - The 40% salary premium mystery 1:30 Introduction & Today's Focus 3:00 Deep Dive Act 1 - Forward-Deployed Engineers Explained 8:45 Deep Dive Act 2 - Real-World Analysis & Case Studies Whether you're a platform engineer looking to advance your career or an engineering leader trying to understand this emerging role, this episode provides actionable insights backed by real industry data and case studies. **Sources & References:** - Why the forward-deployed engineer is tech's hottest job: https://thenewstack.io/why-the-forward-deployed-engineer-is-techs-hottest-job/ - How Artera enhances prostate cancer diagnostics using AWS: https://aws.amazon.com/blogs/architecture/how-artera-enhances-prostate-cancer-diagnostics-using-aws/ - Building vertical microfrontends on Cloudflare's platform: https://blog.cloudflare.com/vertical-microfrontends/ - Debug PostgreSQL query latency faster with EXPLAIN ANALYZE in Datadog Database Monitoring: https://www.datadoghq.com/blog/database-monitoring-explain-analyze/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

AWS DevOps Agent in Production: What Most Teams Get Wrong

2026-01-3016:55

**Why do 73% of AWS DevOps Agent deployments crash and burn in their first week?** It's not what you think. In this episode of Platform Engineering Playbook, we uncover the hidden culprits behind these shocking failure rates and reveal the systematic approach that separates successful platform teams from the rest. **What You'll Learn:** • The real reasons AWS DevOps Agent deployments fail (hint: it's not the code) • How to transform your incident response from "crowded stadium chaos" to "conference room clarity" • A practical framework for optimizing on-call rotations and team structure • Production-ready deployment strategies that actually work Whether you're struggling with agent deployments, drowning in incident noise, or trying to scale your platform team's effectiveness, this episode delivers actionable strategies you can implement immediately. **Sources & References:** • Best Practices for Deploying AWS DevOps Agent in Production: https://aws.amazon.com/blogs/devops/best-practices-for-deploying-aws-devops-agent-in-production/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

AI Agents Are Rewriting the SRE Playbook (For Better or Worse)

2026-01-2915:50

What if AI agents could flip the script on SRE work, turning 87% of firefighting into 87% prevention? That's exactly what's happening in the "agentic revolution" transforming platform engineering teams. In today's Platform Engineering Playbook, we dive deep into how AI agents are reshaping SRE workflows and what this means for your platform strategy. We'll cut through the hype to examine the real-world gap between vision and current reality, then identify which SRE tasks are actually ready for agent automation. **What You'll Learn:** • The three characteristics that make SRE tasks perfect candidates for AI automation • Why most "agentic SRE" implementations fail and how to avoid common pitfalls • Practical strategies for identifying repetitive, well-documented processes in your platform • Real data on how leading teams are shifting from reactive firefighting to proactive prevention **Timestamps:** 0:00 Cold Open - The 87% Problem 2:15 Intro & Today's Focus 5:30 Act 1: The Agentic Revolution Setup 12:45 Act 2: Reality vs. Vision Analysis 21:20 Act 3: Practical Implementation Takeaways 28:10 Outro & Key Actions Whether you're an SRE looking to level up your automation game or a platform engineer evaluating AI tools, this episode delivers actionable insights you can implement immediately. **Sources & References:** - The agentic revolution: A new vision for SREs: https://thenewstack.io/the-agentic-revolution-a-new-vision-for-sres/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

DevOps Is Dead — Platform Engineering Replaced It

2026-01-2819:14

**DevOps is dead - and the companies that created it are the ones pulling the trigger.** But what's replacing it might be the most significant shift in software delivery since containerization. In today's Platform Engineering Playbook, we dive deep into how Internal Developer Platforms are fundamentally reshaping the DevOps landscape. We'll explore why platform engineering has shed its experimental status and become the new standard for scaling development teams. **What You'll Learn:** • The five critical red flags that signal your platform needs immediate attention • Why the "black box problem" is derailing developer productivity • How to navigate the ingress-nginx archival and transition to Cilium • OpenAI's surprising pivot toward scientific research partnerships • Real-world AWS regional outage insights from HCP Vault testing **Chapters:** 5:58: The Black Box problem 10:00: Is platform engineering right for your org? 13:55: Where is platform engineering headed? 15:50: Platform engineering daily news Perfect for platform engineers, DevOps professionals, and engineering leaders navigating the shift from traditional ops to modern platform thinking. **Sources & References:** • OpenAI Scientific Research Partnership: https://www.axios.com/2026/01/26/openai-scientific-research-partner • Ingress-nginx to Cilium Migration: https://www.cncf.io/blog/2026/01/27/navigating-the-ingress-nginx-archival-why-now-is-the-time-to-move-to-cilium/ • HCP Vault AWS Resilience: https://www.hashicorp.com/blog/how-resilient-is-hcp-vault-during-real-aws-regional-outages • Platform Engineering Rise: https://feeds.dzone.com/link/23568/17264430/rise-of-platform-engineering-how-internal-dev-platforms • Cedar CNCF Sandbox: https://www.infoq.com/news/2026/01/cedar-joins-cncf-sandbox/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

Two Missing Characters Nearly Compromised AWS’s Supply Chain

2026-01-2615:12

**What if two missing characters could compromise every AWS-managed GitHub repository?** That's exactly what happened in a critical regex vulnerability that exposed massive supply-chain risks. In today's Platform Engineering Playbook, we break down this shocking security flaw and explore how platform engineers can protect their infrastructure from similar attacks. You'll discover the technical details behind the vulnerability, learn essential webhook security practices, and understand why regex validation is more critical than ever. **What You'll Learn:** ✅ How a simple regex pattern flaw created enterprise-wide security risks ✅ Webhook signature verification best practices ✅ AI-powered Linux security analysis from 20 years of bug data ✅ Cloud 2.0 predictions and what they mean for platform teams ✅ Latest developments in AI-assisted development tools **Episode Chapters:** 0:00 Cold Open - The Two Character Catastrophe 2:15 Today's Platform Engineering News 8:30 Deep Dive: AWS GitHub Vulnerability Analysis 15:45 Security Takeaways for Platform Teams 22:10 Linux Security AI Tool Breakdown 26:30 Cloud 2.0 Discussion Perfect for platform engineers, DevOps professionals, and security-conscious developers who need to stay ahead of emerging threats while building resilient infrastructure. **Sources & References:** - AWS GitHub Vulnerability Analysis: https://www.infoq.com/news/2026/01/aws-github-vulnerability/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global - Linux Security AI Tool: https://thenewstack.io/hacker-jenny-qus-ai-tool-analyzes-20-years-of-linux-bugs/ - Cloud 2.0 Predictions: https://www.thecloudcast.net/2026/01/10-questions-about-what-cloud-20-might.html - Apple Gemini-Powered Siri: https://techcrunch.com/2026/01/25/apple-will-reportedly-unveil-its-gemini-powered-siri-assistant-in-february/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

Kubernetes Just Became Essential for AI Growth (CNCF Report)

2026-01-2518:34

**Why will 90% of AI workloads fail without Kubernetes in the next 18 months?** Most platform teams are walking into a disaster they can't see coming. In today's Platform Engineering Playbook, we break down the CNCF's shocking new survey results showing 82% of organizations are unprepared for AI infrastructure demands. Plus, we cover the Cloudflare BGP incident t hat took down major services and what it means for your platform resilience. **What You'll Learn:** ✅ Why Kubernetes is becoming make-or-break for AI workloads ✅ The hidden performance bottlenecks killing AI model deployments ✅ Actionable audit checklist for your current K8s setup ✅ How organizational culture trumps technology choices ✅ Critical lessons from the latest Cloudflare outage **Timestamps:** 00:00 - Cold Open: The AI Infrastructure Crisis 01:30 - Welcome & Today's Focus 03:15 - Deep Dive: CNCF Survey Breakdown 12:45 - The Real Performance Killers 18:20 - Your Action Plan for 2026 22:10 - Breaking News: Cloudflare BGP Incident 26:35 - Chromium's C++ Ban Impact Whether you're scaling AI workloads or just trying to keep the lights on, this episode gives you the insights and practical steps to build platforms that actually work when it matters. **Sources & References:** - CNCF: Kubernetes is 'foundational' infrastructure for AI: https://thenewstack.io/cncf-kubernetes-is-foundational-infrastructure-for-ai/ - Kubernetes Fuels AI Growth; Organizational Culture Remains the Decisive Factor: https://www.linuxfoundation.org/blog/kubernetes-fuels-ai-growth-organizational-culture-remains-the-decisive-factor - Banned C++ features in Chromium: https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/c++-features.md - Route leak incident on January 22, 2026: https://blog.cloudflare.com/route-leak-incident-january-22-2026/ #PlatformEngineering #DevOps #CloudNative #Kubernetes MAIN TOPIC: ---------------------------------------- https://thenewstack.io/cncf-kubernetes-is-foundational-infrastructure-for-ai/ Title: CNCF: Kubernetes is ‘foundational’ infrastructure for AI NEWS SOURCES: ---------------------------------------- [1] https://www.linuxfoundation.org/blog/kubernetes-fuels-ai-growth-organizational-culture-remains-the-decisive-factor Title: Kubernetes Fuels AI Growth; Organizational Culture Remains the Decisive Factor [2] https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/c++-features.md Title: Banned C++ features in Chromium [3] https://blog.cloudflare.com/route-leak-incident-january-22-2026/ Title: Route leak incident on January 22, 2026

ChatGPT Scales PostgreSQL to power 800 million users

2026-01-2419:45

OpenAI is running ChatGPT for ~800 million users on PostgreSQL — and according to their own disclosures, it’s actually working. In this episode of the Platform Engineering Playbook Daily Podcast, we break down how PostgreSQL was pushed to hyperscale, the architectural tradeoffs behind a single-primary model, and the operational playbook that makes this kind of scale possible. This isn’t a generic “Postgres is great” story. It’s a real-world look at what it takes to run open-source databases at extreme scale, and what platform engineers can learn from it. ⏱️ Episode Timeline 00:31 – Intro 00:54 – Deep Dive (Part 1): Why PostgreSQL at Hyperscale 05:55 – Deep Dive (Part 2): Architecture & Scaling Patterns 10:54 – Deep Dive (Part 3): The Platform Engineering Playbook 16:02 – News 18:57 – Outro 🧠 Key Takeaways •PostgreSQL can scale far beyond “mid-size” workloads •Sharding, pooling, and operational discipline matter more than database choice •Open-source databases reduce vendor lock-in at hyperscale •Engineering expertise can outperform enterprise licensing If you’re designing or operating large-scale systems, this episode breaks down what actually matters when scaling relational databases. Subscribe for daily platform engineering deep dives covering infrastructure, databases, Kubernetes, and the systems behind modern AI. Links and sources discussed are in the show notes. https://openai.com/index/scaling-postgresql/ Title: Scaling PostgreSQL to power 800 million ChatGPT users NEWS SOURCES: ---------------------------------------- [1] https://www.redhat.com/en/blog/northrop-grumman-scales-enterprise-kubernetes-ai-and-hybrid-cloud-red-hat-openshift Title: Northrop Grumman scales enterprise Kubernetes for AI and hybrid cloud with Red Hat OpenShift [2] https://www.env0.com/blog/best-infrastructure-as-code-tools-and-terraform-alternatives Title: Top Infrastructure as Code Tools and Terraform Alternatives [3] https://dev.to/polliog/i-replaced-redis-with-postgresql-and-its-faster-4942 Title: I replaced Redis with PostgreSQL (and it's faster) [4] https://thenewstack.io/cto-chris-aniszczyk-on-the-cncf-push-for-ai-interoperability/ Title: CTO Chris Aniszczyk on the CNCF push for AI interoperability [5] https://opentelemetry.io/blog/2026/obi-goals/ Title: OpenTelemetry eBPF Instrumentation 2026 Goals

3 Skills You Need to Transition to Platform Engineer

2026-01-2316:47

**Will 70% of DevOps engineers disappear in the next 5 years?** That's the bold prediction kicking off today's deep dive into the massive career shift happening in tech right now. In this episode of Platform Engineering Playbook, we explore the critical transition from DevOps to Platform Engineering and what it means for your career survival. You'll discover why traditional DevOps roles are evolving, how companies like Spotify are leading this transformation, and the concrete roadmap you need to navigate this shift successfully. **What You'll Learn:** • Why the DevOps-to-Platform Engineering transition is inevitable • Real-world examples from industry leaders like Spotify's Backstage platform • A practical career roadmap for making the transition • Breaking news: Railway's $100M funding to challenge AWS with AI-native infrastructure • GitHub Actions' new 1 vCPU Linux runner and what it means for CI/CD • The AI slop problem plaguing Kubernetes communities **Timestamps:** 0:00 Cold Open - The 70% Prediction 2:15 Industry News Roundup 8:30 Deep Dive: DevOps Career Evolution 15:45 Platform Engineering Success Stories 22:10 Your Career Transition Roadmap 28:30 Wrap-up & Key Takeaways Whether you're a DevOps engineer feeling uncertain about the future or a platform engineering leader building your team, this episode provides the insights and actionable strategies you need. **Sources & References:** - From DevOps to Platform Engineer: https://platformengineering.org/blog/from-devops-to-platform-engineering - Railway secures $100 million: https://venturebeat.com/infrastructure/railway-secures-usd100-million-to-challenge-aws-with-ai-native-cloud - GitHub Actions 1 vCPU runner: https://github.blog/changelog/2026-01-22-1-vcpu-linux-runner-now-generally-available-in-github-actions - r/kubernetes AI discussion: https://www.reddit.com/r/kubernetes/comments/1qiezxc/rkubernetes_over_taken_with_ai_slop_projects/ - OpenStack & OpenShift monitoring: https://developers.redhat.com/articles/2026/01/22/monitoring-openstack-and-openshift-together #PlatformEngineering #DevOps #CloudNative #Kubernetes

The Infrastructure Monitoring Tools Teams Regret Choosing

2026-01-2217:30

The monitoring tool everyone trusts is actually blind to 40% of your infrastructure failures—and the vendor knows it. Are you using an industry standard that misses almost half of all incidents? In this episode, we unravel the mystery of infrastructure monitoring tools and why your choice could be costing you dearly. As platform engineering teams grapple with an overwhelming array of options—from battle-tested open source tools to shiny SaaS platforms—the stakes have never been higher. The shift in focus from simple server monitoring to comprehensive observability is crucial for modern development. 🔑 What you’ll learn in this episode: - The shocking truth about popular monitoring tools that leave critical gaps in your observability. - Key indicators that signal it’s time to consider paid solutions for your monitoring needs. - A strategic playbook for evaluating vendors without falling into the lock-in trap. - Real-world examples of how companies manage their monitoring expenses, including a mid-sized SaaS company facing an $85,000 monthly bill. Don't miss out on understanding how to choose the right tools that catch what others miss. Tune in as we dive deep into the world of infrastructure monitoring and equip you with the insights you need to make informed decisions. [Timestamps below] ⏱️ TIMESTAMPS: 00:00:03 - Cold Open 00:00:30 - Intro 00:00:49 - Deep Dive - Act 1: The Setup 00:05:32 - Deep Dive - Act 2: The Analysis 00:09:44 - Deep Dive - Act 3: Takeaways 00:14:21 - News 00:16:57 - Outro 📌 IN THIS EPISODE: --- 🎙️ Platform Engineering Playbook 🔗 https://platformengineeringplaybook.com 📧 Subscribe for weekly platform engineering insights #PlatformEngineering #DevOps #Kubernetes #CloudNative #SRE #Podcast

Your CI/CD Pipeline is a Debt Trap

2026-01-2111:26

**73% of engineering teams are drowning in technical debt because of their CI/CD pipelines. Not despite them—because of them.** Are your automation tools secretly sabotaging your codebase? Today's Platform Engineering Playbook dives deep into the hidden ways CI/CD pipelines create technical debt and reveals practical strategies to break the cycle. **What You'll Learn:** • Why inheritance beats copying in platform design • Docker's new hardened images for bulletproof container security • How OpenTelemetry's log deduplication processor can slash your log volume • Critical vulnerabilities in Chainlit and Cloudflare you need to patch NOW • Actionable steps to audit and optimize your CI/CD debt **Episode Chapters:** 0:00 Cold Open - The 73% Problem 1:30 Platform Engineering News Roundup 8:45 Deep Dive: CI/CD Technical Debt Crisis 15:20 The Theory vs Reality of Inheritance 22:10 Practical Solutions for Your Organization 28:30 Security Alert Roundup Whether you're a platform engineer drowning in legacy pipelines or a team lead trying to prevent future debt, this episode gives you the frameworks and tools to build sustainable automation that actually reduces complexity instead of adding to it. **Sources & References:** • CI/CD Technical Debt Analysis: https://thenewstack.io/are-your-ci-cd-pipelines-accidentally-increasing-technical-debt/ • Docker Hardened Images: https://feeds.dzone.com/link/23568/17258698/docker-hardened-images-container-security • OpenTelemetry Log Deduplication: https://opentelemetry.io/blog/2026/log-deduplication-processor/ • Chainlit Security Advisory: https://www.securityweek.com/chainlit-vulnerabilities-may-leak-sensitive-information/ • Cloudflare Zero-Day Alert: https://cybersecuritynews.com/cloudflare-zero-day-vulnerability/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

Kubernetes Just Revolutionized Learning — Get Ahead Now!

2026-01-2017:01

**Are major tech companies secretly abandoning Kubernetes certifications?** What we discovered about the future of K8s learning will change how you approach platform engineering in 2026. In today's Platform Engineering Playbook, we uncover why traditional Kubernetes education is becoming obsolete and what platform teams are doing instead. Plus, breaking news that could revolutionize your infrastructure stack. **What You'll Learn:** • Why the volume of Kubernetes resources reveals a hidden shift in the industry • Microsoft's game-changing Azure Functions announcement for Model Context Protocol servers • How Pinterest's Moka is rewriting big data processing rules with Kubernetes • Practical strategies for platform engineers navigating the evolving K8s landscape • Critical vulnerability insights from Cloudflare's ACME validation logic Whether you're leading a platform team or building cloud-native infrastructure, this episode delivers actionable insights you can implement immediately. **Sources & References:** - Microsoft Azure Functions Model Context Protocol announcement - Cloudflare ACME vulnerability mitigation report - Pinterest Moka Kubernetes big data processing case study #PlatformEngineering #DevOps #CloudNative #Kubernetes #Azure #CloudSecurity #BigData #InfrastructureAsCode

How AWS's New Euro Cloud Changes Data Control Forever

2026-01-1916:43

"92% of European companies don’t trust US cloud providers with their data anymore. So, AWS just locked itself out of its own Euro Cloud! This shocking move raises critical questions about data sovereignty and compliance for businesses operating in Europe. In this episode, we dive deep into AWS's groundbreaking decision to create a completely isolated European cloud infrastructure, one that even Amazon employees can't access. Why would they cut off their own access, and what does this mean for your data strategy? 🔑 Learn about the implications of AWS's European Sovereign Cloud and how it represents a shift in data sovereignty. 🔑 Discover the parent company structure AWS is using with local subsidiaries in Germany and what that means for compliance. 🔑 Get actionable insights on navigating data classification and regulatory exposure in a post-CLOUD Act world. 🔑 Understand how this decision impacts your compliance strategy if you store customer data in AWS. This episode unpacks the complexities of data sovereignty and the geopolitical risks that come with using US cloud providers. Don't miss this crucial information that could change your approach to cloud infrastructure forever. --- 🎙️ Platform Engineering Playbook 🔗 https://platformengineeringplaybook.com ---

Astro Joins Cloudflare: What It Means for Platform Engineers

2026-01-1713:18

Cloudflare acquires the Astro Technology Company, adding a 1M-downloads-per-week web framework to their edge platform. We analyze the strategic implications, what stays open source, and lessons about framework sustainability for platform engineering teams. Key Topics: - Astro framework overview: islands architecture, framework-agnostic components, content-first approach - Why Cloudflare acquired Astro: Developer ecosystem capture, edge compute alignment, workerd integration - Open source sustainability: MIT license preserved, historical patterns (Gatsby, Remix) - What changes for platform teams: Framework evaluation criteria, portability concerns, exit strategies - News: AWS European Sovereign Cloud, Let's Encrypt 6-day certs, Datadog LLM Observability Duration: 13 minutes Subscribe and share with colleagues who'd find this valuable! #PlatformEngineering #Astro #Cloudflare #WebFramework #EdgeComputing #OpenSource #CloudflareWorkers #DevOps #SRE

ScyllaDB X Cloud Challenges DynamoDB Cost and Performance

2026-01-1611:09

ScyllaDB just launched X Cloud with claims of double the performance at half the cost compared to DynamoDB. This episode breaks down the technical architecture behind their tablet-based approach, how they're achieving 80% data compression on ARM Graviton4 instances, and when this actually makes sense for platform engineering teams running high-throughput workloads. Key Topics: - ScyllaDB X Cloud tablet-based architecture (5GB chunks) vs traditional consistent hashing - Claims of 6x performance improvement with 50% cost reduction vs DynamoDB - 80% compression on ARM Graviton4 instances, 25x faster data streaming - High-throughput workload targets: Discord, Disney, Starbucks use cases - News: TerraFormer AI IaC generation, AWS supply chain vulnerability, ML pipeline security Duration: 11 minutes Subscribe and share with colleagues who'd find this valuable! #PlatformEngineering #ScyllaDB #DynamoDB #NoSQL #CloudNative #DatabaseArchitecture #AWS #DevOps #SRE

Invisible Linux Malware: The Undetectable Threat to Your Cloud Infrastructure

2026-01-1516:40

Your Linux servers aren't just running containers anymore—they're hosting invisible tenants that security teams can't even detect. In this episode, we deep dive into VoidLink, the new cloud-native malware framework that Check Point Research just uncovered. This isn't your typical malware that got retrofitted for the cloud—this thing was born in the cloud, designed from the ground up to evade every detection tool in your security stack. We explore: • How VoidLink achieves its terrifying persistence in cloud environments • Why every major cloud provider is vulnerable to this new threat class • eBPF-based rootkits and kernel-level persistence techniques • Why traditional security tools fail against cloud-native threats • How VoidLink learns and adapts to your environment over time • Defense-in-depth strategies for cloud-native infrastructure Key takeaway: VoidLink represents a new generation of threats built specifically for the cloud. Platform teams must evolve their security posture to include runtime detection, eBPF observability, and defense-in-depth strategies. --- Platform Engineering Podcast provides deep dives into infrastructure, DevOps, and cloud-native security. New episodes weekly. Subscribe: https://platformengineering.org/podcast

The AI-Cloud Native Symbiosis - How Intelligent Infrastructure is Transforming Platform Engineering

2026-01-1414:42

By 2025, 90% of new enterprise applications will be AI-powered and cloud-native. This episode explores the symbiotic relationship between AI and Kubernetes - where AI isn't just another workload, but is fundamentally transforming how we build and operate cloud native platforms. We cover real-world examples like Netflix's predictive scaling achieving 92% accuracy, the emergence of AI-driven observability platforms, and why platform engineers need to evolve from infrastructure operators to AI-infrastructure orchestrators. In this episode: - AI transforming the Kubernetes control plane with predictive scheduling - Netflix's AI-driven traffic management: 92% prediction accuracy, 35% resource reduction - AI-native observability: anomaly detection on metric relationships, not just metrics - GPU orchestration: NVIDIA GPU Operator achieving 80%+ utilization vs 30-40% baseline - Edge AI patterns: federated learning, model distillation, intermittent connectivity - Skills evolution: Understanding AI workload characteristics without becoming ML experts - News: Red Hat connects AI to Istio via Kiali MCP Server, AWS CloudWatch adds Apache Iceberg support Perfect for senior platform engineers, SREs, DevOps engineers looking to understand the convergence of AI and cloud native technologies. New episodes every week. Subscribe wherever you listen to stay current on platform engineering. Episode URL: https://platformengineeringplaybook.com/podcasts/00090-ai-cloud-native-symbiosis Duration: 15 minutes Host: Alex and Jordan Category: Technology Subcategory: Software How-To Keywords: AI, cloud native, Kubernetes, symbiosis, intelligent infrastructure, platform engineering, GPU orchestration, predictive scaling, observability, machine learning, Netflix, edge AI, federated learning

MIT 10 Breakthrough Technologies 2026 - The Platform Engineering Perspective

2026-01-1320:36

MIT just released their 10 Breakthrough Technologies for 2026 - and three of them are infrastructure problems that platform engineers are solving right now. This episode explores hyperscale AI data centers consuming 96 GW globally by 2026, vibe coding with 41% of code now AI-generated, and LLM interpretability research from Anthropic. We break down how platform engineers enable these breakthroughs through power-aware scheduling, AI coding guardrails, and new observability patterns for ML systems. In this episode: - Hyperscale AI data centers: 96 GW capacity, $600B capex, 100+ kW per rack - Vibe coding: 92% developer AI adoption, GitHub Copilot at 20M users - LLM interpretability: Anthropic's sparse autoencoders for debugging AI - Platform skills needed: power management, GPU orchestration, ML observability - News: Cloudflare IaC security, AWS CloudWatch Iceberg, SSL certificate dangers Perfect for senior platform engineers, SREs, DevOps engineers looking to understand the infrastructure behind 2026's biggest tech breakthroughs. New episodes every week. Subscribe wherever you listen to stay current on platform engineering. Episode URL: https://platformengineeringplaybook.com/podcasts/00089-mit-10-breakthrough-technologies-2026 Duration: 21 minutes Host: Alex and Jordan Category: Technology Subcategory: Software How-To Keywords: MIT, breakthrough technologies, 2026, AI, hyperscale, data centers, vibe coding, LLM, interpretability, platform engineering, infrastructure, GPU, Copilot, Cursor

AWS Route 53 Global Resolver - Enterprise DNS Security at the Edge

2026-01-1220:04

Every DNS query your hybrid environment makes could be exposing sensitive data. AWS Route 53 Global Resolver, announced at re:Invent 2025, combines anycast routing, encrypted DNS protocols (DoH/DoT), and managed threat filtering in a single service. In this episode, we cover: - Anycast DNS architecture routing to nearest of 11 AWS regions - DoH and DoT encrypted DNS protocol support - AWS RAM authorization for multi-account private hosted zones - DNS filtering with managed threat lists - Implementation patterns for hybrid environments and remote workforces - Query logging for security visibility and threat hunting Plus news on Claude Code creator workflows, UK encryption backdoors, K8s EU hosting costs, PostgreSQL replacing Redis, and Rust ecosystem security. Links: - Episode page: https://playbook.platformengineering.org/podcasts/00088-aws-route-53-global-resolver - AWS Route 53 Global Resolver docs: https://docs.aws.amazon.com/route53/latest/userguide/resolver-global-resolver.html #AWS #Route53 #DNS #DoH #DoT #HybridCloud #Security #PlatformEngineering #DevOps

Kubernetes Upcoming Features Deep Dive - Extended Toleration Operators and Mutable PV Node Affinity

2026-01-1141:19

There's a Kubernetes cluster out there right now burning ten thousand dollars a month on GPU nodes that sit idle sixty percent of the time. Why? Because the scheduler can't say "only schedule pods on nodes with MORE than four GPUs." It's 2026, and our scheduler still can't count. But that's about to change. In this episode, we dive deep into two alpha features in Kubernetes 1.35 that represent a fundamental shift in how Kubernetes handles scheduling and storage: **Extended Toleration Operators (KEP-5471)** - Finally, numeric threshold-based scheduling with taints. New Gt (greater than) and Lt (less than) operators let you express "I can tolerate risk up to 5%" or "schedule me on nodes with at least 4 GPUs." **Mutable PersistentVolume Node Affinity (KEP-5381)** - Storage topology that adapts to reality. When you migrate volumes between availability zones, you no longer need to recreate pods and PVs - just update the nodeAffinity. Plus platform engineering news: - OpenEverest: Percona's database platform goes open governance - GKE Agent Sandbox: Kernel-level isolation for AI agent code execution - MongoBleed (CVE-2025-14847): Critical vulnerability with 87,000 exposed servers - Predictive capacity planning and the shift from reactive to proactive infrastructure This is Kubernetes evolving from reactive feedback loops to truly predictive infrastructure. Listen on the web: https://platformengineering.org/podcasts/00087-kubernetes-upcoming-features-deep-dive

Why Is a 2016 AWS Instance Still the Best Value? (Cloudspecs Research)

2026-01-1020:58

New research from TUM reveals uncomfortable truths about cloud hardware stagnation. The paper "Cloudspecs: Cloud Hardware Evolution Through the Looking Glass" shows that the best-performing AWS instance for NVMe I/O per dollar was released in 2016 - and nothing since has come close. In this episode: • CIDR 2026 research from Technical University of Munich • AWS i3 instances from 2016 still beat all newer options for storage price-performance • CPU gains: 10x cores, but only 2-3x cost-adjusted improvement • Memory crisis: DRAM capacity per dollar has "effectively flatlined" • Network is the only bright spot: 10x improvement per dollar • Interactive tool at cloudspecs.fyi using DuckDB-WASM News segment covers AI coding tool challenges, Kubernetes updates (Dashboard archived, CoreDNS 1.14), Windows Secure Boot certificate expiration, AWS Lambda .NET 10, Amazon MQ mTLS, MCP criticism, and NVIDIA Rubin announcement. Episode page: https://platformengineering.org/podcasts/00086-cloudspecs-cloud-hardware-evolution #PlatformEngineering #CloudComputing #AWS #FinOps #CostOptimization #DevOps

#box-pro-ellipsis-17725438483371{-webkit-line-clamp:2;}Platform Engineering Playbook Podcast