Ship It Weekly - DevOps, SRE, and Platform Engineering News

7 Episodes

Reverse

Ship It Interviews: The WHY Behind DevOps, Upskilling, and Agentic AI (with Maz Islam)

2025-12-2130:38

This is a Ship It Weekly interview episode. The weekly news recaps are still weekly. These interviews drop in between when I find someone worth talking to and the convo feels useful.In this episode I’m joined by Mazharul “Maz” Islam (DevOps with Maz). Maz is a UK-based DevOps Engineer who shares practical, real-world DevOps content on YouTube and LinkedIn. We talk about the stuff that actually matters when you’re building systems, running infrastructure, owning reliability, and living in on-call.We hit three big things: the importance of understanding the WHY behind DevOps (not just the tools), how to upskill and keep up with the industry without burning out, and what the agentic AI era might look like for DevOps, SRE, and platform engineering teams. We also touch on MCPs and AI agents, and what “leveling up” looks like for companies that want to move faster without breaking everything.If you’re into DevOps culture, SRE practices, platform engineering, CI/CD, infrastructure automation, and how teams should think about reliability and security as things keep changing, this one should land.Guest Mazharul Islam (DevOps with Maz) UK-based DevOps Engineer. Posts a lot of hands-on content around cloud, DevOps fundamentals, and leveling up as an engineer.Links (Maz) YouTube: https://m.youtube.com/@devopswithmaz LinkedIn: https://www.linkedin.com/in/mazharul419Topics we covered WHY behind DevOps, and why “tools” is the smallest part of it DevOps fundamentals vs tool-chasing Upskilling strategies for DevOps Engineers and SREs How to keep learning cloud and automation without drowning What strong teams measure and what “good” actually looks like (delivery, reliability, feedback loops) Agentic AI, AI agents in operations, and the next era of DevOps MCPs, automation guardrails, and safe ways to scale change How companies can “level up” their engineering org without turning it into chaosWe also discussed the previous episode of Ship It Weekly - GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/Book Maz recommended The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Paperback, Oct 6, 2016) Gene Kim, Jez Humble, Patrick Debois, John WillisAbout Ship It Weekly (format) Ship It Weekly is for people running infrastructure and owning reliability. Most episodes are quick weekly news recaps for DevOps, SRE, and platform engineering. In between those weekly drops, I’ll publish interview episodes like this one.Subscribe / help the show If you want the weekly DevOps news recaps plus these interviews, hit follow or subscribe in your podcast app. And if you’re feeling generous, leave a rating or review and share this episode with a coworker (especially your on-call buddy). That stuff genuinely helps the show get discovered.

GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI

2025-12-2012:06

This week on Ship It Weekly, Brian looks at how the “platform tax” is showing up everywhere: pricing model shifts, CI dependencies, and new security boundaries thanks to AI agents.We start with GitHub Actions. GitHub announced a new “cloud platform” charge for self-hosted runners in private/internal repos… then hit pause after backlash. Hosted runner price reductions for 2026 are still planned. We also got the perfect timing joke: a GitHub incident the same week.Next up is HashiCorp. Legacy HCP Terraform (Terraform Cloud) Free is reaching end-of-life in 2026, with orgs moving to the newer Free tier capped at 500 managed resources. If you’re running real infrastructure, this is a good moment to audit what you’re actually managing and decide whether you’re cleaning up, paying, or planning a migration.Then we talk PromptPwnd: why stuffing untrusted PR/issue text into AI agent prompts (inside CI) can turn into a supply chain/security problem. The short version: treat AI inputs like hostile user input, keep tokens/permissions minimal, and don’t let agents “run with scissors.”We also cover the Home Depot report about long-lived access exposure as a reminder that secrets hygiene, blast radius, and detection still matter more than the shiny tools.In the lightning round: CDKTF is sunset/archived, Bitbucket is cleaning up free unused workspaces, and SourceHut is proposing pricing changes. We wrap with a human note on “platform whiplash” and why a simple watchlist beats carrying all this stuff in your head.Links from this episodeGitHub Actions pricing + pause https://runs-on.com/blog/github-self-hosted-runner-fee-2026/ https://x.com/github/status/2001372894882918548 https://www.githubstatus.com/incidents/x696x0g4t85lHashiCorp / Terraform Cloud free plan changes https://github.com/hashicorp/terraform-cdk?tab=readme-ov-file#sunset-notice https://www.reddit.com/r/Terraform/s/slYm77wzYrPromptPwnd / AI agents in CI https://www.aikido.dev/blog/promptpwnd-github-actions-ai-agentsHome Depot access exposure report https://techcrunch.com/2025/12/12/home-depot-exposed-access-to-internal-systems-for-a-year-says-researcher/Bitbucket cleanup https://community.atlassian.com/forums/Bitbucket-articles/Bitbucket-cleanup-of-free-unused-workspaces-what-you-need-to/ba-p/3144063SourceHut pricing proposal https://sourcehut.org/blog/2025-12-01-proposed-pricing-changes/

IBM Buys Confluent, React2Shell, and Netflix on Aurora

2025-12-1216:14

In this episode of Ship It Weekly, Brian powers through a cold and digs into a very “infra grown-up” week in DevOps.First up, IBM is buying Confluent for $11B. We talk about what that means if you’re on Confluent Cloud today, still running your own Kafka, or trying to choose between Confluent, MSK, and DIY. It’s part of a bigger pattern after IBM’s HashiCorp deal, and it has real implications for vendor concentration and “plan B” strategies.Then we shift to React2Shell, a 10.0 RCE in React Server Components that’s already being exploited in the wild. Even if you never touch React, if you run platforms or Kubernetes for teams using Next.js or RSC, you’re on the hook for patching windows, WAF rules, and blast-radius thinking.We also look at Netflix’s write-up on consolidating relational databases onto Aurora PostgreSQL, with big performance gains and cost savings. It’s a good excuse to step back and ask whether your own Postgres fleet still makes sense at the scale you’re at now.In the lightning round, we hit OpenTofu 1.11’s new language features, practical Terraform “tips from the trenches,” Ghostty becoming a non-profit project, and two spec-driven dev tools (Spec Kit and OpenSpec) that show what sane AI-assisted development might look like.For the human side, we close with “Your Brain on Incidents” and what high-stress outages actually do to people, plus a few concrete ideas for making on-call less brutal.If you’re on a platform team, own SLOs, or you’re the person people ping when “something is wrong with prod,” this one should give you a mix of immediate to-dos and longer-term questions for your roadmap.Links:IBM + Confluent https://www.confluent.io/blog/ibm-to-acquire-confluent/ https://newsroom.ibm.com/2025-12-08-ibm-to-acquire-confluent-to-create-smart-data-platform-for-enterprise-generative-aiReact2Shell (CVE-2025-55182) https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-componentsNetflix on Aurora PostgreSQL https://aws.amazon.com/blogs/database/netflix-consolidates-relational-database-infrastructure-on-amazon-aurora-achieving-up-to-75-improved-performance/Tools & tips https://opentofu.org/blog/opentofu-1-11-0/ https://rosesecurity.dev/2025/12/04/terraform-tips-and-tricks.html https://mitchellh.com/writing/ghostty-non-profit https://github.com/github/spec-kit https://github.com/Fission-AI/OpenSpecHuman side https://uptimelabs.io/your-brain-on-incidents/

AWS re:Invent for Platform Teams, GKE at 130k Nodes, and Killing Staging

2025-12-0422:00

In this episode of Ship It Weekly, Brian looks at re:Invent through a platform/SRE lens and pulls out the updates that actually change how you design and run systems.We talk about regional NAT Gateways and Route 53 Global Resolver on the networking side, ECS Express Mode and EKS Capabilities as new paved roads for app teams, S3 Vectors GA and 50 TB S3 objects for AI and data lakes, Aurora PostgreSQL dynamic data masking, CodeCommit’s return to full GA, and IAM Policy Autopilot for AI-assisted IAM policies. This was recorded mid–re:Invent, so consider it a “what matters so far” pass, not a full recap.Outside AWS, we get into Google’s 130,000-node GKE cluster and what actually applies if you’re running normal-sized clusters, plus the “It’s time to kill staging” argument and what responsible testing in production looks like with feature flags, progressive delivery, and solid observability.In the lightning round, we hit Zachary Loeber’s Terraform MCP server and terraform-ingest (letting AI tools speak your real Terraform modules), Runs-On’s EC2 instance rankings so you stop picking instance types by vibes, and Airbnb’s adaptive traffic management for their key-value store. We close with Nolan Lawson’s “The fate of small open source” and what it means when your platform quietly depends on one-maintainer libraries.Links from this episode:AWS highlights:https://aws.amazon.com/about-aws/whats-new/2025/11/aws-nat-gateway-regional-availabilityhttps://aws.amazon.com/blogs/aws/introducing-amazon-route-53-global-resolver-for-secure-anycast-dns-resolution-previewhttps://aws.amazon.com/about-aws/whats-new/2025/11/announcing-amazon-ecs-express-modehttps://aws.amazon.com/about-aws/whats-new/2025/12/amazon-s3-vectors-generally-available/Other topics:https://cloud.google.com/blog/products/containers-kubernetes/how-we-built-a-130000-node-gke-clusterhttps://thenewstack.io/its-time-to-kill-staging-the-case-for-testing-in-production/https://blog.zacharyloeber.com/article/terraform-custom-module-mcp-server/https://go.runs-on.com/instances/rankinghttps://medium.com/airbnb-engineering/from-static-rate-limiting-to-adaptive-traffic-management-in-airbnbs-key-value-store-29362764e5c2https://nolanlawson.com/2025/11/16/the-fate-of-small-open-source/

Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails

2025-11-2616:40

In this episode of Ship It Weekly, Brian digs into what’s new for people actually running infra: Kubernetes config, EKS control planes and networking, and GitHub’s latest CI/CD and Copilot updates.We start with Kubernetes’ new configuration good practices post and how to turn it into a checklist to clean up Helm/Kustomize and kill off “hotfix from my laptop” manifests.Then we hit AWS: EKS Provisioned Control Plane to size control plane capacity for big or noisy clusters, plus new network observability so you can see who’s talking to what across clusters and AZs instead of guessing from node metrics.On the GitHub side, Actions OIDC tokens now include a check_run_id for tighter access control, and Copilot adds instructions files and custom agents so you can encode platform and security expectations directly into reviews and workflows.In the lightning round, we touch on Terrascan being archived, Microsoft’s write-up of a 15.72 Tbps Aisuru DDoS attack against Azure, and AWS flat-rate CloudFront plans that bundle CDN and security into more predictable pricing.We close with Lorin Hochstein’s “Two thought experiments” and what it looks like to write incident reports as if an AI (and your future teammates) will rely on them to debug the next outage.If run Kubernetes in prod this one should give you a few concrete ideas for your roadmap.Links from episodehttps://kubernetes.io/blog/2025/11/25/configuration-good-practices/https://aws.amazon.com/about-aws/whats-new/2025/11/amazon-eks-provisioned-control-plane/https://aws.amazon.com/blogs/aws/monitor-network-performance-and-traffic-across-your-eks-clusters-with-container-network-observability/https://github.blog/changelog/2025-11-13-github-actions-oidc-token-claims-now-include-check_run_id/https://github.blog/ai-and-ml/unlocking-the-full-power-of-copilot-code-review-master-your-instructions-files/https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent/create-custom-agentsLightning Roundhttps://github.com/tenable/terrascanhttps://www.bleepingcomputer.com/news/microsoft/microsoft-aisuru-botnet-used-500-000-ips-in-15-tbps-azure-ddos-attack/https://aws.amazon.com/about-aws/whats-new/2025/11/aws-flat-rate-pricing-plans/https://sreweekly.com/sre-weekly-issue-498/ (Lorin's Article)

Kubernetes Shake-ups, Platform Reality, and AI-Native SRE

2025-11-2115:53

In this episode of Ship It Weekly, Brian digs into 3 big themes for anyone running Kubernetes or building internal platforms.First, Kubernetes is officially retiring Ingress NGINX and moving it into best-effort maintenance until March 2026. We talk about what that actually means if you’re still using it and how to think about choosing and rolling out a replacement ingress.Second, we look at how CNCF is defining platform engineering and what “platform as a product” looks like in practice, plus some hard-earned lessons from running Kubernetes in production.Third, we talk about AI as a first-class workload on Kubernetes. CNCF’s new Certified Kubernetes AI Conformance Program aims to standardize how AI runs on K8s, and recent writing on SRE in the age of AI looks at what reliability means when systems learn and drift.In the lightning round, we hit good reads on database migrations, Postgres upgrades, and a distributed priority queue on Kafka. We wrap with the human side of incidents: fixation during incident response and using incidents as landmarks for the tradeoffs you’ve been making over time.If you’re on a platform team, responsible for SLOs, or the person people ping when “Kubernetes is weird,” this one should give you concrete questions to take back to your roadmap and runbooks.Links from this episodehttps://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/https://www.haproxy.com/blog/ingress-nginx-is-retiringhttps://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/Lightning roundhttps://www.cncf.io/blog/2025/11/18/top-5-hard-earned-lessons-from-the-experts-on-managing-kubernetes/https://www.tines.com/blog/zero-downtime-database-migrations-lessons-from-moving-a-live-productionhttps://palark.com/blog/postgresql-upgrade-no-data-loss-downtime/https://klaviyo.tech/building-a-distributed-priority-queue-in-kafka-1b2d8063649ehttps://sreweekly.com/sre-weekly-issue-497/https://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.html

Special: When the Cloud Has a Bad Day: Cloudflare, AWS us-east-1 & GitHub Outages

2025-11-2012:54

In this special kickoff episode of Ship It Weekly, Brian walks through three major outages from the last few weeks and what they actually mean for DevOps, SRE, and platform teams.Instead of just reading status pages, we look at how each incident exposes assumptions in our own architectures and runbooks:Topics in this episode:• Cloudflare’s global outage and what happens when your CDN/WAF becomes a single point of failure• The AWS us-east-1 incident and why “multi-AZ in one region” isn’t a full disaster recovery strategy• GitHub’s Git operations / Codespaces outage and how fragile our CI/CD and GitOps flows can be• Practical questions to ask about your own setup: CDN bypass, cross-region readiness, backups for Git and CIThis episode is more of a themed “special” to kick things off.Going forward, most episodes will follow a lighter news format: a couple of main stories from the week in DevOps/SRE/platform engineering, a quick tools and releases segment, and one culture/on-call or burnout topic. Specials like this will pop up when there’s a big incident or theme worth unpacking.If you’re the person people DM when production is acting weird, or you’re responsible for the platform everyone ships on, this one’s for you.Links from this episodeCloudflare outage – November 18, 2025https://blog.cloudflare.com/18-november-2025-outage/https://www.thousandeyes.com/blog/cloudflare-outage-analysis-november-18-2025AWS us-east-1 outage – October 2025https://aws.amazon.com/message/101925/https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025GitHub outage – November 18, 2025https://us.githubstatus.com/incidents/f3f7sg2d1m20https://currently.att.yahoo.com/att/github-down-now-not-just-211700617.html

#box-pro-ellipsis-176684119227591{-webkit-line-clamp:2;}Ship It Weekly - DevOps, SRE, and Platform Engineering News

Ship It Weekly - DevOps, SRE, and Platform Engineering News