Ship It Weekly - DevOps, SRE, and Platform Engineering News

24 Episodes

Reverse

Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

2026-02-2717:38

This week on Ship It Weekly, Brian covers three “automation meets reality” stories that every DevOps, SRE, and platform team can learn from.Cloudflare accidentally withdrew customer BYOIP prefixes due to a buggy cleanup task, Clerk got knocked over by a Postgres auto-analyze query plan flip, and AWS responded to reports about its internal Kiro tooling by framing the incident as misconfigured access controls. Plus: a quick EKS node monitoring update, and a tight security lightning round.LinksCloudflare BYOIP outage postmortem https://blog.cloudflare.com/cloudflare-outage-february-20-2026/ Clerk outage postmortem (Feb 19, 2026) https://clerk.com/blog/2026-02-19-system-outage-postmortem AWS outage report (Reuters) https://www.reuters.com/business/retail-consumer/amazons-cloud-unit-hit-by-least-two-outages-involving-ai-tools-ft-says-2026-02-20/ AWS response on Kiro + access controls https://www.aboutamazon.com/news/aws/aws-service-outage-ai-bot-kiroEKS Node Monitoring Agent (open source) https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-eks-node-monitoring-agent-open-source/Grafana CVE-2026-21721 https://grafana.com/security/security-advisories/cve-2026-21721/runc CVEs (AWS-2025-024) https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024/ GitLab patch releases https://about.gitlab.com/releases/2025/11/26/patch-release-gitlab-18-6-1-released/ Atlassian Feb 2026 security bulletin https://confluence.atlassian.com/security/security-bulletin-february-17-2026-1722256046.htmlHuman story: SRE Is Anti-Transactional (ACM Queue) https://queue.acm.org/detail.cfm?id=3773094More episodes and show notes at https://shipitweekly.fmOn Call Briefs at: https://oncallbrief.com

Ship It Conversations: Mike Lady on Day Two Readiness + Guardrails in the AI Era

2026-02-2434:38

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).In this Ship It: Conversations episode I talk with Mike Lady (Senior DevOps Engineer, distributed systems) from Enterprise Vibe Code on YouTube. We talk day two readiness, guardrails/quality gates, and why shipping safely matters even more now that AI can generate code fast.HighlightsDay 0 vs Day 1 vs Day 2 (launching vs operating and evolving safely)What teams look like without guardrails (“hope is not a strategy”)Why guardrails speed you up long-term (less firefighting, more predictable delivery)Day-two audit checklist: source control/branches/PRs, branch protection, CI quality gates, secrets/config, staging→prod flowAI agents: they’ll “lie, cheat, and steal” to satisfy the goal unless you gate themMulti-model reviews (Claude/Gemini/Codex) as different perspectivesAI in prod: start read-only (logs/traces), then earn trust slowlyMike’s linksYouTube: https://www.youtube.com/@EnterpriseVibeCodeSite: https://www.enterprisevibecode.com/LinkedIn: https://www.linkedin.com/in/mikelady/Stuff mentionedVibe Coding (Gene Kim + Steve Yegge): https://www.simonandschuster.com/books/Vibe-Coding/Gene-Kim/9781966280026Beads (agent memory/issue tracker): https://github.com/steveyegge/beadsGas Town (agent orchestration): https://github.com/steveyegge/gastownAGENTS.md (agent instructions file): https://agents.md/OpenAI Codex: https://openai.com/codex/More episodes + details: https://shipitweekly.fm

Ship It Weekly – DevOps and SRE News for Engineers Who Run Production

2026-02-2200:53

Ship It Weekly is a DevOps and SRE news podcast for engineers who run real systems.Every week I break down what actually matters in cloud, Kubernetes, CI/CD, infrastructure as code, and production reliability. No hype. No vendor spin. Just practical analysis from someone who’s been on call and shipped systems at scale.This isn’t a tutorial show. It’s a signal filter.I cover major industry shifts, security incidents, cloud provider changes, and tooling updates, then explain what they mean for platform teams and engineers operating in production.If you work in DevOps, SRE, platform engineering, or cloud infrastructure and want context instead of clickbait, you’re in the right place.New episodes weekly.You can also find detailed write-ups at: https://shipitweekly.fmAnd curated production-focused briefs at: https://oncallbrief.comSubscribe, and let’s ship.

GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep

2026-02-2019:21

This week on Ship It Weekly, Brian hits five stories where the “defaults” are shifting under ops teams.GitHub is bringing Agentic Workflows into Actions, Gentoo is migrating off GitHub to Codeberg, Argo CD upgrades are forcing Server-Side Apply in some paths, AWS Config quietly expanded coverage again, and EC2 nested virtualization is now possible on virtual instances.LinksYouTube episodes https://www.youtube.com/watch?v=tuuLlo2rbI0&list=PLYLi5KINFnO7dVMbhsJQTKRFXfSSwPmuL&pp=sAgCOnCallBrief https://oncallbrief.comTeller’s Tech Substack https://tellerstech.substack.com/GitHub Agentic Workflows (preview) https://github.blog/changelog/2026-02-13-github-agentic-workflows-are-now-in-technical-preview/Gentoo moves to Codeberg https://www.theregister.com/2026/02/17/gentoo_moves_to_codeberg_amid/Argo CD upgrade guide: 3.2 -> 3.3 (SSA) https://argo-cd.readthedocs.io/en/latest/operator-manual/upgrading/3.2-3.3/AWS Config: 30 new resource types https://aws.amazon.com/about-aws/whats-new/2026/02/aws-config-new-resource-typesEC2 nested virtualization (virtual instances) https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec2-nested-virtualization-on-virtual/GitHub status page update https://github.blog/changelog/2026-02-13-updated-status-experience/GitHub Actions: early Feb updates https://github.blog/changelog/2026-02-05-github-actions-early-february-2026-updates/Runner min version enforcement extended https://github.blog/changelog/2026-02-05-github-actions-self-hosted-runner-minimum-version-enforcement-extended/Open Build Service postmortem https://openbuildservice.org/2026/02/02/post-mortem/Human story: AI SRE vs incident management https://surfingcomplexity.blog/2026/02/14/lots-of-ai-sre-no-ai-incident-management/More episodes and show info on https://shipitweekly.fm

Special: OpenClaw Security Timeline and Fallout: CVE-2026-25253 One-Click Token Leak, Malicious ClawHub Skills, Exposed Agent Control Panels, and Why Local AI Agents Are a New DevOps/SRE Control Plane (OpenAI Hires Founder)

2026-02-1718:49

In this Ship It Weekly special, Brian breaks down the OpenClaw situation and why it’s bigger than “another CVE.”OpenClaw is a preview of what platform teams are about to deal with: autonomous agents running locally, wired into real tools, real APIs, and real credentials. When the trust model breaks, it’s not just data exposure. It’s an operator compromise.We walk through the recent timeline: mass internet exposure of OpenClaw control panels, CVE-2026-25253 (a one-click token leak that can turn your browser into the bridge to your local gateway), a skills marketplace that quickly became a malware delivery channel, and the Moltbook incident showing how “agent content” becomes a new supply chain problem. We close with the signal that agents are going mainstream: OpenAI hiring the OpenClaw creator.Chapters1. What OpenClaw Actually Is2. The Situation in One Line3. Localhost Is Not a Boundary (The CVE Lesson)4. Exposed Control Panels (How “Local” Went Public)5. The Marketplace Problem (Skills Are Supply Chain)6. The Ecosystem Spills (Agent Platforms Leaking Real Data)7. Minimum Viable Safety for Local Agents8. The Plot Twist (OpenAI Hires the Creator)Links from this episodeCensys exposure research https://censys.com/blog/openclaw-in-the-wild-mapping-the-public-exposure-of-a-viral-ai-assistantGitHub advisory (CVE-2026-25253) https://github.com/advisories/GHSA-g8p2-7wf7-98mqNVD entry https://nvd.nist.gov/vuln/detail/CVE-2026-25253Koi Security: ClawHavoc / malicious skills https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targetingMoltbook leak coverage (Reuters) https://www.reuters.com/legal/litigation/moltbook-social-media-site-ai-agents-had-big-security-hole-cyber-firm-wiz-says-2026-02-02/OpenClaw security docs https://docs.openclaw.ai/gateway/securityOpenAI hire coverage (FT) https://www.ft.com/content/45b172e6-df8c-41a7-bba9-3e21e361d3aaMore information and past episodes on https://shipitweekly.fm

When guardrails break prod: GitHub “Too Many Requests” from legacy defenses, Kubernetes nodes/proxy GET RCE, HCP Vault resilience in an AWS regional outage, and PCI DSS scope creep

2026-02-1315:49

This week on Ship It Weekly, Brian hits four stories where the guardrails become the incident.GitHub had “Too Many Requests” caused by legacy abuse protections that outlived their moment. Takeaway: controls need owners, visibility, and a retirement plan.Kubernetes has a nasty edge case where nodes/proxy GET can turn into command execution via WebSocket behavior. If you’ve ever handed out “telemetry” RBAC broadly, go audit it.HashiCorp shared how HCP Vault handled a real AWS regional disruption: control plane wobbled, Dedicated data planes kept serving. Control plane vs data plane separation paying off.AWS expanded its PCI DSS compliance package with more services and the Asia Pacific (Taipei) region. Scope changes don’t break prod today, but they turn into evidence churn later if you don’t standardize proof.Human story: “reasonable assurance” turning into busywork.LinksGitHub: When protections outlive their purpose (legacy defenses + lifecycle)https://github.blog/engineering/infrastructure/when-protections-outlive-their-purpose-a-lesson-on-managing-defense-systems-at-scale/Kubernetes nodes/proxy GET → RCE (analysis)https://grahamhelton.com/blog/nodes-proxy-rceOpenFaaS guidance / mitigation noteshttps://www.openfaas.com/blog/kubernetes-node-proxy-rce/HCP Vault resilience during real AWS regional outageshttps://www.hashicorp.com/blog/how-resilient-is-hcp-vault-during-real-aws-regional-outagesAWS: Fall 2025 PCI DSS compliance package updatehttps://aws.amazon.com/blogs/security/fall-2025-pci-dss-compliance-package-available-now/GitHub Actions: self-hosted runner minimum version enforcement extendedhttps://github.blog/changelog/2026-02-05-github-actions-self-hosted-runner-minimum-version-enforcement-extended/Headlamp in 2025: Project Highlights (SIG UI)https://kubernetes.io/blog/2026/01/22/headlamp-in-2025-project-highlights/AWS Network Firewall Active Threat Defense (MadPot)https://aws.amazon.com/blogs/security/real-time-malware-defense-leveraging-aws-network-firewall-active-threat-defense/Reasonable assurance turning into busywork (r/sre)https://www.reddit.com/r/sre/comments/1qvwbgf/at_what_point_does_reasonable_assurance_turn_into/More episodes + details: https://shipitweekly.fm

Azure VM Control Plane Outage, GitHub Agent HQ (Claude + Codex), Claude Opus 4.6, Gemini CLI, MCP

2026-02-0620:53

This week on Ship It Weekly, Brian hits four “control plane + trust boundary” stories where the glue layer becomes the incident.Azure had a platform incident that impacted VM management operations across multiple regions. Your app can be up, but ops is degraded.GitHub is pushing Agent HQ (Claude + Codex in the repo/CI flow), and Actions added a case() function so workflow logic is less brittle.MCP is becoming platform plumbing: Miro launched an MCP server and Kong launched an MCP Registry.LinksAzure status incident (VM service management issues) https://azure.status.microsoft/en-us/status/history/?trackingId=FNJ8-VQZGitHub Agent HQ: Claude + Codex https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-codex-on-agent-hq/GitHub Actions update (case() function) https://github.blog/changelog/2026-01-29-github-actions-smarter-editing-clearer-debugging-and-a-new-case-function/Claude Opus 4.6 https://www.anthropic.com/news/claude-opus-4-6How Google SREs use Gemini CLI https://cloud.google.com/blog/topics/developers-practitioners/how-google-sres-use-gemini-cli-to-solve-real-world-outagesMiro MCP server announcement https://www.businesswire.com/news/home/20260202411670/en/Miro-Launches-MCP-Server-to-Connect-Visual-Collaboration-With-AI-Coding-ToolsKong MCP Registry announcement https://konghq.com/company/press-room/press-release/kong-introduces-mcp-registryGitHub Actions hosted runners incident thread https://github.com/orgs/community/discussions/186184DockerDash / Ask Gordon research https://noma.security/blog/dockerdash-two-attack-paths-one-ai-supply-chain-crisis/Terraform 1.15 alpha https://github.com/hashicorp/terraform/releases/tag/v1.15.0-alpha20260204Wiz Moltbook write-up https://www.wiz.io/blog/exposed-moltbook-database-reveals-millions-of-api-keysChainguard “EmeritOSS” https://www.chainguard.dev/unchained/introducing-chainguard-emeritossMore episodes + details: https://shipitweekly.fm

CodeBreach in AWS CodeBuild, Bazel TLS Certificate Expiry Breaks Builds, Helm Charts Reliability Audit, and New n8n Sandbox Escape RCE

2026-01-3018:39

This week on Ship It Weekly, Brian looks at four “glue failures” that can turn into real outages and real security risk.We start with CodeBreach: AWS disclosed a CodeBuild webhook filter misconfig in a small set of AWS-managed repos. The takeaway is simple: CI trigger logic is part of your security boundary now.Next is the Bazel TLS cert expiry incident. Cert failures are a binary cliff, and “auto renew” is only one link in the chain.Third is Helm chart reliability. Prequel reviewed 105 charts and found a lot of demo-friendly defaults that don’t hold up under real load, rollouts, or node drains.Fourth is n8n. Two new high-severity flaws disclosed by JFrog. “Authenticated” still matters because workflow authoring is basically code execution, and these tools sit next to your secrets.Lightning round: Fence, HashiCorp agent-skills, marimo, and a cautionary agent-loop story.LinksAWS CodeBreach bulletin https://aws.amazon.com/security/security-bulletins/2026-002-AWS/ Wiz research https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild Bazel postmortem https://blog.bazel.build/2026/01/16/ssl-cert-expiry.html Helm report https://www.prequel.dev/blog-post/the-real-state-of-helm-chart-reliability-2025-hidden-risks-in-100-open-source-charts n8n coverage https://thehackernews.com/2026/01/two-high-severity-n8n-flaws-allow.html Fence https://github.com/Use-Tusk/fence agent-skills https://github.com/hashicorp/agent-skills marimo https://marimo.io/ Agent loop story https://www.theregister.com/2026/01/27/ralph_wiggum_claude_loops/ Related n8n episodes: https://www.tellerstech.com/ship-it-weekly/n8n-critical-cve-cve-2026-21858-aws-gpu-capacity-blocks-price-hike-netflix-temporal/ https://www.tellerstech.com/ship-it-weekly/n8n-auth-rce-cve-2026-21877-github-artifact-permissions-and-aws-devops-agent-lessons/More episodes + details: https://shipitweekly.fm

Ship It Conversations: AI Automation for SMBs: What to Automate (And What Not To) (with Austin Reed)

2026-01-2724:56

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).In this Ship It: Conversations episode I talk with Austin Reed from horizon.dev about AI and automation for small and mid-sized businesses, and what actually works once you leave the demo world.We get into the most common automation wins he sees (sales and customer service), why a lot of projects fail due to communication and unclear specs more than the tech, and the trap of thinking “AI makes it cheap.” Austin shares how they push teams toward quick wins first, then iterate with prototypes so you don’t spend $10k automating a thing that never even happens.We also talk guardrails: when “human-in-the-loop” makes sense, what he avoids automating (finance-heavy logic, HIPAA/medical, government), and why the goal is usually leverage, not replacing people. On the dev side, we nerd out a bit on the tooling they’re using day to day: GPT and Claude, Cursor, PR review help, CI/CD workflows, and why knowing how to architect and validate output matters way more than people think.If you’re a DevOps/SRE type helping the business “do AI,” or you’re just tired of automation hype that ignores real constraints like credentials, scope creep, and operational risk, this one is very much about the practical middle ground.Links from the episode:Austin on LinkedIn: https://www.linkedin.com/in/automationsexpert/horizon.dev: horizon.devYouTube: https://www.youtube.com/@horizonsoftwaredevSkool: https://www.skool.com/automation-mastersIf you found this useful, share it with the person on your team who keeps saying “we should automate that” but hasn’t dealt with the messy parts yet.More information on our website: https://shipitweekly.fm

curl Shuts Down Bug Bounties Due to AI Slop, AWS RDS Blue/Green Cuts Switchover Downtime to ~5 Seconds, and Amazon ECR Adds Cross-Repository Layer Sharing

2026-01-2415:39

This week on Ship It Weekly, Brian looks at three different versions of the same problem: systems are getting faster, but human attention is still the bottleneck.We start with curl shutting down their bug bounty program after getting flooded with low-quality “AI slop” reports. It’s not a “security vs maintainers” story, it’s an incentives and signal-to-noise story. When the cost to generate reports goes to zero, you basically DoS the people doing triage.Next, AWS improved RDS Blue/Green Deployments to cut writer switchover downtime to typically ~5 seconds or less (single-region). That’s a big deal, but “fast switchover” doesn’t automatically mean “safe upgrade.” Your connection pooling, retries, and app behavior still decide whether it’s a blip or a cascade.Third, Amazon ECR added cross-repository layer sharing. Sounds small, but if you’ve got a lot of repos and you’re constantly rebuilding/pushing the same base layers, this can reduce storage duplication and speed up pushes in real fleets.Lightning round covers a practical Kubernetes clientcmd write-up, a solid “robust Helm charts” post, a traceroute-on-steroids style tool, and Docker Kanvas as another signal that vendors are trying to make “local-to-cloud” workflows feel less painful.We wrap with Honeycomb’s interim report on their extended EU outage, and the part that always hits hardest in long incidents: managing engineer energy and coordination over multiple days is a first-class reliability concern.Links from this episodecurl bug bounties shutdown https://github.com/curl/curl/pull/20312RDS Blue/Green faster switchover https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-rds-blue-green-deployments-reduces-downtime/ECR cross-repo layer sharing https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-ecr-cross-repository-layer-sharing/Kubernetes clientcmd apiserver access https://kubernetes.io/blog/2026/01/19/clientcmd-apiserver-access/Building robust Helm charts https://www.willmunn.xyz/devops/helm/kubernetes/2026/01/17/building-robust-helm-charts.htmlttl tool https://github.com/lance0/ttlDocker Kanvas (InfoQ) https://www.infoq.com/news/2026/01/docker-kanvas-cloud-deployment/Honeycomb EU interim report https://status.honeycomb.io/incidents/pjzh0mtqw3vtSRE Weekly issue #504 https://sreweekly.com/sre-weekly-issue-504/More episodes + details: https://shipitweekly.fm

n8n Auth RCE (CVE-2026-21877), GitHub Artifact Permissions, and AWS DevOps Agent Lessons

2026-01-1612:28

This week on Ship It Weekly, the theme is simple: the automation layer has become a control plane, and that changes how you should think about risk.We start with n8n’s latest critical vulnerability, CVE-2026-21877. This one is different from the unauth “Ni8mare” issue we covered in Episode 12. It’s authenticated RCE, which means the real question isn’t only “is it internet exposed,” it’s who can log in, who can create or modify workflows, and what those workflows can reach. Takeaway: treat workflow automation tools like CI systems. They run code, they hold credentials, and they can pivot into real infrastructure.Next is GitHub’s new fine-grained permission for artifact metadata. Small change, big least-privilege implications for Actions workflows. It’s also a good forcing function to clean up permission sprawl across repos.Third is AWS’s DevOps Agent story, and the best part is that it’s not hype. It’s a real look at what it takes to operationalize agents: evaluation, observability into tool calls/decisions, and control loops with brakes and approvals. Prototype is cheap. Reliability is the work.Lightning round: GitHub secret scanning changes that can quietly impact governance, a punchy Claude Code “guardrails aren’t guaranteed” reminder, Block’s Goose as another example of agent workflows getting productized, and OpenCode as an “agent runner” pattern worth watching if you’re experimenting locally.Linksn8n CVE-2026-21877 (authenticated RCE) https://thehackernews.com/2026/01/n8n-warns-of-cvss-100-rce-vulnerability.html?m=1Episode 12 (n8n “Ni8mare” / CVE-2026-21858) https://www.tellerstech.com/ship-it-weekly/n8n-critical-cve-cve-2026-21858-aws-gpu-capacity-blocks-price-hike-netflix-temporal/GitHub: fine-grained permission for artifact metadata (GA) https://github.blog/changelog/2026-01-13-new-fine-grained-permission-for-artifact-metadata-is-now-generally-available/GitHub secret scanning: extended metadata auto-enabled (Feb 18) https://github.blog/changelog/2026-01-15-secret-scanning-extended-metadata-to-be-automatically-enabled-for-certain-repositories/Claude Code issue thread (Bedrock guardrails gap) https://github.com/anthropics/claude-code/issues/17118Block Goose (tutorial + sessions/context) https://block.github.io/goose/docs/tutorials/rpi https://block.github.io/goose/docs/guides/sessions/smart-context-managementOpenCode https://opencode.aiMore episodes + details: https://shipitweekly.fm

Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)

2026-01-1222:02

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).In this Ship It: Conversations episode I talk with Gracious James Eluvathingal about TARS, his “human-in-the-loop” fixer bot wired into CI/CD.We get into why he built it in the first place, how he stitches together n8n, GitHub, SSH, and guardrailed commands, and what it actually looks like when an AI agent helps with incident response without being allowed to nuke prod. We also dig into rollback phases, where humans stay in the loop, and why validating every LLM output before acting on it is the single most important guardrail.If you’re curious about AI agents in pipelines but hate the idea of a fully autonomous “ops bot,” this one is very much about the middle ground: segmenting workflows, limiting blast radius, and using agents to reduce toil instead of replace engineers.Gracious also walks through where he’d like to take TARS next (Terraform, infra-level decisions, more tools) and gives some solid advice for teams who want to experiment with agents in CI/CD without starting with “let’s give it root and see what happens.”Links from the episode:Gracious on LinkedIn: https://www.linkedin.com/in/gracious-james-eluvathingalTARS overview post: https://www.linkedin.com/posts/gracious-james-eluvathingal_aiagents-devops-automation-activity-7391064503892987904-psQ4If you found this useful, share it with the person on your team who’s poking at AI automation and worrying about guardrails.More information on our website: https://shipitweekly.fm

n8n Critical CVE (CVE-2026-21858), AWS GPU Capacity Blocks Price Hike, Netflix Temporal

2026-01-0916:18

This week on Ship It Weekly, Brian’s theme is basically: the “automation layer” is not a side tool anymore. It’s part of your perimeter, part of your reliability story, and sometimes part of your budget problem too.We start with the n8n security issue. A lot of teams use n8n as glue for ops workflows, which means it tends to collect credentials and touch real systems. When something like this drops, the right move is to treat it like production-adjacent infra: patch fast, restrict exposure, and assume anything stored in the tool is high value.Next is AWS quietly raising prices on EC2 Capacity Blocks for ML. Even if you’re not a GPU-heavy shop, it’s a useful signal: scarce compute behaves like a market. If you do rely on scheduled GPU capacity, it’s time to revisit forecasts and make sure your FinOps tripwires catch rate changes before the end-of-month surprise.Third is Netflix’s write-up on using Temporal for reliable cloud operations. The best takeaway is not “go adopt Temporal tomorrow.” It’s the pattern: long-running operational workflows should be resumable, observable, and safe to retry. If your critical ops are still bash scripts and brittle pipelines, you’re one transient failure away from a very dumb day.In the lightning round: Kubernetes Dashboard getting archived and the “ops dependencies die” reality check, Docker pushing hardened images as a safer baseline and Pipedash.LinksSRE Weekly issue 504 (source roundup) https://sreweekly.com/sre-weekly-issue-504/n8n CVE (NVD) https://nvd.nist.gov/vuln/detail/CVE-2026-21858n8n community advisory https://community.n8n.io/t/security-advisory-security-vulnerability-in-n8n-versions-1-65-1-120-4/247305AWS price increase coverage (The Register) https://www.theregister.com/2026/01/05/aws_price_increase/Netflix: Temporal powering reliable cloud operations https://netflixtechblog.com/how-temporal-powers-reliable-cloud-operations-at-netflix-73c69ccb5953Kubernetes SIG-UI thread (Dashboard archiving) https://groups.google.com/g/kubernetes-sig-ui/c/vpYIRDMysek/m/wd2iedUKDwAJKubernetes Dashboard repo (archived) https://github.com/kubernetes/dashboardPipedash https://github.com/hcavarsan/pipedashDocker Hardened Images https://www.docker.com/blog/docker-hardened-images-for-every-developer/More episodes and more details on this episode can be found on our website: https://shipitweekly.fm

Ship It Conversations: Backstage vs Internal IDPs, and Why DevEx Muscle Matters (with Danny Teller)

2026-01-0626:29

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).I sat down with Danny Teller, a DevOps Architect and Tech Lead Manager at Tipalti, to talk about internal developer platforms and the reality behind “just set up a developer portal.” We get into Backstage versus internal IDPs, why adoption is the real battle, and why platform/DevEx maturity matters more than whatever tool you pick.What we coveredBackstage vs internal IDPs Backstage is a solid starting point for a developer portal, but it doesn’t magically create standards, ownership, or platform maturity. We talk about when Backstage fits, and when teams end up building internal tooling anyway.DevEx muscle (the make-or-break) Danny’s take: the portal UI is the easy part. The hard part is the ongoing work that makes it useful: paved roads, sane defaults, support, and keeping the catalog/data accurate so engineers trust it.Where teams get burned Common failure mode: teams ship a portal first, then realize they don’t have the resourcing, ownership, or workflows behind it. Adoption fades fast if the portal doesn’t remove real friction.A build vs buy gut check We walk through practical signals that push you toward open source Backstage, a managed Backstage offering, or a commercial portal. We also hit the maintenance trap: if you build too much, you’ve created a second product.Links and resources Danny Teller's Linkedin: https://www.linkedin.com/in/danny-teller/matlas — one CLI for Atlas and MongoDB: https://github.com/teabranch/matlas-cliBackstage: https://backstage.io/ Roadie (managed Backstage): https://roadie.io/ Port: https://www.port.io/ Cortex: https://www.cortex.io/ OpsLevel: https://www.opslevel.com/ Atlassian Compass: https://www.atlassian.com/software/compass Humanitec Platform Orchestrator: https://humanitec.com/products/platform-orchestrator Northflank: https://northflank.com/If you enjoyed this episode Ship It Weekly is still the weekly news recap, and I’m dropping these guest convos in between. Follow/subscribe so you catch both, and if this was useful, share it with a platform/devex friend and leave a quick rating or review. It helps more than it should.Visit our website at https://www.shipitweekly.fm

Fail Small, IaC Control Planes, and Automated RCA

2026-01-0317:45

This week on Ship It Weekly, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite.Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10–15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.In the lightning round: a follow-up on GitHub Actions direction (and a quick callback to Episode 6’s runner pricing pause), AWS ECR creating repos on push, a smarter take on incident metrics, Terraform drift visibility, and parallel “coding agent” workflows.We wrap with a human reminder about the ironies of automation: automation doesn’t remove responsibility, it moves it. Faster systems require better brakes, better observability, and easier rollback.Links from this episodeSRE Weekly issue 503 (source roundup - CloudFlare) https://sreweekly.com/sre-weekly-issue-503/Pulumi: all IaC, including Terraform and HCL https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/Meta DrP: https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/GitHub Actions: “Let’s talk about GitHub Actions” https://github.blog/news-insights/product-news/lets-talk-about-github-actions/Episode 6 (GitHub runner pricing pause, Terraform Cloud limits, AI in CI) https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/AWS ECR: create repositories on push https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/DriftHound https://drifthound.io/Superset https://superset.sh/More episodes + contact info, and more details on this episode can be found on our website: https://shipitweekly.fm

Ship It Conversations: From Full-Stack to Cloud/DevOps, One Project at a Time (with Eric Paatey)

2025-12-3023:25

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).I sat down with Eric Paatey, a Cloud & DevOps Engineer who’s been transitioning from full-stack web development into cloud/devops, and building real skills through hands-on projects instead of just collecting tools and buzzwords.We talk about what that transition actually feels like, what’s helped most, and why you don’t need a rack of servers to learn DevOps.What we covered Eric’s path into DevOps How he moved from building web apps to caring about pipelines, infra, scalability, reliability, and automation. The “oh… code is only part of the job” moment that pushes a lot of people toward DevOps.The WHY behind DevOps Eric’s take: DevOps is mainly about breaking down silos and improving communication between dev, ops, security, and the business. We also hit the idea from The DevOps Handbook: small batches win. The bigger the release, the harder it is to recover when something breaks.Leveling up without drowning in tools DevOps has an endless tool list, so we talked about how to stay current without burning out. Eric’s recommendation: stay connected to the industry. Meet people, join user groups, go to events, and don’t silo yourself.The homelab mindset (and why simple is fine) Eric shared his “homelab on the go” setup and why the hardware isn’t the point. It’s about using a safe environment to build habits: automation, debugging, systems thinking, monitoring, breaking things, recovering, and improving the design.A practical first project for aspiring DevOps engineers We talked through a starter project you can actually show in interviews: Dockerize a simple app, deploy it behind an ALB, and learn basic networking/security along the way. You don’t need to understand everything on day one, but you do need to build things and learn what breaks.Agentic AI and guardrails We also touched on AI agents and MCPs, what they could mean for ops teams, and why you should not give agents full access to anything. Least privilege and policy guardrails matter, because “non-deterministic” and “prod permissions” is a scary combo.Links and resources Eric Paatey on LinkedIn: https://www.linkedin.com/in/eric-paatey-72a87799/Eric’s website/portfolio: https://ericpaatey.com/If you enjoyed this episode Ship It Weekly is still the weekly news recap, and I’m dropping these guest convos in between. Follow/subscribe so you catch both, and if this was useful, share it with a coworker or your on-call buddy and leave a quick rating or review. It helps more than it should.Visit our website at https://www.shipitweekly.fm

Cloudflare’s Workers Scheduler, AWS DBs on Vercel, and JIT Admin Access

2025-12-2715:39

This week on Ship It Weekly, Brian looks at real platform engineering in the wild.We start with Cloudflare’s write-up on building an internal maintenance scheduler on Workers. It’s not marketing fluff. It’s “we hit memory limits, changed the model, and stopped pulling giant datasets into the runtime.”Next up: AWS databases are now available inside the Vercel Marketplace. This is a quiet shift with loud consequences. Devs can click-button real AWS databases from the same place they deploy apps, and platform teams still own the guardrails: account sprawl, billing/tagging, audit trails, region choices, and networking posture.Third story: TEAM (Temporary Elevated Access Management) for IAM Identity Center. Time-bound elevation with approvals, automatic expiry, and auditing. We cover how this fits alongside break-glass and why auto-expiry is the difference between least-privilege and privilege creep.Lightning round: GitHub Actions workflow page performance improvements, Lambda Managed Instances (slightly cursed but interesting), a quick atmos tooling blip, and k8sdiagram.fun for explaining k8s to humans.We close with Marc Brooker’s “What Now? Handling Errors in Large Systems” and the takeaway: error handling isn’t a local code decision, it’s architecture. Crashing vs retrying vs continuing only makes sense when you understand correlation and blast radius.shipitweekly.fm has links + the contact email. Want to be a guest? Reach out. And if you’re enjoying the show, follow/subscribe and leave a quick rating or review. It helps a ton.Links from this episodeCloudflare https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/ AWS on Vercel https://aws.amazon.com/about-aws/whats-new/2025/12/aws-databases-are-available-on-the-vercel/ https://vercel.com/changelog/aws-databases-now-available-on-the-vercel-marketplace TEAM https://aws-samples.github.io/iam-identity-center-team/ https://github.com/aws-samples/iam-identity-center-team GitHub Actions https://github.blog/changelog/2025-12-22-improved-performance-for-github-actions-workflows-page/ Lambda Managed Instances https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html Atmos https://github.com/cloudposse/atmos/issues k8sdiagram.fun https://k8sdiagram.fun/ Marc Brooker https://brooker.co.za/blog/2025/11/20/what-now.html

Ship It Conversations: The WHY Behind DevOps, Upskilling, and Agentic AI (with Maz Islam)

2025-12-2130:38

This is a Ship It Weekly conversation episode. The weekly news recaps are still weekly. These interviews drop in between when I find someone worth talking to and the convo feels useful.In this episode I’m joined by Mazharul “Maz” Islam (DevOps with Maz). Maz is a UK-based DevOps Engineer who shares practical, real-world DevOps content on YouTube and LinkedIn. We talk about the stuff that actually matters when you’re building systems, running infrastructure, owning reliability, and living in on-call.We hit three big things: the importance of understanding the WHY behind DevOps (not just the tools), how to upskill and keep up with the industry without burning out, and what the agentic AI era might look like for DevOps, SRE, and platform engineering teams. We also touch on MCPs and AI agents, and what “leveling up” looks like for companies that want to move faster without breaking everything.If you’re into DevOps culture, SRE practices, platform engineering, CI/CD, infrastructure automation, and how teams should think about reliability and security as things keep changing, this one should land.Guest Mazharul Islam (DevOps with Maz) UK-based DevOps Engineer. Posts a lot of hands-on content around cloud, DevOps fundamentals, and leveling up as an engineer.Links (Maz) YouTube: https://m.youtube.com/@devopswithmaz LinkedIn: https://www.linkedin.com/in/mazharul419Topics we covered WHY behind DevOps, and why “tools” is the smallest part of it DevOps fundamentals vs tool-chasing Upskilling strategies for DevOps Engineers and SREs How to keep learning cloud and automation without drowning What strong teams measure and what “good” actually looks like (delivery, reliability, feedback loops) Agentic AI, AI agents in operations, and the next era of DevOps MCPs, automation guardrails, and safe ways to scale change How companies can “level up” their engineering org without turning it into chaosWe also discussed the previous episode of Ship It Weekly - GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CIhttps://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/Book Maz recommended The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Paperback, Oct 6, 2016) Gene Kim, Jez Humble, Patrick Debois, John WillisAbout Ship It Weekly (format) Ship It Weekly is for people running infrastructure and owning reliability. Most episodes are quick weekly news recaps for DevOps, SRE, and platform engineering. In between those weekly drops, I’ll publish interview episodes like this one.Subscribe / help the show If you want the weekly DevOps news recaps plus these interviews, hit follow or subscribe in your podcast app. And if you’re feeling generous, leave a rating or review and share this episode with a coworker (especially your on-call buddy). That stuff genuinely helps the show get discovered.

GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI

2025-12-2012:06

This week on Ship It Weekly, Brian looks at how the “platform tax” is showing up everywhere: pricing model shifts, CI dependencies, and new security boundaries thanks to AI agents.We start with GitHub Actions. GitHub announced a new “cloud platform” charge for self-hosted runners in private/internal repos… then hit pause after backlash. Hosted runner price reductions for 2026 are still planned. We also got the perfect timing joke: a GitHub incident the same week.Next up is HashiCorp. Legacy HCP Terraform (Terraform Cloud) Free is reaching end-of-life in 2026, with orgs moving to the newer Free tier capped at 500 managed resources. If you’re running real infrastructure, this is a good moment to audit what you’re actually managing and decide whether you’re cleaning up, paying, or planning a migration.Then we talk PromptPwnd: why stuffing untrusted PR/issue text into AI agent prompts (inside CI) can turn into a supply chain/security problem. The short version: treat AI inputs like hostile user input, keep tokens/permissions minimal, and don’t let agents “run with scissors.”We also cover the Home Depot report about long-lived access exposure as a reminder that secrets hygiene, blast radius, and detection still matter more than the shiny tools.In the lightning round: CDKTF is sunset/archived, Bitbucket is cleaning up free unused workspaces, and SourceHut is proposing pricing changes. We wrap with a human note on “platform whiplash” and why a simple watchlist beats carrying all this stuff in your head.Links from this episodeGitHub Actions pricing + pause https://runs-on.com/blog/github-self-hosted-runner-fee-2026/ https://x.com/github/status/2001372894882918548 https://www.githubstatus.com/incidents/x696x0g4t85lHashiCorp / Terraform Cloud free plan changes https://github.com/hashicorp/terraform-cdk?tab=readme-ov-file#sunset-notice https://www.reddit.com/r/Terraform/s/slYm77wzYrPromptPwnd / AI agents in CI https://www.aikido.dev/blog/promptpwnd-github-actions-ai-agentsHome Depot access exposure report https://techcrunch.com/2025/12/12/home-depot-exposed-access-to-internal-systems-for-a-year-says-researcher/Bitbucket cleanup https://community.atlassian.com/forums/Bitbucket-articles/Bitbucket-cleanup-of-free-unused-workspaces-what-you-need-to/ba-p/3144063SourceHut pricing proposal https://sourcehut.org/blog/2025-12-01-proposed-pricing-changes/

IBM Buys Confluent, React2Shell, and Netflix on Aurora

2025-12-1216:14

In this episode of Ship It Weekly, Brian powers through a cold and digs into a very “infra grown-up” week in DevOps.First up, IBM is buying Confluent for $11B. We talk about what that means if you’re on Confluent Cloud today, still running your own Kafka, or trying to choose between Confluent, MSK, and DIY. It’s part of a bigger pattern after IBM’s HashiCorp deal, and it has real implications for vendor concentration and “plan B” strategies.Then we shift to React2Shell, a 10.0 RCE in React Server Components that’s already being exploited in the wild. Even if you never touch React, if you run platforms or Kubernetes for teams using Next.js or RSC, you’re on the hook for patching windows, WAF rules, and blast-radius thinking.We also look at Netflix’s write-up on consolidating relational databases onto Aurora PostgreSQL, with big performance gains and cost savings. It’s a good excuse to step back and ask whether your own Postgres fleet still makes sense at the scale you’re at now.In the lightning round, we hit OpenTofu 1.11’s new language features, practical Terraform “tips from the trenches,” Ghostty becoming a non-profit project, and two spec-driven dev tools (Spec Kit and OpenSpec) that show what sane AI-assisted development might look like.For the human side, we close with “Your Brain on Incidents” and what high-stress outages actually do to people, plus a few concrete ideas for making on-call less brutal.If you’re on a platform team, own SLOs, or you’re the person people ping when “something is wrong with prod,” this one should give you a mix of immediate to-dos and longer-term questions for your roadmap.Links:IBM + Confluent https://www.confluent.io/blog/ibm-to-acquire-confluent/ https://newsroom.ibm.com/2025-12-08-ibm-to-acquire-confluent-to-create-smart-data-platform-for-enterprise-generative-aiReact2Shell (CVE-2025-55182) https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-componentsNetflix on Aurora PostgreSQL https://aws.amazon.com/blogs/database/netflix-consolidates-relational-database-infrastructure-on-amazon-aurora-achieving-up-to-75-improved-performance/Tools & tips https://opentofu.org/blog/opentofu-1-11-0/ https://rosesecurity.dev/2025/12/04/terraform-tips-and-tricks.html https://mitchellh.com/writing/ghostty-non-profit https://github.com/github/spec-kit https://github.com/Fission-AI/OpenSpecHuman side https://uptimelabs.io/your-brain-on-incidents/

#box-pro-ellipsis-177224755047392{-webkit-line-clamp:2;}Ship It Weekly - DevOps, SRE, and Platform Engineering News

Ship It Weekly - DevOps, SRE, and Platform Engineering News