Fail Small, IaC Control Planes, and Automated RCA
Description
This week on Ship It Weekly, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.
We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.
Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite.
Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10–15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.
In the lightning round: a follow-up on GitHub Actions direction (and a quick callback to Episode 6’s runner pricing pause), AWS ECR creating repos on push, a smarter take on incident metrics, Terraform drift visibility, and parallel “coding agent” workflows.
We wrap with a human reminder about the ironies of automation: automation doesn’t remove responsibility, it moves it. Faster systems require better brakes, better observability, and easier rollback.
Links from this episode
SRE Weekly issue 503 (source roundup - CloudFlare) https://sreweekly.com/sre-weekly-issue-503/
Pulumi: all IaC, including Terraform and HCL https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/
GitHub Actions: “Let’s talk about GitHub Actions” https://github.blog/news-insights/product-news/lets-talk-about-github-actions/
Episode 6 (GitHub runner pricing pause, Terraform Cloud limits, AI in CI) https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/
AWS ECR: create repositories on push https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/
DriftHound https://drifthound.io/
Superset https://superset.sh/
More episodes + contact info, and more details on this episode can be found on our website: https://shipitweekly.fm














