Solving incidents with one-time ephemeral runbooks

Update: 2025-10-20

Description

Share Episode ⸺ Episode Sponsor: Attribute - https://dev0ps.fyi/attribute

In the wake of one of the worst AWS incidents in history, we're joined by Lawrence Jones, Founding Engineer at Incident.io. The conversation focuses on the challenges of managing incidents in highly regulated environments like FinTech, where the penalties for downtime are harsh and require a high level of rigor and discipline in the response process. Lawrence details the company's evolution, from running a monolithic Go binary on Heroku to moving to a more secure, robust setup in GCP, prioritizing the use of native security primitives like GCP Secret Manager and Kubernetes to meet the obligations of their growing customer base.

We spotlight exactly how a system can crawl GitHub pull requests, Slack channels, telemetry data, and past incident post-mortems to dynamically generate an ephemeral runbook for the current incident.Also discussed are the technical challenges of using RAG (Retrieval-Augmented Generation), noting that they rely heavily on pre-processing data with tags and a service catalog rather than relying solely on less consistent vector embeddings to ensure fast, accurate search results during a crisis.

Finally, Lawrence stresses that frontier models are no longer the limiting factor in building these complex systems; rather, success hinges on building structured, modular systems, and doing the hard work of defining objective metrics for improvement.

💡 Notable Links:

🎯 Picks:

Warren - Anker Adpatable Wall-Charger - PowerPort Atom III
Lawrence - Rocktopus & The Checklist Manifesto

Comments

In Channel

Browser Native Auth and FedCM is finally here!

2025-12-1549:44

Are we building the right thing?

2025-12-0436:02

Why Your Code Dies in Six Months: Automated Refactoring

2025-11-2032:58

AI, IDEs, Copilot & Critical Thinking

2025-10-3153:20

Solving incidents with one-time ephemeral runbooks

2025-10-2049:59

The IT Dictionary: Post-Mortems, Cargo Cults, and Dropped Databases

2025-10-0229:34

Vector Databases Explained: From E-commerce Search to Molecule Research

2025-09-2455:29

The Unspoken Challenges of Deploying to Customer Clouds

2025-09-1752:41

How to build in Observability at Petabyte Scale

2025-09-0745:31

The Open-Source Product Leader Challenge: Navigating Community, Code, and Collaboration Chaos

2025-08-2459:25

FinOps: Holding engineering teams accountable for spend

2025-07-3155:07

The Auth Showdown: Single tenant versus Multitenant Architectures

2025-07-1753:24

Should We Be Using Kubernetes: Did the Best Product Win?

2025-06-2401:06:35

Mastering SRE: Insights in Scale and at Capacity with Aimee Knight

2025-06-2101:17:55

Exploring MCP Servers and Agent Interactions with Gil Feig

2025-06-1401:04:57

No Lag: Building the Future of High-Performance Cloud with Nathan Goulding

2025-06-0901:00:38

Ground Truth & Guided Journeys: Rethinking Data for AI with Inna Tokarev Sela

2025-06-0452:45

Incident Vibing: The Self-Healing System - DevOps 242

2025-05-2901:10:05

Decentralized Chaos: Web3 Infra, NodeOps, and the Art of Blockchain Load Balancing - DevOps 241

2025-05-2201:16:25

Observability in the CI/CD Pipeline with Adriana Villela - DevOps 240

2025-05-1501:21:08

00:00

Solving incidents with one-time ephemeral runbooks

#box-pro-ellipsis-176707870478617{-webkit-line-clamp:2;}Solving incidents with one-time ephemeral runbooks

Solving incidents with one-time ephemeral runbooks

Will Button, Warren Parad

Solving incidents with one-time ephemeral runbooks