Discover
This is Fine! A podcast about resilience engineering and software
This is Fine! A podcast about resilience engineering and software
Author: Colette Alexander and Clint Byrum
Subscribed: 0Played: 0Subscribe
Share
© Colette Alexander and Clint Byrum
Description
A podcast about resilience engineering and software.
Ever wondered why things on the internet break? Do you work in software and wish that you could have a Dear-Abby-Like call-in show that could answer your deepest questions about how to make your workplace suck less? We're here to help!
Write us anonymously at our open question form
Email us at: thisisfine.softwarepodcast@gmail.com
Call us and leave a voicemail, or text us at: (401) 592-7574
28 Episodes
Reverse
If you’re feeling like you need to do more to respond to our moment:Lots of place to donate to in the twin cities are listed here: https://mspmag.com/arts-and-culture/general-interest/ice-minnesota-support-immigrant-communities-fundraisers-food-drives-trainings/You can always find mutual aid networks in your own area, including immigrant aid networkshttps://immigrantdefensenetwork.org/ does good work, tooThe Hometown Holler podcast with Tressie McMillan Cottom was a wonderful discussion: https://www.youtube.com/watch?v=2gr4mW8aR-gThe Ruth Wilson Gilmore’s interview that I quoted clumsily is here: https://www.nytimes.com/2019/04/17/magazine/prison-abolition-ruth-wilson-gilmore.html The paper itself: https://qualitysafety.bmj.com/content/14/2/130.shortIf you haven’t seen The Pitt, you should, it’s super good: https://en.wikipedia.org/wiki/The_PittCharles Perrow’s Normal Accidents has more definitions/examples of coupling: https://bookshop.org/p/books/normal-accidents-living-with-high-risk-technologies-updated-edition-professor-charles-perrow/cad38a43fcffa1f8?ean=9780691004129&next=tSome stuff on microservices and coupling here: https://microservices.io/post/architecture/2023/03/28/microservice-architecture-essentials-loose-coupling.htmlColette’s #notanad endorsement for paper organizing is https://paperpile.com/Rasmussen’s boundary model comes initially from his paper here: https://www.sciencedirect.com/science/article/abs/pii/S0925753597000520And if you want a good writeup on Rasmussen’s boundary model explaining it, you can always read Lorin’s blog: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Dr Cook’s talk at Velocity is a classic, and goes over Rasmussen’s boundary model really well: https://www.youtube.com/watch?v=PGLYEDpNu60Fred does a great job writing about the Law of Stretched Systems and how it applies to his own work on his blog: https://ferd.ca/the-law-of-stretched-cognitive-systems.html“Plans are nothing, but planning is everything” is a paraphrase of Eisenhower: https://www.presidency.ucsb.edu/documents/remarks-the-national-defense-executive-reserve-conferenceWant to chat about this paper with other folks? Come to the RISF live event for a Paper Party! https://resilienceinsoftware.org/events/157553
Seriously though, can’t wait to gtfo of this year.Palisades fire links: https://www.nbclosangeles.com/investigations/anonymous-letter-demands-independent-palisades-fire-investigations/3800442/https://internationalfireandsafetyjournal.com/palisades-fire-report/https://www.latimes.com/california/story/2025-12-20/lafd-report-on-palisades-fire-was-watered-down-in-editing-process-records-showCorey Quinn’s commentary on the AWS outage in October is here: https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/Time to reset the clock on how many episodes it’s been since we’ve mentioned the Ironies of Automation: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfAlso on Rasmussen’s Boundary Model, which Lorin does a great write up on: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Lorin’s Law is our favorite law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/You can ask us questions or write to us using our form linked from our website: thisisfinepod.comResilience in Software Foundation is at resilienceinsoftware.org
Mentioned multiple times, Em Ruppe’s amazing talk on incident severity: https://www.usenix.org/conference/srecon24americas/presentation/ruppeWe talk about the RIS Slack sometimes - you can join us in the slack, by joining the Foundation here: https://resilienceinsoftware.org/Please ask us a question at thisisfinepod.com
The writeup on the AWS outage from AWS themselves, if you haven’t seen it: https://aws.amazon.com/message/101925/Dave’s department at OSU, Cognitive Systems Engineering: https://ise.osu.edu/human-systems-integration/cognitive-systems-engineering is a part of the larger Integrated Systems Engineering school: https://ise.osu.edu/human-systems-integration Dave was talking early on about the discussion on the war on expertise, it was this webinar through the NDM association: https://vimeo.com/1129606494?fl=pl&fe=sh&mc_cid=c807a504fbDave was a part of the Paul Feltovich got a shout out - he wrote a lot, but one of the best is with Gary Klein on Common Ground and Coordination in Joint Activity: https://www.academia.edu/download/31764257/Common_Ground_Single.pdfAnd Studies of Expertise from Psychological Perspectives: https://www.researchgate.net/profile/Paul-J-Feltovich/publication/200772882_Studies_of_expertise_from_psychological_perspectives/links/58bd18b2aca27261e528de07/Studies-of-Expertise-from-Psychological-Perspectives.pdfDave mentions his “Command-Adapt Paradox chapter” - you can find that here: https://library.oapen.org/bitstream/handle/20.500.12657/88327/1/978-3-031-45055-6.pdf#page=77Shout out to Norbert Weiner, the godfather of cybernetics: https://www.jstor.org/stable/24945913For just two studies on how private equity in hospitals causes worse outcomes for patients you can see: https://hsph.harvard.edu/news/private-equitys-appetite-for-hospitals-may-put-patients-at-risk/Andhttps://www.sciencedirect.com/science/article/pii/S0304405X25001151Dave talks a bit about saturation and crossing boundaries towards failure - it’s worth familiarizing yourself with Rasmussen’s boundary model - Lorin Hochstein writes a good summary over at his blog: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Dave also mentions graceful extensibility - this is a concept he’s written quite a bit about, you can start here: https://link.springer.com/article/10.1007/s10669-018-9708-3Shout out to Slight Reliability: https://slightreliability.com/One of the great Woods/Cook write ups on anticipation in anesthesiology: https://www.sciencedirect.com/science/article/pii/S0952818096900094In case you’re unfamiliar with the Chicago Seven: https://en.wikipedia.org/wiki/Chicago_SevenThe Messy 9 are:congestioncascadeconflictlagsaturationfrictiontemposurprisetanglesKeep an eye on the merch store over at https://www.bonfire.com/store/risf/ if you want the t-shirt.
It’s Spamton G (not J) Spamton, Clint! Get hip to the game characters! https://deltarune.fandom.com/wiki/SpamtonThere are a couple of incident command trainers out there who tend to get recommended in the tech world (that we know of): https://www.blackrock3.com/ and Great Circle: https://greatcircle.com/im/
A history of the 5 whys and root cause analysis from papersSome critiques of the 5 whys:From John Allspaw: https://www.oreilly.com/radar/the-infinite-hows/From Alan J Card: https://qualitysafety.bmj.com/content/26/8/671James Reason and the Swiss Cheese Model: https://pmc.ncbi.nlm.nih.gov/articles/PMC8514562/James Reason’s book Human Error: https://bookshop.org/p/books/human-error/9e06d8a100a07537?ean=9780521314190&next=tAnd a classic from Sidney Dekker (et al.) on the implication of complexity within safety investigations:https://www.sciencedirect.com/science/article/abs/pii/S0925753511000105?via%3DihubWe always recommend the Howie Guide: https://howie-guide.pagerduty.com/STAMP is starting to get popular: https://functionalsafetyengineer.com/introduction-to-stamp/Google’s STAMP paper: https://www.usenix.org/publications/loginonline/evolution-sre-googleGoogle’s STAMP discussion on ProdCast: https://sre.google/prodcast/#season4-episode7And presentation at SRECon: https://www.usenix.org/conference/srecon25americas/presentation/kleinNancy Leveson’s google scholar is always worth browsing: https://scholar.google.com/citations?user=78y4sEcAAAAJ&hl=enAllspaw’s LinkedIn post that we quoted: https://www.linkedin.com/posts/jallspaw_important-reminders-about-learning-effectively-activity-7378775591447183360-c_eDLorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/Want to talk more about this subject? We’re doing a live event co-sponsored by RISF and you can sign up for it here: https://resilienceinsoftware.org/networks/events/146485
More robustness than resilience, but worth repeating that you should always check your earthquake go-bag: https://www.earthquakeauthority.com/blog/2019/how-to-make-an-earthquake-emergency-kitClint did ASA 103: https://americansailing.com/learn-to-sail/certifications/asa-103-coastal-cruising/Since this is a science podcast, there is a scientific reason people get emotional on airplanes: https://www.cntraveler.com/story/why-do-we-always-cry-on-planes52 Hertz Whale documentary: https://en.wikipedia.org/wiki/The_Loneliest_Whale:_The_Search_for_52And Leslie Jamison wrote 52 Blue as a chapter in one of her essay collections (you can read it excerpted here: https://slate.com/technology/2014/08/52-blue-the-loneliest-whale-in-the-world.html )Colette was wrong, Jamison referenced a famous Kathryn Schulz piece in one of her own essays, which was the source of confusion - The Big One: https://www.newyorker.com/magazine/2015/07/20/the-really-big-one about a cataclysmic earthquake on the west coast. In case you’re curious, Colette uses scholar.google.com and paperpile.com shamelessly live.We reference A Tale of Two Stories: Contrasting View of Patient Safety by Richard Cook and Dave Woods: https://www.researchgate.net/publication/245102691_A_Tale_of_Two_Stories_Contrasting_Views_of_Patient_Safety?enrichId=rgreq-a699511fb5bc518bf1584a0a6613d8d0-XXX&enrichSource=Y292ZXJQYWdlOzI0NTEwMjY5MTtBUzoyMDYyMjM2NjExMTMzNDdAMTQyNjE3ODk2MDQ4NA%3D%3D&el=1_x_2&_esc=publicationCoverPdfThe Beaumaiden report (that dives into a deeper, second story) is here: https://dmaib.com/reports/2021/beaumaiden-grounding-on-18-october-2021We will continue to point to DORA’s organizational model page: https://dora.dev/capabilities/generative-organizational-culture/Some Wikipedia on double loop learning: https://en.wikipedia.org/wiki/Double-loop_learningColette mentioned Mads Møller’s Lund HFSS thesis on deaths and accountability: https://lup.lub.lu.se/student-papers/search/publication/9106422And Bram Couteaux’s Lund HFSS thesis on the drunk flight attendants/pilots court: https://lup.lub.lu.se/student-papers/search/publication/9111661J Paul Reed wrote about being ‘Blame Aware’ - https://medium.com/@jpaulreed/why-blameless-postmortems-might-feel-wrong-cbeee00d51b2
Lorikeets are pretty: https://en.wikipedia.org/wiki/Rainbow_lorikeetYou think Colette’s kidding about the kangaroo? https://www.youtube.com/watch?v=DQjHVRHXbc8 The Mackinac Bridge is long: https://en.wikipedia.org/wiki/Mackinac_BridgeMichelle’s Blog post:https://resilienceinsoftware.org/news/1288714DORA has some good writing on Westrum’s cultural models if you’re wondering about it: https://dora.dev/capabilities/generative-organizational-culture/The link to our TiF live event with Michelle where we will be discussing the blog post! https://resilienceinsoftware.org/networks/events/143194Please ask us questions! You can go to thisisfinepod.com to get the link to our anonymous google form!
Corn sweat is a real thing: https://www.scientificamerican.com/article/humidity-from-corn-sweat-intensifies-extreme-heat-wave-in-midwest-u-s/Also, plugging Tajin here, because: https://en.wikipedia.org/wiki/Taj%C3%ADn_seasoningWikipedia tells me Tajin is Mexican. I dunno, Clint.Beaumaiden report, for those that didn’t listen to the prior episode where we mentioned it: https://dmaib.com/reports/2021/beaumaiden-grounding-on-18-october-2021John Allspaw’s talk at Spotify that we referenced: https://www.youtube.com/watch?v=M8mYPyRG1fQLorin’s Law is always a good plug: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/Clint’s book recommendation: https://bookshop.org/p/books/the-15-commitments-of-conscious-leadership-a-new-paradigm-for-sustainable-success-diana-chapman/14574335?ean=9780990976905&next=tSend us questions! at thisisfinepod.com or find us on LinkedIn here: https://www.linkedin.com/company/this-is-fine-a-podcast-about-software-and-resilience-engineering/You can come to the Lund panelist event for RISF by signing up here: https://resilienceinsoftware.org/networks/events/133948
A huge thanks to our panelists:John AllspawJed NeedleChad ToddRISF and TiF will host a live follow up to this episode on July 31st! You can sign up here: https://resilienceinsoftware.org/networks/events/133948If you’re interested in Lund’s Masters of Science program in Human Factors and Systems Safety, or any of their learning labs, you can check out more info here: https://www.humanfactors.lth.se/Adaptive Capacity Labs is how Jed was introduced to some of the concepts of LFI & Resilience Engineering, which eventually landed him at Lund.John mentioned SciShow Tangents, a podcast by Hank Green and Ceri Riley: https://www.youtube.com/c/scishowtangentsAs well as Conway’s Law: https://en.wikipedia.org/wiki/Conway%27s_lawAnd Dunbar’s Number: https://en.wikipedia.org/wiki/Dunbar%27s_number And the Theory of Graceful Extensibility, which you can read about here: https://infoscience.epfl.ch/server/api/core/bitstreams/87cfe245-c138-43cb-87c9-4062dc1a0519/contentLund theses list: https://www.humanfactors.lth.se/ny-sajt/msc-programme/msc-theses/Our panel’s select theses that they love:Colette’s pick: https://lup.lub.lu.se/student-papers/search/publication/9106422Chad’s pick: https://lup.lub.lu.se/student-papers/search/publication/9009930John’s picks were all of the software theses, I’m probably missing some but this is my attempt:John’s (was the first): https://lup.lub.lu.se/student-papers/search/publication/8084520 J Paul Reed: https://lup.lub.lu.se/student-papers/search/publication/8966930 Chad’s thesis on handovers in software: https://lup.lub.lu.se/student-papers/search/publication/9076274 Michael Wettick: https://lup.lub.lu.se/student-papers/search/publication/9150096 Colette’s thesis on QRA: https://lup.lub.lu.se/student-papers/search/publication/9148570Jessica De Vita: https://lup.lub.lu.se/student-papers/search/publication/9149521 Dr. Raymer’s I want to Treat the Patient and Not the Alarm: https://lup.lub.lu.se/student-papers/search/publication/2861164
Dave Wood’s Talk at SRECon 25 was on Complexification and SRE: https://www.youtube.com/watch?v=lmBvUJnGUX4Jens Rasmussen’s model - Is really well explained by Richard Cook’s talk at Velocity: https://www.youtube.com/watch?v=PGLYEDpNu60&t=3sLorin’s blog also has a good summary: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/And finally, Jens Rasmussen’s original paper on the subject: Risk Management in a Dynamic Society https://linkinghub.elsevier.com/retrieve/pii/S0925753597000520SRECon 25 talk on Incident Metrics that Matter that was awesome - https://www.youtube.com/watch?v=QrR2SvpWvdgWant to read about how things are getting a bit fash-y in tech these days?https://www.newyorker.com/culture/infinite-scroll/techno-fascism-comes-to-america-elon-muskhttps://www.theguardian.com/technology/ng-interactive/2025/jan/29/silicon-valley-rightwing-technofascismPerrow/Normal Accidents: https://bookshop.org/p/books/normal-accidents-living-with-high-risk-technologies-updated-edition-revised-charles-perrow/10369279?ean=9780691004129High Reliability Organizations (HROs):Started (ish) with “A Rejoinder to Perrow” https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1468-5973.1994.tb00047.xAnd you can find Rochlin & La Porte behind a lot of the early writing on HROs, including https://www.jstor.org/stable/44637690?seq=1 and https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1468-5973.1996.tb00078.xAs well as Weick and Sutcliffe: https://bookshop.org/p/books/managing-the-unexpected-sustained-performance-in-a-complex-world-kathleen-m-sutcliffe/11267666?ean=9781118862414&https://journals.sagepub.com/doi/10.2307/41165243
You can register for the After-the-Episode chat with Andrew at https://resilienceinsoftware.org/networks/events/129997Tickets are free for members, $10 for non-members. You can join the Foundation at https://resilienceinsoftware.org/signup Zuul is what Volvo uses for their CI, and it’s part of the OpenInfra Foundation, it’s rad.You can find Andrew on LinkedIn here.
Michael Wettick’s Lund thesis is great, and Laura Maguire’s paper on the Costs of Coordination that is a shortened version of her dissertation is worth a read!Clint’s SRECon talk that he mentioned a couple times: https://www.youtube.com/watch?v=k4UaDDkLOhwLorin wrote a great article on incidents and improvisation: https://surfingcomplexity.blog/2023/06/11/when-theres-no-plan-for-this-scenario-youve-got-to-improvise/Incident.io and the people who work there have hilarious LinkedIn posts about how people use incidents in their org.We talked about BlackRock3 who do incident command training: https://www.blackrock3.com/Brent Chapman has also done great incident command training and has done some talks on why IT incident management can learn from fire/emergency response management processes.We have a LinkedIn! https://www.linkedin.com/company/this-is-fine-a-podcast-about-software-and-resilience-engineering/And you can ask us questions here: https://forms.gle/rggrbGG6aFVrgZsv9
The O’Reilly book on Chaos Engineering by Casey and Nora Jones is here: https://www.oreilly.com/library/view/chaos-engineering/9781492043850/Some of the Netflix posts introducing Chaos Monkey and Simian Army are here and here.You can see Lorin Hochstein talking about Chaos Engineering at Netflix here.The Void is an awesome collection of information on incidents throughout tech and you can find it here.Casey mentioned Rasmussen’s model. Lorin has a great summary of that on his blog, but you can read the original paper by Rasmussen introducing this model here.A report on the Netflix outage during Christmas of 2012.A reminder - you can ask us questions for the podcast at www.thisisfinepod.com
Clint wrote the Socio-Technical Reality Engineer as a blog post it’s a good read.The Burnout book by the Nagoski sisters is A+++ reading.Those Found Responsible Have Been Sacked is by the late, great, Dr. Richard Cook and Chris NemethThe Perverse Incentives of Reliability by Katie Wilde from Snyk at this year’s SRECon was just an incredible talk.Colette mentioned the Beaumaiden report from DMAIB. She gave a talk for the DORA community on resilience engineering that you can see here.
Ben (Goodheart), Dave (Provan) and Ron (Gantt) have the very awesome podcast Punk Rock Safety (punkrocksafety.com) - you can get your own punk rock safety merch at punkrocksafetymerch.comCharles Perrow wrote Normal Accidents and talks about safety and power in his essay (book, really), Complex Organizations.Drew Rae lit the stage on fire about safety work as soothing rather than actually improving safety: EHS Congress Berlin 2024 - Day2Dr. Richard Cook’s concepts of ‘Above the Line/Below the Line’ got a shout out - here’s the paper, and here’s John Allspaw giving a talk about the concept.
No video for this one because it didn’t really end up working.We had some awesome people with us for this show:Eric DobbsWill GallegoJuan Carlos RamirezMartin SmithDr Richard Cook’s talk on The Marvelous Resilience of Bone(one of our absolute favorites)You can see the schedule for the SRECon 2025 Americas conference hereThe keynotes from the day we recorded were Dr David Woods and Katie Wilde (from Snyk)
The XKCD comic that’s in Colette’s thesis is DependencyJustin Reock is at DXhttps://punkrocksafety.com/ are our mutual podcast friends
You can find John at Adaptive Capacity Labs or his (old) blog at Kitchen Soap. ITIL is… well, it’s a thing.Colette’s “You’re surprised it works in the first place” comes from Richard Cook’s brilliant Velocity talk in 2013.FYI, John wasn’t talking about Franz Kafka, we think he was talking about Apache Kafka. But they are pretty similar, we think.
You can find ACL (Adaptive Capacity Labs), the folks who train software engineers how to do LFI and who we speak so fondly of here.Colette mentioned Allspaw’s take on Five Whys - if you want to know why we think there are better options for learning out there, you can read it here.Alex did a great talk with Sarah Butt on some LFI related things at LFI Conf in 2023: https://www.youtube.com/watch?v=CbSiKAtO7FkAnd at SRECon: SREcon20 Americas - Are We Getting Better Yet? Progress Toward Safer OperationsColette went to go see whales in the Baja through this tour, it was awesomeWrite to us at thisisfine.softwarepodcast@gmail.com or go fill out our form with a question at Thisisfinepod.com





