We Have to Talk About Crowdstrike! Hot Takes and Quality Debates
Description
The conversation discusses the CrowdStrike outage caused by a kernel bug in a Windows update. The impact of the outage was widespread, affecting airports, medical professionals, banking, and even news channels.
The hosts emphasize the need to understand the complexity of software testing and not jump to conclusions or blame testers. They highlight the importance of continuous improvement, learning from mistakes, and taking ownership of problems.
The conversation also touches on the debate around releasing software on Fridays and the need for context-specific decision-making. The conversation explores the impact of software bugs and the importance of quality in software development. It discusses the ability to turn off software in critical situations, the challenges of working on low-level or embedded software, and the need for risk mitigation.
The conversation also touches on the response of CrowdStrike to the recent software bug and the potential human impact of such incidents. The concept of quality in software is examined, and the conversation concludes with a discussion on the increasing prevalence of software in various industries.
Links & Mentions
- 01:01 - Crowdstrike
- 01:17 - What is a kernel?
- 01:56 - What happened?
- BBC article - Crowdstrike release causes "Mass IT outage affects airlines, hospitals, media and banks"
- Preliminary Post Incident Review from Crowdstrike
- 05:32 - Dave's garage explanation of what happened 😙🤌🏾 (Ex-Microsoft Dev)
- 12:46 - Rich's LinkedIn posted about jumping to conclusions in the wake of the Crowdstrike issue
- 15:38 - Mark Winteringham
- Mark's website and blog
- Mark's excellent blog post about "Quality Engineering, Digital Employees and Job Security"
- Mark's LinkedIn
- 28:00 - Article: Crowdstrike CEO called to congress
- 37:59 - Crowdstrike updates
- Their blog
- Their Remediation and Guidance Hub: Falcon Content Update for Windows Hosts
- Their Preliminary Post Incident Review (PIR): Content Configuration Update Impacting the Falcon Sensor and the Windows Operating System (BSOD)
- 44:47 - Dame Anita Frew
00:00 Introduction and Appreciation for Listeners
00:33 - Did anything interesting happen in the last week?
01:01 - Crowdstrike (what else?!)
01:56 - Vernon & Richard describe what happened with the Crowdstrike shenanigans
04:23 Realizing the Global Impact of the Outage
06:16 Explaining the Kernel Bug and its Effects
07:44 The Process of Getting a Kernel-Based Application
08:40 The Kernel's Response to Errors and Risks
09:29 The Significance of the Kernel in Software
10:35 Updates and News from CrowdStrike
11:11 The Importance of Software Testing and Quality
12:12 The Fallacy of Blaming Testers and Testing
12:46 - Vern reads out Rich's LinkedIn post in the immediate wake of the issue
14:29 Recognizing Process Shortcomings and Risks
15:38 - The danger of "hot takes"
16:24 Taking Ownership and Learning from Mistakes
19:15 - Common Crowdstrike Hot Takes: Thou shalt not release of Friday!
19:46 Alternative Explanations and Hot Takes
21:16 The Danger of Treating Hot Takes as Facts
22:20 The Debate Around Releasing on Fridays
23:17 Mitigating Risks and Context-Specific Decision-Making
24:42 The Need for Continuous Improvement and Learning
26:18 - Common Crowdstrike Hot Takes: Clearly this hasn't been tested!
26:37 - Common Crowdstrike Hot Takes: Obvious risk mitigation steps the should have taken
28:00 - Crowdstrike CEO called to congress
28:45 The Impact of Software Bugs and the Importance of Quality
30:54 - What might have happened if Crowdstrike didn't release a critical update?
36:22 Mitigating Risks and Turning Off Software in Critical Situations
37:59 - Updates directly from Crowdstrike
38:39 - Rich's Columbo question
43:48 - The miracle of ubiquitous software
45:42 The Response of CrowdStrike and the Potential Human Impact
46:22 - One Final Hot Take from Rich