Slight Reliability

Learning SRE, one day at a time.

Slight Reliability Episode 82 - CI/CD with Amin Astaneh

Send us a text This week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more. You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastaneh You can find t...

02-13
25:47

Slight Reliability Episode 81 - Incident Management in Non-Prod Environments

Send us a text "Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently? In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments. (Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some words, will sort it for the next episode) You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitt...

02-06
10:09

The Implications of AI on Observability with Aaron "Checo" Pacheco (Episode 109)

Send us a text How could AI help human beings negotiate the mountains of telemetry we collect to get simple and fast insight? This week I'm joined by Ottermon AI CEO and founder Checo Pacheco about the lifecycle of observability coverage and tooling within organisations and how AI is helping to find signals amongst the noise and reduce cognitive load for SREs. We discuss... πŸŽ‚ The need for a layer of logic on top of our telemetry data 🚲 The observability lifecycle of a DevOps team 🎢 How most o...

11-04
38:27

Chaos Engineering with Kolton Andrus (Episode 108)

Send us a text What is chaos engineering and how is it being used in 2025? This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss... πŸŒͺ️ What is chaos engineering and what is its origins? πŸͺ΄ How has it evolved over the year? πŸ€– The role of AI agents in SRE work πŸ’° Justifying the value of chaos engineering πŸƒβ€β™€οΈβ€βž‘οΈ How do I get started? ...and much more. You can find Kolton on: LinkedIn: https://www.linkedin.com/in/kolton-andrus-77315a2/ And you can find out more about Gremlin's n...

10-25
31:16

Team Topologies with Luke McManus (Episode 107)

Send us a text What are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)? This week I'm joined by Luke McManus to discuss... ⛰️ What are the four team topologies? πŸ† Can we have too much collaboration? ⌚ Team interaction models 🌏 Cognitive load πŸƒβ€β™€οΈβ€βž‘οΈ Value dynamics mapping ...and much more. You can find Luke on: LinkedIn: https://www.linkedin.com/in/luke-mcmanus-agile/ Check out the recently released second edition of the Team Top...

10-07
23:10

Contributing to Open Source with Wendy Ha (Episode 106)

Send us a text How do you begin contributing to an open source project? What's it like? What do you get out of it? This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore... ⛰️ What it's like working on one of the biggest open source projects in the world πŸ† The benefits of contributing to open source ⌚ How much time and effort does it take? 🌏 The unique challenges of contributing from APAC (and the need for more con...

09-23
43:52

Influencing Leadership with Nora Jones (Episode 105)

Send us a text As an #SRE how do you influence senior leadership to get support and priority for the things you care about? To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDuty. Our conversation touches on... 🀝 How understanding needs to flow both ways (between engineers and leaders) 🎨 Reliability is as much an art as a science πŸ“ Using napkin math to start conversations 🧠 Understand the system (your org) before try...

09-09
28:16

Slight Reliability Podcast Retrospective (Episode 104)

Send us a text This week I do a retrospective on the Slight Reliability podcast. πŸ‘‚ How many people listen to it? ❀️ How do I feel about the show? πŸŽ‰ What's going well? πŸͺ΄ What could be better? ❔ What's next for the show? If you want to check out the podcast that came before Slight Reliability, you can find Performance Time archived on YouTube here: https://www.youtube.com/@performance-time You can find Stephen on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Bluesky: https://bsky.app...

08-26
27:28

Burnout with Colette Alexander (Episode 103)

Send us a text Have you burned out at work? What was your experience? How did you work through it? This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out at work. We cover... πŸ”₯ What is burnout? ❓ Why does it happen? πŸ«€ What are the symptoms? πŸ₯Š Fight, flight, or freeze πŸ§‘β€πŸš’ Advice on how to recover ...and much more. Resources from the show... Why you're so angry at work (and what to do about it) b...

08-12
38:36

Mobile Observability with Hanson Ho (Episode 102)

Send us a text This week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover... πŸ“± The mobile/backend observability divide ✍️ The challenge of distributed tracing on mobile apps 🌏 The entire device runtime environment matters for your app πŸ‘€ The quest for user-centric mobile observability βœ… Advice on how to get started with mobile observability ...and much more. You can find Hanson on: LinkedIn: https://www.link...

07-29
31:57

Intro to Resilience Engineering with Michelle Casey (Episode 101)

Send us a text This week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover... πŸ‹οΈβ€β™€οΈ Reliability VS Robustness VS Resilience 🧩 What is a complex system? πŸ”’ Safety one/safety two 🧠 Mental models 😩 Human error ...and so much more. Resources from this episode: Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implication...

07-15
39:36

Learning with John Allspaw (Episode 100)

Send us a text This week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss... πŸ“’ Classroom VS situated learning 🀝 The myth of the perfect handover ITIL as a coping strategy to try and make sense of the organic, wild, and messy πŸ₯• How you cannot incentivise to avoid incidents (it doesn't work that way) β€οΈβ€πŸ©Ή You can't understand how something is broken unless you know how it's supposed to work i...

06-24
48:17

Focusing on What Matters with Trent Hornibrook (Episode 99)

Send us a text This week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, and beyond. We discuss... πŸ”Œ Empowering engineers to implement change in your org πŸ§‘β€πŸΌ Focusing on what matters (customer & business > technology) πŸ‘€ Not just adding more monitoring as the output of each PIR 😎 How autonomy can lead to...

06-03
29:28

The Root Cause Fallacy with Andrew Hatch (Episode 98)

Send us a text This week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore... 🌌 Is the root cause of every incident the big bang? πŸ¦– How the value of root cause degrades as complexity increases 🫣 That if the culture is not blameless, people will hide things 🌳 Alternative approaches to root cause analysis such as branching timelines πŸ™‹ Getting someone without skin in the ga...

05-20
32:22

Synthetic Monitoring with David Dick (Episode 97)

Send us a text This week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover... πŸ€– What is synthetic monitoring? 🦾 What are the benefits and drawbacks to using it? ☒️ Non-web based synthetics (the tough stuff) 🍹 Combining RUM and synthetics 🫒 Does synthetics need an OTEL-like framework? ...and much more. You can find David on: LinkedIn: https://www.linkedin.com/in/david-dick/ You can find more about 2 Steps at https://2steps.io/# You can find Stephen on: ...

05-06
33:04

Tech Leadership with Milan Brown (Episode 96)

Send us a text This week I'm joined by Cin7 Engineering Director Milan Brown to unpack the challenges of technology management and leadership. We discuss... βœ–οΈ Theory X vs Theory Y management πŸ—£οΈ Intention based leadership and communication 🏒 Conditions in an org for people to thrive πŸ˜΅β€πŸ’« How do you learn to manage and lead? 🫀 Managing people when you're not an expert in what they do ...and much more. Resources mentioned during the episode: Turn The Ship Around! (book): https://davidmarquet.com...

04-23
31:27

Finding Tech Work with Leon Adato (Episode 95)

Send us a text This week Leon Adato and I break down the state of applying for roles in tech. We cover... πŸ“ What a resume or CV is and is not 🀝 Leveraging your connections rather than relying on applying cold πŸͺ„ How most job descriptions are works of fiction 🦾 White-fonting to game AI resume assessment πŸ§ͺ Experimental ways we could recruit ...and our pitch for Kubernetes the Rock Opera (and much more) You can find Leon's job postings weekly on his website: https://www.adatosystems.com/category/...

03-29
36:26

Getting a Start in SRE with Priyam Kumar (Episode 94)

Send us a text This week Priyam Kumar shares his story of moving from a massive organisation to a startup and the challenges and growth that came from that. We discuss... πŸͺ– War stories and examples of production incidents 🩹 The "hacks" we build to keep things running (and how maybe that's just normal) 😎 Keeping it simple... YAGNI (You Ain't Gonna Need It!) 🧯 The perils of getting stuck in reactive mode πŸ“– Areas of of learning if you want to get into SRE ...and much much more. You can find Priy...

03-22
31:09

SRE Leadership with Michelle Casey (Episode 93)

Send us a text This week Michelle Casey shares her insights as a 'head of' engineering manager in the SRE context. This was one of my favourite conversations on the podcast so far. We cover topics such as... 🀷🏽 Why move into leadership? πŸ‘οΈ Learning from other leaders πŸ’Ž What is unique about SRE leadership? πŸ‘‘ Women in engineering leadership ...and we go through some feedback I got as a leader recently. Resources that Michelle mentions during the episode: The Five Dysfunctions of a Team (book): ...

03-11
39:29

Observability Maturity with ÁdÑm Tóth (Episode 92)

Send us a text This week Adam and I get philosophical about what constitutes maturity in the field of observability. We tackle questions such as... πŸ’Έ Does your org treat observability as a cost centre or a value add? πŸ”₯ Are you using observability reactively to solve problems? Or proactively to build better products and services? πŸ‘€ Is your observability connected to your users and business in a meaningful way? 🌐 Is monitoring the social media sentiment of your product part of observability? .....

02-25
30:09

Recommend Channels