Discover
Slight Reliability
120ย Episodes
Reverse
Send a text This week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more. You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastaneh You can find the ...
Send a text "Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently? In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments. (Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some words, will sort it for the next episode) You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter....
Send a text How do you ingest and store petabytes of telemetry every day in a cost effective and high performing way? How can you do this in a way which gives engineers the operational data they need to keep services running? How has this challenge be tackled in the past and what's been the evolution? This week I'm joined by Observe co-founder Jacob Leverich to go deep into this topic. We discuss... ๐พ A deep-dive into the evolution of telemetry storage and where it's going ๐ฝ The advent of gen...
Send a text How do you take all the utopian ideas you read about in books and apply them to the reality of the organisations we work in? This week I'm joined by leader, mentor, and coach Rob Roe to tackle this question. We discuss... ๐ช๏ธ The pitfalls of functional silos ๐คซ Is the annual budget a load of rubbish? ๐ How our management promotion systems are often broken ๐ซ The power of virtual teams ๐ Team interaction models ...and much more. You can find Rob on... LinkedIn: https://www.linkedin.co...
Send a text We spend a third of our life at work. It needs to be something we enjoy and something with purpose. Our work experience also impacts our family, friends, and our personal lives. This week I'm joined by tech engineer, leader, and author Richard Bown to explore this and many other topics including... ๐ช๏ธ The difficulty in applying the ideas we read in books in real organisations ๐คซ When you want to implement a thing you can't talk directly about the thing ๐ Does change require senior ...
Send a text When you become a people leader there is no manual. How can we not only learn leadership skills but practice them and build leadership muscle? This week I'm joined by Orion Group Limited co-founder Xiao Zhang to discuss... ๐ The challenge of transitioning into people leadership ๐ช How we don't get fit by watching other people work out โ Pausing as an act of active leadership ๐ The power of slack time for creativity and systems thinking ๐ Going below the waterline ...and much more. ...
Send a text This week I kick off the 2026 season with some news and we explore how to prepare for a new role. You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand): https://slightreliability.digitees.co.nz/ You can find Stephen on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Bluesky: https://bsky.app/profile/slightreliability.bsky.social YouTube: https://www.youtube.com/c/SlightReliability Instagram: https://www.instagram.com/slight_rel...
Send a text From the day we invented computers we've been struggling to keep applications running and delivering services to the business. Is this latest wave of AI helping or hurting us? This week I'm joined by Causely founder Shmuel Kliger to dive into... ๐ The three waves of AI hype over the decades (the history of AI) โ ๏ธ The dangers of over-promising and under-delivering what AI can do ๐ง What is causal reasoning? ๐ฑ Is AI replacing SREs? ๐ฎ AI as a way to allow humans to solve higher level ...
Send a text What is operational intelligence and how is it different from observability or BI? This week I'm joined by SquaredUp's VP of Innovation Adam Kinniburgh to answer that question and many more including... โ What is operational intelligence? ๐ Relating observability back to customer, business, or revenue ๐ The value of giving stakeholders confidence ๐ Who bridges the gap between tech and business or engineers and leadership? ๐ฆ Correlation VS causation and our innate desire to build c...
Send a text How does leading platform teams differ from leading product teams? This week I'm joined by experienced technology leader Dinesh Sukhija to answer that question and many more including... โ What is a platform team? โฝ Coaching engineers to focus on outcomes โ๏ธ Connecting platform initiatives to business goals โ Identifying the limiters in your team ๐ค Spreading knowledge and avoiding single points of failure ...and much more. You can find Dinesh on: LinkedIn: https://www.linkedin.com...
Send a text How has my first two years as a manager in tech been? What have I learned? What do I need to work on? This week I share my experiences over the past couple of years. I cover: ๐ฅ My recent close call with burnout ๐ซถ How I attempted to build a team culture ๐ช The importance of tough conversations ๐ฅฑ How roles and responsibilities might be boring to think about but is critical โ What's next? ...and much more. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Blu...
Send a text How could AI help human beings negotiate the mountains of telemetry we collect to get simple and fast insight? This week I'm joined by Ottermon AI CEO and founder Checo Pacheco about the lifecycle of observability coverage and tooling within organisations and how AI is helping to find signals amongst the noise and reduce cognitive load for SREs. We discuss... ๐ The need for a layer of logic on top of our telemetry data ๐ฒ The observability lifecycle of a DevOps team ๐ถ How most orgs...
Send a text What is chaos engineering and how is it being used in 2025? This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss... ๐ช๏ธ What is chaos engineering and what is its origins? ๐ชด How has it evolved over the year? ๐ค The role of AI agents in SRE work ๐ฐ Justifying the value of chaos engineering ๐โโ๏ธโโก๏ธ How do I get started? ...and much more. You can find Kolton on: LinkedIn: https://www.linkedin.com/in/kolton-andrus-77315a2/ And you can find out more about Gremlin's new ...
Send a text What are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)? This week I'm joined by Luke McManus to discuss... โฐ๏ธ What are the four team topologies? ๐ Can we have too much collaboration? โ Team interaction models ๐ Cognitive load ๐โโ๏ธโโก๏ธ Value dynamics mapping ...and much more. You can find Luke on: LinkedIn: https://www.linkedin.com/in/luke-mcmanus-agile/ Check out the recently released second edition of the Team Topolo...
Send a text How do you begin contributing to an open source project? What's it like? What do you get out of it? This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore... โฐ๏ธ What it's like working on one of the biggest open source projects in the world ๐ The benefits of contributing to open source โ How much time and effort does it take? ๐ The unique challenges of contributing from APAC (and the need for more contri...
Send a text As an #SRE how do you influence senior leadership to get support and priority for the things you care about? To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDuty. Our conversation touches on... ๐ค How understanding needs to flow both ways (between engineers and leaders) ๐จ Reliability is as much an art as a science ๐ Using napkin math to start conversations ๐ง Understand the system (your org) before trying...
Send a text This week I do a retrospective on the Slight Reliability podcast. ๐ How many people listen to it? โค๏ธ How do I feel about the show? ๐ What's going well? ๐ชด What could be better? โ What's next for the show? If you want to check out the podcast that came before Slight Reliability, you can find Performance Time archived on YouTube here: https://www.youtube.com/@performance-time You can find Stephen on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Bluesky: https://bsky.app/pr...
Send a text Have you burned out at work? What was your experience? How did you work through it? This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out at work. We cover... ๐ฅ What is burnout? โ Why does it happen? ๐ซ What are the symptoms? ๐ฅ Fight, flight, or freeze ๐งโ๐ Advice on how to recover ...and much more. Resources from the show... Why you're so angry at work (and what to do about it) by N...
Send a text This week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover... ๐ฑ The mobile/backend observability divide โ๏ธ The challenge of distributed tracing on mobile apps ๐ The entire device runtime environment matters for your app ๐ค The quest for user-centric mobile observability โ
Advice on how to get started with mobile observability ...and much more. You can find Hanson on: LinkedIn: https://www.linkedi...
Send a text This week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover... ๐๏ธโโ๏ธ Reliability VS Robustness VS Resilience ๐งฉ What is a complex system? ๐ข Safety one/safety two ๐ง Mental models ๐ฉ Human error ...and so much more. Resources from this episode: Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implications_f...






















