DiscoverSlight Reliability
Slight Reliability
Claim Ownership

Slight Reliability

Author: Stephen Townshend

Subscribed: 16Played: 544
Share

Description

Learning SRE, one day at a time.
115ย Episodes
Reverse
Send us a text This week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more. You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastaneh You can find t...
Send us a text "Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently? In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments. (Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some words, will sort it for the next episode) You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitt...
Send us a text From the day we invented computers we've been struggling to keep applications running and delivering services to the business. Is this latest wave of AI helping or hurting us? This week I'm joined by Causely CEO and founder Shmuel Kliger to dive into... ๐ŸŒŠ The three waves of AI hype over the decades (the history of AI) โ˜ ๏ธ The dangers of over-promising and under-delivering what AI can do ๐Ÿง  What is causal reasoning? ๐Ÿ˜ฑ Is AI replacing SREs? ๐Ÿ”ฎ AI as a way to allow humans to solve hi...
Send us a text What is operational intelligence and how is it different from observability or BI? This week I'm joined by SquaredUp's VP of Innovation Adam Kinniburgh to answer that question and many more including... โ“ What is operational intelligence? ๐Ÿ™ˆ Relating observability back to customer, business, or revenue ๐Ÿ˜Ž The value of giving stakeholders confidence ๐ŸŒ‰ Who bridges the gap between tech and business or engineers and leadership? ๐Ÿฆ‹ Correlation VS causation and our innate desire to buil...
Send us a text How does leading platform teams differ from leading product teams? This week I'm joined by experienced technology leader Dinesh Sukhija to answer that question and many more including... โ“ What is a platform team? โšฝ Coaching engineers to focus on outcomes โ˜€๏ธ Connecting platform initiatives to business goals โœ‹ Identifying the limiters in your team ๐ŸŽค Spreading knowledge and avoiding single points of failure ...and much more. You can find Dinesh on: LinkedIn: https://www.linkedin....
Send us a text How has my first two years as a manager in tech been? What have I learned? What do I need to work on? This week I share my experiences over the past couple of years. I cover: ๐Ÿ”ฅ My recent close call with burnout ๐Ÿซถ How I attempted to build a team culture ๐Ÿ’ช The importance of tough conversations ๐Ÿฅฑ How roles and responsibilities might be boring to think about but is critical โ“ What's next? ...and much more. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ ...
Send us a text How could AI help human beings negotiate the mountains of telemetry we collect to get simple and fast insight? This week I'm joined by Ottermon AI CEO and founder Checo Pacheco about the lifecycle of observability coverage and tooling within organisations and how AI is helping to find signals amongst the noise and reduce cognitive load for SREs. We discuss... ๐ŸŽ‚ The need for a layer of logic on top of our telemetry data ๐Ÿšฒ The observability lifecycle of a DevOps team ๐ŸŽถ How most o...
Send us a text What is chaos engineering and how is it being used in 2025? This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss... ๐ŸŒช๏ธ What is chaos engineering and what is its origins? ๐Ÿชด How has it evolved over the year? ๐Ÿค– The role of AI agents in SRE work ๐Ÿ’ฐ Justifying the value of chaos engineering ๐Ÿƒโ€โ™€๏ธโ€โžก๏ธ How do I get started? ...and much more. You can find Kolton on: LinkedIn: https://www.linkedin.com/in/kolton-andrus-77315a2/ And you can find out more about Gremlin's n...
Send us a text What are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)? This week I'm joined by Luke McManus to discuss... โ›ฐ๏ธ What are the four team topologies? ๐Ÿ† Can we have too much collaboration? โŒš Team interaction models ๐ŸŒ Cognitive load ๐Ÿƒโ€โ™€๏ธโ€โžก๏ธ Value dynamics mapping ...and much more. You can find Luke on: LinkedIn: https://www.linkedin.com/in/luke-mcmanus-agile/ Check out the recently released second edition of the Team Top...
Send us a text How do you begin contributing to an open source project? What's it like? What do you get out of it? This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore... โ›ฐ๏ธ What it's like working on one of the biggest open source projects in the world ๐Ÿ† The benefits of contributing to open source โŒš How much time and effort does it take? ๐ŸŒ The unique challenges of contributing from APAC (and the need for more con...
Send us a text As an #SRE how do you influence senior leadership to get support and priority for the things you care about? To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDuty. Our conversation touches on... ๐Ÿค How understanding needs to flow both ways (between engineers and leaders) ๐ŸŽจ Reliability is as much an art as a science ๐Ÿ“ Using napkin math to start conversations ๐Ÿง  Understand the system (your org) before try...
Send us a text This week I do a retrospective on the Slight Reliability podcast. ๐Ÿ‘‚ How many people listen to it? โค๏ธ How do I feel about the show? ๐ŸŽ‰ What's going well? ๐Ÿชด What could be better? โ” What's next for the show? If you want to check out the podcast that came before Slight Reliability, you can find Performance Time archived on YouTube here: https://www.youtube.com/@performance-time You can find Stephen on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Bluesky: https://bsky.app...
Send us a text Have you burned out at work? What was your experience? How did you work through it? This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out at work. We cover... ๐Ÿ”ฅ What is burnout? โ“ Why does it happen? ๐Ÿซ€ What are the symptoms? ๐ŸฅŠ Fight, flight, or freeze ๐Ÿง‘โ€๐Ÿš’ Advice on how to recover ...and much more. Resources from the show... Why you're so angry at work (and what to do about it) b...
Send us a text This week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover... ๐Ÿ“ฑ The mobile/backend observability divide โœ๏ธ The challenge of distributed tracing on mobile apps ๐ŸŒ The entire device runtime environment matters for your app ๐Ÿ‘ค The quest for user-centric mobile observability โœ… Advice on how to get started with mobile observability ...and much more. You can find Hanson on: LinkedIn: https://www.link...
Send us a text This week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover... ๐Ÿ‹๏ธโ€โ™€๏ธ Reliability VS Robustness VS Resilience ๐Ÿงฉ What is a complex system? ๐Ÿ”ข Safety one/safety two ๐Ÿง  Mental models ๐Ÿ˜ฉ Human error ...and so much more. Resources from this episode: Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implication...
Send us a text This week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss... ๐Ÿ“’ Classroom VS situated learning ๐Ÿค The myth of the perfect handover ITIL as a coping strategy to try and make sense of the organic, wild, and messy ๐Ÿฅ• How you cannot incentivise to avoid incidents (it doesn't work that way) โค๏ธโ€๐Ÿฉน You can't understand how something is broken unless you know how it's supposed to work i...
Send us a text This week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, and beyond. We discuss... ๐Ÿ”Œ Empowering engineers to implement change in your org ๐Ÿง‘โ€๐Ÿผ Focusing on what matters (customer & business > technology) ๐Ÿ‘€ Not just adding more monitoring as the output of each PIR ๐Ÿ˜Ž How autonomy can lead to...
Send us a text This week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore... ๐ŸŒŒ Is the root cause of every incident the big bang? ๐Ÿฆ– How the value of root cause degrades as complexity increases ๐Ÿซฃ That if the culture is not blameless, people will hide things ๐ŸŒณ Alternative approaches to root cause analysis such as branching timelines ๐Ÿ™‹ Getting someone without skin in the ga...
Send us a text This week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover... ๐Ÿค– What is synthetic monitoring? ๐Ÿฆพ What are the benefits and drawbacks to using it? โ˜ข๏ธ Non-web based synthetics (the tough stuff) ๐Ÿน Combining RUM and synthetics ๐Ÿซข Does synthetics need an OTEL-like framework? ...and much more. You can find David on: LinkedIn: https://www.linkedin.com/in/david-dick/ You can find more about 2 Steps at https://2steps.io/# You can find Stephen on: ...
Send us a text This week I'm joined by Cin7 Engineering Director Milan Brown to unpack the challenges of technology management and leadership. We discuss... โœ–๏ธ Theory X vs Theory Y management ๐Ÿ—ฃ๏ธ Intention based leadership and communication ๐Ÿข Conditions in an org for people to thrive ๐Ÿ˜ตโ€๐Ÿ’ซ How do you learn to manage and lead? ๐Ÿซค Managing people when you're not an expert in what they do ...and much more. Resources mentioned during the episode: Turn The Ship Around! (book): https://davidmarquet.com...
loading
Commentsย