#113 - Faster Incident Response feat. Tim Armandpour // CTO @ PagerDuty
Description
Plan and PRACTICE for better incident response with insights from Tim Armandpour, CTO of PagerDuty. Learn the secrets to resilience from the team that mitigated the impact of a major outage—handling a 250% traffic surge while delivering on their SLA.
Listen to find out:
- 🛠️ Why planning AND practice are both critical for incident response.
- 🚧 How to practice for incident response (e.g Failure Fridays with Chaos Engineering)
- 🧑🤝🧑 Ownership: Why tech AND business teams must join post-mortems.
- ☁️ How to mitigate the impact of your cloud provider’s lower SLA.
- ⚓ Which architectural patterns are more resilient?
- ⚖️ WARNING: “bend” the CAP theorem at your own risk
Listen here
TimeStamps:
(00:00:00 ) Introduction to Alphalist Podcast
(00:01:00 ) Meet Tim Armanpour
(00:01:56 ) Tim's Early Career
(00:06:22 ) Handling Major Incidents at PagerDuty
(00:09:21 ) The Importance of Preparedness
(00:13:54 ) Practicing Failure Scenarios
(00:18:16 ) Resilient Infrastructure and Architectural Patterns
(00:22:44 ) Standardization and Data Management
(00:25:48 ) Exploring Infrastructure Resilience
(00:26:20 ) Achieving High Availability with Lower SLA Cloud Platforms
(00:29:38 ) Defining Meaningful SLIs
(00:32:15 ) Assessing Incident Readiness
(00:35:15 ) The Importance of Ownership
(00:41:30 ) Continuous Improvement
(00:43:53 ) Lessons from a Yogurt Business
(00:48:18 ) Final Thoughts and Takeaways























