How Instagram Reels manages reliability | Jack Li (Instagram, Shopify)
Update: 2023-02-16
Description
Jack Li explains how his production engineering team rolled out a new incident review process, how they’ve made the case for investing in reliability, and specific tools his team has built to improve reliability.
—
Discussion points:
- (1:25 ) How Jack became interested in reliability
- (3:24 ) Where the Instagram Reels team fits into the broader organization
- (4:05 ) What Jack’s team focuses on
- (4:55 ) The role of production engineering at Instagram versus Shopify
- (8:32 ) The essence of DevOps
- (10:44 ) Pros and cons of having product-focused teams
- (13:35 ) How Jack’s team defines and tracks quality
- (15:46 ) Signals the team monitors outside of systems
- (18:10 ) Revamping Instagram Reel’s incident management process
- (19:46 ) Making the case for improving the incident review process
- (28:10 ) How their incident review process works
- (31:55 ) The roles involved in an incident review
- (33:40 ) The value of having incident reviews
- (35:55 ) Why leaders should be part of incident reviews
- (38:34 ) Why Jack’s team builds tools for driving reliability goals
- (40:06 ) The types of tools Jack’s team focuses on
- (43:09 ) What a merge queue is and why it was built at Shopify
- (51:20 ) Using a Slack bot for ‘failed build’ alerts
- (52:32 ) When a company should consider implementing a merge queue
—
Mentions and links:
Follow Jack on LinkedIn
Jack’s article from his time on Shopify about their Merge Queue
Jack’s talk on Shopify’s Merge Queue at GitHub Universe 2019
Comments
In Channel























