It's Down! Simulating Incidents in Production

Who loves getting paged at 3am? No one.

In responding to incidents – either at 3am or the middle of the day – we want to feel prepared and practiced in resolving production issues. In this talk, you'll learn how to practice incident response by simulating outages in your application.

This talk is going to be a case study on the first few gamedays at Stitch Fix. We have a handful of senior leadership who spearheaded “Chaos Engineering” at Netflix, Twilio and Etsy. We are using those strategies to set up the basic infrastructure and plans for gamedays on the Styling Engineering team at Stitch Fix. Teams are able to learn and gather feedback on their incident response or the way the application reacts by simulating these incidents in production.

I'll show a few code examples, highlight additional metrics we implemented before starting gamedays to measure data specific to each gameday and discuss the output of these gamedays.

The key takeaway from this talk is that incident response practice helps ensure we are building stable software that the entire team can maintain and support.

This talk is open to everyone! There will be a sprinkling of Ruby code, with the majority of the talk focused more on process, strategies and learnings.

View full program