Learning From Failure: How We Learned a Better Way

Have you been blamed for an outage?  Was the root cause ““Human Error?””  Running software services is hard. We all work with complex systems that fail in different ways. How can we reduce the impact of future failures? By learning from other industries we can improve the services we run. 

Attendees will learn the following:

  • Why we experience failures
  • Why Five whys and RCA methods should not be used
  • Etsy’s Open Source Debriefing Method
  • How to facilitate learning sessions
  • How to improve from failures
  • How not to jump into ““fix it now”” mode

This talk will be useful for any teams that experience failures - whether they are large or small, distributed, or cross-functional. Attendees will walk away with a plan for addressing future failures and the foundation to keep improving their post-failure responses.



Craig Cook

Craig has a 20+ year background in infrastructure and monitoring. He coaches and advises development squads to improve operational efficiency. He has implemented various open source tools and ...