Learning From Failures

Running software services is hard. We all work with complex systems that fail in different and unexpected ways. How can we reduce the impact of future failures? By learning from our incidents we can improve the services we run. An attendee will learn the following:- Why we experience failures- Why Five whys and RCA methods should no longer be used- Jell.io"s Howie: The Post-Incident Guide- How to facilitate learning sessions- How to improve from failures This talk will be useful for any teams that experience failures - whether they are large or small, distributed, or cross functional. The audience will walk away with a plan for addressing future failures and the foundation to keep improving their post-failure responses. What will the Audience get from the talk? A more effective method for teams to follow to learn from failures.Any size teams, large, small, cross-functional, etcNotes to speaker review: This is an expanded version of my previous lightning talk, “Howie” is a newer guide I will reference.

Speaker

Craig Cook

Craig has a 25+ year background in infrastructure and monitoring. He manages two IBM teams, one focused on PagerDuty and Instana (Observability Platform) for all of IBM, and the other develops ...