Failures are inevitable. Every once in a great while, they can become epic disasters. Do you remember what happened during that time your cloud provider lost an entire region? What about that time your teams couldn’t check in any of their code or when your favorite social networking site was down for half a day? And yes, even that time when your alerting provider couldn’t send you alerts for over a day? These types of disasters erode customer trust and learning how to respond appropriately is critical if you expect to earn it back.
At PagerDuty, we decided that managing communications during a crisis is just another of the types of incidents we may encounter. We applied many of the DevOps principles we learned from managing technical incidents to other parts of our organization. Marketing, Support, Sales, and even the Executive Leadership Team now have trained on-call responders.
In this talk, we’ll examine the role of technical responders during a massive outage. We’ll look at what happens during major outages, and compare that against what happens during catastrophic outages when additional cross-company responders are mobilized. I provide a step-by-step framework you can use to establish your own crisis communications plan. I also provide some tips and lessons for getting a process like this deployed in your own organization.