Xero originally had a single operations team that that managed all production incidents. As Xero has grown, it has been necessary to empower product teams to support their own services. To do this, Xero’s Site Reliability Engineering (SRE) team has developed a set of best practices around incident management. The challenge was to make it easy for other teams to adopt these practices, which is where Xero’s incident management chat-bot was born.
“Multivac” is our automated guide through Xero SRE’s incident management framework. It helps users define roles and responsibilities for an incident, communicate with a wider audience, track down other teams to help and generally attempts to reduce the time to service restoration. In this talk, I’ll discuss why we built Multivac and how it has become an indispensable aide in managing our production environment.