Reporting on Reliability - Improving stakeholder conversations is a best practices presentation for how to communicate about reliability: during incidents, immediately after, and in periodic planning sessions. “Phew! That incident is resolved. Now let’s never speak of it again.” Tempting, right?
After a production outage, rehashing it is often the last thing we want to do, especially since the conversation is likely to be a tense “face the music” situation full of finger pointing and apologies. As an alternative, reliability engineers have long practiced blameless retrospectives, having discovered that this is the best way to learn from the past, lest we repeat it. Unfortunately, stakeholders outside the operations team may not be so forgiving. But as organizations increasingly realize the importance of reliability to their users and their bottom line, we find an opportunity to engage our colleagues in developing a more mature, sustainable approach to discussing reliability—and the occasional lack thereof. By normalizing incidents, communicating systematically, and aligning on goals and investments, we mature from an avoidant, reactive approach toward a collaborative, strategic one.
Ramón is a Staff Site Reliability Engineer at Google where he works on the Identity team. He started back in 2011 as an intern and has since then become
...