Maintaining Reliable systems: How to minimize Incident's impact?

Incidents are expensive to the business, especially if customers leave us if we are perceived as unreliable. But failures will happen, it’s not an issue of IF, but a question of when. So how can we reduce the impact on our users? In this talk, I will review the production incident cycle, the time that we are not reliable and our users are not happy which includes the time to detect, time to repair and time between failures. I’ll share a few methods to tackle each one of those parts in order to minimize incident impact both from technical and people aspects, expending on incident response and post mortems to know what is the most important thing for us, and we want to be data driven in those decisions.

Speaker

Ayelet Sachto

Ayelet is a Site Reliability Engineer @Google UK, formerly Strategic Cloud Engineer and leading PSO-SRE efforts in EMEA. @Google and a passionate problem solver. Throughout her 17 years career, Ayelet ...