As systems reach production, the value it provides to a customer can become a focus of engineering teams holding a pager by setting relevant SLOs and responding to alerts. However, as systems change over time, they may gain more points of failure, increase in complexity, or customers may simply use the system differently.
If SLOs aren’t kept up-to-date, teams can find themselves responding to more and more alerts that are increasingly hidden from customers. Even the best teams can find themselves firefighting toilsome alerts and without time to improve the system’s as a whole.
Based on a true story, in this talk you will learn about pitfalls encountered when setting SLOs and how these pitfalls directly impacted the day-to-day developer experience of engineers and the systems being worked on at Red Hat. We’ll also discuss how avoiding and climbing out of these pitfalls can bring about a better understanding of the system, reducing burn out.