Reliability Advocacy

We can live by reliability principles ourselves, but we would benefit a lot more if those ideas spread throughout our organizations and others work alongside us. Having teams interested in using error budgets, having discussions about implementing feature freezes are some of our goals. SLOs are powerful, but when everyone is following the process and building reliable systems, they’re even more impactful. Working alone is an ongoing battle of priorities that can be a lot easier to address if the systems we depend on also track and improve reliability.

Getting SLOs off the ground can be very hard. Gathering metrics, setting up dashboards and alerts can often be done in a few days. Changing how an organization works is a lot more difficult. Inspired by Implementing Service Level Objectives: A Practical Guide to Slis, Slos, and Error Budgets , Site Reliability Engineering and The Site Reliability Workbook , and because we’re not the first to venture out on this journey, we’ll tackle Reliability Advocacy by diving it into three separate stages, Crawl, Walk and Run, and see how we can implement them.



Ricardo Castro

Lead Site Reliability Engineer at Anova. MSc in Computer Science by the University of Porto. CK{AD, A, S} by Cloud Native Computing Foundation (CNCF) | Linux Foundation. {Terraform, Consul, Vault} ...