Making sense of metrics, alerts and dashboards

In this presentation we'll learn what are the most important metrics we should be measuring in our systems (upper and lower bounds, SLAs/SLOs), what is the purpose of having dashboards, how different consumers will need different dashboards and why dashboards are for gathering more information about outages and not to figure out there is one outage happening, and, sadly, alerting. What to think about before including a new alert (can we automate the response? is it really actionable? do we have expectations for when it will trigger) and avoiding alerting burnout. The main goal is to help teams and managers to make sense of their data by collecting meaningful information, showing it in a way that is useful for all parties involved and not drowning teams on noise.



Maurício Linhares

Maurício is a software engineer at DigitalOcean building the Cloud. He's passionate about uptime, metrics, dashboards, but not that passionate about alerts and leads the team responsible for DO's ...