Planet-Scale Dashboards

Talk

Google runs hundreds of thousands of services globally, often interdependent and with shared concerns. At that scale, classical Federated Observability — a platform team providing foundations and/or building blocks for each team to assemble on their own — does not scale anymore.

In this talk, we will demonstrate how Google managed to cut toil dramatically while providing best-in-class monitoring out-of-the-box:

  • What are the unique circumstances that contributed to Google’s scaling problem?
  • A data model for re-usable dashboards
  • Impact on both configuration overhead and incident response
  • Looking beyond dashboards, how such re-use can be facilitated in the broader observability space

Robert will draw on Google’s research paper on Planet-Scale Dashboards (to be published mid 2025) and more than a decade of experience in SRE.

Speaker

robert-lehmann

Robert Lehmann

 

Staff SRE @ Google

Robert has spent the past decade with Google SRE, building the tools that a majority of Google’s incident response uses for production monitoring. In a previous life, he was a

...