a pint sized introduction to SLO

Athletes, Firemen and Doctors train everyday to be the best at their chosen profession. As engineers we spend much of our time getting stuff to production and making sure our infrastructure doesn’t burn down out right. We however spend very little time learning to understand and respond to outages. Does our platform degrade in a graceful way or what does a high CPU load really mean? What can we learn from level 1 outages to be able to run our platforms more reliably.

Plenty of people are jumping on the new hype, Observability, lots of them are replacing their “legacy” monitoring stack. Not all of them achieve the goals they set. But observability is not a tool — it is a property of a system. Moving from many small black boxes to a more data driven view of your system.

Furthermore we ll discuss the need for and the options of not only monitoring our platforms and it’s inevitable outages, but also their (potential) length and impact. We ll look at tools like at using Service Level Objects for ways to prepare teams to tweak their testing and monitoring setup and run-books to quickly observe, react to and resolve problems.



Bram Vogelaar

Bram Vogelaar spent the first part of his career as a Molecular Biologist, he then moved on to supporting his peers by building tools and platforms for them with a lot of Open Source technologies. He ...