As engineers we spend much of our time getting stuff to production and making sure our infrastructure doesn’t burn down out right. We however spend very little time learning to understand and respond to outages. Does our platform degrade in a graceful way or what does a high cpu load really mean? What can we learn from level 1 outages to be able to run our platforms more reliably.
Plenty of people are jumping on the new hype, Observability, lots of them are replacing their “legacy” monitoring stack. Not all of them achieve the goals they set. But observability is not a tool — it is a property of a system. Moving from many small black boxes to a more holistic view of your system.
In this talk we ll talk about how to prepare teams to tweak their testing and monitoring setup and work instructions to quickly observe, react to and resolve problems. We look at improving your monitoring by adapting your culture and then maybe your tooling. Where we as engineers not only write, maintain and operate our software platforms but actively pursue ways to learn and predict its (non-functional) behavior.
Furthermore we ll discuss the need for and the options of not only monitoring our platforms and it’s envitable outages, but also their (potential) length and impact. We ll look at tools like at using Service Level Objects for ways to prepare teams to tweak their testing and monitoring setup and runbooks to quickly observe, react to and resolve problems.