The monitoring systems at our companies are one of the widest windows into how our services are behaving. They are where we go when things are going wrong. They are also how we communicate with colleagues and our future selves about how our systems are composed. These systems also contain archeological information about past events as we tweak them over time.
However, dashboards can sometimes be an afterthought. They can be left as a task for later, a low priority item that never gets finished. It shouldn’t be so!
We’ll explore some of the methods that humans use to investigate outages and incidents. With that knowledge in hand, we’ll talk through some techniques you can use in your company to:
improve your dashboards to reduce incident response times
learn more about your company’s services through existing dashboards
teach others as you go!