In today's world, a company must be a "Learning Organization" in order to be successful and innovative. Learning from both failure and success, in order to implement small incremental improvements is critical. But until you implement and apply new information, you haven't truly "learned” anything and you certainly haven’t improved.
According to the 2015 Monitoring Survey, most companies leverage metrics from monitoring and logging purely for performance analytics and trending. If high availability and reliability are important, they also leverage metrics to alert on fault and anomaly detection. Despite these “best practices”, the metrics are primarily only used as context to keep things “running” or return them back to "normal" if there’s a problem. Rarely is that data used as a method to identify areas of improvement once services have been restored. When an outage occurs to your system, you will absolutely repair and restore services as best you know how, but are you paying attention to the data from the recovery efforts? What were operators seeing during diagnosis and remediation? What were their actions? What was going on with everyone, including conversations? A step-by-step replay of exactly what took place during that outage.
This “old-view” perspective on the purpose of monitoring, logging, and alerting leaves the full value of metrics unrealized. It fails to address what’s important to the overall business objective and it lacks any hope of seeking out innovation or disruption of the status quo.
This talk will illustrate how to identify if your company is making the best use of metrics and ways to not only learn from failure, but to become a "Learning Company".
Speaker: Jason Hand