A fully automated DevOps workflow gives teams a false sense of invulnerability to failure. A failure is an “all hands on deck” reaction, usually with a lot of raw data but few real clues about the cause of the failure. Too often rebooting the instances or similar unrelated action solves the problem, but doesn’t lead to a root cause. So it may happen again and again, without resolution.
There are two ways to become comfortable with failure. First is to ensure that it is not sudden. Active monitoring of can often provide predictive indicators of failures. Second is to practice failures. Chaos engineering was originally intended to give DevOps teams confidence in the ability of an application to withstand unexpected shocks.
This presentation discusses reasons why we’re are not prepared for sudden failure, as well as techniques for addressing those failures when they occur. It emphasizes that practice can help make failure just another day in the office.