Making Disaster Routine: Practicing Failures Using Active Monitoring and Chaos Engineering

A fully automated DevOps workflow gives teams a false sense of invulnerability to failure. A failure is an “all hands on deck” reaction, usually with a lot of raw data but few real clues about the cause of the failure. Too often rebooting the instances or similar unrelated action solves the problem, but doesn’t lead to a root cause. So it may happen again and again, without resolution.

There are two ways to become comfortable with failure. First is to ensure that it is not sudden. Active monitoring of can often provide predictive indicators of failures. Second is to practice failures. Chaos engineering was originally intended to give DevOps teams confidence in the ability of an application to withstand unexpected shocks.

This presentation discusses reasons why we’re are not prepared for sudden failure, as well as techniques for addressing those failures when they occur. It emphasizes that practice can help make failure just another day in the office.

Speakers

Peter Varhol

Peter Varhol is a well-known writer and speaker on software and technology topics, having authored dozens of articles and spoken at a number of industry conferences and webcasts. He has advanced ... gerie-owen

Gerie Owen

Gerie Owen is Vice President, Knowledge and Innovation-US at QualiTest Group, Inc. She is a Certified Scrum Master, Conference Presenter and Author on technology and testing topics. She enjoys ...