Athletes, Firemen and Doctors train everyday to be the best at their chosen profession. As engineers we spend much of our time getting stuff to production and making sure our infrastructure doesn’t burn down out right. We however spend very little time learning to understand and respond to outages.
Things like Infrastructure as Code, Service Discovery and Config Management can and have helped us to quickly build and rebuild infrastructure but we haven’t nearly spend enough time to train our self to review, monitor and respond to outages. Does our platform degrade in a graceful way or what does a high cpu load really mean? What can we learn from level 1 outages to be able to run our platforms more reliably.
In this talk we ll discuss the need for and the options of creating a game day culture. Where we as engineers not only write, maintain and operate our software platforms but actively pursue ways to learn and predict its (non-functional) behavior. We ll look at tools like toxiproxy and the simian army for ways to prepare teams to tweak their testing and monitoring setup and work instructions to quickly observe, react to and resolve problems.