ETCD debugging saga

Imagine you have set up a Kubernetes cluster on premise. It works fine for a few months. And then, with no reason, master nodes got high load and you are not able to fix it. There is nothing obvious in the logs, or at least they are inconclusive. After a few hours it fixes on its own. The situation repeats after a few weeks. This is a story of a difficult investigation, with no witnesses, circumstantial evidence and lots of suspects.



