Designing distributed systems means considering failure scenarios—both likely and less so. Will the network let you down? (Almost assuredly.) Will some portion of your IaaS misbehave? (Have you met computers?) We build in graceful degradation for much of our automation but often neglect the (just-as-essential) human interactions.
The classic “hard problems” of cache invalidation and naming things revolve around our understanding of what’s correct and true and our agreements with one another on scope and relevance. Communication is essential for making context-dependent decisions.
Whether we’re attempting to determine the current state of reality or distinguish logical boundaries, democratized observability is key to answering our questions. As the fractal complexity of our distributed systems grows, we need to mindfully choose practices that work with our tooling. You can’t buy a silver bullet, but you can forge one from the collaborative efforts of your team.