Increasingly complex architecture generates a lot of noise, which in turn causes a catch-22: we need monitoring systems that generate appropriate levels of noise based on incident severity. In this talk, I’ll be discussing sanity saving steps that teams can use to design lean monitoring systems.
Integrations have been key in helping teams signal each other about all kinds of valuable information: when code is ready for review, when features are ready for release, when resources are running low or there is an outage, or even when the team is getting ready to go to a team lunch. All this information can lead to overload and engineering teams may find that they are frequently filtering or muting individual alerts, channels, or perhaps even alert sources so they can focus enough to complete other tasks. Unfortunately this can lead to missing critical events with potentially costly consequences. I’ll be discussing:
Sources of noise
Sources of silence
How to collect and direct information to appropriate teams
How to handle garbage collection / avoid unnecessary notifications