Lessons learned from monitoring a large production network

At Facebook, we have a large internal network spawning multiple datacenters, over multiple geographic regions across multiple continents. This means a large number of devices and links to monitor. We have been developing a bunch of different strategies to make sure our network is healthy, from SNMP counters collection to sophisticated synthetic traffic injection tools. The aim of this talk is to review how we approached the problem over the years, how we dealt with common problems in the monitoring space but, most importantly, what we learned while doing so, and how we applied those concepts to other problems.

Speaker

Giacomo Bagnoli

Giacomo Bagnoli is a Production Engineer at Facebook in Dublin, where he works on network monitoring tools. Previously at Etsy, Amazon, and various small startups, he has been breaking and fixing systems for more than a decade.