Rethinking On-Call

Did you know that on-call is short for “onerous and callous?” You didn’t because it’s not. But it feels like it could be, doesn’t it? On-call is often lonely, thankless, and underinvested. Many talks tell you that you should feel bad about it. This isn’t one of those talks.

On-call is often organically developed and minimally structured. It evolves from the discomfort of being unprepared for surprises and the pressure to avoid customer-angering events. These drivers often mean on-call schedules aren’t well thought out, handoffs are non-existent or poorly handled, and tooling gets glommed together into a giant ball of pain.

In this talk, we’ll describe a novel approach inspired by decades of experience, gobs of research, and a year of interviews: on-call as an organizational program. The on-call program’s goal is to create the capacity to deal with operational disturbances in a sustainable and adaptable way. By focusing on managing capacity instead of reacting to pages, a programmatic approach to on-call makes better use of available resources, resulting in better responses and less stress for the engineers involved. We’ll discuss the elements of a capacity management on-call program. This includes facing the tradeoffs involved with having dedicated on-call support vs. side-of-the-desk support, managing attention and minimizing disruptions, sharing capacity across organizational boundaries, and continuously clarifying responsibilities as conditions change.



Cory Watson

Cory is a founder, leader, and engineer who’s worked at places like Stripe, Splunk, and Twitter. He helped popularize observability and is passionate about the people who create resilience in ...