By August 2022, we had finished training and onboarding all of our product teams to go on-call for the services they were responsible for. Since then, we’ve realized a tremendous positive impact on our engineering culture and a dramatic reduction in workload on our SRE team. Making this transition was difficult and took 6 months of hard work and planning. Although in the end, it was a success, we made many mistakes along the way.
In this talk, I’ll outline our many successes and failures during our transition to a distributed model of on-call and highlight a number of things we should have done differently. My goal is to help you improve incident management at your company by sharing the learnings we had at Pleo while revamping our on-call.