Imagine an Oncall shift where you don’t start your day sifting through routine capacity alerts, nudging stuck rollouts, or closing noisy, low-impact tickets. Instead, you get to tackle things that matter. This is the goal we’re chasing at Google.
We are developing a system where software agents can autonomously handle a significant chunk of operational toil. The key is to do this generically and horizontally, making the solutions broadly applicable crossing the lines between developers and operations.
In this session, I’ll share our journey and lessons learned. We’ll cover the significant challenges, including evaluation, ensuring safe and secure operations, and how to codify complex, sometimes opinionated, remediation steps. I’ll outline the infrastructure we’ve put in place due to those challenges and requirements.
This talk aims to provide a practical perspective on leveraging automation and agents in a production environment. You’ll leave with critical questions to consider for your own agent that interacts with production.

Today: Managing a team of Site Reliability Engineers
Before:
10years of being an IC SRE @ Google PhD in Information Retrieval Master in AI

Ramón is a Senior Staff Site Reliability Engineer at Google where he works on the Identity team. He started back in 2011 as an intern and has since then become team Technical Lead (TL), Engineering
...