GitOps has distinct benefits for managing cloud native operations at scale and many people talk about why this practice should be followed by modern engineering organizations. What is rarely discussed are the tradeoffs that are being made once this practice is followed, and the novel challenge this approach introduces, such as how does it make on-call harder? When applying greater policy and governance, this comes at the expense of developer control.
This can adversely impact the ability to troubleshoot incidents in real time when urgent patches and updates are required. This leads to questions such as: How do urgent incident remediations lead to infrastructure drift? How can we avoid or mitigate each of the negative effects?
In this talk we’ll travel down the one-way street of GitOps, examine known problems of operational overhead and friction, provide practical suggestions to rollback challenges and optimize incident root cause analysis, enabling users to understand how to unlock the benefits of GitOps while minimizing its challenges.