In this talk, we’ll delve into troubleshooting complex systems comprising hundreds or even thousands of services. We’ll address key questions like: What information do we require about each service? How do we structure dashboards for rapid comprehension of the system’s current state? We’ll also explore the effective utilization of observability signals—metrics, logs, traces, and profiles—discussing their respective applications and the insights they provide. Additionally, we’ll examine automation strategies for root cause analysis, considering both complete and partial automation approaches.