Readiness probes, health checks, autoscaling, chaos - you’ve followed all the known best practices and checked it all off. And yet, your application is still unreliable! So you add in auxiliary services to provide replication, sharding, work queues, and all the other buzzwords. Now your straightforward and easy-to-understand process has exploded in complexity. In this talk, we’’ll examine how to build application-layer durability on top of the infrastructure durability you’’ve already built. Along the way, you’’ll learn how to simplify your architecture and seamlessly mitigate a few specific catastrophic failures.