As site reliability / infrastructure engineers we do a lot of things for better reliability. Some of them are easy, especially if your cloud provider supports it, like adding HA to your database. And some of them require more thorough process and planning , for example reducing blast radius. You can start small with multi-zone / multi-region setup for your compute, but then you will most likely still end up with SPOFs like database and load balancer. Sharding is not new and there are many ways to accomplish it. We'll go over what Twingate did for it's own sharding strategy in this talk that eliminated all single point of failures and reduced the impact of database migrations and infrastructure changes.
Birol Ertekin is a Director of Engineering at Twingate, a new API-first security platform that helps companies manage how identities and devices access networks, applications, and data. Previously he has 20+ years of experience in the industry working for companies like Meta, Vudu - Movies & TV, ign.com. Highly enthusiastic about uptime and sees everything in life as a learning experience including outages. Is not happy if root-cause(s) can not be identified after incidents/outages.