Looking at scaling SRE? You might need more than SREs. Let’s dive in Booking.com’s SRE journey and discuss our successes and failures in supporting reliability in a continuously growing infrastructure - from the on-premise monoliths of the start, to the thousands of services in hybrid cloud today.
This talk highlights Booking.com journey through implementing SREs - starting from a 2-teams setup in 2016 to our current organization of over 200 SREs spread around the world. The main constant of our journey was the need to continuously scale reliability practices. While hiring more SREs was the solution to support more products at the beginning, it quickly became clear that more was needed. More automation, more process, more collaboration, … and more delegation back to the developers. In a retrospective fashion, we will discuss the different models we tried, the challenges we faced, the changes we implemented, and move on until our current setup. This session is aimed at SRE, SRE managers and anyone interested in growing an SRE organization - benefiting from insights on our challenges and our successes.
Yoann is a Senior SRE Manager, with experience in building and operating resilient applications at high-scale. He joined Booking.com in 2018, where he is supporting company core services on performance, reliability, disaster recovery and security topics with a continuous focus on making SRE practices scale through efficient tooling and processes.