SREDAY

Site Reliability, DevOps and Cloud

April 11, 2025 San Francisco, CA, USA

1
Days
16+
Speakers
2
Tracks
100
Attendees

A Tale of Two Outages

David Argent
Salesforce

Having been at ground zero for two outages that you can still look up on Google, this talk is designed to give an oral history of the causes, reactions, solutions, and aftermath to the Danger Sidekick outage and another major outage whose company I cannot mention by name.

The first is a story about how a single individual decision cascaded into a massive failure that took six weeks to clean up and was deemed an impossible recovery by both Oracle and the SAN Vendor. It was arguably one of the largest cases of putting Humpty Dumpty back together again, as a 15TB Oracle DB was reassembled, one 4K page at a time, to provide 99.9% data recovery on what was deemed “impossible”. We called it the Lazarus Project for a reason. We had PC desktop tower units littering the aisles of our otherwise Unix datacenter, as they were the recovery fleet. The third-party data recovery team that helped achieve the impossible built their recovery tools on Windows.

The second is a story about how several choices and design flaws can all come together in a perfect storm. We had poor designs involving infinite retries on a non-caching interface. We had no testing of content posted minutes after the start of a major sale which led to back-end loads from one internal service hitting the main NoSQL DB being larger than the entire service was designed to handle. There were design constraints in the NoSQL platform making it unable to shed “dead” transactions in queue, lengthening the time to recover. We had an overall rendering engine requiring hundreds to thousands of calls to all succeed in order to render a single retail page, and a need to retry all calls in the event of a single failure. The human decision to start the major sale globally at the same time, rather than staggering it according to time zones was the capstone in a Taj Mahal of an outage. While you couldn’t see it from the outside, over 98% of all transactions were successful against the NoSQL DB even at the worst part of the outage, though the design decisions elsewhere led to a much lower observed availability than that.

I'm a former chemist who wandered away from his science degree and found a home in the tech industry where I've spent better than the last two decades working to design, implement, deploy, and support highly scalable, reliable online services in a wide variety of roles. I grew up in New Jersey, where they teach sarcasm as your first language, and I've never been able to do the accent no matter how many times I watch The Sopranos. During my free time I used to be a competitive bridge player (competed in the World Junior Pairs and the Reisinger), but now I occupy myself more with bowling (still haven't thrown a perfect game), and playing tabletop RPGs with my friends, both in person and online.

Sponsors & Partners

Want to become a sponsor? Get in touch!