Mahesh Venkataraman & Koushik Vijayaraghavan
Accenture
```markdown
Why should SREs care about systems thinking applied in aviation safety engineering?
Modern distributed systems face the same challenges as complex safety-critical systems: emergent failures, cascading outages, and the gap between system design and runtime behavior. While traditional monitoring focuses on known failure modes, aviation safety engineering provides systematic approaches to identify unknown risks.
In this talk, you will learn how we applied MIT's System-Theoretic Process Analysis (STPA) — a methodology from aviation safety — to analyze reliability risks in a large-scale eCommerce platform processing millions of transactions daily.
What you will learn:
- How to map STPA concepts (hazards, constraints, control actions) to distributed systems components
- A systematic framework for identifying cascading failure scenarios before they occur
- Practical techniques for analyzing interactions between auto-scaling, load balancing, and circuit breakers
- How cascading failures were reduced using insights from this analysis
- Actionable methods you can apply to your own systems
Real examples covered:
- Circuit breaker coordination failures that created retry storms
- Auto-scaling feedback loops that amplified rather than dampened failures
- Security policy interactions that blocked legitimate traffic during incidents
- Configuration drift detection that prevented silent reliability degradation
This is not theoretical — we'll show concrete code examples, architecture diagrams, and actual incident data. You'll leave with a practical toolkit for systematic reliability analysis that goes beyond traditional SRE approaches.
Whether you are dealing with microservices, serverless architectures, or hybrid cloud deployments, this methodology will help you build and maintain more resilient systems by thinking systematically about failure modes and control structures.
Perfect for: SREs, Platform Engineers, and Engineering Managers who want to move beyond reactive incident response to proactive reliability engineering.
```
Mahesh Venkataraman leads innovation in the area of application of artificial intelligence, data mining and machine learning in software engineering. He has led successful implementation of natural language processing driven test automation, usage and failure modeling using log analytics, empirical analysis of technical debt and application of knowledge graphs in discovering patterns and relationships for optimizing test suites and improve decision making for system integration projects. His passion is bridging the gap between theory and practice, between academia and industry and creative thinking in software. He is a regular keynote speaker in many conferences. He is currently working on addressing uncertainty in fault prognosis and diagnosis
Koushik Vijayaraghavan is a Senior Managing Director at Accenture, where he has spent over 20 years driving product innovation, engineering, and digital transformation for global clients. He began his career at Cognizant and has completed Harvard Business School’s Disruptive Strategy program, strengthening his expertise in guiding organizations through change.