55 Stockton St, San Francisco,
CA 94108, United States
Modern production systems can no longer rely on static dashboards and reactive on-call rotations to ensure uptime. At large scale — with billions of requests flowing through mission-critical services — reliability must be engineered into the system through autonomous detection, mitigation, and recovery. In this session, I’ll share how our platform team evolved from traditional observability stacks to an integrated, self-defending resilience architecture that transforms metrics into real-time, automated mitigations. Key topics include: Actionable observability: Designing high-fidelity Prometheus instrumentation that surfaces actionable SLO breaches and capacity anomalies — not just vanity metrics. Closed-loop alerting: Building alert pipelines that automatically trigger mitigations, including traffic shaping, circuit breaking, and dynamic configuration changes. Continuous delivery at scale: How we implemented fully automated CI/CD pipelines with canary deployments, progressive rollouts, and automatic rollback — eliminating manual gates while preserving production stability. Dynamic rate limiting: Using adaptive throttling to contain abusive or runaway workloads before they impact critical path services. Proactive incident response: Real-world learnings from production incidents that shaped our automated safeguards, including post-incident automation improvements and resilience patterns. Operational trust: Governance strategies for enabling engineers to trust self-healing automation, from progressive rollout policies to guardrails for fail-safe operation. Attendees will gain a practical blueprint for evolving traditional monitoring into an autonomous resilience layer — with concrete patterns, architectural considerations, and lessons learned operating a high-volume, always-on platform. Whether you’re modernizing your incident response playbooks, tightening your feedback loops, or scaling continuous delivery for critical systems, you’ll leave with actionable strategies to move beyond dashboards — and build a production environment that can defend itself. Key Takeaways How to evolve from passive observability to automated corrective action. Designing metrics pipelines that detect and trigger real-time mitigations. Safe automation of deployments at scale without sacrificing reliability. Implementing dynamic safeguards like adaptive rate limiting and circuit breaking. Practical leadership and governance approaches for building trust in self-healing systems.
Sureshkumar Karuppuchamy is a technology leader with more than two decades of experience designing and modernizing large-scale, AI-enabled infrastructure for some of the world’s most complex platforms. In his current role as a senior engineering leader at eBay, he has led critical modernization efforts across core systems—revamping legacy platforms, transitioning to cloud-native data solutions, and reimagining API architectures to improve agility, reliability, and scalability. His work includes the development of advanced compliance systems that support real-time moderation and auditing in alignment with global regulations like the EU Digital Services Act. He’s also helped shape seller experience through intuitive listing flows and AI-powered tools that streamline product onboarding, such as transforming product images into fully generated listings. Sureshkumar’s contributions have been featured in publications including The Guardian, Deloitte WSJ Insights, and marketscreener.com. He began his career at Oracle, building enterprise solutions for global supply chains, and is a graduate of Anna University’s College of Engineering, Guindy. Passionate about knowledge-sharing, he mentors technologists, contributes to peer-reviewed research, and regularly speaks at international conferences on system architecture, AI, data platforms, and compliance tech.
Real-time machine learning inference platforms present unique SRE challenges that traditional monitoring and reliability practices often can't address. This talk provides a comprehensive framework for applying SRE principles to ML inference systems, drawing from hands-on experience scaling platforms that serve billions of daily predictions with sub-100ms latency requirements. We'll explore how to establish meaningful SLIs and SLOs for ML systems, where traditional availability metrics fall short in capturing model performance degradation, data drift, and inference quality issues. Learn practical approaches to incident response for ML platforms, including automated fallback mechanisms, circuit breakers for model failures, and graceful degradation strategies that maintain user experience during outages. The session covers essential reliability patterns including blue-green deployments for model updates, canary releases with statistical significance testing, and rollback strategies that account for model warming and feature pipeline dependencies. We'll examine monitoring and observability strategies that go beyond traditional metrics, incorporating model performance tracking, feature drift detection, and business impact correlation. Infrastructure reliability techniques will be demonstrated through real-world examples: implementing request batching for throughput optimization while maintaining latency SLAs, designing feature stores for consistency and disaster recovery, and orchestrating Kubernetes-based serving infrastructure with proper resource allocation and auto-scaling policies. Critical operational aspects include capacity planning for ML workloads with variable computational requirements, managing dependencies between feature generation pipelines and serving systems, and implementing effective on-call procedures for ML-specific incidents. Attendees will gain practical tools for building resilient ML inference platforms including monitoring dashboards, alerting strategies, and runbook templates. This session bridges the gap between traditional SRE practices and modern ML operations, providing actionable frameworks for maintaining reliable AI systems that deliver consistent business value while meeting stringent performance requirements.
Gangadharan Venkataraman is a highly accomplished technology leader based in Bellevue, Washington, with a strong record of envisioning, architecting, and delivering cutting-edge software and AI/ML solutions for leading global enterprises. With over 18 years of experience across diverse domains including e-commerce, telecommunications, healthcare, and enterprise technology, Gangadharan brings deep technical expertise and strategic leadership to every initiative. Currently serving as Senior Engineer - AI/ML Platform at Starbucks, he leads the design and optimization of large-scale machine learning infrastructure. He recently spearheaded the launch of ML Platform v2.0, drove the Databricks-to-Unity Catalog migration, and championed data governance enhancements, all to improve scalability, performance, and model reliability enterprise-wide. Previously, at UST Global (client: T-Mobile), Gangadharan led key modernization efforts, including Kubernetes-based Gen4 migrations and critical database upgrades for T-Mobile’s Customer Hub platform. Prior to that, he had a transformative 11+ year tenure at eBay, where as a Member of Technical Staff II, Core AI, he led the development of scalable ML infrastructure, real-time personalization pipelines, and video commerce platforms. His contributions powered over 30 billion daily inferences and $200M in annual GMV from event-triggered campaigns. Earlier roles at Computer Science Corporation and GE Healthcare saw him building resilient email platforms and clinical decision-support systems, with a strong focus on automation, compliance, and user-centric design. Gangadharan holds an M.S. in Computer Science from the Georgia Institute of Technology and a B.Tech in Information Technology from the University of Madras. His technical toolkit includes expertise in distributed systems, MLOps, big data pipelines, and cloud-native development using Java, Scala, Python, Kubernetes, and Azure. A recognized leader with a passion for innovation, Gangadharan thrives on solving complex problems, building high-performing teams, and delivering impactful products that scale.
Many hyper-growth startups hit a point where the current systems just aren’t enough.
Racing toward product–market fit, they skip best practices around observability, monitoring, and alerting—and pay for it later.
This talk is about going from 0 → 1 and protecting your company, team, and customers when the pressure mounts.
I’ll cover:
Getting started from the ground floor
Foundational work, the three pillars of observability, and—more importantly—how to get hands-on with all three.
This won’t be another high-level “logs/traces/metrics” sermon; we’ll actually use them.
Building your first monitor, alert, and dashboard
Move to offense: catch full outages, errors, latency spikes, and change alerts before your customers do.
We’ll touch more than just the “golden signals.”
Iterating
Going from 0 → 1 is only the beginning. You’ll need to tune false alarms, coach engineers on response, add new metrics, and prune stale ones—plenty of hidden gotchas here.
Teams often get bogged down by dogma and decision paralysis.
I’ll share tactics for keeping a bias toward action and steadily moving the reliability needle.
No sales pitch. I may demo with an APM tool (no affiliations) purely to illustrate what’s possible.
Aditya Bansal is a Staff Engineer at Cortex, an Internal Developer Portal that helps engineering teams catalog their services. He joined as the company’s first employee over four years ago and has since grown into his current Staff Engineer role.
Aditya began his career at Poynt, helping scale the engineering team from 15 to 60+ people. There, he built and maintained much of the company-wide infrastructure and watched the platform make the classic transition from a monolith to microservices.
He later joined Curebase as one of the earliest hires, working directly with the founders to build the engineering organisation from scratch before moving on to Cortex.
Credentials allow human-to-machine and machine-to-machine communication. According to CyberArk's recent research, 93% of organizations had two or more identity-related breaches in the past year. It is clear that we need to address this growing issue. Unfortunately, many organizations are OK with using plaintext credentials, which we should all know not to do by now. Given the scope of the problem, what can we do? Let's make a plan! Secrets Detection Secrets Management Developer Workflows Secrets Scanning Automatic Rotation
By the end of this session, you should have a clear roadmap for taming the machine identity mess in your code and pipelines. `
Dwayne has been working as a Developer Advocate since 2014 and has been involved in tech communities since 2005. His entire mission is to “help people figure stuff out.” He loves sharing his knowledge, and he has done so by giving talks at hundreds of events worldwide. He has been fortunate enough to speak at institutions like MIT and Stanford and internationally in Paris and Iceland. Dwayne currently lives in Chicago. Outside of tech, he loves karaoke, live music, and crochet.
The race to the cloud is on, with enterprises everywhere migrating core infrastructure to stay competitive and cost effective. But when it comes to the messaging systems that power cross-component communications, a simple "lift and shift" isn't adequate and can be a recipe for failure. The migration path is riddled with complex decisions and design pitfalls unique to every use case. In this session, AWS Cloud Support expert Tom will walk you through the critical stages of rehosting, replatforming, and refactoring, showing you how to unlock maximum performance and reliability for messaging systems. Additionally, Tom will compare traditional message brokers with more modernized serverless messaging services on AWS. By the end of the session, you will have a much more comprehensive understanding of the migration prcess, key questions to ask and some best practices for harnessing the benefits of Cloud.
Tom is a Cloud Support Engineer at AWS with five years of dedicated experience. He has gained extensive expertise in cloud-based messaging systems by guiding hundreds of customers through the process of migrating their on-premise systems to the cloud and troubleshooting any issues that arise.