SREDAY

Site Reliability, DevOps and Cloud

October 3, 2025 San Francisco, CA, USA

1
Days
16+
Speakers
2
Tracks
100
Attendees

Event Starts In:

Tickets

Schedule

Day 1

09:30

Coffee break

Main lobby


10:00

Container Live Migration in Kubernetes: Why and How

CAST AI
What if Kubernetes could move a running container to another node without a single second of downtime? In this session, we’ll dive into Container Live Migration—a game-changing capability that brings a new level of availability to your workloads. Say goodbye to disruption from Spot instance...

“We’re Down!” to “We’re Good.” — Shipping observability in 2 weeks

Cortex
**Many hyper-growth startups hit a point where the current systems just aren’t enough.** Racing toward product–market fit, they skip best practices around observability, monitoring, and alerting—and pay for it later. This talk is about going from **0 → 1** and protecting your company, team, and...

Building Self-Healing Data Pipelines: How Reinforcement Learning Reduces Operational Overhead While Improving Performance

Nike
Site Reliability Engineers face escalating challenges managing data pipelines that must adapt to dynamic workloads, handle traffic spikes, and maintain high availability across distributed cloud environments. Traditional static optimization and rule-based approaches fall short when dealing with...

11:30

Building Bulletproof AI: Site Reliability Engineering for Financial AI Infrastructure at Scale

IMC Trading
As the financial AI market grows rapidly toward a projected $78.6 billion by 2030, Site Reliability Engineers face unprecedented challenges in maintaining high availability for mission-critical AI systems processing millions of transactions daily. This presentation reveals how SRE practices are...

12:00

Lunch & networking

Main lobby


Transform chaos experiments into actionable insights using generative AI

AWS
Tired of manual chaos experiment analysis? Discover how to leverage generative AI to analyze test results and validate experiment hypothesis. Learn to integrate Amazon Bedrock with AWS FIS to transform your chaos engineering experiments and game days into efficient, data-driven exercises that...

Secrets Security End-To-End

GitGuardian
Credentials allow human-to-machine and machine-to-machine communication. According to CyberArk's recent research, 93% of organizations had two or more identity-related breaches in the past year. It is clear that we need to address this growing issue. Unfortunately, many organizations are OK with...

14:00

Data Lakehouse Architecture: Reducing Operational Complexity for SRE Teams

Microsoft
Modern enterprise data infrastructure creates significant operational overhead for SRE teams, with organizations spending the majority of their engineering cycles managing ETL pipelines, data replication, and maintaining multiple storage systems across data warehouses, lakes, and specialized...

Beyond 100 Petabytes: Why We Built a Custom Exporter to Replace Our OTel Pipeline

ClickHouse
Are your observability signals trapped in separate pillars? Logs in one place, metrics in another, both losing context? At ClickHouse, we faced this challenge at a massive scale. Our solution was to abandon the traditional model and embrace a new philosophy: store everything, aggregate nothing....

15:00

15:30

Agentic Access: OAuth Gets You In. Zero Trust Keeps You Safe

Pomerium
AI agents are no longer experimental. Developers are already using them to query APIs, modify content, and chain services using emerging protocols like **MCP (Model Context Protocol)**. The latest MCP specification introduces modern **OAuth 2.1 authentication** and support for **Resource...

16:00

Wrap up

Scan each other's QR codes & head to a nearby pub!


09:30

Coffee break

Main lobby


The Human Factor in Site Reliability: Designing Automation That Amplifies Engineering

SiriusXM Radio
As automation sophistication increases across SRE practices, organizations face a critical inflection point: whether to pursue lights-out operations or embrace human-centered reliability engineering that delivers measurably superior outcomes. This presentation reveals how leading tech...

10:30

Migration from On-Prem Messaging System to The Cloud: What, How and Why

AWS
The race to the cloud is on, with enterprises everywhere migrating core infrastructure to stay competitive and cost effective. But when it comes to the messaging systems that power cross-component communications, a simple "lift and shift" isn't adequate and can be a recipe for failure. The...

From Dashboard to Defense: Automating Resilience at Large Scale

eBay
Modern production systems can no longer rely on static dashboards and reactive on-call rotations to ensure uptime. At large scale — with billions of requests flowing through mission-critical services — reliability must be engineered into the system through autonomous detection, mitigation, and...

Building Bulletproof ML Inference Platforms: SRE Principles for Real-Time AI at Scale

Starbucks
Real-time machine learning inference platforms present unique SRE challenges that traditional monitoring and reliability practices often can't address. This talk provides a comprehensive framework for applying SRE principles to ML inference systems, drawing from hands-on experience scaling...

12:00

Lunch & networking

Main lobby


13:00

Bulletproofing Trillion-Parameter Training: SRE Strategies for Ultra-Large AI Infrastructure at Scale

Meta
Training trillion-parameter language models presents unique site reliability challenges that dwarf traditional distributed systems complexity. With training costs exceeding millions of dollars and runs spanning months across thousands of GPUs, even minor infrastructure failures can result in...

13:30

How We Built ClickStack - an open source, open telemetry native Observability stack

ClickHouse
Modern observability is built on a flawed foundation: three siloed pillars - logs, metrics, and traces - each powered by different engines with separate query models, storage formats, and operational costs. Users are forced to manually correlate across systems, accept duplication, or pay high...

14:00

Microfrontend Reliability: SRE Strategies for Distributed Frontend Systems

Castlight Health
Microfrontend architectures with Module Federation introduce distributed system complexity to frontend applications, creating new reliability challenges that traditional SRE practices must adapt to address. This talk explores how to apply Site Reliability Engineering principles to microfrontend...

14:30

Cloud-Native SRE: Scaling Reliable Insurance Platforms with Duck Creek Technologies

Cognizant
The insurance technology landscape demands exceptional reliability as critical Policy Administration Systems handle millions of daily transactions. This presentation explores how implementing cloud-native SRE practices transformed traditional Duck Creek environments into resilient, scalable...

15:00

15:30

Shadow Dependencies - The Rising Role (Risk?) of Data

Gable
Some of the largest outages on the internet can be traced back not only to changes in code, but also how the code changed underlying data models. Through countless discussions with software engineers, many noted the importance of the underlying data model for quality development, yet also...

16:00

Wrap up

Scan each other's QR codes & head to a nearby pub!


Speakers

Aditya Bansal
Cortex
Read more →
Anjan Dash
Meta
Read more →
Deepika Annam
Nike
Read more →
Dwayne McDaniel
GitGuardian
Read more →
Gangadharan Venkataraman
Starbucks
Read more →
Jimmy Katiyar
SiriusXM Radio
Read more →
Justin Davis
Castlight Health
Read more →
Mark Freeman
Gable
Read more →
Matt Schillerstrom
Harness
Read more →
Mike Shi
ClickHouse
Read more →
Nick Taylor
Pomerium
Read more →
Parul Purwar
IMC Trading
Read more →
Piyush Dubey
Microsoft
Read more →
Ran Tao
AWS
Read more →
Sameer Joshi
Cognizant
Read more →
Saurabh Kumar & Ruskin Dantra
AWS
Read more →
Steve Poyer
CAST AI
Read more →
Sureshkumar Karuppuchamy
eBay
Read more →
Vlad Seliverstov
ClickHouse
Read more →

Venue

The offices of Harness.io

55 Stockton St, San Francisco,
CA 94108, United States

Sponsors & Partners

Want to become a sponsor? Get in touch!