SREDAY

Site Reliability, DevOps and Cloud

October 3, 2025 San Francisco, CA, USA

1
Days
16+
Speakers
2
Tracks
100
Attendees

Event Starts In:

Tickets

Schedule

Day 1

09:30

Keynote: What Observability Can Learn From BI: Decoupling for Speed, Scale, and Flexibility

Imply
Today’s observability platforms are often vertically integrated—binding data storage, query, and visualization layers into a single stack. This tight coupling drives up costs, makes integrations painful, and slows teams down. But it doesn’t have to be this way. In this talk, we’ll explore how SRE...

10:00

Coffee break

Main lobby


10:30

Container Live Migration in Kubernetes: Why and How

CAST AI
What if Kubernetes could move a running container to another node without a single second of downtime? In this session, we’ll dive into Container Live Migration—a game-changing capability that brings a new level of availability to your workloads. Say goodbye to disruption from Spot instance...

“We’re Down!” to “We’re Good.” — Shipping observability in 2 weeks

Cortex
**Many hyper-growth startups hit a point where the current systems just aren’t enough.** Racing toward product–market fit, they skip best practices around observability, monitoring, and alerting—and pay for it later. This talk is about going from **0 → 1** and protecting your company, team, and...

Building Self-Healing Data Pipelines: How Reinforcement Learning Reduces Operational Overhead While Improving Performance

Nike
Site Reliability Engineers face escalating challenges managing data pipelines that must adapt to dynamic workloads, handle traffic spikes, and maintain high availability across distributed cloud environments. Traditional static optimization and rule-based approaches fall short when dealing with...

12:00

Lunch & networking

Main lobby


13:00

Context Engineering in Observability: The next SRE Superpower

CardinalHQ
As AI systems grow more capable, their usefulness hinges not just on what they know, but what context they understand. In this talk, Ruchir Jha, CEO of Cardinal — an AI-powered observability company, unpacks the emerging discipline of context engineering: the art and science of feeding AI the...

13:30

10 Billion Downloads: Insights and Trends in Open Source

Scarf
In this talk, we share the up-to-date results and fresh insights of an in-depth analysis of data gathered from over 10 billion events analyzed across thousands of projects. The analysis reveals a clearer view of the latest emerging trends. Our findings offer valuable insights into user behaviors...

Transform chaos experiments into actionable insights using generative AI

AWS
Tired of manual chaos experiment analysis? Discover how to leverage generative AI to analyze test results and validate experiment hypothesis. Learn to integrate Amazon Bedrock with AWS FIS to transform your chaos engineering experiments and game days into efficient, data-driven exercises that...

14:30

Beyond 100 Petabytes: Why We Built a Custom Exporter to Replace Our OTel Pipeline

ClickHouse
Are your observability signals trapped in separate pillars? Logs in one place, metrics in another, both losing context? At ClickHouse, we faced this challenge at a massive scale. Our solution was to abandon the traditional model and embrace a new philosophy: store everything, aggregate nothing....

Secrets Security End-To-End

GitGuardian
Credentials allow human-to-machine and machine-to-machine communication. According to CyberArk's recent research, 93% of organizations had two or more identity-related breaches in the past year. It is clear that we need to address this growing issue. Unfortunately, many organizations are OK with...

16:30

Wrap up

Scan each other's QR codes & head to a nearby pub!


10:00

Coffee break

Main lobby


10:30

Data Lakehouse Architecture: Reducing Operational Complexity for SRE Teams

Microsoft
Modern enterprise data infrastructure creates significant operational overhead for SRE teams, with organizations spending the majority of their engineering cycles managing ETL pipelines, data replication, and maintaining multiple storage systems across data warehouses, lakes, and specialized...

The Human Factor in Site Reliability: Designing Automation That Amplifies Engineering

SiriusXM Radio
As automation sophistication increases across SRE practices, organizations face a critical inflection point: whether to pursue lights-out operations or embrace human-centered reliability engineering that delivers measurably superior outcomes. This presentation reveals how leading tech...

11:30

Migration from On-Prem Messaging System to The Cloud: What, How and Why

AWS
The race to the cloud is on, with enterprises everywhere migrating core infrastructure to stay competitive and cost effective. But when it comes to the messaging systems that power cross-component communications, a simple "lift and shift" isn't adequate and can be a recipe for failure. The...

12:00

Lunch & networking

Main lobby


From Dashboard to Defense: Automating Resilience at Large Scale

eBay
Modern production systems can no longer rely on static dashboards and reactive on-call rotations to ensure uptime. At large scale — with billions of requests flowing through mission-critical services — reliability must be engineered into the system through autonomous detection, mitigation, and...

Building Bulletproof ML Inference Platforms: SRE Principles for Real-Time AI at Scale

Starbucks
Real-time machine learning inference platforms present unique SRE challenges that traditional monitoring and reliability practices often can't address. This talk provides a comprehensive framework for applying SRE principles to ML inference systems, drawing from hands-on experience scaling...

14:00

Bulletproofing Trillion-Parameter Training: SRE Strategies for Ultra-Large AI Infrastructure at Scale

Meta
Training trillion-parameter language models presents unique site reliability challenges that dwarf traditional distributed systems complexity. With training costs exceeding millions of dollars and runs spanning months across thousands of GPUs, even minor infrastructure failures can result in...

14:30

15:00

Microfrontend Reliability: SRE Strategies for Distributed Frontend Systems

Castlight Health
Microfrontend architectures with Module Federation introduce distributed system complexity to frontend applications, creating new reliability challenges that traditional SRE practices must adapt to address. This talk explores how to apply Site Reliability Engineering principles to microfrontend...

15:30

How We Built ClickStack - an open source, open telemetry native Observability stack

ClickHouse
Modern observability is built on a flawed foundation: three siloed pillars - logs, metrics, and traces - each powered by different engines with separate query models, storage formats, and operational costs. Users are forced to manually correlate across systems, accept duplication, or pay high...

16:00

Shadow Dependencies - The Rising Role (Risk?) of Data

Gable
Some of the largest outages on the internet can be traced back not only to changes in code, but also how the code changed underlying data models. Through countless discussions with software engineers, many noted the importance of the underlying data model for quality development, yet also...

17:00

Wrap up

Scan each other's QR codes & head to a nearby pub!


Speakers

Aditya Bansal
Cortex
Read more →
Anjan Dash
Meta
Read more →
Avi Press
Scarf
Read more →
Deepika Annam
Nike
Read more →
Dwayne McDaniel
GitGuardian
Read more →
Gangadharan Venkataraman
Starbucks
Read more →
Gian Merlino
Imply
Read more →
Jimmy Katiyar
SiriusXM Radio
Read more →
Justin Davis
Castlight Health
Read more →
Mark Freeman
Gable
Read more →
Matt Schillerstrom
Harness
Read more →
Mike Shi
ClickHouse
Read more →
Piyush Dubey
Microsoft
Read more →
Ran Tao
AWS
Read more →
Ruchir Jha
CardinalHQ
Read more →
Saurabh Kumar & Ruskin Dantra
AWS
Read more →
Steve Poyer
CAST AI
Read more →
Sureshkumar Karuppuchamy
eBay
Read more →
Vlad Seliverstov
ClickHouse
Read more →

Venue

The offices of Harness.io

55 Stockton St, San Francisco,
CA 94108, United States

Sponsors & Partners

Want to become a sponsor? Get in touch!