SREDAY

Site Reliability, DevOps and Cloud

September 18-19, 2025 London, UK

2
Days
50+
Speakers
6
Tracks
200
Attendees

Event Starts In:

Tickets

Schedule

Day 1

Keynote: TBD

SRE Author


Keynote: What Observability Can Learn From BI: Decoupling for Speed, Scale, and Flexibility

Imply
Today’s observability platforms are often vertically integrated—binding data storage, query, and visualization layers into a single stack. This tight coupling drives up costs, makes integrations painful, and slows teams down. But it doesn’t have to be this way. In this talk, we’ll explore how SRE...

Keynote: Strategic AI Agents for Kubernetes SRE: Beyond Ad-hoc Prompting

Komodor
While LLMs make AI agent creation seem simple, building reliable production solutions requires structured methodology. This talk shares key insights from my year-long project developing an AI-powered root cause analysis system for Kubernetes clusters. I'll present three strategies that...

10:30

Coffee break

Main lobby


Optimizing Kubernetes and unlock blockers with Pod Live Migration

CAST AI
Kubernetes is easy, isn't it? Creating Kubernetes cluster in public cloud take a few minutes, deploying an application it's like same. But how can we make sure that our cluster scales in an efficient way? And what can we do with workloads which are not really a good fit for Kubernetes?

A DBA's Surprising Journey To Databases On Kubernetes

EverythingDevOps & ING
If you ask 10 DBAs in a conference about putting data on Kubernetes, most will say that’s a bad idea – **Divine**. **Divine** is an advocate for data on Kubernetes, and a Data on Kubernetes Ambassador. **Shivadeep**, on the other hand, as an Oracle DBA, traditionally believed that containers and...

Automagic Observability with eBPF

Grafana Labs
Instrumenting legacy or closed sourced applications can be a pain. But it doesn't have to be! This talk offers an introduction to the open source Beyla tool and shows you how to utilize it in order to instrument different types of applications without touching a single line of code. Beyla hooks...

Beyond Single-Cloud SRE: Live Multi-Agent Incident Investigation and Resolution Across Azure and Beyond

Microsoft & Neubird.ai
Watch AI agents autonomously solve production crises! Azure SRE Agent and Hawkeye collaborate to investigate a critical error, diving into Azure and multi-cloud telemetry. See real-time coordination as agents leverage source code to deploy fixes, demonstrating the future of cross-cloud SRE.

13:00

Lunch & networking

Main lobby


14:00

Beyond 100 Petabytes: Why We Built a Custom Exporter to Replace Our OTel Pipeline

ClickHouse
Are your observability signals trapped in separate pillars? Logs in one place, metrics in another, both losing context? At ClickHouse, we faced this challenge at a massive scale. Our solution was to abandon the traditional model and embrace a new philosophy: store everything, aggregate nothing....

14:30

The Airgap Paradox: Adversarial Tactics & Defensible Design

Blis
Air-gapped systems are seen as the pinnacle of security, but are they truly untouchable? This talk explores real-world breaches—from Stuxnet to electromagnetic attacks—highlighting modern threats like supply chain risks and social engineering. Attendees will learn practical strategies to...

15:30

Wrap up

Scan each other's QR codes & head to a nearby pub!


10:30

Coffee break

Main lobby


Overcoming the Fallacies of Distributed Systems with Chaos Mesh and Kubernetes

TeraSky
Platform engineering must embrace resilience, not just scale. This talk shows how Kubernetes, ArgoCD, Chaos Mesh, and Pixie, using eBPF, enable real-time observability, chaos testing, and automated deployments, making platforms stable, chaos-ready, and easy to manage.

11:30

Breaking Bad (Systems): a short journey through chaos engineering

GlobalLogic
The shift from traditional testing to chaos engineering marks a revolution in building reliable systems. This session unpacks the concept’s history and role in ensuring system resilience. We’ll look at some of the approaches to chaos engineering, before looking at Chaos Engineering as a Service...

Making AI Accountable: Evaluators in Action with Azure AI Foundry

Microsoft
As AI systems scale in complexity and impact, observability becomes essential—not just for performance monitoring, but for ensuring quality, safety, and trust. In Azure AI Foundry, evaluators are the backbone of this observability layer, enabling continuous, automated assessments across the AI...

Nanakorobi yaoki: Learnings from Attacking and Defending GPUs

Bitso
The main focus here is to truly teach the audience the mindset of attack/defence, especially in an environment as underexplored as GPUs and HPC as a whole. We will present real-life examples and have a small ecosystem in the cloud to demonstrate and raise awareness about this crucial topic. We...

13:00

Lunch & networking

Main lobby


14:00

From Blind Spots to Total Vision: Observability at Massive Scale

AWS
In today's digital space, downtime adversely impacts customer trust and can lead to lost revenue. This session will show you how to rethink your infrastructure monitoring with logs, metrics, and traces for complete visibility. By utilizing better practices, applying machine learning for anomaly...

14:30

Kepler - Optimized power usage monitoring for Kubernetes

Okta
Sustainability is increasingly becoming a priority in the Information Technology sector, which has fueled the demand for energy-efficient solutions in all computing environments. Effective management of resources in Kubernetes environments requires proper monitoring and optimisation of power...

15:00

Drowning in Observability Costs? Build a Cost-Aware Telemetry Pipeline to Keep You Afloat ft. OpenTelemetry

Independent
Observability is the cornerstone of reliable systems. It lets teams identify and resolve issues before they impact a broader group of users. Yet building an ideal observability stack is far from easy. It demands time and effort, instrumenting every app, service, and component that emits...

16:00

Wrap up

Scan each other's QR codes & head to a nearby pub!


10:30

Coffee break

Main lobby


11:00

Beginner's Guide to Terragrunt

Sufle
If you've worked with Terraform in production, you've likely encountered the pain of managing multiple environments, duplicate configuration files, and complex remote state setups. Terragrunt solves these common Infrastructure as Code challenges by providing a thin wrapper around Terraform that...

Why internships and entry-level roles matter in Tech

Teesside University London
In the age of AI, efficiency and budget-cuts, the idea of hosting internships or early-career grads may seem in redundant - but this rhetoric couldn't be more misplaced! This session aims to demystify working with students, graduates and educational institutions. Instead, come prepared to...

Who Owns Your SRE Stack? The Rise of AI Agents

Independent
SREs are the guardians of reliability — we build for failover, redundancy, and scale. But in today’s AI-native systems, there’s a quiet shift happening: automation scripts, AI-driven agents, and machine identities are running critical operations with increasing autonomy. They auto-remediate...

DORA Metrics Decoded: The Science of High-Performing Teams

GlobalLogic
Most teams fail to optimize their SDLC because they ignore the four key metrics that drive speed and stability: Deployment Frequency, Lead Time, Change Failure Rate, and MTTR. This talk will reveal how to leverage DORA metrics for measurable SDLC improvements and deliver better software, faster.

13:00

Lunch & networking

Main lobby


From Git Push to Exit

countX
A real-world case study of a B2B fintech startup, powered by a lean team and a fully serverless, AWS-native architecture. Continuous Deployment was implemented from day one, enabling dozens of automated releases per week with minimal overhead. This delivery setup was tightly integrated with...

Empowering Developers through Open-Source AI

SambaNova Systems
In the ever evolving AI landscape, organizations are faced with the choice between open-source versus closed-source models. While many developers find it easier to get started in the closed source ecosystem, they quickly realize it is ultimately more expensive, inefficient, and doesn’t have the...

15:00

Automating Kibana Alerting: When GitOps meets UI

Auto1 Group
Managing alerts across hundreds of services can quickly become a challenge — fragmented workflow, inconsistent configurations and high operational costs. In this session, we will share how we streamlined alerting by adopting Kibana Alerts and building a bidirectional automation on top of it,...

15 NGINX Metrics to Monitor

NGINX
Let's take a look at some of the basic NGINX metrics to monitor and what they indicate. We start with the application layer and move down through process, server, hosting provider, external services, and user activity. With these metrics, you get coverage for active and incipient problems with NGINX

16:30

Cutting Through Metrics Cardinality Noise with VictoriaMetrics

VictoriaMetrics
In high-scale environments, metrics cardinality isn’t just a resource concern, it’s an architectural challenge. Left unchecked, it can impact performance, query latency, and even system stability. This talk takes a deep technical dive into how VictoriaMetrics enables advanced observability...

17:00

Wrap up

Scan each other's QR codes & head to a nearby pub!


Day 2

09:00

Coffee break

Main lobby


Driving Platform Adoption with Embedded SREs

Chainalysis & Rootly
Platform engineers work hard to build great tooling and automations for developers, but often struggle to get feature teams to adopt the platform to its full potential. Meanwhile, SREs are buried in incident firefighting and can’t keep up with onboarding new services or proactive reliability...

10 Common Things People Do Wrong in Kubernetes Environments

CAST AI
In this talk, we’ll explore ten frequent mistakes developers and operators make when working with Kubernetes. From misconfigured resources and insecure deployments to overlooked observability and poor scaling practices — this session will highlight real-world pitfalls and offer practical advice...

OTel You It's Not Just for Backend!

Elastic
Observability is the ability to measure the current state of a system. SREs are familiar with OpenTelemetry signals and how they can be used to diagnose issues. Yet in the frontend world, we're behind the curve. Join me as I dive into the current state of OpenTelemetry for Web. I'll cover a...

11:00

Platform Engineering and AI - Two Buzzwords Finally Meet!

Tanzu
Two Buzzwords Finally Meet! How should we manage AI in large organizations? What are the "services" developers need to add AI to enterprise apps, and what role do platform engineers take? Where do data scientists fit in? How does MCP, A2A, and whatever the latest AI API is fit in? There are so...

11:30

Lunch & networking

Main lobby


12:30

Move from Traditional Ops to Integrated SRE Ops at a Retail organization

H&M Group
With our traditional operations setup supporting monolithic systems – as the tech implementation scaled, there was a proportionate increase in costs to support and maintain the systems. This had been primarily due to segregation in organization setup to deliver business outcomes. While one part...

13:00

Surviving Failure with Confidence: Observability and Chaos Testing in OceanBase Database

OceanBase
In large-scale distributed systems, failure is not a matter of if, but when. What sets a reliable platform apart is its ability to detect, isolate, and recover from these failures — without compromising performance or consistency. In this session, we’ll dive into how OceanBase, a distributed...

When PaaS isn’t enough: Building your own internal platform in Multi Cloud

Ministry of Housing, Communities and Local Government
Multi-cloud adoption breaks the one-size-fits-all promise of traditional PaaS. This talk dives into how we built an internal developer platform that balances flexibility, control, and speed across clouds.

Aggregating Metrics In-Flight: Challenges and Opportunities

VictoriaMetrics
One of the common practices for improving the query speed in Prometheus is to create recording rules for commonly used queries. While this usually works great, recording rules have a cost: The raw metrics still need to be stored in the Prometheus, even if we don't need them Recording rule need to...

Inference is the New Exfil: How Cloud AI Leaks What It Learns

Your AI might be the biggest insider threat you’ve ever deployed without even knowing it. As large language models (LLMs) become embedded in cloud-native apps and infrastructure, a new kind of risk is emerging: data leakage through inference. With the right prompt, an attacker can extract...

15:30

Wrap up

Scan each other's QR codes & head to a nearby pub!


09:00

Coffee break

Main lobby


From experimentation to continuous verification: how to benefit from the entire spectrum of Chaos Engineering

Datadog
Chaos Engineering is often misunderstood as simply “breaking things on purpose.” This talk challenges that perception and repositions Chaos Engineering as a critical pillar of reliability and resilience engineering. Rather than focusing on failure injection alone, we explore how to leverage...

10:00

Echoes in the Core: Designing Resilient Platforms

Mirantis
In complex platform environments, incidents don’t always begin with loud failures, they start with subtle drifts, degraded assumptions, and silent breaks in mental models. This talk tells the story of a narrative-driven coordination system inspired by Resilience Engineering principles, and how we...

Reliability at the Edge: Building SRE Culture in Resource-Constrained Environments

Payble Technologies
How do you build reliable systems where downtime is not just an inconvenience, but a threat to trust, income, and even safety? In this talk, Roosevelt Elias, founder of Payble, explores how to establish SRE principles in markets where infrastructure is unreliable, cloud access is intermittent,...

Going Beyond Code: Igniting Sociotechnical Systems through Reliability Advocacy

ING
In our world where everything is code, reliability extends beyond clean and reliable code running on the right infrastructure. It requires a robust sociotechnical system, the dynamic interplay between social and technical components. Our North Star is an engineering culture, built on shared...

11:30

Lunch & networking

Main lobby


12:30

Scale or Fail as Spotify's Growth Exposed the Abstraction Paradox

Spotify
When Spotify scaled from millions to hundreds of millions of users, we discovered that our carefully crafted abstractions—designed to simplify our systems—had become our biggest operational liability. New engineers could ship features but couldn't debug failures. Our beautiful, clean interfaces...

13:00

State in a stateless world: data serving in the cloud

Riskified
The promise of cloud computing—elasticity, agility, and reduced operational overhead—often hinges on the concept of stateless application design. Yet, every meaningful application relies on persistent data, creating a fundamental tension: how do you manage and serve "state" effectively in a...

Using OpenTelemetry to Improve Service Level Objectives

Coralogix
As systems grow more complex, engineering teams must move beyond guesswork when resolving production issues and instead focus on delivering consistent, measurable user value. This talk explores how OpenTelemetry empowers teams to define, measure and improve Service Level Objectives with precision...

14:30

Wrap up

Scan each other's QR codes & head to a nearby pub!


09:00

Coffee break

Main lobby


09:30

There's no AI without APIs

Postman
We’re in the midst of an AI revolution—and APIs are its unsung heroes. While LLMs and AI agents grab headlines, it's APIs that power their ability. Behind every AI-generated insight, recommendation, or automated task is an API call connecting the model to the tools, services, and data it needs to...

Flexible Kubernetes Multitenancy with vCluster

vCluster Labs
In this session, we'll explore how to implement flexible multitenancy in Kubernetes using **vCluster**. You'll learn how to design platforms that can adapt to different isolation requirements, resource sharing needs, and trust models within a single cluster. **We'll explore:** - **Multitenancy...

Beyond the Pipeline: Getting Started—and Thriving—in DevSecOps

Seccl
DevSecOps success isn’t just about picking the right tools—it’s about driving change across people, process, and technology. In this talk, we’ll go beyond the buzzwords to explore what it truly takes to embed security into modern software delivery. We’ll start by demystifying the DevSecOps...

11:00

Observability Made Simple: Monitoring Plants with IoT and Grafana

Grafana Labs
Have you ever wondered what observability really means and how it can be useful, even if you’re just starting out? In this session, we’ll explore the basics of observability through a fun, relatable project: monitoring the health of plants using simple IoT sensors and Grafana dashboards. Using a...

11:30

Lunch & networking

Main lobby


TLA+: Your Secret Weapon Against Concurrency Hell

OSInet
Let's be honest: writing correct concurrent and distributed code is hard. Race conditions? Deadlocks? They're the bane of modern software development, especially with tools like Kafka, Kubernetes, and microservices. Testing helps, but it rarely finds all the subtle, timing-dependent bugs. Imagine...

How We Built ClickStack - an open source, open telemetry native Observability stack

ClickHouse
Modern observability is built on a flawed foundation: three siloed pillars - logs, metrics, and traces - each powered by different engines with separate query models, storage formats, and operational costs. Users are forced to manually correlate across systems, accept duplication, or pay high...

14:00

Wrap up

Scan each other's QR codes & head to a nearby pub!


Speakers

Alecia Cotterell
Teesside University London
Read more →
Amit Kushwaha
SambaNova Systems
Read more →
Andrei Pokhilko
Komodor
Read more →
Anthony Ekpechue
GlobalLogic
Read more →
Ashish Upadhyay
Ministry of Housing, Communities and Local Government
Read more →
Ayd Asraf
Auto1 Group
Read more →
Bharav Patel
AWS
Read more →
Carly Richmond
Elastic
Read more →
Ceyda Duzgec
Sufle
Read more →
Cynthia Akiotu
Independent
Read more →
Dale McDiarmid
ClickHouse
Read more →
Dave McAllister
NGINX
Read more →
Dheeraj Bandaru & Francois Martel
Microsoft & Neubird.ai
Read more →
Diana Todea
VictoriaMetrics
Read more →
Dima Malyshenko
countX
Read more →
Divine Odazie & Shivadeep Gundoju
EverythingDevOps & ING
Read more →
Dominik Süß
Grafana Labs
Read more →
Frederic G. Marand
OSInet
Read more →
Harel Safra
Riskified
Read more →
Joris Bonnefoy
Datadog
Read more →
Kim-Norman Sahm
CAST AI
Read more →
Leena Mooneeram & Jorge Lainfiesta
Chainalysis & Rootly
Read more →
Marcus Tenorio
Bitso
Read more →
Marie Cruz
Grafana Labs
Read more →
Martin McLarnon
Coralogix
Read more →
Mayank Goyal
Okta
Read more →
Meletius Mgbeodichimma Igbokwe
Read more →
Michael Cote
Tanzu
Read more →
Miko Pawlikowski
SRE Author
Read more →
Mitul Jain
H&M Group
Read more →
Peng Wang
OceanBase
Read more →
Peter Marshall
Imply
Read more →
Piotr Zaniewski
vCluster Labs
Read more →
Pooja Mistry
Postman
Read more →
Prithvi Raj
Mirantis
Read more →
Roman Khavronenko
VictoriaMetrics
Read more →
Roosevelt Elias
Payble Technologies
Read more →
Rory Crispin
ClickHouse
Read more →
Scott Rosenberg
TeraSky
Read more →
Sean Behan
Blis
Read more →
Sebastian Coles
Seccl
Read more →
Simon Hanmer
GlobalLogic
Read more →
Stephan Mousset
ING
Read more →
Stratos Kourtzanidis
Microsoft
Read more →
Stuart Clark
Spotify
Read more →
Yash Verma
Independent
Read more →

Venue

Everyman Canary Wharf

Crossrail Place,
Canary Wharf,
E14 5AR, London, UK
Level -2

Tube access
Jubilee, Elizabeth and DLR lines: Canary Wharf station

Sponsors & Partners

Want to become a sponsor? Get in touch!