SREday

Schedule

Day 1

09:00

Miko Pawlikowski

Keynote: SRE’s worst nightmare

SRE Author

What's the worst that can happen? Join me for a story about reliability, secrecy and potential 200k fatalities.

09:30

Coffee break

Main lobby

10:00

Laura Thomson & Vladislav Nedosekin

Bolstering Your Workload's Resilience via Chaos Engineering: AWS Insights from 2024

AWS

Fancy a peek into the crystal ball for 2025's resilience planning? Join us as we unpack the valuable lessons gleaned from Amazon's best practices and customers' experiences in 2024. We'll explore the Chaos Engineering mechanisms AWS has developed to fortify your workload's resilience. Ready to...

10:30

Simon Copsey

From Diapers to Delivery: Parenting Lessons for Effective Management

The Curious Coffee Club

Parenthood often arrives with little time to prepare. Our idea of 'good' parenting usually involves emulating others, whilst hoping we don’t do any permanent damage. Stepping into management is remarkably similar: we emulate others, but it never feels quite right. Fortunately, there’s a better way

11:00

Michael Cote

Platform Engineering for Private Cloud

VMware / Pivotal

“Platform engineering” is the art of building and managing the infrastructure that powers your applications: a mix of cloud, a handful of DevOps, a pinch of SRE, and a thick glaze of product management. While it’s “nothing new,” many organizations are just starting to practice it—and for good...

11:30

Yan Cui

Money-saving tips for the frugal serverless developer

Lumigo

Dive into the world of serverless and explore common, costly mistakes and learn actionable tips for cutting down waste and reducing your AWS bill. Whether you're looking to cut down on CloudWatch costs or improve cost-efficiency for your serverless application, we've got some helpful tips for you.

12:00

Lunch & networking

Main lobby

13:00

Steve Wade

The Human API: Engineering Better Conversations

Steven Wade Consulting

What if the principles that make great software could transform the way we connect with people? Join Steve for an illuminating exploration of how technical expertise can unlock extraordinary human connections. Drawing from years of experience bridging the gap between technical brilliance and...

13:30

Pawel Hajdan

What a submarine commander taught me about effective software teams

Tech Momentum

My colleagues were right to recommend L. David Marquet's *"Turn the Ship Around"* as a good read for software team leaders. I encourage you to read it too, with my key takeaways and how they apply to software. --- #### 1) Fix the environment, not the people We've got a lot to learn from other...

14:00

Igor Baliuk

Lessons Learned from Changing 3 Service Meshes in 7 Years

Avito

I'm a member of a platform team that managed to change 3 service mesh solutions during the past 7 years. We did it seamlessly for other 1,500 engineers that work at Avito; our solution manages over 3,000 microservices and > 3 mln RPS. I will share difficulties and lifehacks how we achieved that.

14:30

James Eastham

So You Want to Maintain a Reliable Event Driven System

Datadog

Building an event-driven system is the easy part. You build producers that produce messages and consumers that consume messages, and you leverage managed services as the message channels between your systems. But what does this mean for your operations? The things that keep your systems online,...

15:00

Yishai Beeri

DevOps for the GenAI Age

LinearB

GenAI is disrupting how we write, review, accept and deliver code. DevOps practices must evolve to be able to keep up. There are new kinds of bottlenecks to open, new bends in the pipelines to navigate, and new technologies at our disposal Join us to learn how.

15:30

Diego Nieto

Scaling Analytics with ClickHouse in Cloud-Native Environments

Altinity

Learn strategies to optimize ClickHouse for massive data streams. Explore Kubernetes orchestration, query optimization, and real-world solutions for scalable, cost-efficient, and reliable analytics in modern systems.

16:00

Wrap up

Scan each other's QR codes & head to a nearby pub!

09:30

Coffee break

Main lobby

10:00

Vladimir Klevko

Stateful Workloads Made Easy: A Practical Demo of Live Migration

CAST AI

Kubernetes works great for stateless applications, but stateful workloads like databases or long-running jobs pose a challenge. These applications rely on persistent data and can’t afford interruptions, making Kubernetes’ “ephemeral” approach risky. Downtime can lead to data loss,...

10:30

Aivars Kalvans

Raw-dogging the Linux proc filesystem

Ebury

The goal of this talk is to show the source of information many tools use to display process information. I will go through the most interesting files in the /proc filesystem and show what information is there, along with standard tools for displaying this information. This comes in handy when...

11:00

Elad Leev

The operator pattern is here to stay: Building a foundational cloud-native Streaming Platform

Dojo

This session delivers critical insights into leveraging Kubernetes Operators in the data space. I'll cover the Kubernetes operator pattern, KubeBuilder, data governance, how we built our connector ecosystem, and many more.

11:30

Casey Wylie

Building Smarter Kubernetes Workflows: Pepr for the Modern SRE

Defense Unicorns

Pepr simplifies Kubernetes operations by consolidating admission controllers and operators into one lightweight framework. Enforce global security postures, leverage a full-fledged programming language, and offload operational expertise into code. Pepr makes administering Kubernetes clusters easy!

12:00

Lunch & networking

Main lobby

13:00

Hrittik Roy

Speeding Up CI Pipelines: Testing Kubernetes Apps with vCluster

Loft Labs

Accelerate your CI workflows with vCluster—create lightweight, on-demand Kubernetes clusters for faster testing and development, reducing build times and overhead while supporting CRDs for production-like environments.

13:30

Mateusz Solnica

Emulation, Contenerization and Virtualization - do you know the differences?

SpeakAura

Let’s journey back to the basics and explore the fascinating realms of emulation, virtualization, and containerization. Together, we’ll uncover how these three pillars revolutionized the technical landscape and continue to drive innovation today.

14:00

Ilya Andreev

kubenetmon: How we built a tool to meter data transfer in ClickHouse Cloud

ClickHouse

In this talk, we are going to discuss how to go from zero to hero in understanding which of your workloads send how much data to each other. We'll talk about popular observability solutions, such as Cilium Hubble, Flow Logs, tools like Retina, and others, and where they fall short. We will then...

14:30

Max Golionko

Cost-Effective Monitoring in Kubernetes

VictoriaMetrics

Discover cost-effective monitoring in Kubernetes! Learn to optimize expenses with practical strategies. We'll explore efficient resource utilization, smart data retention, and more, aimed at maximizing your monitoring investment. Join us to enhance your monitoring approach without breaking the bank!

15:00

Wrap up

Scan each other's QR codes & head to a nearby pub!

09:30

Coffee break

Main lobby

10:00

Mark Faiers

The DevOps Organisation

RiverSafe

DevOps has many benefits for software eng, but is rarely talked about outside of that context. In this talk we’ll explore why DevOps is not a purely technical endeavour, what it means to apply DevOps across the whole organisation, and how you can use these ideas to deliver change where you work.

10:30

Rotem Tamir

The Missing Chapter in the Platform Engineering Playbook

Atlas by Ariga

Databases power nearly every cloud-native application, yet they remain one of the most overlooked components in platform engineering. When LLM-powered co-pilots author schema changes alongside our devs, building proper platform guardrails is no longer optional - it's essential. This talk uncovers...

11:00

Tomasz Czajka, Ciaran Gaffney & Pascal Schlumpf

Defining Reliability through User Objectives

Tenable

In this talk, we’ll explore how we revolutionized our SLO practices by introducing User Objectives—customer-experience-focused metrics that transcend individual services. This approach transformed our SRE function from a traditional embedded model to a centralized Application SRE team, fostering...

11:30

Joshua Fox

Taking Machine Learning to production: Cloud MLOps for speed and efficiency

DoiT International

Taking Machine Learning to production: Cloud MLOps for speed and efficiency

12:00

Lunch & networking

Main lobby

13:00

Chris Phillips & Simon Kapadia

Why you should own an internal platform for your External AI and SaaS Providers

IBM

Today there are thousands of AI and SaaS Services out there and are used throughout your businesses. This session will explain What, Why and How platform teams need to productise their external AI and SaaS Providers to provide flexibility and control of these external applications. These AI and...

13:30

Kristina Kondrashevich & Gang Luo

Why we skipped SRE and switched to Platform Engineering?

Electrolux Group

We work in the IoT space at Electrolux Group, leader in Home Appliance industry, scaling from 10 to 300 developers with just 5 Ops engineers in 4 years. Along the way, we faced challenges in promoting SRE principles to development teams. This led us to transition from SRE to Platform Engineering....

14:00

Pedro Ivo Raimundo

No More Heroes: Why Team Composition is a BIG Deal

Pegasystems

This talk covers a topic that's universal across any team, company and industry that deals with technology - Team Composition. And with this talk, I bring relevant data and proven sources to the discussion to explain what the key concepts are, and why they matter so much on the outcomes delivered...

14:30

Jorge Lainfiesta

Search and Rescue: from mountain peaks to protocols

Rootly

Getting paged at 3 a.m. is tough—but imagine it’s because you need to rescue a lost hiker in a vast mountain range with freezing temperatures. Search and Rescue (SAR) operations depend on protocols, communication channels, defined team roles, and other concepts familiar to any SRE—except SAR...

15:00

Wrap up

Scan each other's QR codes & head to a nearby pub!

Day 2

09:00

Harry Kimpel

Keynote: The Future of Observability: Trends, AI, and New Relic’s Vision for a Smarter Stack

New Relic

As cloud-native development accelerates, observability is no longer a nice-to-have, but a necessity. This session explores key trends shaping the observability space, including the role of AI in transforming monitoring practices, the rise of open standards like OpenTelemetry, and how platforms...

09:30

Coffee break

Main lobby

10:00

Harry Kimpel

2h Workshop: Hands-on guide to monitor your API-driven AI/LLM applications

New Relic

In this workshop, we will focus on leveraging New Relic's AI Monitoring to confidently build and run AI applications. You'll learn how to achieve comprehensive observability across your stack to maintain peak performance, ensure compliance, promote quality, and observe costs. Through hands-on...

10:30

Harry Kimpel

2h Workshop: Hands-on guide to monitor your API-driven AI/LLM applications

New Relic

11:00

Harry Kimpel

2h Workshop: Hands-on guide to monitor your API-driven AI/LLM applications

New Relic

11:30

Harry Kimpel

2h Workshop: Hands-on guide to monitor your API-driven AI/LLM applications

New Relic

12:00

Lunch & networking

Main lobby

13:00

Alina Astapovich & Markus Makela

Automating SRE Operations with Multi-Agent AI: InfraAssistant Approach

Electrolux Group

SRE teams often face challenges with a high volume of routine tasks and requests, making it difficult to focus on critical, high-priority issues. At Electrolux, we faced the same challenge, which led us to develop __InfraAssistat__ —an ***multi-agent AI-powered solution*** designed to automate...

13:30

Nick Taylor

Zero Trust: From Airports to Identity-Aware Proxies

Pomerium

Zero Trust doesn't have to be intimidating. Learn how Identity-Aware Proxies transform service access from perimeter-based to continuous verification, explained through the universal experience of airport security.

14:00

Stuart Rimell

Incident Groundhog Day

Uptime Labs

Learning how to respond effectively to incidents is hard. One of the reasons is that we never see the same incident twice. While we can learn vital lessons during and after an incident, we can’t hop into a time machine, and apply these lessons to the same incident to discover their impact. What...

14:30

Prerit Munjal

Don't Over-Engineer your Observability stack period

KubeCloud

In the cloud-native space, there is a plethora of tools available for observing Kubernetes applications & Infra. However, the choices often involve either opting for service meshes that increase architectural complexity or selecting tools with exorbitant costs. What if there was a one-stop...

15:00

Wrap up

Scan each other's QR codes & head to a nearby pub!

09:30

Coffee break

Main lobby

10:00

Alon Nisser

It's Friday! CI/CD as an unfinished journey

ZenCity

It's Friday afternoon, and you've got plans for this evening. You've just finished the feature. you push to main, and click deploy. OR DO YOU? let's talk about Friday deployments and what they can teach us. We'll talk about being stuck in slow deployment cycles and how to break free from them....

10:30

Piotr Zaniewski

How to Build Cloud Native Platforms with Kubernetes

Loft Labs

In this talk, I will explore how to build cloud-native platforms using Kubernetes. I will discuss creating self-service portals, leveraging programmatic APIs, and automating workflows to enhance productivity and reliability. We’ll cover best practices for infrastructure management, security...

11:00

Rob Charlwood

Embracing the chaos: How Chaos Engineering could have saved Jurassic Park

Supercharged

What do dinosaurs and distributed systems have in common? Both are complex, unpredictable, and prone to catastrophic incidents without proper safeguards. In this talk, we’ll explore how the principles of Chaos Engineering could have prevented the incident at Jurassic Park. We’ll dissect critical...

11:30

Panagiotis Moustafellos

A story of hundreds of thousands SLOs across the globe

Elastic

This talk dives into the technical architecture and operational strategies behind Elastic's global-scale SLO management system, designed to handle hundreds of thousands of SLOs across 60+ regions and major cloud providers. The system empowers development teams to define, monitor, and manage SLOs...

12:00

Lunch & networking

Main lobby

13:00

Pawel Skrzypek

NebulOuS Meta Operating System for cloud continuum ops based on Kubernetes

7bulls.com

In this talk, I present a novel, meta-operating system approach to the cloud continuum - showcasing the NebulOuS project vision and the first results that enable cloud continuum ops. NebulOuS accomplishes substantial research contributions in the realms of cloud continuum brokerage by introducing...

13:30

Jonathan Perry

Taming Noisy Neighbors: Accelerating Response Times With Memory Performance Isolation

Unvariance

We think of containers as providing isolation for our applications, however a major source of performance interference remains unaddressed, significantly degrading performance. Contention for CPU caches and memory bandwidth has been shown to increase tail response times by 4-13x and reduce...

14:00

Marius Kimmina

From Spot Ocean to Karpenter - Zero Downtime Migration

adjoe

From Spot Ocean to Karpenter: adjoe's zero-downtime migration story. Learn how we switched autoscalers in production, the challenges we faced along the way, and why we built a custom controller to fix broken nodes.

14:30

Rajith Muditha Attapattu

Mastering Kubernetes Cost Optimization for Sustainable Cloud Operations

Randoli

Cloud-native platforms like Kubernetes offer unparalleled flexibility and scalability, but they often come with a hidden price tag. Without intentional cost management, organizations risk overspending due to inefficient resource utilization, over-provisioning, and lack of visibility into cloud...

15:00

Mykhaylo Rykmas

Business-Driven Monitoring: An SRE’s Secret Weapon

Vettabase

In this talk, I’ll share how focusing on business metrics, not just technical ones, can transform Site Reliability Engineering. By tracking business-centric metrics, we identified issues early and resolved them before they significantly impacted users or revenue. Real-World Cases from Experience...

15:30

Siddharth Vijay

SRE in Gaming Tech: Handling Millions of Real-Time Requests

Baazi Games

In the gaming industry, every millisecond matters. When managing a real-time, high-traffic platform like PokerBaazi (India's biggest online poker platform), where millions of users are simultaneously playing, the need for Site Reliability Engineering (SRE) becomes critical. Achieving seamless...

16:00

Wrap up

Scan each other's QR codes & head to a nearby pub!

09:30

Coffee break

Main lobby

10:00

Carly Richmond

Observability is not just for Backend!

Elastic

Observability is the ability to measure the current state of a system. Backend engineers are becoming more familiar with the primary signals and technologies, such as OpenTelemetry that can be used to instrument applications and diagnose issues. Yet, in the frontend world, we're behind the curve....

10:30

Martin McLarnon

Evolving Shift Left: Integrating Observability into Modern Software Development

Coralogix

The concept of “Shift Left” has long guided developers to address issues early in the software development lifecycle (SDLC), catching bugs before they reach production. But as modern software ecosystems become more complex—with microservices, serverless architectures, and global...

11:00

Eric D. Schabell & Graziano Casto

When Platform Engineers meet SREs: The Birth of Observability-as-a-Service Superpowers

Chronosphere & Mia-Platform

Monitoring the behavior of a system is essential to ensuring its long-term effectiveness. However, managing an end-to-end observability stack can feel like stepping into quicksand, without a clear plan you’re risking sinking deeper into system complexities. In this talk, we’ll explore how...

11:30

Agnieszka Welian

How to tame chaos effectively?

Pegasystems

Imagine a self-healing system that handles surprises, letting you sleep peacefully. If that sounds appealing, chaos engineering could be the answer. Trusted by Netflix, LinkedIn, Google, and Facebook, it's key for business resilience. In this session, we'll explore its history, learn how to apply...

12:00

Lunch & networking

Main lobby

13:00

Daniel Afonso

Plan for Unplanned Work: Game Days with Chaos Engineering

PagerDuty

How do you plan for unplanned incidents? You practice with Chaos Engineering. Strong incident response doesn't just happen, you have to build the skills and train your team. Practicing for major incidents gives your team insight into how your applications will behave when something goes wrong as...

13:30

Denys Vasyliev

AIRE: AI Reliability Engineering. Bringing SRE to AI

GfK - An NIQ Company

AI products are becoming critical for businesses to maintain a competitive edge, yet integrating them into an organization’s ecosystem brings unique challenges. Ensuring the reliability, security, and alignment of AI systems with business goals and ethical standards demands new approaches and...

14:00

Dmytri Kleiner

Behaviour-Driven Automation and Commerce as Code

Saleor Commerce

Behaviour-driven automation builds on practices like configuration as code and infrastructure as code, where entire systems—hosts, services, and resources—are defined in code rather than configured manually. This makes deployments consistent, testable, and reproducible. However, traditional code-...

14:30

Victor Onyenagubom

The Human Side of the Cloud: Why Soft Skills Are the Key to Success

Teesside University London

In a world increasingly defined by complex technology and rapid innovation, it is easy to focus entirely on the technical aspects of success. Yet, the most advanced cloud infrastructure, the most cutting-edge tools, and the most sophisticated algorithms are only as effective as the people behind...

15:00

Yongkang HE

Scaling Community: The K8SUG Story

K8SUG

What does it really take to build a global cloud-native community from the ground up — with no funding, no big-name backing, and no playbook? In this talk, I’ll share the journey of growing K8SUG (Kubernetes & Cloud Native User Group) from a small local meetup into a global movement, now spanning...

15:30

Wrap up

Scan each other's QR codes & head to a nearby pub!

Buy Tickets

Site Reliability, DevOps and Cloud

March 27-28, 2025 London, UK

Event finished

Tickets

Schedule

Day 1

09:00

09:30

10:00

10:30

11:00

11:30

12:00

13:00

13:30

14:00

14:30

15:00

15:30

16:00

09:30

10:00

10:30

11:00

11:30

12:00

13:00

13:30

14:00

14:30

15:00

09:30

10:00

10:30

11:00

11:30

12:00

13:00

13:30

14:00

14:30

15:00

Day 2

09:00

09:30

10:00

10:30

11:00

11:30

12:00

13:00

13:30

14:00

14:30

15:00

09:30

10:00

10:30

11:00

11:30

12:00

13:00

13:30

14:00

14:30

15:00

15:30

16:00

09:30

10:00

10:30

11:00

11:30

12:00

13:00

13:30

14:00

14:30