Crossrail Place,
Canary Wharf,
E14 5AR, London, UK
Level -2
Tube access
Jubilee, Elizabeth and DLR lines: Canary Wharf station
“Platform engineering” is the art of building and managing the infrastructure that powers your applications: a mix of cloud, a handful of DevOps, a pinch of SRE, and a thick glaze of product management. While it’s “nothing new,” many organizations are just starting to practice it—and for good reason. But what happens when your platform is running on private cloud?
With around 50% of enterprise apps still running on private clouds, platform engineering for private platforms is surprisingly under-discussed. This talk dives into real-world examples and stories from large organizations tackling this challenge. The hurdles often lie in adapting platform engineering to existing IT stacks and processes—most organizations can’t simply start from scratch, nor would they want to abandon what’s currently driving revenue.
If you support your organization’s apps, platform engineering is something you’ll probably be doing soon. Come learn how your peers are navigating these challenges and share your own experiences.
Michael Cote studies how large organizations get better at building software to run better and grow their business. His books Changing Mindsets, Monolithic Transformation, and The Business Bottleneck_ cover these topics. He's been an industry analyst at RedMonk and 451 Research, done corporate strategy and M&A, and was a programmer. He also co-hosts several podcasts, including Software Defined Talk. His daily-ish newsletter is at newsletter.cote.io.
In this talk, we’ll explore how we revolutionized our SLO practices by introducing User Objectives—customer-experience-focused metrics that transcend individual services. This approach transformed our SRE function from a traditional embedded model to a centralized Application SRE team, fostering collaboration with product and engineering teams across the organization.
We'll share how we:
- Collaboratively defined User Objectives with PMs and Engineering Leaders.
- Mapped dependencies across services and datastores to create a robust dependency graph.
- Built SLOs centered on User Objectives, with secondary metrics for individual services.
- Established effective processes like weekly SLO reviews for product teams and monthly Production Reviews with senior leaders.
- Introduced meaningful alerting using error budgets and burn rates.
- Developed an SLO framework that automates dashboards, monitors, and metrics.
This evolution redefined SRE’s role in our company, establishing a true partnership with product teams. By creating feedback loops that balance features and stability, we’ve elevated our understanding of product reliability and improved the customer experience.
Tomasz Czajka is a seasoned professional with extensive experience in DevOps, DataOps, and Site Reliability Engineering (SRE). His career spans leadership roles, automation projects, and performance optimization across high-profile organizations like Tenable, Deutsche Bank, and Citrix. Tomasz has led significant projects, including the development of scalable analytics workflows, robust CI/CD pipelines, and performance-enhancing solutions on large-scale infrastructure. His technical expertise includes Python, Bash, Docker, Kubernetes, and tools for hypervisor platforms. Tomasz is also adept at driving collaboration, implementing cutting-edge tools, and improving process efficiencies. With a strong academic foundation in IT from Wrocław University of Science and Technology and Politechnika Opolska, he embodies continuous learning and excellence in engineering practices.
Ciaran Gaffney brings over a decade of diverse experience in Site Reliability Engineering (SRE) and software development. His journey spans key roles at Tenable and Hosted Graphite, where he demonstrated expertise in designing and maintaining distributed systems, enhancing reliability, and implementing innovative solutions such as dynamic load balancing and a gRPC-based aggregation layer. At Hosted Graphite, he was integral in scaling a system that processed 160 billion data points daily while ensuring SLAs were consistently met. Notable achievements include developing Hosted Graphite's alerting feature, creating customer-facing APIs, and leading automation and hardware provisioning efforts. With a strong foundation in Python, DevOps tools, and infrastructure management, Ciaran is a skilled problem solver passionate about system scalability, performance optimization, and customer satisfaction.
Pascal Schlumpf brings extensive expertise in site reliability engineering, software development, and monitoring systems to our event. With experience spanning over a decade in leading organizations such as Tenable and AT&T, Pascal has honed his skills in automation, monitoring integration, and reducing MTTR through innovative solutions. From building exporters for Prometheus and Grafana to integrating systems with AppDynamics and Splunk, Pascal has demonstrated a commitment to advancing observability and system reliability. We're excited to have him share his insights and practical approaches at the conference.
SRE teams often face challenges with a high volume of routine tasks and requests, making it difficult to focus on critical, high-priority issues. At Electrolux, we faced the same challenge, which led us to develop InfraAssistat —an multi-agent AI-powered solution designed to automate key operational tasks such as infrastructure management, user onboarding, and responding to internal requests. This shift reduced our manual workload and significantly improved operational efficiency. InfraAssistant is built on specialized agents that coordinate to autonomously manage complex tasks, reducing the need for continuous manual involvement. This session will cover the design and orchestration of these agents, showcasing how InfraAssistant helps SRE teams by automating day-to-day operations, minimizing repetitive tasks, and enhancing the management of complex infrastructure.
Alina Astapovich is a Site Reliability Engineer with a strong background in IoT development. Holding a master's degree in 'Neurotechnologies and Software Engineering,' she excels in designing and architecting backend platforms and microservices. Her experience goes beyond, she has a strong background in building scalable and robust solutions. With a Solutions Architect certification, Alina leverages her expertise in cloud technologies to optimize system performance and reliability. She is dedicated to continuous professional growth, always exploring the latest innovations and trends in technology. As a skilled public speaker, Alina regularly shares her insights at industry events and conferences, contributing to the broader tech community.
Markus Makela is a skilled professional with a background in software development and site reliability engineering (SRE), currently working at Electrolux Group in Stockholm. Holds a Civilingenjörsexamen in Engineering Physics from KTH Royal Institute of Technology (2015–2021). Experienced in machine learning and MLOps, complemented by an AWS Certified Machine Learning Engineer – Associate credential (valid through 2027). Formerly a Software Developer and Programmer at Prevas AB, delivering innovative solutions in the Stockholm region.
We work in the IoT space at Electrolux Group, leader in Home Appliance industry, scaling from 10 to 300 developers with just 5 Ops engineers in 4 years. Along the way, we faced challenges in promoting SRE principles to development teams. This led us to transition from SRE to Platform Engineering. In this talk, we’ll share how we built an Internal Developer Platform (IDP) integrated with our cloud and toolchains, embedding SRE principles into its foundation. This platform enables developers to autonomously create and manage infrastructure and services while adhering to best practices and security standards. This newfound autonomy boosted developer productivity, enabling them to spin up new regions in a single day, rather than relying on approval processes. In this talk, we'll discuss Electrolux's journey from a traditional SRE model to platform engineering. We'll cover the challenges they faced with their initial SRE model, including scalability and a lack of self-service capabilities for developers. We'll then delve into how they addressed these challenges by automating common requests, moving workload to SaaS solutions, and building an internal developer platform. Finally, we'll explore how they open-sourced their developer platform, making it available to the wider community
Kristina Kondrashevich is SRE Product Manager at Electrolux. In my life, I have been responsible for writing code, managing teams, and improving delivery processes. Today, as a PM, I support my Platform team to bring traditional product management practices into our way of working. My primary objective is to prioritize the satisfaction of developers within my organization, viewing them as consumers and striving to enhance their overall experience. I like to read and have a profound love for lifelong learning. Exploring new ideas, concepts, and perspectives is something that truly ignites my curiosity.
Gang Luo has spent over six years building and leading distributed teams, introducing Chaos Engineering, and adopting cloud-native and SRE best practices to enhance service reliability and scalability. His tenure at Salesforce included designing large-scale automated testing and deployment platforms, saving significant resources while maintaining exceptional system performance and availability. Gang holds a Master's in Software Engineering and Management from Linköping University and has earned multiple awards for his technical contributions and innovation.
The goal of this talk is to show the source of information many tools use to display process information. I will go through the most interesting files in the /proc filesystem and show what information is there, along with standard tools for displaying this information. This comes in handy when you don't have permission to install extra tools or tools that do not exist yet. I will also show the most common use cases: restoring binaries deleted by accident or intentionally, restoring deleted log files, finding disk space leaks, debugging process limits and environment variables, and many more.
Aivars Kalvāns is a FinTech developer, software architect, and consultant. He spent more than 18 years at Tieto developing and architecting payment card software for acquiring and issuing, accounting and utility payments through mobile phones, ATMs, and POS terminals. At the moment he is a contractor for Ebury exploring the Foreign Exchange area of the FinTech landscape.
Serverless breaches expose dangerous missteps in securing function chains, IAM policies, and API gateways. We unravel serverless compromises to reveal the overlooked risks lurking in your infrastructureless apps. Arm yourself with actionable lessons to lock down your functions and avoid headlines.
As organizations rapidly adopt serverless, they expose themselves to new risks not secured by traditional controls. Real-world serverless breaches have already exposed overlooked flaws and misconfigurations resulting in catastrophic data theft and service disruption.
In this talk, we conduct a deep forensic analysis of high-profile serverless compromises involving misconfigured S3 buckets, overly permissive functions, vulnerable web frameworks, and Common missteps in serverless permissions and secrets management.
Walk away with actionable lessons to avoid becoming the next serverless breach headline. We’ll provide concrete steps to reduce your attack surface, implement least privilege access, monitor anomalous activity, and instill a “secure by default” posture across your infrastructureless apps.
Babar Khan Akhunzada is a 24-year-old cyber wizard, Founder & CEO of SecurityWall - Stacked on AI & Big Data technology to help enterprises and individuals enhance security capabilities through capability building, risk management and hybrid security audit.
Babar have helped numerous enterprises to tackle financial cyber crime issues on business application and infrastructure scale. Enterprises improved 95% of their security alignments and protected against cyber criminals for zero-paid orders leading to financial crisis.
Babar is acknowledged by well-known tech companies for contributing to their products' security including Adobe, eBay, Apple, Nokia, Microsoft, Oracle, Sony, Redhat, Yahoo, DuckDuckGo, StackOverflow, NextCloud, and a 100+ more.
Babar have many years of experience of bug bounty hunting and ethical work, which make him unique in all at such young age. Babar has served as a Security Consultant for Directorate of Information Technology and was responsible for security matters of Government Data Center during his tenure.
Recently, Babar has been featured in 25-Under-25 as a young high achiever. Babar have been to many conferences and events including BlackHatMEA, High Technology Crime Investigation Association, UC EXPO, OWASP Romania, Cyber Security Indonesia, EC-Council for Annual Halted Conference, HITCON Taiwan and many more.
Pepr simplifies Kubernetes operations by consolidating admission controllers and operators into one lightweight framework. Enforce global security postures, leverage a full-fledged programming language, and offload operational expertise into code. Pepr makes administering Kubernetes clusters easy!
Think of Pepr like a mix between Operator-SDK and Kyverno/OPA Gatekeeper. It has a fresh take on building Kubernetes controllers with extremely simple interfaces. It takes no time to be up and running in the cluster automating away tasks that would have taken hours. The only requirement is a Kubernetes Cluster.
Casey is currently a lead software engineer focused on the hybrid cloud around Kubernetes. I solve problems in distributed systems. I love to chat and talk with people.
Today there are thousands of AI and SaaS Services out there and are used throughout your businesses. This session will explain What, Why and How platform teams need to productise their external AI and SaaS Providers to provide flexibility and control of these external applications.
These AI and SaaS Providers are accessed with APIs and by managing them in a centralised platform you can share the same advantages of managing your internal APIs. These advantages include • Track the Spend (and usage) from one central location for all External Services • Protect data exposure • Understand SaaS to SaaS communication • Consistency • Minimise security exposures • Controlled Self Service on boarding and credential management. • Centralised Caching
This session will go through the What, Why and How of this, and how it can be done with tools already available.
As an Integration SME Chris has travelled the world working with clients on how to utilise APIs and how to ensure they are are following the best strategies to support both internal and external consumers.
AI products are becoming critical for businesses to maintain a competitive edge, yet integrating them into an organization’s ecosystem brings unique challenges. Ensuring the reliability, security, and alignment of AI systems with business goals and ethical standards demands new approaches and tools.
This talk explores the concept of AI Reliability Engineering (AIRE), which adapts SRE principles to AI systems. Based on our experience as an AI-based startup building a mentoring platform, we’ll discuss the challenges and solutions encountered when managing LLMs, and language chain tracing and AI gateways.
Key challenges include: - lack of visibility and control over the AI lifecycle: data collection, model deployment, and monitoring. - ensuring quality and robustness of AI models, addressing issues like prompt attacks, data drift, and evolving performance. - managing complexity in dependencies, configurations, and resources across environments.
The AIRE approach combines the CNCF ecosystem with established SRE practices to address these challenges. By leveraging tools like OpenInference and AI gateways, AIRE introduces processes that enhance reliability, mitigate risks, and improve the security of AI systems.
A seasoned professional with over 15 years of experience in Software Development, DevOps, Site Reliability Engineering (SRE), AIOps and Kubernetes.
Proven track record in leadership roles including Technical Manager, Operations Team Lead, and CTO. 5 years as Co-Founder, Cloud B2B/B2C application projects.
DevOps/SRE/Kubernetes Coach and Public Speaker
Tired of DBaaS lock-in? Divine explores Sovereign DBaaS, a model putting you back in control. Learn how to build your own private DBaaS with Dapr, overcoming compliance and licensing headaches. Gain insights from real-world feedback and design considerations.
DBaaS has existed for over a decade. Since Amazon launched its Relational Database Service (RDS) in 2009, it has become one of its most popular services, and the other cloud providers followed suit. Managed DBaaS services are popular because databases are complex at scale, requiring specialized knowledge, and there is a shortage of experienced DBAs.
Though RDS and the like make managing databases easier, they create new problems ranging from data compliance issues and license instability to vendor lock-in. Hence, sovereignty is needed, not just for the data, but for its stack – infrastructure and tooling.
Divine will start this talk by elaborating on the critical problems with traditional DBaaS today, introducing the concept of Sovereign DBaaS and its benefits.
After that, Divine will discuss the actual building of a private DBaaS using Dapr, from the fundamental points you need to consider to system design considerations and what a provisional architecture will look like when developed.
Finally, Divine will wrap up by sharing feedback on building a private DBaaS from the CNCF DBaaS community and talking to database decision makers in organizations.
Divine Odazie is a Technology Evangelist at Severalnines with over 5+ years of experience in Technology and a track record in Backend Engineering, DevOps, Cloud Native and Developer Relations on a global scale. He has given talks/workshops at developer conferences like Open Source Summit Europe, KubeCon North America, Cloud Native Rejekts, and Ansible Contributor Summit.
This session delivers critical insights into leveraging Kubernetes Operators in the data space. I'll cover the Kubernetes operator pattern, KubeBuilder, data governance, how we built our connector ecosystem, and many more.
Building a data streaming platform that can scale with your organization while maintaining developer productivity is challenging. In this talk, I'll unpack the architectural decisions and engineering challenges of building a cloud-native streaming platform from scratch using Kubernetes Operators. Through this journey - from initial design to production scale, we'll explore transforming the operator pattern into a foundational piece of our data infrastructure, powering thousands of streaming workflows daily. Beyond the technical implementation, I'll share insights about building developer experiences that scale. We will cover the architectural patterns used and the extensible connector ecosystem that enabled our platform to handle millions of daily events without changing its core design.
Elad Leev is a Staff Engineer and open-source enthusiast with over 10 years of experience managing complex production systems. He has expertise in distributed systems, databases, and data streaming technologies. Elad is using Kafka in production since 2016, and he is a strong advocate for data streaming solutions, particularly with technologies like Kafka, Flink, Debezium and Kafka Connect.
What does it take to design and launch a DevOps platform capable of running alien code—written by both users and AI—fast, securely, and at scale? This talk dives deep into the journey of creating Triform, a platform redefining DevOps for a new era of AI-driven development.
You’ll learn how we:
Through real-world insights, I’ll share how we transitioned from concept to launch—crafting a platform designed to support complex, dynamic workflows written by humans and AI alike.
Whether you're an engineer building scalable systems, a founder dreaming of your next big idea, or just someone intrigued by cutting-edge technology, this session will provide inspiration, practical takeaways, and a fresh perspective on what DevOps can be in the age of AI.
Join me for a story about innovation, challenges, and how Triform was built to empower the next generation of developers.
Iggy Gullstrand is the CEO and co-founder of Triform, a pioneering platform for building, deploying, and orchestrating large-scale AI agents tailored to meet today’s dynamic production demands. Iggy and Josef Nilsen are currently out on a road trip through Europe with a tiny campervan visiting events and building the platform while meeting up with developers from all over the world.
I'm a member of a platform team that managed to change 3 service mesh solutions during the past 7 years. We did it seamlessly for other 1,500 engineers that work at Avito; our solution manages over 3,000 microservices and > 3 mln RPS. I will share difficulties and lifehacks how we achieved that.
Over the past seven years, our team at Avito has operated two self-implemented service meshes before completing a two-year migration to Istio. Today, our service mesh supports over 3,000 services across dozens of Kubernetes clusters, processing millions of requests per second. This solution is maintained by a single team, while more than 1,500 developers seamlessly work without needing to understand the internals of the service mesh or write Kubernetes manifests. This talk will delve into: - Our two-year journey migrating from one service mesh to another. - Challenges in organizing ingress communication rules. - Accelerating the adoption of Mutual TLS (mTLS) and service authorization. - Testing changes within a service mesh. - Should developers be aware of the service mesh? How we’ve made it transparent to them. - The new possibilities and advantages unlocked by adopting a service mesh. - What can you expect from a service mesh at scale, and how can you choose the right one?
Igor works at Avito, the largest classified advertisement website (MAU over 50 million).
A Platform Engineer who is strongly interested in Kubernetes, observability tools, service meshes, and other modern cloud-native approaches.
Before working as a software engineer, Igor studied distributed systems and participated in ACM competitions.
Building an event-driven system is the easy part. You build producers that produce messages and consumers that consume messages, and you leverage managed services as the message channels between your systems. But what does this mean for your operations? The things that keep your systems online, your users delighted, and your pager quiet at 3 am.
In this session, you'll gain practical knowledge you can apply when building and operating event-driven systems. You'll learn how to test systems that prioritize asynchronous communication, evolve them over time, and, most importantly, observe them and recover from failures when things go wrong. This session will walk through the software development lifecycle for an actual application, and it includes the theory behind the different practices and practical things you can take away and implement in your architectures.
James Eastham is a Serverless Developer Advocate at Datadog. He has over 10 years experience in software, at all layers of the application stack.
He has worked in front-line support, database administration, backend development and with everything from startups to some of the biggest companies in the world architecting systems using AWS technologies.
James produces content on YouTube, focused around building applications with serverless technologies using .NET, Java & Rust.
In this talk, I present a novel, meta-operating system approach to the cloud continuum - showcasing the NebulOuS project vision and the first results that enable cloud continuum ops.
NebulOuS accomplishes substantial research contributions in the realms of cloud continuum brokerage by introducing methods and tools for enabling secure and optimal application provisioning and reconfiguration over the cloud continuum.
NebulOuS project develops a platform that seamlessly exploits edge nodes in conjunction with multi-cloud resources, to cope with the requirements posed by low latency applications. We call this kind of platform a Meta Operating System (MOS) as it works as the layer above traditional operating systems.
The NebulOuS software is published on GitHub at https://github.com/eu-nebulous under an open-source licence: Mozilla Public Licence 2.0. It extensively uses other open-source software published under permissive licences such as Apache License 2.0. Most notably, the core of deployment relies on Kubernetes, KubeVela and Knative. The solution includes the following directions of work: - Modelling methods and tools for describing the cloud continuum, application requirements, and data streams; for assuring the QoS of the provisioned services. - Efficient comparison of available offerings, using appropriate multi-criteria decision-making methods. - Addressing the security aspects emerging in the cloud continuum. - Conducting and monitoring smart contracts-based service level agreements.
Jan is a dedicated Java Developer and Cloud Engineer, currently advancing in his computer science degree. With over three years of involvement in the HORIZON 2020 programs, he has gained substantial experience in developing Java applications and managing cloud infrastructures. Janek's practical knowledge and hands-on approach to cloud technologies and software development make him a valuable asset. His work primarily focuses on implementing scalable cloud-native solutions and ensuring robust system performance. Janek is known for his ability to navigate complex technical challenges, a skill he continuously refines through his academic and project-based experiences.
We think of containers as providing isolation for our applications, however a major source of performance interference remains unaddressed, significantly degrading performance. Contention for CPU caches and memory bandwidth has been shown to increase tail response times by 4-13x and reduce compute efficiency by over 25% – even with per-application CPU and memory limits in place. With current telemetry, affected applications simply show high CPU utilization, leading operators to "throw more hardware at the problem", which is expensive and ineffective at mitigating the high response times.
In this talk, we'll cover three key areas: 1. Characterize real-world triggers like garbage collection and container image decompression 2. How modern CPU features allow detecting interference and identifying noisy neighbors. 3. Practical approaches to mitigate these effects, including findings from Google and Alibaba's production environments
Finally, we'll provide a status report on our open source effort to measure memory interference and discuss future directions.
Jonathan Perry is Texas-based a maintainer of the OpenTelemetry eBPF-based network collector. He researched performance isolation in datacenter and cloud networks at MIT, then founded Flowmill, which developed an eBPF-based Network Performance Monitoring collector, now part of OpenTelemetry (the company was acquired by Splunk).
Taking Machine Learning to production: Cloud MLOps for speed and efficiency
I work with startups that innovate their algorithms, but hit a scaling wall as they succeed. I'll show how cloud platforms like Google, AWS, and Azure provide a full spectrum of MLOps services, and services, and how to decide when to leverage each.
Joshua Fox advises tech startups and growth companies about the cloud: Google Cloud, AWS, and Azure. He also writes open source, publishes technical articles, and speaks to cloud engineers as a Google Developer Expert. Before that, he was a software architect in innovative technology companies in Israel for 20 years. He has a PhD from Harvard University and a BA in math from Brandeis.
Fancy a peek into the crystal ball for 2025's resilience planning? Join us as we unpack the valuable lessons gleaned from Amazon's best practices and customers' experiences in 2024. We'll explore the Chaos Engineering mechanisms AWS has developed to fortify your workload's resilience. Ready to turbocharge your AWS journey? This session is your ticket to staying ahead of the curve. Don't miss this opportunity to future-proof your operations – what will you discover?
Laura Thomson is a Principal Product Manager with AWS Reliability Services and has been with AWS for 7+ years. She has worked on a variety of EC2 components, including launch templates, AMIs, and the EC2 console (i.e., front end), and recently led an early stage initiative for sustainability data exchange across supply chains. Currently her focus is on Fault Injection Service and supporting customers testing and improving their applications resilience. Laura obtained a B.S. in Mechanical Engineering from the University of Southern California and MBA from Carnegie Mellon University.
Vladislav Nedosekin is a Principal Solutions Architect at Amazon Web Services with over 20 years of experience pioneering resilience engineering and chaos testing practices in regulated industries. Based in London, he specializes in helping financial services organizations build resilient cloud-native platforms using cutting-edge technologies. He is passionate about resilience and chaos engineering, with current focus areas include implementing chaos engineering practices in serverless architectures and leveraging generative AI to enhance system resilience. Vladislav is based out of London.
From Spot Ocean to Karpenter: adjoe's zero-downtime migration story. Learn how we switched autoscalers in production, the challenges we faced along the way, and why we built a custom controller to fix broken nodes.
Is it possible to seamlessly hand over your Kubernetes cluster from one node-autoscaler to another without causing downtime — and, more importantly, should you? In this talk, we’ll dive into how adjoe accomplished this transition, tackling challenges such as scaling test environments outside working hours, navigating the release of Karpenter V1 mid-migration, and even creating a custom controller to manage broken nodes. Join to learn about the obstacles we faced and the valuable benefits we gained along the way.
Marius, a dedicated DevOps Engineer at Hamburg’s fast-growing adtech company, adjoe. Since joining over three years ago, Marius has been on a mission to enhance the reliability and scalability of our backend — an event-driven microservice architecture written in Go and powered by Kubernetes. His passion for coding and Kubernetes doesn’t stop here. Marius also actively contributes to open-source projects, including CoreDNS and Karpenter, constantly pushing his skills and knowledge forward. When he’s not scaling systems or optimizing code, you’ll find him hosting board game nights, tackling bouldering walls, or setting off on adventures across Asia.
DevOps has many benefits for software eng, but is rarely talked about outside of that context. In this talk we’ll explore why DevOps is not a purely technical endeavour, what it means to apply DevOps across the whole organisation, and how you can use these ideas to deliver change where you work.
DevOps is one of those terms whose meaning varies greatly depending on the background, job role, and lived experience of those being asked, but most people would agree that it is something that applies to the development and operation of software. But at some point, DevOps outgrew this simple coming together, and so we combined other functions to get terms like ‘DevSecOps’, ‘DataOps’, ‘MLOps’ and so on. All these terms are, however, too simplistic to capture the actual essence of DevOps and the ways of working that have developed around it. However, there is no reason that these principles should only be applied to the development and operation of digital products and services. They can actually be applied across organisations as a whole to create ‘The DevOps Organisation’. This talk explores how DevOps principles can be taken and applied outside of the context of software development to provide better outcomes, both internally, and externally, for an organisation.
Mark has over 15 years experience in software engineering, working across the SDLC starting out as a software engineer and later transitioning into DevOps and Cloud focused roles. He has been leading Communities of Practice for around 5 years and has worked with numerous technology stacks and across a wide range of industries, including Healthcare, FinTech, and Logistics. He has worked as an AWS consultant to some of the biggest Financial and Insurance firms in the U.K. and is now the DevOps Practice Lead at RiverSafe. In this role he is responsible for leading and building a highly skilled team of DevOps experts, delivering expert services and exceptional outcomes for customers. Mark is especially passionate about serverless technology and sustainability.
The concept of “Shift Left” has long guided developers to address issues early in the software development lifecycle (SDLC), catching bugs before they reach production. But as modern software ecosystems become more complex—with microservices, serverless architectures, and global deployments—testing alone no longer suffices. It’s time to rethink Shift Left as more than just pre-release testing—it must evolve. Traditional testing strategies like Test-Driven Development (TDD) and Behavior-Driven Development (BDD) have been invaluable, ensuring code correctness and expected behavior. However, they miss critical aspects of real-world performance, scalability, and user experience. A passing test suite can’t guarantee an application won’t crash under peak loads or degrade due to subtle performance bottlenecks. Enter observability: the missing piece in the Shift Left equation. By integrating observability practices early in the SDLC, teams gain deep, actionable insights into how applications behave in production-like environments. Metrics, traces, and logs reveal performance slowdowns, scalability constraints, and user experience hiccups long before end users are affected.
Martin McLarnon is an engineer with almost 30 years experience working in many different roles within IT. He has worked as a Network Manager, Software engineer, Cloud Architect, Engineering lead and worked as an engineering consultant before joining Coralogix.
In this talk, I’ll share how focusing on business metrics, not just technical ones, can transform Site Reliability Engineering. By tracking business-centric metrics, we identified issues early and resolved them before they significantly impacted users or revenue.
Real-World Cases from Experience
SREs shouldn’t focus only on infrastructure and tech metrics. Sometimes, tracking core business metrics and spotting anomalies can uncover critical issues—often more effectively than monitoring millions of technical metrics.
Attendees will leave with practical insights and strategies to incorporate business-driven monitoring into their workflows, aligning technical operations with business success.
Mike Rykmas is a skilled software engineer with years of expertise in database administration, cloud computing, and IT infrastructure. Thanks to his strong background in data management and performance optimization, Mike has successfully led and implemented scalable solutions, managing petabytes of data to meet diverse business needs.
This talk covers a topic that's universal across any team, company and industry that deals with technology - Team Composition.
And with this talk, I bring relevant data and proven sources to the discussion to explain what the key concepts are, and why they matter so much on the outcomes delivered by any team who deals with software in any way.
During the talk, my main goals are to: - Explain the key concepts in an accessible and interesting way; - Use examples to demonstrate the concepts in practice; - Share valuable sources along the ride (books, studies, etc); - Finish with actionable points for both engineers and team leaders/managers/etc.
Throughout my 12 years tech career, I've worked on prod support, DevOps, and now leadership. I worked for companies like IBM, Luxoft, and now Pegasystems.
I lived and worked most of my life in South America, and now I'm living in Poland for some years.
These days, my focus is on Cloud, Observability, DevOps and mostly, leadership. I also hold a few certifications, including Kubernetes CKA, AWS, GCP, Scrum, and Pega Architecture.
When I'm not catching up with my Steam games backlog, I love participating in conferences, hackathons, learning new things and meeting new people.
In the cloud-native space, there is a plethora of tools available for observing Kubernetes applications & Infra. However, the choices often involve either opting for service meshes that increase architectural complexity or selecting tools with exorbitant costs. What if there was a one-stop solution that covers monitoring, tracing, and profiling seamlessly at 0 cost and minimal changes?
Join Prerit in this session as he takes you on a journey with Pixie, a CNCF sandbox offering comprehensive observability of your cluster and its workloads. Pixie enables you to monitor cluster resources, and application traffic, and delve into detailed views like pod state and flame graphs. Remarkably, Pixie achieves all this without introducing any overhead, utilizing eBPF to automatically collect telemetry data, including full-body requests and application profiles. Explore Pixie's capabilities in monitoring network activity and Infrastructure Health with Flamegraphs. Don't miss this session for a deep dive into Pixie's observability features and how it simplifies the monitoring landscape for Kubernetes applications all followed by a hands-on demo for a live application.
Prerit is working as a Software Architect, directing his expertise towards harnessing Cloud Native Technologies to design resilient architectures that can seamlessly scale in the future, all while prioritizing technical cost, security, availability and end-user experience. As the driving force behind the YouTube channel 'Tech with Prerit' and KubeCloud, an umbrella company with multiple products in the Cloud-Native Space, he envisions creating an ecosystem focused on Cloud Native Technologies.
July 18th 2020, Max Verstappen qualifies 7th for the Hungarian Grand Prix. With Red Bull fighting Mercedes for the Constructor’s championship and Max fighting Lewis Hamilton and Vallteri Bottas for the Driver’s Championship, this wasn’t his best qualifying session.
July 19th 2020, during the formation lap, Max crashed his car, damaging his front wing and suspension. In the next 22 minutes, Red Bull mechanics performed an absolute miracle.
How did they do it? How did they go from a crashed car to giving Max a chance to fight for a podium in under 23 minutes? And what can we learn about incident management from one of the most demanding engineering disciplines in the world?
Principal Engineer, SRE at FanDuel/Blip.pt. MSc in Computer Science by the University of Porto. CK{AD, A, S} by Cloud Native Computing Foundation (CNCF) | Linux Foundation. {Terraform, Consul, Vault} Associate by HashiCorp. Working daily to build high-performance, reliable and scalable systems. DevOps Porto meetup co-organizer and DevOpsDays Portugal co-organizer. A strong believer in culture and teamwork. Open source passionate, martial arts amateur, and metal lover.
Parenthood often arrives with little time to prepare. Our idea of 'good' parenting usually involves emulating others, whilst hoping we don’t do any permanent damage.
Stepping into management is remarkably similar: we emulate others, but it never feels quite right.
Fortunately, there’s a better way
Simon Copsey is a change consultant, and is on a mission to help organisations understand and unwind complex, cross-functional obstacles that get in the way of their staff, and reduce the ability to serve their customers.
Simon’s career has taken him from being a developer in the trenches to helping various organisations take pragmatic steps from a place of chaos and paralysis, to one where it becomes a little easier to see the wood for the trees.
This session will guide you through the foundational concepts, setup, and best practices to effectively implement OpenTelemetry in your environment. We’ll explore key components like instrumentation libraries and collector configurations, showcasing practical examples and integrations.
Steve Flanders is a Senior Director of Engineering at Splunk, a Cisco company, responsible for the Observability Platform team, which includes contributions to the OpenTelemetry project. He was previously the Head of Product at Omnition (acquired by Splunk). Prior to Omnition, he led log analytics and data collection at VMware. Steve is writing a book called Mastering OpenTelemetry and Observability: Enhancing Application and Infrastructure Performance and Avoiding Outages, which will be available this fall.
When a legacy bank's fintech venture faced operational paralysis, one "simple" question triggered a revolution. Discover how GitOps transformed a tech stack and an entire organisational DNA. This isn't the story of another tool – it's a blueprint for turning chaos into your competitive advantage.
Every transformation story has a trigger point. For one of Britain's oldest banks, it wasn't a system failure or a security breach – it was a single question nobody could answer. This seemingly innocent inquiry exposed the fragility of their fintech venture's operational foundation and sparked a journey that would reshape their entire organisation. This isn't just another technical implementation story. It's a narrative about how GitOps principles catalysed a cultural renaissance within a centuries-old institution. You'll witness how a team transformed from firefighters to innovators and how version control evolved from a code repository to a single source of truth for the entire organisation. Whether you're drowning in operational complexity or simply seeking to future-proof your organisation, this talk will arm you with actionable insights to lead your own transformation. You'll leave understanding how GitOps can be more than a methodology – it can be the foundation of your competitive advantage in the digital age. Leave inspired, equipped, and ready to transform your own chaos into clarity.
Steve Wade founded The Cloud Native Club, a global community for cloud-native enthusiasts. He is also a maintainer of the Flux Terraform Provider. As an experienced conference speaker, independent cloud-native consultant, and trainer, Steve shares his expertise worldwide. He has held platform leadership roles across various industries, including real estate, gaming, fintech, and the UK Parliament. With a BSc in Computer Science, Steve is passionate about cloud-native software development and distributed computing.
In a world increasingly defined by complex technology and rapid innovation, it is easy to focus entirely on the technical aspects of success. Yet, the most advanced cloud infrastructure, the most cutting-edge tools, and the most sophisticated algorithms are only as effective as the people behind them. The truth is, it is not just the technical expertise that drives successful cloud initiatives; it is the ability to communicate, collaborate, and build trust that truly sets teams apart. In this talk, I will explore why the human side of the cloud is often the missing piece in achieving transformational success.
The cloud is not just a technical platform; it is a shared space where people, ideas, and goals converge. Whether you are working on a multi-cloud strategy or deploying serverless applications, success hinges on the ability of individuals and teams to work together seamlessly. It is about understanding diverse perspectives, breaking down silos, and fostering an environment where innovation thrives. Yet, too often, organizations overlook the importance of soft skills in favor of purely technical competencies. The result? Misaligned priorities, communication breakdowns, and missed opportunities to fully leverage the power of the cloud.
Soft skills such as empathy, active listening, and conflict resolution are not just “nice to have” in cloud projects—they are essential. Imagine a scenario where a development team and an operations team have differing priorities during a cloud migration. Without clear communication and a shared understanding, this disconnect can lead to delays, inefficiencies, or worse, failure to deliver. On the other hand, when team members can articulate their needs, understand each other's challenges, and find common ground, the result is a project that runs more smoothly and achieves better outcomes.
But soft skills extend beyond team dynamics. They are also critical when engaging with stakeholders. Whether you are pitching a new cloud initiative to executives or addressing concerns from non-technical colleagues, your ability to frame ideas in ways that resonate is crucial. It is not enough to understand the technical value of what you are building; you must also communicate its impact in human terms. The most innovative cloud solutions will fall flat if their value is not clearly articulated to those who depend on them.
This talk will also explore how soft skills can enhance leadership in cloud-focused environments. In fast-paced, high-stakes projects, effective leaders need more than technical know-how; they need to inspire, motivate, and guide their teams through uncertainty. Leadership is about creating a culture where people feel empowered to take risks, share ideas, and work together to solve complex problems. It is about fostering a sense of purpose that transcends technical goals and aligns with broader organizational values.
As organizations increasingly adopt cloud technologies, the demand for professionals with strong soft skills is only growing. Employers are looking for individuals who can not only manage infrastructure and write code but also lead conversations, solve conflicts, and drive collaboration across departments. These are the qualities that turn technical experts into invaluable team players and transformational leaders.
Soft skills also play a crucial role in bridging the gap between technical teams and end-users. Cloud technologies often bring about change, which can be met with resistance or hesitation. Professionals who can communicate with empathy and clarity help reduce friction, ensuring smoother transitions and greater adoption of cloud solutions. They can demystify the technical aspects, making the benefits of the cloud accessible and relatable to all.
This talk is for anyone who wants to unlock the full potential of their cloud projects by harnessing the power of human connection. Whether you are a cloud architect, a project manager, or an IT professional, you will leave with actionable insights on how to cultivate the soft skills that matter most. From practical tips on improving communication to strategies for building trust within teams, this session will equip you to succeed in the increasingly collaborative world of the cloud.
The human side of the cloud is not a secondary concern—it is the foundation upon which all successful cloud initiatives are built. By prioritizing soft skills, you can drive innovation, improve outcomes, and create a lasting impact in your organization. Join me for this talk and discover how the key to mastering the cloud lies not just in the technology you use but in the people who bring it to life.
Victor Onyenagubom is a dynamic Lecturer in Cybersecurity at Teesside University London, where he develops and delivers cybersecurity modules to international master’s students, fostering an engaging and inclusive learning environment.
Victor has attended the Pre-Doctoral school organized by the Max Planck Institute of Software Systems in Germany, in partnership with the University of Maryland and Cornell University. He also volunteers as a Lead IT Trainer at CodeYourFuture, where he empowers refugees and asylum seekers with digital and cybersecurity skills.
Victor is a frequent speaker at high-profile tech events, where he shares his expertise on cybersecurity and other emerging technologies. His combined expertise in cybersecurity, research, and community service positions him as a key figure in both academic and social spheres, dedicated to equipping others with the tools to navigate and secure the digital world.
As an IT Security Analyst at Pennon Group Plc, Victor performed security audits, designed cybersecurity training materials, and conducted vulnerability assessments, significantly enhancing the organization's security posture. In his role as a Research Assistant at the Nuffield Trust, he extracted, cleaned, and analyzed healthcare data, developed dashboards, and presented data-driven insights to improve NHS processes.
In our upcoming presentation, we'll explore a cutting-edge architectural solution for real-time SMS and email notifications, particularly geared towards responding to earthquake events. This system is designed to handle rapid data transmission, listening for event changes every second, making it ideal for real time critical alert scenarios. Central to our discussion will be the integration of Lambda functions and Confluent Kafka, coupled with advanced multithreading techniques and DynamoDB lock strategies. A focal point of our presentation will be addressing the challenges and innovative solutions involved in integrating Confluent Kafka with Lambda functions to enable serverless operation of both producers and consumers. This is a key element in ensuring the quick and efficient distribution of notifications through parallel methods. Additionally, we will delve into the implementation of an automated scaling mechanism, which is vital for optimising the performance of the Serverless Notification ecosystem. Our aim is to provide a comprehensive insight into how these technologies can be effectively combined to develop a robust and efficient system, capable of delivering critical real-time alerts for situations like earthquake occurrences, ultimately playing a crucial role in saving human lives.
Vlad Onetiu, a DevSecOps and Software Automation Engineer from Cluj-Napoca, Romania, is renowned for his expertise in cloud technology, cybersecurity, and software automation. Since embarking on his career in 2018, he has been instrumental in conducting security research for Romania's major banks, significantly bolstering their cybersecurity measures. Vlad has also contributed to the field through his research papers on malware and phishing, shedding light on these critical cyber threats. His proficiency in employing cloud-based solutions for system automation, combined with his skillful handling of CI/CD processes and cloud architecture, reflects his commitment to fostering secure and resilient digital environments. Known for his passion for technology and relentless innovation, Vlad stands out as a leading figure in cybersecurity, continuously exploring and implementing cutting-edge strategies to address the challenges of evolving cyber threats.
Dive into the world of serverless and explore common, costly mistakes and learn actionable tips for cutting down waste and reducing your AWS bill. Whether you're looking to cut down on CloudWatch costs or improve cost-efficiency for your serverless application, we've got some helpful tips for you.
Helpful tips to cut down on waste and reduce AWS cost, including how to keep CloudWatch costs in check, how to implement caching, how to pick the right services for your workload, how to right-size Lambda functions, and so on.
Yan is an experienced engineer who has run production workload at scale in AWS for 10 years. He has been an architect and principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. He has worked extensively with AWS Lambda in production, and has been helping companies around the world adopt AWS and serverless as a consultant. Yan is an AWS Serverless Hero and a regular speaker at user groups and conferences internationally, and he is also the author of Production-Ready Serverless and co-author of Serverless Architectures on AWS 2nd Edition, both by Manning. Yan also keeps an active blog at https://theburningmonk.com.
Monitoring the behavior of a system is essential to ensuring its long-term effectiveness. However, managing an end-to-end observability stack can feel like stepping into quicksand, without a clear plan you’re risking sinking deeper into system complexities.
In this talk, we’ll explore how combining two worlds—developer platforms and observability—can help tackle the feeling of being off the beaten cloud native path. We’ll discuss how to build paved paths, ensuring that adopting new developer tooling feels as seamless as possible. Further, we’ll show how to avoid getting lost in the sea of telemetry data generated by our systems. Implementing the right strategies and centralizing data on a platform ensures both developers and SREs stay on top of things. Practical examples are used to map out creating your very own Internal Developer Platform (IDP) with observability integrated from day 1.
Eric is Chronosphere's Director Evangelism. He's renowned in the development community as a speaker, lecturer, author, baseball expert, and CNCF Ambassador. His current role allows him to help the world understand the challenges they are facing with observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies and organizations. More on https://www.schabell.org.
Graziano is a Developer Relations Engineer at Mia-Platform, specializing in content creation, technical evangelism, and bridging communication between users and R&D teams. He is also a recognized Green Software Champion, actively contributing to sustainable software practices. With a background in developing distributed systems and product management, Casto is an active speaker at industry events and a published author on topics like cloud-native technologies. He has previous experience as a Technical Product Owner and Software Engineer, focusing on agile practices and product management.
Imagine a self-healing system that handles surprises, letting you sleep peacefully. If that sounds appealing, chaos engineering could be the answer. Trusted by Netflix, LinkedIn, Google, and Facebook, it's key for business resilience. In this session, we'll explore its history, learn how to apply its principles to stress-test applications, and review tools for fault injection in real-world scenarios.
Agnieszka is a dedicated engineer specializing in cloud solutions, with expertise in cutting-edge technologies such as Kubernetes and Terraform. She also brings valuable insights from her experience in testing activities. Agnieszka is passionate about enhancing people’s wellbeing and promoting a healthy work-life balance. Outside of her professional endeavors, she is a sports freak, board game lover and owner of two adorable cats.
Observability is the ability to measure the current state of a system. Backend engineers are becoming more familiar with the primary signals and technologies, such as OpenTelemetry that can be used to instrument applications and diagnose issues. Yet, in the frontend world, we're behind the curve.
Join me as I dive into the tools and techniques we can use to instrument, monitor and diagnose issues in our production frontend applications. I'll cover RUM agents, and the metrics and traces they provide, how to combine them with backend tracing for a holistic picture. We'll dive into the state of client instrumentation in OpenTelemetry and what you can do now. Finally, we'll also show how Synthetic Monitoring and alerting in Observability platforms can help us be alerted to issues impacting users in the UIs we build and maintain.
Carly is a Developer Advocate and Manager at Elastic, based in London, UK. Before joining Elastic in 2022, she spent over 10 years working as a technologist at a large investment bank, specialising in Frontend Web development and agility. She is an agile evangelist, UI enthusiast, and regular blogger.
She enjoys cooking, photography, drinking tea, and chasing after her young son in her spare time.