Crossrail Place,
Canary Wharf,
E14 5AR, London, UK
Level -2
Tube access
Jubilee, Elizabeth and DLR lines: Canary Wharf station
Today’s observability platforms are often vertically integrated—binding data storage, query, and visualization layers into a single stack. This tight coupling drives up costs, makes integrations painful, and slows teams down. But it doesn’t have to be this way. In this talk, we’ll explore how SRE teams can benefit from a more modular approach to observability—one inspired by the evolution of Business Intelligence. Just as BI stacks evolved to separate ETL, data warehouses, and dashboards, observability stacks can be designed around clear boundaries: interoperable tools, technology-neutral query layers, and plug-and-play storage. You’ll learn why decoupled observability architecture is essential for cost control, agility, and tool flexibility—and how to move toward a stack that meets the real-world needs of today’s SRE teams.
Peter Marshall is a technology leader and community builder with a background in developer relations, data architecture, and digital transformation. As Director of Developer Relations at Imply, he leads programs that grow and engage the global Apache Druid community through education, support, and events. With experience across startups, enterprises, and the public sector, Peter brings a blend of technical expertise and strategic vision to help organizations connect with developers and drive impact through open source.
Let's take a look at some of the basic NGINX metrics to monitor and what they indicate. We start with the application layer and move down through process, server, hosting provider, external services, and user activity. With these metrics, you get coverage for active and incipient problems with NGINX
Currently providing technical evangelism for NGINX, Dave works with DevOps, developers and architects to understand the advantages of modern microservice architectures and orchestration to solve large-scale distributed systems challenges, especially with open source and its innovation. Dave has been a champion for open systems and open source from the early days of Linux, through open distributed file systems like XFS, GFS, and GlusterFS to today's world of clouds and containers. He often speaks on topics such as the real-world issues associated with emerging software architectures and practices, on open source software and on creating new technology companies. Dave has spoken on technical topics like distributed request tracing, modern monitoring practices, open sources projects from both corporations and foundation views, and on how open source innovations powers todays world. Dave was named as one of the top ten pioneers in open source by Computer Business Review, having cut his teeth on Linux and compilers before the phrase "open source" was coined. Well versed in trivia, he won a Golden Penguin in 2002. When he's not talking, you can find him hiking with his trusty camera, trying to keep up with his wife.
The shift from traditional testing to chaos engineering marks a revolution in building reliable systems. This session unpacks the concept’s history and role in ensuring system resilience. We’ll look at some of the approaches to chaos engineering, before looking at Chaos Engineering as a Service with Amazon’s Fault Injection Service. You’ll leave with insights into crafting and running Fault Injection Service experiment templates, plus a live demo looking at how we can now test serverless code in AWS with Chaos engineering.
Starting out in a very traditional background in the data-centres of the fabled M4 corridor, Simon eventually realised it was time to give up on the problems of manually babysitting servers, racks and UPS's and migrated to the Cloud (and in the process, the Scottish Highlands) and now works to enable clients to transition their workloads, processes and swag requirements to the same route. Simon is a member of the AWS Community builder program, and these days enjoys coaching and mentoring as much getting hands-on with code, automation and head-scratching.
As AI systems scale in complexity and impact, observability becomes essential—not just for performance monitoring, but for ensuring quality, safety, and trust. In Azure AI Foundry, evaluators are the backbone of this observability layer, enabling continuous, automated assessments across the AI lifecycle. This session provides a deep technical dive into the evaluator framework in Azure AI Foundry, covering: 🔍 Evaluator Categories and Use Cases RAG Evaluators (Retrieval-Augmented Generation) Retrieval: Measures relevance of retrieved documents. Groundedness / Groundedness Pro: Assesses factual alignment with context. Relevance & Completeness: Evaluates how well the response answers the query and covers all necessary information. Agent Evaluators Intent Resolution: Accuracy in understanding user intent. Task Adherence: Measures how well the agent completes assigned tasks. Tool Call Accuracy: Validates correct tool invocation and usage. General Purpose Evaluators Fluency: Natural language quality and readability. Coherence: Logical consistency and flow. QA: Comprehensive quality checks for question-answering tasks. 📊 Operational Observability Real-time metrics: latency, token usage, error rates. Visualizing evaluator outputs in the Foundry Observability Dashboard. Connecting evaluations to traces for root cause analysis. By the end of this session, attendees will understand how to design, implement, and scale evaluator-driven observability in AI Foundry to build robust, transparent, and production-grade AI systems.
Senior Azure Cloud & AI Consultant at Microsoft
Security isn’t just about prevention—it’s also about investigation. This talk dives into the forensic side of container security. We’ll explore how to analyze container images using Syft (SBOM generation), Grype (vulnerability scanning), and Trivy (multi-purpose scanner). Learn how to detect hidden risks, trace vulnerabilities back to their source, and build a repeatable forensic workflow that strengthens your DevSecOps pipeline. This isn’t just static analysis—it’s detective work for containers.
Hello, I'm Mert Polat. I'm currently working as an Cloud and Platform Engineer at Sufle. I started my career as a Jr. DevOps Engineer at Zip Turkey, where I gained extensive experience in Infrastructure and DevOps domains. At Duzce MEKATEK, I worked in the software team for an autonomous vehicle project and took on a leadership role. Additionally, I honed my skills in technologies like Docker, Kubernetes, and Ansible during a DevOps internship at Formica. I am passionate about technology and knowledge sharing, so I write various articles for @DevopsTurkiye and @Bulut Bilişimciler publications on Medium. I graduated from Duzce University with a degree in Computer Programming, and I'm currently pursuing a bachelor's degree in Management Information Systems at Anadolu University.
In the ever evolving AI landscape, organizations are faced with the choice between open-source versus closed-source models. While many developers find it easier to get started in the closed source ecosystem, they quickly realize it is ultimately more expensive, inefficient, and doesn’t have the security and controls required for their applications. Alternatively, open source models have quickly caught up to their closed counterparts and now deliver a cheaper and just as accurate solution that provides organizations with the security they need to run their operations. In addition, the pace of the open source community is iterating at speeds that are beyond belief and not only improving the accuracy of these models, but making them run even faster, especially on hardware that is better suited for AI beyond the GPU. With new open source models dropping every week and an active community fine-tuning even better versions daily, open source is making AI a commodity, with a wide range of cloud services for developers to choose from that make it even easier for them to use. Join Amit Kushwaha, Director of AI Engineering, SambaNova, as he breaks down for attendees the advantages of open-source models, why they are critical for fast product iteration, and how organizations can use them to tap into the best minds globally, accelerate their pace of innovation, and position themselves to shape industry standards while continuously advancing enterprise-grade solutions.
Amit Kushwaha is the Director of AI Engineering at SambaNova Systems, leading the development and implementation of AI solutions that leverage SambaNova's differentiated hardware. Previously, as Principal Data Scientist at ExxonMobil, he led the organization’s digital transformation efforts, driving the strategy and execution of a multi-million dollar AI/ML portfolio that fostered innovation and enhanced operational efficiency at scale. Passionate about harnessing technology to solve complex challenges, Amit specializes in developing innovative, business-focused solutions at the intersection of artificial intelligence, high-performance computing, and computer simulations. He holds a Ph.D. in Engineering from Stanford University.
Air-gapped systems are seen as the pinnacle of security, but are they truly untouchable? This talk explores real-world breaches—from Stuxnet to electromagnetic attacks—highlighting modern threats like supply chain risks and social engineering. Attendees will learn practical strategies to strengthen air-gapped environments through physical security, procedural controls, and advanced detection methods.
Sean Behan is a Senior Offensive Security Engineer at Oracle with a decade of experience across top-tier organizations including Google, AWS, the NSA, and the U.S. Navy. He specializes in offensive security, red teaming, and cloud security, with deep technical expertise in exploit development and adversary simulation. Sean holds OSCP and CISSP certifications and is passionate about continuous learning, regularly participating in CTFs and advanced cybersecurity research.
DevSecOps success isn’t just about picking the right tools—it’s about driving change across people, process, and technology. In this talk, we’ll go beyond the buzzwords to explore what it truly takes to embed security into modern software delivery. We’ll start by demystifying the DevSecOps tooling landscape—covering the full acronym bingo (SAST, DAST, SCA, CSPM, etc.), highlighting major vendors, and sharing a practical framework for choosing the right tools for your environment. But that’s just the beginning. We’ll explore SDLC maturity models and how to create a DevSecOps roadmap that aligns with your organization's goals. You’ll also learn how to approach organizational change with principles borrowed from change management—because tooling alone won’t get you there. Finally, we’ll focus on the human side: the social skills you need to overcome resistance, influence stakeholders, and craft internal communications that get attention and drive adoption. Whether you're just getting started or trying to take your DevSecOps efforts to the next level, this session equips you with the full advocate toolkit—technical insight, strategic perspective, and the soft skills to make it all stick.
Seb Coles is a seasoned security leader and engineer with a passion for making DevSecOps work in the real world—not just in theory. He’s held senior security roles at Clarks, ClearBank, LRQA, and Seccl, and worked as a senior consultant at Veracode, helping organizations of all sizes embed security into their software delivery. At ClearBank, Seb built and led the security team that earned a Highly Commended recognition at the 2023 Computing Awards for Best DevSecOps Implementation. His approach goes beyond tools—focusing on strategy, organizational change, and the often-overlooked human side of DevSecOps. Originally a software engineer, Seb has learned the hard way how crucial influence, communication, and cultural alignment are to making security stick. He’s stepped on plenty of rakes so others don’t have to—and now shares those lessons through talks, consulting, and national conference appearances. Seb brings a candid, practical perspective to DevSecOps, grounded in hands-on experience and a deep understanding of both the tech and the people behind it.
In this hands-on session, I'll demonstrate how AI can revolutionize your Terraform testing workflow. Watch as I transform a complex terraform module with zero test coverage into a battle-hardened, production-ready component in under 30 minutes. You'll learn how to: Leverage AI to quickly generate comprehensive test cases across unit, mock, and integration tests Build test suites that validate the full spectrum of your Terraform modules from configuration to outputs Implement advanced testing patterns with mock providers to simulate AWS resources Create maintainable test coverage reports that highlight gaps and provide actionable insights Adopt a test-driven development approach that scales with your infrastructure We'll showcase real examples, demonstrating how this approach has reduced our testing time by 90% while increasing code quality and developer confidence. Whether you're new to Terraform testing or looking to enhance your existing test suites, you'll walk away with actionable techniques to implement immediately.
Dharani Sowndharya is a Lead DevOps Engineer at Equal Experts, with nearly a decade of experience in the tech industry. For the past eight years, she has specialized in Cloud and DevOps, working on large-scale cloud migrations, site reliability engineering (SRE), and designing self service data platforms using Kubernetes on AWS. Her strong foundation in cloud infrastructure and DevOps practices has made her a trusted expert in delivering scalable, reliable solutions. Dharani is also passionate about mentoring and sharing knowledge. She frequently speaks at tech conferences and actively supports the growth of emerging engineers in the community. Outside of work, she enjoys playing foosball and has an enviable collection of board games, which she loves to play with friends and family.
Many teams rely on staging environments to catch bugs before production — but what if that’s actually slowing you down and giving false confidence? In this talk, I’ll share how we eliminated our staging environment and built a pipeline that delivers faster, safer releases. We’ll cover the practical tooling we used, how feature flags and canary releases play a critical role, and the cultural mindset shift that made it work. You’ll walk away with concrete ideas for reducing deployment risk without relying on brittle pre-prod setups.
I got my first full-time developer job in 1999 and have been in software ever since. Over the years, I’ve worn many hats — developer, engineering manager, product manager — and I love building great products and teams. I’m currently CTO at Ivelum, where we create custom software for startups and enterprises, and I also work on Teamplify, a team management suite for engineering teams.
We’re in the midst of an AI revolution—and APIs are its unsung heroes. While LLMs and AI agents grab headlines, it's APIs that power their ability. Behind every AI-generated insight, recommendation, or automated task is an API call connecting the model to the tools, services, and data it needs to get the job done. As AI systems evolve from passive assistants to autonomous agents capable of decision-making and execution, APIs have become the essential infrastructure enabling this transformation. They are no longer just integration tools—they are the action layer of AI. In this talk, we’ll explore how APIs are shaping the future of intelligent automation. Using real-world examples from across industries, we’ll examine how companies are leveraging APIs to orchestrate multi-step workflows, access real-time data, and drive operational efficiency with AI. Organizations with robust, scalable, and discoverable API ecosystems will not only keep up—they’ll lead. If AI is the recipe, APIs are the ingredients. It's time we start treating them that way. What You’ll Learn: - The shift from human-first to machine-first consumption patterns in API design - Emerging standards that are streamlining AI-API interactions - Strategies to future-proof your API ecosystem for the intelligent systems of tomorrow
Pooja Mistry is a passionate Senior Developer Advocate at Postman, where she champions the power of APIs by creating technical content, leading workshops, and building developer communities around the world. With a strong background in cloud, automation, and open-source technologies, she brings over a decade of hands-on experience in software engineering and advocacy. Before joining Postman, Pooja held several impactful roles at IBM, including leading developer outreach for IBM Cloud and contributing to healthcare innovation at IBM Watson Health. She's also been an active community leader, having organized Startup Weekend events to empower entrepreneurs and technologists across Boston.
If you ask 10 DBAs in a conference about putting data on Kubernetes, most will say that’s a bad idea – Divine. Divine is an advocate for data on Kubernetes, and a Data on Kubernetes Ambassador. Shivadeep, on the other hand, as an Oracle DBA, traditionally believed that containers and Kubernetes weren't suitable for persistent workloads like databases. And that’s rightly so because it was originally designed to be stateless. But he didn’t know a lot had changed. With no prior experience in containers, Kubernetes, or NodeJS, Shivadeep embarked on a challenging project to build a scalable Pacman game. This journey not only helped him learn new technologies but also changed his perspective on using Kubernetes for database workloads. Shivadeep discovered why and how Kubernetes is increasingly becoming a popular choice for running databases as well. Shivadeep will start this talk sharing his story of learning and exploration, highlighting: - How he containerized and deployed the PACMAN Game application on Kubernetes - How he connected it to an Oracle DB on Container as well as on-prem Oracle database via CMAN To make the session even more engaging, Shivadeep will provide a hands-on opportunity for attendees to try out the Pacman game! Towards the end of the talk, Divine will share how DBAs can evolve to manage databases on Kubernetes.
Shivadeep Gundoju is a seasoned database administrator at ING Netherlands , who loves to automate everything around databases and its Infrastructure. He holds a Master's degree in Computing Systems & Infrastructure and excited to explore Oracle and other RDBMS on containers/Kubernetes.
With our traditional operations setup supporting monolithic systems – as the tech implementation scaled, there was a proportionate increase in costs to support and maintain the systems. This had been primarily due to segregation in organization setup to deliver business outcomes. While one part of the tech organization was focused on build, the other part was focused on maintain. With differing goal-post and motivation to deliver around collective business impact, the focus was divergent which led to undesirable customer experience and unsustainable costs. As the tech landscape became more distributed and decoupled, we saw this as an opportunity to introduce SRE adoption, where we managed to build an integrated setup with product teams and embedded SRE taking full accountability from build to maintain for the product even as the implementation scaled across multiple markets. While this setup has proved to be impactful over the last couple of months, it has been a continuous effort to tie Service Level Indicators and Objectives (SLI/SLO) to Business KPI's thereby showcasing direct impact to business. Will delve into ways organizations can successfully adopt SRE thereby building an integrated and impactful setup.
Mitul Jain Digital Transformation Leader | SRE & DevOps Expert | Global Technology Executive With over 23 years of global leadership experience, a recognized expert in Site Reliability Engineering (SRE), DevOps, and enterprise-scale automation. I have led transformative initiatives across industries, reshaping traditional IT operations into modern, business-aligned engineering practices. I have built and scaled global SRE organizations, integrated reliability engineering into digital value streams, and enabled Build-to-Operate models that drive measurable business outcomes—including zero P1 incidents during peak periods and significant cost optimizations. As a passionate advocate for innovation and community building, I founded SRE Community of Practice within the organization and has led strategic alliances with platform providers to enhance service capabilities and go-to-market strategies. I often speak on topics including digital transformation, reliability engineering, DevOps at scale, and the future of IT operations. His insights are shaped by hands-on experience across Consumer Goods, BFSI, Telecom, Manufacturing, and Travel & Hospitality.
In our world where everything is code, reliability extends beyond clean and reliable code running on the right infrastructure. It requires a robust sociotechnical system, the dynamic interplay between social and technical components. Our North Star is an engineering culture, built on shared beliefs, practices and behaviours that shape how we operate, solve problems, collaborate, innovate and continuously learn. But how do we achieve this engineering culture, where innovation and success are driven by collaboration, trust, autonomy, and passion? What is our code of conduct and how does it propel us forward? In this talk, we will take you through our 5-year journey in Reliability Advocacy. We started out as a group of 10 enthusiastic engineers from different platform and enablement teams who wanted to share knowledge of our reliability product offering and practices across the organization. We are now a household name, run our annual Reliability Event conference which are attended by 300+ engineers every year and our reliability trainings are part of the curriculum for all new joiners and we've guided hundreds of engineers on their path to defining SLIs and SLOs. In part, thanks to our reliability advocates, reliability is now one of our engineering pillars for years to come shaping our organization in the process, raising our general availability rating from 99.54% to 99,87% in the process. We will share what worked for us, what didn’t, and how we gradually embedded ourselves into our large, regulated, and risk-averse organization with over 15.000 engineers. We will reveal the code that helped us shape our SRE practices and make this transformation possible. In doing so, we will share a 5-step plan to start a reliability advocacy function within your organization.
Stephan Mousset is the Global Product Manager for the Performance & Resilience Engineering Platform at ING, where he also serves as Lead Reliability Advocate. With nearly two decades of experience at ING, Stephan has led major efforts in performance testing automation, SLI/SLO adoption, and reliability engineering at scale. He is an active voice in the DevOps and SRE communities — co-organizing DevOpsDays Amsterdam, Site Reliability Engineering NL, and ING’s internal Reliability Event. Stephan has shared his insights at multiple conferences, including: - SLOconf 2022: ING’s global SLO rollout and lessons learned - SRE NL Meetup at ING (2023): Building a platform for SLO-driven performance engineering - DevOpsDays Berlin 2024 (Ignite): "Unleashing DevOps Magic in Performance Testing & Analysis Automation" His talks focus on practical, platform-enabled approaches to making reliability and resilience part of everyday engineering — through automation, observability, and culture.
How do you build reliable systems where downtime is not just an inconvenience, but a threat to trust, income, and even safety? In this talk, Roosevelt Elias, founder of Payble, explores how to establish SRE principles in markets where infrastructure is unreliable, cloud access is intermittent, and talent pipelines are still emerging. Drawing from real-world experience building financial infrastructure in Africa, we’ll discuss culturally aware incident management, lightweight observability stacks, distributed troubleshooting with limited tooling, and what it means to bake resilience into the DNA of both product and process even when the odds are stacked against you. This talk is ideal for SREs, platform engineers, and product leaders building for emerging markets or aiming to design more resilient systems globally.
Roosevelt Elias is a visionary solutions architect, product strategist, technology entrepreneur, and founder of Payble, a next-generation product technology company focused on solving complex economic and digital inclusion challenges for micro and small businesses across Africa and globally. With over a decade of experience spanning product design, payments, and creative technology, Roosevelt has built companies at the intersection of design, technology, and social impact. including a successful exit in the print media space and a thriving live-streaming SaaS business that has powered events for global artists and thought leaders. A trained computer scientist with a background in IT Security, Roosevelt is passionately committed to creating tools that enable underserved businesses to thrive with the same capabilities as Fortune 500 enterprises, through intelligent systems, AI-driven insights, and user-first design. Under his leadership, Payble is redefining the role of financial technology, not just as a utility, but as an ecosystem that transforms local businesses into global players. He is a bold thinker, deeply rooted in service, faith, and sustainable impact, building Payble to be a company that will outlive generations.
Why Terragrunt Matters If you've worked with Terraform in production, you've likely encountered the pain of managing multiple environments, duplicate configuration files, and complex remote state setups. Terragrunt solves these common Infrastructure as Code challenges by providing a thin wrapper around Terraform that promotes DRY (Don't Repeat Yourself) principles and simplifies configuration management. What You'll Learn This talk will take you from Terragrunt zero to hero, covering: Foundation & Setup What Terragrunt is and why it exists - Key differences between Terraform and Terragrunt - Installation and initial configuration Pre-requisites: - Basic AWS knowledge - Basic Terraform knowledge
Ceyda serves as a Cloud and Platform Engineer at Sufle. She has previously worked as a Software Developer at different companies. She has experience particularly in technologies such as AWS, Kubernetes, and Python. She completed her Master's degree in Software Engineering at Boğaziçi University and graduated from Sabancı University with a Bachelor's degree in Computer Science and Engineering. She holds a HashiCorp Certified Terraform Associate certification and has been using Terraform and Terragrunt for big production workloads.
Sustainability is increasingly becoming a priority in the Information Technology sector, which has fueled the demand for energy-efficient solutions in all computing environments. Effective management of resources in Kubernetes environments requires proper monitoring and optimisation of power usage. Kepler (Kubernetes-based Efficient Power Level Exporter) solves this problem by offering a solid solution for energy monitoring at the pod level. Using software counters, tailored machine learning models, and the Cloud Native benchmark suite, Kepler provides accurate energy consumption estimations and detailed reports on power usage. Developers and system administrators can thus make informed choices towards environmentally friendly and energy-efficient Kubernetes operations.
Mayank Goyal is a Senior Site Reliability Engineer at Okta, specialising in maintaining Workflows systems' scalability and reliability. Before working at Okta, he honed his DevOps & Software engineering skills at Zoom and RedHat, where he grew deeply passionate about automation. While at RedHat, he was instrumental in automating and configuring the Release Pipeline in RHELWF. Apart from his professional obligations, Mayank is a passionate supporter of open-source technologies and likes being on the cutting edge. He loves connecting with the technology community, exchanging thoughts, and learning from others.
Platform engineers work hard to build great tooling and automations for developers, but often struggle to get feature teams to adopt the platform to its full potential. Meanwhile, SREs are buried in incident firefighting and can’t keep up with onboarding new services or proactive reliability initiatives. Turns out these two challenges could be solved by tackling them together. In this talk, we’ll share how we combined platform engineering and SRE into one hybrid responsibility that doesn’t just ship tooling, it helps teams actually adopt it. We’ll show how our Platform SREs make new services “reliable by default” with out-of-the-box observability, alerts, and SLOs. But for those older, messier services? We send someone in. Embedded SRE style, but for a limited time and scope. We’ll walk through how we structure these short-term embed missions, what’s worked (and what’s flopped), and how this helped adoption go way up without burning anyone out. If you’re tired of begging teams to migrate or your SREs are on the verge, this one’s for you.
Jorge is a Reliability Advocate at Rootly and the author of the Linux Foundation Introduction to Backstage (LFS142) course. He has a background in software engineering (ex-PayPal) and digital communication (UCLA). He's also a certified sommelier (CETT Barcelona).
The main focus here is to truly teach the audience the mindset of attack/defence, especially in an environment as underexplored as GPUs and HPC as a whole. We will present real-life examples and have a small ecosystem in the cloud to demonstrate and raise awareness about this crucial topic. We will show here why the offensive mindset is a differentiator and how it helps us guide and build new security paths in Kubernetes, focusing on tools of our cloud-native ecosystem and creating a new path for newcomers who will eat GPUs for Breakfast and the rest of us.
Mart is an Infrastructure Security Manager , where he enjoys managing various engineers who teach him every day how to break things and become a better manager and engineer. Mart began his journey in cybersecurity trying to understand why so many people liked prime numbers. From there, understanding how these numbers ended up in the clouds and even inside processors became a fascination. He enjoys playing with obscure technologies and trying to be a chef in his spare time.
Your AI might be the biggest insider threat you’ve ever deployed without even knowing it. As large language models (LLMs) become embedded in cloud-native apps and infrastructure, a new kind of risk is emerging: data leakage through inference. With the right prompt, an attacker can extract sensitive data, proprietary logic, or even credentials from your model’s responses bypassing traditional cloud and application security. In this talk, we’ll explore real-world examples of prompt injection, model inversion, and inference-time exfiltration attacks. You’ll learn why AI models running in Kubernetes, serverless, or SaaS environments introduce hidden exposure, and what to do about it. Whether you’re a security engineer, cloud architect, DevSecOps lead, or AI/ML practitioner, this session will equip you to recognize and defend against one of the fastest-evolving threats in modern cloud systems. Because in the age of AI, the model doesn’t need to be hacked to become the leak.
Meletius Igbokwe is a cybersecurity-focused Modern Workplace Engineer with a strong passion for protecting organizational data and identities in today’s cloud-driven world. With expertise spanning cloud security, infrastructure, and automation, he is dedicated to helping organizations adopt modern technologies while maintaining a strong security posture. A professional member of BCS, The Chartered Institute for IT, and a holder of a Master’s degree in Cloud and Network Security from the University of Bolton, Meletius is deeply invested in advancing cybersecurity through research, innovation, and real-world application. His interests include the intersection of AI and cybersecurity, digital forensics, identity security, and securing enterprise environments using industry best practices. Meletius has contributed to several cloud transformation initiatives, using DevOps methodologies to enhance secure deployment and operational efficiency. He prioritizes resilience, compliance, and scalability in every solution he delivers.
Kubernetes is easy, isn't it? Creating Kubernetes cluster in public cloud take a few minutes, deploying an application it's like same. But how can we make sure that our cluster scales in an efficient way? And what can we do with workloads which are not really a good fit for Kubernetes?
Kim is Regional Field CTO and Solution Architect at CAST AI, focused on guiding Kubernetes users to cost efficient and automated platforms. Before CAST, he was working as Cloud Architect and Consultant for Cloudical, noris network and Deutsche Telekom. He’s passionate about cloud infrastructure, automation and open source technologies. He’s an open source contributor and public speaker.
Have you ever wondered what observability really means and how it can be useful, even if you’re just starting out? In this session, we’ll explore the basics of observability through a fun, relatable project: monitoring the health of plants using simple IoT sensors and Grafana dashboards. Using a hands-on example, I’ll share how I teamed up with my 8-year-old daughter to build a soil moisture sensor project. We wired a sensor to an ESP32 board, collected moisture data, sent it to Prometheus, and visualized it using Grafana, a powerful but beginner-friendly tool for creating dashboards. Along the way, I’ll explain key observability concepts like metrics, monitoring, and dashboards, and break down technical jargon with gardening analogies that make it easier to grasp. This session requires no prior experience, just curiosity and a love of learning. By the end, you’ll see how observing your systems (or your plants!) through data can be both empowering and fun, and how Grafana makes it easy to bring data to life visually.
Marie Cruz is a Software Tester with over 10 years of experience and currently works as a Senior Developer Advocate for Grafana Labs. She is also the co-author of Contract Testing in Action, the first book dedicated solely to contract testing, and an international speaker. She has previously worked as an engineering manager responsible for driving continuous testing and quality improvements, and as a principal engineer, she focused on introducing recommended practices for testing and test automation frameworks.
Two Buzzwords Finally Meet! How should we manage AI in large organizations? What are the "services" developers need to add AI to enterprise apps, and what role do platform engineers take? Where do data scientists fit in? How does MCP, A2A, and whatever the latest AI API is fit in? There are so many questions because the field of managing AI in enterprises barely exists. We need to figure it out quickly however, before we see a repeat of a past patterns, like shadow AI and unmanageable apps in production. Based on real-world examples, this talk will go over what we currently know about using platform engineering to help developers use AI.
Michael Cote studies how large organizations get better at building software to run better and grow their business. His books Changing Mindsets, Monolithic Transformation, and The Business Bottleneck_ cover these topics. He's been an industry analyst at RedMonk and 451 Research, done corporate strategy and M&A, and was a programmer. He also co-hosts several podcasts, including Software Defined Talk. His daily-ish newsletter is at newsletter.cote.io.
When Spotify scaled from millions to hundreds of millions of users, we discovered that our carefully crafted abstractions—designed to simplify our systems—had become our biggest operational liability. New engineers could ship features but couldn't debug failures. Our beautiful, clean interfaces masked the very complexity that seasoned engineers needed to understand during critical incidents. This talk chronicles how internal fragmentation during hyper-growth exposed a fundamental paradox in software engineering: the tension between helpful abstraction and dangerous oversimplification. Through real examples from Spotify's infrastructure evolution, you'll discover why traditional approaches to hiding complexity often backfire at scale.
This session is valuable for: - Software engineers - Platform teams - Site Reliability Engineers (SREs) - Engineering leaders Especially those who struggle with: - Onboarding engineers to complex systems - Maintaining operational excellence during rapid growth - Balancing developer productivity with system transparency - Building abstractions that enhance rather than hinder incident response
You'll leave with actionable insights for designing systems and tooling that scale both technically and cognitively—ensuring your abstractions become force multipliers rather than barriers to understanding.
Stuart is a sought-after speaker, TEDx presenter, and champion for sustainable software. He's passionate about the impact of software and AI on the climate and empowers developers to build a more sustainable future. As a leading expert in programmability and DevOps, he frequently graces industry stages worldwide, sharing his knowledge and inspiring others. He lives in Lincoln, England, with his wife, Natalie, and their son, Maddox. He plays guitar and rocks an impressive two-foot beard while drinking coffee.
Healthcare systems can't afford downtime—lives literally depend on it. After seven years of implementing technology solutions at Deloitte across mission-critical healthcare environments, I've learned that traditional reliability approaches fall dangerously short when human lives are on the line. This talk shares hard-won lessons from real-world implementations where 99.9% uptime wasn't good enough. Drawing from extensive experience managing cross-functional teams through high-stakes healthcare technology deployments, I'll reveal the costly mistakes that taught us everything about building truly resilient systems. You'll discover why 68% of healthcare technology initiatives fail reliability tests in production, and how we developed a four-phase reliability framework that reduced critical incidents by 63% while achieving 96.7% system stability rates. This isn't just theory—I'll share specific war stories from surgical robotics implementations where system failures could impact patient outcomes, and enterprise-wide automation deployments serving thousands of healthcare workers. You'll learn how we transformed reliability engineering from an afterthought into a strategic advantage, achieving 41.7% faster recovery times and reducing post-incident remediation costs by millions. Key takeaways include practical strategies for implementing chaos engineering in regulated environments, building observability into legacy healthcare systems without disrupting patient care, and creating SRE cultures that balance innovation with the non-negotiable requirement for reliability. Whether you're working in healthcare or any other industry where failure isn't an option, these battle-tested approaches will help you build systems that truly stand the test of time—and scrutiny.
Aishwarya Pai is a highly accomplished Computer Science professional with over seven years of experience at Deloitte US, one of the Big 4 consulting firms. Currently serving as a Senior Consultant, she has demonstrated exceptional growth throughout her career, advancing from Associate Analyst to her current leadership role. Aishwarya holds a Master of Science in Computer Science from Stevens Institute of Technology, where she concentrated in algorithms, data structures, cloud computing, and software engineering. Her technical expertise spans Java development, JavaScript, database management, and full-stack web development. As a Certified ScrumMaster, Aishwarya excels in Agile project management and has successfully led cross-functional teams through software development initiatives. Her contributions have been recognized with multiple Applause Awards for outstanding performance in both technical delivery and Scrum Master roles. Beyond her technical achievements, Aishwarya is a dedicated leader who has mentored over 20 junior developers and actively contributes to firm initiatives. She has served as Team Lead for Deloitte Impact Day, led wellbeing initiatives for North Dakota, and participated in the C&M Talent Advisory Council, demonstrating her commitment to both professional excellence and community impact. Aishwarya's combination of technical expertise, leadership capabilities, and commitment to continuous learning makes her a valuable contributor in delivering high-quality solutions in today's technology landscape.
Your CEO wants AI deployed company-wide. Your CISO says absolutely not. Your compliance team is having nightmares about SOX audits. Sound familiar? While startups rapidly deploy AI, enterprises remain stuck in analysis paralysis - caught between business demands for AI-powered productivity and regulatory requirements that make traditional AI solutions impossible to deploy. This talk reveals the battle-tested framework StatusNeo developed to solve this exact problem for Fortune 500 companies. You'll discover how to deploy enterprise-grade AI that satisfies both your innovation teams and your compliance officers.
Nishkarsh is a DevSecOps expert and an International GitHub Star. Nishkarsh is an ardent supporter of open-source, GitHub, DevEx, and DevOps. Nishkarsh serves as StatusNeo Inc.'s Principal Evangelist & Consultant. Over the years, he has been actively GitHubbing and contributing to open-source. By giving talks at conferences, organizing meetups, and encouraging people to take on the #100DaysofCode challenge, he has encouraged many brilliant minds to embark on their journeys in open-source projects and preach the significance of collaboration to aspiring developers.
In high-scale environments, metrics cardinality isn’t just a resource concern, it’s an architectural challenge. Left unchecked, it can impact performance, query latency, and even system stability. This talk takes a deep technical dive into how VictoriaMetrics enables advanced observability practices with a strong focus on cardinality management. I’ll explore how to design efficient and scalable scrape configurations using Prometheus-compatible jobs and exporters, optimize your label strategies, and use built-in cardinality analysis tools within VictoriaMetrics to identify and mitigate high-cardinality patterns early. The session also covers integration with Grafana open source for visualizing metrics in a way that supports signal clarity and operational response, as well as setting up practical alerting strategies that minimize noise while ensuring fast issue detection. By the end of the talk, the audience will understand the engineering trade-offs of high-cardinality metrics and how to detect them, how VictoriaMetrics handles storage and querying at scale, how to build resilient and low-overhead scrape configs with Prometheus compatibility and how to use Grafana open source to highlight cardinality hot spots and improve alert signal quality.
Diana is a DX Engineer at VictoriaMetrics. She is passionate about Observability, machine learning. She is an active contributor to the OpenTelemetry CNCF open source project and supports women in tech.
Are your observability signals trapped in separate pillars? Logs in one place, metrics in another, both losing context? At ClickHouse, we faced this challenge at a massive scale. Our solution was to abandon the traditional model and embrace a new philosophy: store everything, aggregate nothing. This talk charts our journey to 100 PB and 500 trillion rows, centered on the concept of "wide events." Instead of shipping a simple log message and a separate, pre-aggregated metric, we store a single, context-rich event containing every possible dimension. This shift from three pillars to a single warehouse of high-cardinality telemetry was a game-changer. The key to this model is using ClickHouse itself as our observability backend. This unlocks unbounded query flexibility through full SQL. When an engineer asks "what's the p95 pod replacement time after termination?", we don't say "let me ship a new metric." We write a SQL query. This talk will cover: The Wide-Event Philosophy: Why logging rich, structured events is more powerful than juggling separate logs and metrics, and how it defeats cardinality fears. Unbounded SQL-based Querying: We’ll show real-world examples of complex diagnostic queries (like using ASOF JOIN to correlate disparate Kubernetes events) that are impossible in traditional log-search tools. The Enabler - SysEx: How our custom, high-performance exporter made this firehose of wide-event data technically and financially viable, handling 20x the volume of OTel with 90% less CPU. Beyond Dashboards: How this approach allows us to use data science tools like Plotly and Jupyter directly on our observability data for deeper, more flexible analysis. Join us to see how treating observability as a data warehouse problem, powered by ClickHouse, gives you the speed and flexibility to answer any question about your systems, past or present.
Rory is a senior engineer on the observability team at ClickHouse, responsible for building and managing a 100+ petabyte observability platform. With over a decade of SRE experience, he focuses on monitoring ClickHouse Cloud, optimizing metrics pipelines and working with Grafana for visualization.
Instrumenting legacy or closed sourced applications can be a pain. But it doesn't have to be! This talk offers an introduction to the open source Beyla tool and shows you how to utilize it in order to instrument different types of applications without touching a single line of code. Beyla hooks into the kernel using eBPF and does the hard work for you. It's not only useful for services where you can't influence the underlying instrumentation, but it can help you establish a baseline level of observability in large scale deployments. We’ll also go over limitations, so you’ll leave the talk knowing if Beyla is right for you (spoiler: it probably is).
Dominik started his journey in technology as an SRE, working on projects ranging from warehouse logistics and photobook designers to analyzing satellite imagery. During this time, he discovered his passion for developer tooling and making sure developers can focus on what they do best - build great software! Now he is working as a Developer Experience Engineer at Grafana Labs, building tools to see clearly in the ever-changing world of software.
Chaos Engineering is often misunderstood as simply “breaking things on purpose.” This talk challenges that perception and repositions Chaos Engineering as a critical pillar of reliability and resilience engineering. Rather than focusing on failure injection alone, we explore how to leverage existing knowledge, validate known truths, and foster confidence in complex systems.
In the first part, we deconstruct common myths around Chaos Engineering and reframe its core principles. Learn how aligning chaos practices with reliability goals can transform the way your organization perceives and applies these techniques—by emphasizing structured validation over blind experimentation.
The second part brings theory into practice with a hands-on framework that reimagines chaos experiments as self-feeding, iterative loops—mirroring the scientific method. We introduce the concept of continuous verification, drawing parallels with integration testing, and show how Chaos Engineering can seamlessly integrate into the Software Development Lifecycle (SDLC) through a shift-left approach.
The session wraps up with a visual framework for implementing a sustainable Chaos Engineering strategy, including how to evolve gamedays into repeatable, hypothesis-driven validations that scale with system changes.
Whether you're just exploring Chaos Engineering or looking to mature your reliability strategy, this talk will leave you with actionable insights, a modernized mindset, and a clear path to operational resilience.
I am a software engineer with an attrition for low-level and Linux internals who joined Datadog 7 years ago in the SRE group. After developing the now open-source Chaos Controller project meant to provide large-scale fault injection to engineering teams, I built the Security Chaos Engineering group now made of 3 teams and 20 people for which I act as the Engineering Manager. I am also part of the Core Incident Commanders group involved in all high severity outages to help with internal coordination and external communication.
Observability is the cornerstone of reliable systems. It lets teams identify and resolve issues before they impact a broader group of users. Yet building an ideal observability stack is far from easy. It demands time and effort, instrumenting every app, service, and component that emits telemetry. Many teams default to “Store’em All - just in case”, logs that no one reads, traces that no one queries, metrics that never inform action. The result? Costs escalate, operational clarity fades, and ROI on observability tends to plateau or even decline. So, shouldn’t we be asking ourselves: are we really investing in observability, or just paying for some distributed noise?
The issue isn’t lack of telemetry; it’s unchecked volume without purpose. This talk explores the telemetry pipeline as a strategy to take back control. At the OpenTelemetry Collector level, we can filter, transform, sample, redact sensitive data, and route telemetry with intent. The goal is to extract clear business value from every signal and every dollar spent. By aligning observability with outcomes, we get an adaptive, efficient, and cost-aware setup. Whether you’re just starting out or operating at scale, this talk will show how to turn observability into a strategic asset instead of a liability.
Yash is a software engineer and researcher with a deep interest in distributed systems. His focus is on observability and performance, areas where he constantly seeks new insights. As an active advocate of OpenTelemetry, Yash contributes to both the project and the wider community. Outside of tech, he’s an avid explorer, whether in the kitchen experimenting with new recipes or traveling the world to taste diverse cuisines.
SREs are the guardians of reliability — we build for failover, redundancy, and scale. But in today’s AI-native systems, there’s a quiet shift happening: automation scripts, AI-driven agents, and machine identities are running critical operations with increasing autonomy.
They auto-remediate incidents, make scaling decisions, deploy configurations, and even patch systems — often with elevated permissions and little human oversight.
But here’s the problem:
In this session, I will expose the blind spot most SRE teams don’t realize they have — the growing influence of non-human actors in reliability engineering. I will share real-world examples of automation gone wrong, highlight how over-trusted machine identities quietly amplify risk, and explain why SREs need a governance mindset toward automation.
I will discuss:
This is not a security talk — it’s a reality check for SREs operating in environments where automation does more than assist; it decides. If you are an SRE, platform engineer, or operations lead, this talk will challenge you to rethink your approach before your AI helpers become your biggest liability.
Cynthia Akiotu is a Cybersecurity Architect and Identity Specialist with deep expertise in Identity & Access Management (IAM), data governance, and Zero Trust implementation. She focuses on securing environments where AI agents, autonomous models, and machine identities are rapidly expanding the attack surface. With experience across diverse sectors, Cynthia has led projects on secure cloud adoption, privileged access management, insider risk mitigation, and identity governance for AI-driven environments. She holds several industry certifications, including Microsoft Certified: Identity & Access Administrator and Microsoft Certified: Information Security Administrator. Her thought leadership extends to academic and industry publications, contributing to works on AI-enabled regulatory compliance and digital skills development in security. Beyond her professional work, she volunteers as Cyberhero Coach Cynthia, promoting cybersecurity awareness and safe digital habits among children and communities. Driven by a commitment to secure innovation, she champions identity governance as a foundation for trust in AI-native ecosystems.
In this session, we'll explore how to implement flexible multitenancy in Kubernetes using vCluster. You'll learn how to design platforms that can adapt to different isolation requirements, resource sharing needs, and trust models within a single cluster.
We'll explore:
Trust boundaries: zero / partial / full
Flexibility in Practice
How to mix and match different multitenancy models for different teams and workloads.
vCluster Deep Dive
How virtual clusters enable this flexibility while maintaining strong isolation boundaries.
Real-World Scenarios
Live demo showing how to implement different multitenancy patterns based on actual use cases.
Key Takeaways
Why This Talk Matters
Most Kubernetes multitenancy discussions focus on a single approach — either namespace isolation or full cluster isolation. Real-world platforms need flexibility to accommodate different teams with varying security, resource, and trust requirements. This talk shows how to build platforms that can adapt to these diverse needs without compromising on security or operational efficiency.
An active contributor to OpenSource projects on GitHub, blogger and content creator, focusing on practical, scalable solutions in cloud-native environments. DevOps and Platform Engineering practitioner and advocate. Visit: cloudrumble.net
Managing alerts across hundreds of services can quickly become a challenge — fragmented workflow, inconsistent configurations and high operational costs. In this session, we will share how we streamlined alerting by adopting Kibana Alerts and building a bidirectional automation on top of it, using a GitOps-driven approach. Our solution balances UI flexibility with Infrastructure as Code (IaC) principles, simplifying configuration while maintaining control and simplifying complex features at scale. Discover how automation improved consistency, enabled version control, and reduced complexity.
In this session, we’ll explore how we automated Kibana alerting at scale—and tackled some of the gaps that come with it. You’ll see how we implemented advanced features like re-alerting and spike detection, even though they weren’t available out of the box. From GitOps-driven pipelines to creative workarounds, this talk is a practical deep dive into making alerting reliable, repeatable, and ready for scale.
Ayd is a DevOps Team Lead at AUTO1 Group, where he helps power one of Europe’s fastest-growing digital marketplaces by building scalable, resilient, and automated infrastructure. With over 12 years of experience across cloud, DevOps, and platform engineering, he’s led transformations from legacy chaos to modern observability at scale. Known for breaking things on purpose (and then automating the fix), Ayd is passionate about turning operational complexity into repeatable systems. Outside the terminal, he’s a TEDx organizer, TED Translator, and self-declared youth activist. He believes in sharing what hurts, building what works, and always pushing platforms—gently or otherwise—toward what they should become.
In large-scale distributed systems, failure is not a matter of if, but when. What sets a reliable platform apart is its ability to detect, isolate, and recover from these failures — without compromising performance or consistency.
In this session, we’ll dive into how OceanBase, a distributed relational database built for cloud-native environments, combines deep observability with real-world chaos testing to ensure high availability at scale. We’ll share how our team conducts chaos engineering drills using ChaosMeta, simulating node failures, network partitions, and resource exhaustion within Kubernetes clusters — and how OceanBase’s Paxos-based architecture enables automatic failover and strong consistency under these scenarios.
We’ll also explore how OceanBase’s built-in observability stack — including SQL-level diagnostics, slow query tracing, and real-time metrics — helps platform teams proactively monitor and troubleshoot issues before they escalate.
Whether you’re an SRE, platform engineer, or systems architect, this talk will equip you with practical insights into building resilient stateful systems — and the confidence to survive failure in production.
Peng Wang is the Global Technical Evangelist for OceanBase, a distributed relational database designed for cloud-native applications. He brings over a decade of experience in the database R&D, including his previous role as a team lead in Intel’s database R&D group. He is also a contributor to several top-level Apache open source projects, and is passionate about open collaboration in the developer community. At OceanBase, he leads global developer engagement efforts — including technical content creation, community building, and open source evangelism.
In today's digital space, downtime adversely impacts customer trust and can lead to lost revenue. This session will show you how to rethink your infrastructure monitoring with logs, metrics, and traces for complete visibility. By utilizing better practices, applying machine learning for anomaly detection, and using AI assistance in place of human interactions with data, organizations can prevent and remediate issues before they lead to downtime, and to continue operating in reliable digital ecosystems. This session uses OpenSearch to present these platform-agnostic concepts.
I am a Specialist Solution Architect for Analytics at Amazon Web Services, specializing in Amazon OpenSearch Service. My role is to guide customers through essential concepts and design principles for optimal cloud-based analytics, search and GenAI deployments. When not architecting solutions, I enjoy discovering new destinations and exploring diverse culinary experiences.