32 Rue Blanche
75009 Paris, France
In modern distributed systems, the volume and fragmentation of production data can easily and frequently overwhelm human operators. This talk introduces a LangChain agent built to autonomously investigate production issues by orchestrating multiple tools across organizations’ stacks. We'll walk through how to build a modular, composable multi-tool agent, dig deeper into real-world reliability challenges, and share strategies and lessons learned in making these agents production-ready.
Tomaz is the Founding Engineer at Ewake.ai, working actively on advancing Ewake’s mission of building an AI teammate that brings peace of mind to engineering teams by investigating issues reactively and watching production proactively.
Prior to that, Tomaz worked as a Software Engineer for Doctolib, Benie Saúde, and Abbiamo in Brazil.
Observability is the cornerstone of reliable systems. It lets teams identify and resolve issues before they impact a broader group of users. Yet building an ideal observability stack is far from easy. It demands time and effort, instrumenting every app, service, and component that emits telemetry. Many teams default to “Store’em All - just in case”, logs that no one reads, traces that no one queries, metrics that never inform action. The result? Costs escalate, operational clarity fades, and ROI on observability tends to plateau or even decline. So, shouldn’t we be asking ourselves: are we really investing in observability, or just paying for some distributed noise?
The issue isn’t lack of telemetry; it’s unchecked volume without purpose. This talk explores the telemetry pipeline as a strategy to take back control. At the OpenTelemetry Collector level, we can filter, transform, sample, redact sensitive data, and route telemetry with intent. The goal is to extract clear business value from every signal and every dollar spent. By aligning observability with outcomes, we get an adaptive, efficient, and cost-aware setup. Whether you’re just starting out or operating at scale, this talk will show how to turn observability into a strategic asset instead of a liability.
Yash is a software engineer and researcher with a deep interest in distributed systems. His focus is on observability and performance, areas where he constantly seeks new insights. As an active advocate of OpenTelemetry, Yash contributes to both the project and the wider community. Outside of tech, he’s an avid explorer, whether in the kitchen experimenting with new recipes or traveling the world to taste diverse cuisines.
OpenTelemetry Semantic conventions cover many layers of your stack but fall flat when it comes to business logic. But this doesn’t have to be the case! The OpenTelemetry Weaver project gives you the tools to build your own semantic conventions. With auto generated instrumentation libraries and documentation, developers no longer have to worry about whether an attribute is called customer_id, customerID, accountNumber or something completely different - it’s all in the schema! Built-in support for policy validation also ensures your alerts never break because a metric has been renamed.
Dominik started his journey in technology as an SRE, working on projects ranging from warehouse logistics and photobook designers to analyzing satellite imagery. During this time, he discovered his passion for developer tooling and making sure developers can focus on what they do best - build great software! Now he is working as a Developer Experience Engineer at Grafana Labs, building tools to see clearly in the ever-changing world of software.
AI is making its way into platform engineering—not just as a workload, but as a smart automation layer for how platforms are built, operated, and optimized. Promises of intelligent autoscaling, self-tuning systems, and AI-assisted remediation are everywhere. But how do these claims hold up in real-world Kubernetes environments?
In this deeply technical session, the latest generation of AI-powered features and patterns will be benchmarked and stress-tested in the context of platform operations. From scaling decisions to observability-driven automation and adaptive infrastructure behavior, the focus will be on how these systems perform under load, how they handle edge cases, and what the operational overhead truly looks like.
Attendees will walk away with a clear-eyed view of the strengths and limitations of AI-driven platforms, grounded in data—not just demos.
Because if your infrastructure says “I’ll be back,” it’s worth knowing what it’s planning.
Annie Talvasto is an award-winning international technology speaker and leader. She has spoken at over 100+ tech events worldwide, including KubeCon + CloudNativeCon and Microsoft Build & Ignite. She has been recognized with the CNCF Ambassador, Azure & AI Platform MVP awards. She has co-organized the Kubernetes & CNCF Finland meetup since 2017. In the past, she has also served as a track chair for KubeCon + CloudNativeCon, Program Chair for Secure AI Summit (powered by Cloud Native) and has been hosting Cloud Native Live, a weekly livestream by CNCF, since 2021.
How many times were you woken up during the night to either spend more time than you would like trying to figure out what exactly broke, or just bash your keyboard in frustration once you figure out it was actually a false positive? What if there was a better way? I mean, AI is everywhere nowadays, what if we used it to solve these problems? Let's talk about AIOps and, through real-world case studies and industry research, see why it enables companies to reduce their MTTR by 62%, cut their alert noise by 91%, and predict 87% of potential service degradations — before they impact customers. We will take a journey through the evolution of AIOps: from early AI analytics, to the rise of GenAI, and taking us to the latest promised savior: Agents. But how do Agents help us when incidents come to knock? Will they speed us up? Are they even able to fix incidents without having to wake us up? And even if Agents are here to change everything, where are the gaps? Can everything be "agentified"?
Daniel Afonso is a Senior Developer Advocate at PagerDuty, SolidJS DX team member, Instructor at Egghead.io, and Author of State Management with React Query. Daniel has a full-stack background, having worked with different languages and frameworks on various projects from IoT to Fraud Detection. He is passionate about learning and teaching and has spoken at multiple conferences around the world about topics he loves. In his free time, when he's not learning new technologies or writing about them, he's probably reading comics or watching superhero movies and shows.
SREs are on the frontlines of uptime, performance, cost efficiency, and incident response. So, at times, policies for security and compliance often live in stale docs enforced inconsistently, if at all, until something breaks or someone has an audit. Policy as Code (PaC) replaces that mess with real-time automation right where the work happens.
We will discuss how PaC bridges the gap between security and operations by making policies transparent, codified, versioned, and customizable so you can enforce external standards (CIS, SOC 2, HIPAA) and your own internal rules (like requiring HA in prod but not in dev). You'll see how to apply policy intent consistently from the pipeline directly to runtime, giving teams proactive control, not reactive fire drills or unnecessary "vibe killers".
We'll walk through a full-lifecycle live demo to see the possibilities of PaC to: - Prevent misconfigurations with real-time checks - Enforce policies consistently across repos, infra, and runtime - Customize and codify controls that reflect unique needs - Control cloud costs through autoscaling, right-sizing, and cleanup - Unify security, dev, and ops with shared policies
What will we accomplish? Well, no more stale docs or arguments around policies, faster deploy, tighter cost controls, and safer infrastructure without slowing anyone down.
Alayshia Knighten is a seasoned engineering leader and customer success strategist with a strong background in DevOps, cloud architecture, and technical enablement. Currently serving as the Founding Principal Customer Success Engineer at Mondoo, she brings over a decade of experience helping organizations build and scale secure, efficient, and reliable infrastructure.
Previously, Alayshia held key roles at Pulumi as a Senior Customer Architect and at Honeycomb.io, where she led implementation engineering and partnerships architecture. Her expertise spans across engineering consulting at Chef Software and hands-on DevOps engineering at Verisign. Passionate about empowering teams and driving technical excellence, Alayshia is known for bridging the gap between engineering and customer success.
Traditional syslog systems have long been opaque — exporting minimal, fixed-format metrics that rarely reflect what users actually care about. AxoSyslog, a high-performance fork of syslog-ng, has taken a different path: not only adopting native Prometheus metrics, but also enabling metric emission directly from the user’s log processing logic.
In this talk, I’ll share how we transitioned from CSV-style global and per-driver metrics to full Prometheus integration. But more importantly, I’ll explore a less common mindset: treating metrics not as static artifacts of a system, but as programmable, user-defined views of what matters. Users can emit fine-grained, label-rich metrics from log routing logic itself — for example, tracking per-tenant message volume, labeling metrics with custom log or environment related info, or observing how often certain fields are missing.
We’ll walk through: * What syslog metrics looked like historically (and why they fell short). * Our journey to integrating Prometheus natively. * How update_metric() works and why it's powerful. * Real-world use cases where dynamic metrics made debugging and policy enforcement dramatically easier.
If you want to make observability accessible in deeply traditional parts of the stack — or want to let users code their own metrics — this talk is for you.
Founding Engineer at Axoflow, leading the dataplane team, specializing in scalable log ingestion and processing pipelines for enterprise and cloud-native environments.
Longtime syslog-ng and AxoSyslog developer with a strong passion for open source and community-driven innovation. Focused on building reliable, high-performance log management and observability systems that address real-world operational challenges.
Session Overview
Kubernetes offers many ways to share GPUs, but a single, cluster-wide scheduler often forces trade-offs between utilization, stability, and team autonomy. This talk shows how vCluster makes the NVIDIA Kubernetes AI Scheduler (KAI) run as an opt-in service for each tenant—so platform teams can raise GPU density while keeping operations predictable.
What We’ll Cover
- Problem statement – why mixed workloads leave GPUs under-used and complicate on-call
- vCluster fundamentals – lightweight control planes that isolate scheduling logic, not hardware
- KAI at a glance – fractional GPU allocation, gang queues, topology awareness
- Live demonstration – two vClusters on one host
Key Takeaways
- A reproducible pattern for running different schedulers side-by-side
- Practical steps to increase GPU utilisation without adding more clusters
- An isolation model that lets teams experiment safely
Why It Matters
As GPU demand grows, platform engineers must balance cost efficiency with reliability. Combining vCluster and KAI delivers both—turning idle accelerators into productive capacity while preserving operational control.
```
Want me to also make a slide-deck version in Markdown (short bullets, less prose) so you can drop it into a presentation tool?
An active contributor to OpenSource projects on GitHub, blogger and content creator, focusing on practical, scalable solutions in cloud-native environments. DevOps and Platform Engineering practitioner and advocate. Visit: cloudrumble.net
Today’s observability stacks are rich in telemetry but poor in semantic alignment. This talk introduces the Intent Graph—a new visualization paradigm that traces the propagation of design decisions across system layers, from infrastructure to application logic to business outcomes. The Intent Graph makes dependencies between architectural choices, test coverage gaps, and runtime anomalies explicit and navigable. We walk through real-world scenarios where misaligned design intent caused failures—even when metrics looked green. Using techniques from MIT's STAMP/STPA, causal inference, and Generative AI, we show how to move from metric dashboards to intent-aware observability fabric. This talk will resonate with those seeking to build truly self-explaining systems.
It maps how design decisions, quality trade-offs, and business goals flow across the system stack, from infrastructure to app to outcomes. When this flow is broken, the system may still "work"—but it works wrong.
What you'll learn:
How to capture intent traces at design time How to correlate runtime telemetry to intent deviation How to prevent false positives/negatives in alerting through purpose-aware thresholds How to respond to incidents based on drift from intent, not just error rates
This is observability with semantics. And it is what SRE needs next.
Short description Go beyond dashboards—this talk introduces Intent Graphs to trace how design decisions shape runtime behavior and business outcomes.
Mahesh Venkataraman leads innovation in the area of application of artificial intelligence, data mining and machine learning in software engineering. He has led successful implementation of natural language processing driven test automation, usage and failure modeling using log analytics, empirical analysis of technical debt and application of knowledge graphs in discovering patterns and relationships for optimizing test suites and improve decision making for system integration projects. His passion is bridging the gap between theory and practice, between academia and industry and creative thinking in software. He is a regular keynote speaker in many conferences. He is currently working on addressing uncertainty in fault prognosis and diagnosis
Koushik Vijayaraghavan is a Senior Managing Director at Accenture, where he has spent over 20 years driving product innovation, engineering, and digital transformation for global clients. He began his career at Cognizant and has completed Harvard Business School’s Disruptive Strategy program, strengthening his expertise in guiding organizations through change.
In our upcoming presentation, we'll explore a cutting-edge architectural solution for real-time SMS and email notifications, particularly geared towards responding to earthquake events. This system is designed to handle rapid data transmission, listening for event changes every second, making it ideal for real time critical alert scenarios. Central to our discussion will be the integration of Lambda functions and Confluent Kafka, coupled with advanced multithreading techniques and DynamoDB lock strategies. A focal point of our presentation will be addressing the challenges and innovative solutions involved in integrating Confluent Kafka with Lambda functions to enable serverless operation of both producers and consumers. This is a key element in ensuring the quick and efficient distribution of notifications through parallel methods. Additionally, we will delve into the implementation of an automated scaling mechanism, which is vital for optimising the performance of the Serverless Notification ecosystem. Our aim is to provide a comprehensive insight into how these technologies can be effectively combined to develop a robust and efficient system, capable of delivering critical real-time alerts for situations like earthquake occurrences, ultimately playing a crucial role in saving human lives.
Vlad Onetiu, a DevSecOps and Software Automation Engineer from Cluj-Napoca, Romania, is renowned for his expertise in cloud technology, cybersecurity, and software automation. Since embarking on his career in 2018, he has been instrumental in conducting security research for Romania's major banks, significantly bolstering their cybersecurity measures. Vlad has also contributed to the field through his research papers on malware and phishing, shedding light on these critical cyber threats. His proficiency in employing cloud-based solutions for system automation, combined with his skillful handling of CI/CD processes and cloud architecture, reflects his commitment to fostering secure and resilient digital environments. Known for his passion for technology and relentless innovation, Vlad stands out as a leading figure in cybersecurity, continuously exploring and implementing cutting-edge strategies to address the challenges of evolving cyber threats.
This talk presents the real-world story behind countX, a B2B fintech company that grew from first commit to successful private equity exit in under four years, without VC funding and with a lean, empowered team. From day one, we built on a fully serverless AWS-native architecture: Lambda, SNS/SQS, API Gateway, CloudFront, Cognito, and CDK. On top of that, we implemented a Continuous Deployment pipeline using GitHub, CodeBuild, and CodePipeline, enabling us to deploy to production dozens of times per week with high confidence and zero manual gates.
But the talk isn’t just about architecture or tooling. What made this setup truly powerful was how we paired CD with Continuous Discovery - customer interviews, fake doors, lightweight A/B tests, and KPIs tied to actual product outcomes. This combination created a feedback-driven loop that allowed us to ship fast, iterate with purpose, and align engineering with business goals.
The main takeaway: how a pure DevOps practice like Continuous Deployment, when paired with the right product mindset, can significantly increase not just delivery velocity, but team performance, product-market fit, and ultimately revenue and ROI. This is not a theoretical or aspirational talk, it’s a practical case study showing how modern SRE practices can become strategic drivers of business success.
I’m a Ukrainian entrepreneur, software engineer by background, with over 20 years of experience. I started my career in Kyiv, spent several years working in Moscow, and have spent the last decade in Berlin. Most recently, I was the co-founder and CTO of countX, a B2B fintech company that went from first commit to a successful exit in under four years. I’m also pursuing an Executive MBA at London Business School, where I’m deepening my focus on fintech, financial systems, and venture strategy.
Machine Learning (ML) solutions often start on a simple platform like a virtual machine, which is great for initial research. However, as the system scales and enters production, automation becomes crucial. Cloud suites such as Google Vertex AI, Azure Machine Learning, and AWS Sagemaker, can streamline this process.
For example, model training is more efficient with a managed service that automatically scales compute resources based on your training needs, eliminating the cost of idle resources –- as happens when you use the Jupyter Notebook or a VM alone for training.
We’ll cover all parts of the ML process, including development, hypertuning, deployment for inference, experiments, model management, monitoring, performance, and operating the entire pipeline.
Joshua Fox has 20 years experience as a software architect in software product companies, and now advises tech companies on their gnarliest cloud challenges as a senior cloud architect at DoiT International. See more at joshuafox.com/publications
Setting up continuous integration is now a common practice in the industry. However, there are still only few effective solutions for doing so across hundreds of repositories encompassing thousands of projects. How do we manage dependencies between projects? How do we assess the quality of each one? How do we automate the validation of a project's clients even before merging a pull request? In this session, we’ll quickly revisit the fundamentals of the problem. Then, we’ll share Criteo's own journey on this topic, along with the pros and cons of the different approaches we explored.
Emmanuel Guérin is a Staff Site Reliability Engineer at Criteo. Over the past 25 years, he has been a strong advocate for automation, working with a number of small French startups. Frustrated by the lack of progress as an individual contributor, he continued to champion better practices wherever he could. At Criteo, he has helped scale the primary build system used by the R&D organization. He now focuses on the company’s main scheduler for data jobs.
This session explores how Happening completely revamped their edge Kubernetes infrastructure by implementing EKS Hybrid to centrally manage all their on-premise clusters across different markets. Faced with regulatory requirements to store data locally at the edge while maintaining operational efficiency, we designed a sophisticated hybrid architecture with a centralized AWS control plane to manage edge data planes. We'll dive deep into our technical implementation including our mixed-mode CNI setup with VPC-CNI and Cilium, multi-pool IPAM to handle cross-cloud networking, and how we leverage Kyverno policies to ensure workload placement across markets. Learn how we established seamless connectivity between AWS and on-premise environments through Wireguard VPN tunnels, coupled with BGP routing policies to efficiently route edge workloads. By centralizing management of geographically distributed Kubernetes clusters, we've reduced our on-premise management burden by 40%, resulting in substantial cost savings and dramatically decreased the team operational toil. Cluster upgrades which we were always tedious is now done in just a few hours across all environments simultaneously, significantly improving our maintenance windows and capacity management.
Laurent Godet is a seasoned Site Reliability Engineer with nearly a decade of experience building and scaling cloud-native infrastructure for high-growth companies. Currently at Happening, he focuses on reliability, automation, and scalable systems design.
Previously, Laurent led SRE initiatives at LiveRamp, where he managed the company’s largest Kubernetes cluster on GKE, introduced GitOps best practices, and built the award-winning Tenant Factory service that reduced onboarding time from a full day to just one hour.
His career spans pivotal DevOps and SRE roles at Earnd (Greensill), iov42, and Babylon Health, where he drove cloud migrations, built high-availability platforms, and implemented modern observability and CI/CD pipelines. Laurent’s expertise lies in Kubernetes, AWS, serverless architectures, and data-intensive systems, always with a focus on reliability, scalability, and developer productivity.
This talk introduces telemetry as code**: bringing the same declarative principles that transformed infrastructure to your observability stack.
Using OpenTelemetry Collector Custom Resources and the Telemetry Controller, we'll demonstrate how to eliminate configuration drift, enable true multi-tenancy, and make observability as reliable and repeatable as your deployments.
What You'll Learn
Transform Your Telemetry Pipeline
- Replace brittle YAML with declarative Kubernetes CRDs that abstract complexity while maintaining flexibility
- Build tenant-aware routing that scales from single teams to enterprise-wide deployments
Master Production Patterns
- Design secure, multi-tenant Prometheus integration using Remote Write protocols
- Leverage automated configuration validation and testing strategies that catch issues before production
- Navigate the hidden complexities of cross-namespace telemetry routing and security
Avoid Costly Mistakes
- Learn battle-tested approaches for managing collector configurations across diverse environments
Passionate Software Engineer who loves building reliable systems and actively engages with the cloud-native community to advance Kubernetes security and observability. Maintainer of multiple CNCF sandbox projects, including Bank-Vaults, which is a project dedicated to simplifying the complex world of secret management. The Logging operator, which solves logging-related problems in Kubernetes environments.
Currently, working on the Logging operator and the new Telemetry controller at Axoflow.
Coming soon...
Matthieu Blumberg is Senior Vice President of Engineering at Criteo, where he leads Infrastructure, Security, and Internal IT initiatives to drive business transformation and empower teams with a world-class digital workplace. With over 14 years at Criteo and a strong background in engineering leadership and cybersecurity, Matthieu plays a key role in scaling technology platforms and ensuring the integrity and efficiency of global operations.
Production today is messy. There’s noise, complexity, and a constant stream of change. And while we’ve come a long way with observability, it still leans heavily on human foresight. Logs, metrics, alerts, they’re all things we had to think of ahead of time. But when we don’t? That’s where blind spots are born.
Ambient agents try to shift that model. These are always-on, proactive teammates who don’t wait for a prompt. They listen to everything happening in production. They surface things we’d likely miss.
In this talk, we’ll dive into what it takes to bring an ambient agent into your stack, how it listens, learns, and acts, and why this might just be the layer of intelligence your system’s been missing.
Pooné Mokari is the CEO and co-founder of Ewake.ai, an AI Reliability Teammate on a mission to bring real peace of mind to engineering teams. Drawing on her experience as an SRE at Criteo, she founded Ewake to offer engineers their dream teammate, which investigates issues reactively and watches production proactively. Throughout her career, she was active as a speaker in different tech conferences, such as Devoxx Belgium and Devoxx France. She’s also been engaged in mentoring women in tech.