W67C+7JJ, Elcot Sez, Sholinganallur,
Chennai, Tamil Nadu 600119
Why can't you just use your old monitoring tools for your new AI systems? This session answers that question by introducing AI Observability. You'll learn how to use AIOps to move beyond guesswork and effectively manage your AI agents and infrastructure
Meiyappan Kannappa is a Technical Director at Ford Motor Company with around 20 years of experience in designing and developing architectural frameworks and cloud applications. He specializes in modernizing applications to cloud-native technologies and has extensive experience building B2C and B2B systems within the e-Commerce, Connected Vehicles, and mobility sectors. As a technical enthusiast, he actively writes about innovative software architecture for digital transformation.
Every Kubernetes deployment starts with good intentions, but the path to production is littered with configuration landmines that can destroy performance, compromise security, and create operational nightmares. This talk exposes the most common—and costly—mistakes that even experienced teams make when working with Kubernetes. What You'll Learn Through real-world case studies and live demonstrations, we'll explore: Security Disasters Every Kubernetes deployment starts with good intentions, but the path to production is littered with configuration landmines that can destroy performance, compromise security, and create operational nightmares. This talk exposes the most common—and costly—mistakes that even experienced teams make when working with Kubernetes. What You'll Learn Through real-world case studies and live demonstrations, we'll explore: Security Disasters
Why default RBAC configurations are a security nightmare waiting to happen The hidden dangers of running containers as root and how privilege escalation attacks unfold Container image vulnerabilities that slip through CI/CD pipelines Network policy misconfigurations that create unintended attack vectors
Configuration Catastrophes
Resource limits and requests: the difference between "it works on my machine" and production stability How improper health checks can cascade into cluster-wide failures Storage configuration mistakes that lead to data loss The subtle namespace and labeling errors that break everything
Observability Blind Spots
Why basic CPU/memory metrics aren't enough for Kubernetes troubleshooting Missing runtime security monitoring that could have prevented breaches Log aggregation anti-patterns that hide critical failure signals How to detect anomalous behavior before it impacts users
Scaling and Performance Traps
HPA configurations that create resource thrashing instead of smooth scaling Node scheduling mistakes that lead to resource waste and outages Network bottlenecks that aren't obvious until it's too late
Beyond the Problems: Practical Solutions This isn't just a catalog of disasters—you'll walk away with:
Actionable checklists for security hardening Tool recommendations for continuous monitoring and assessment Automation strategies to prevent configuration drift Proven patterns for reliable observability
Every Kubernetes deployment starts with good intentions, but the path to production is littered with configuration landmines that can destroy performance, compromise security, and create operational nightmares. This talk exposes the most common—and costly—mistakes that even experienced teams make when working with Kubernetes. What You'll Learn Through real-world case studies and live demonstrations, we'll explore: Security Disasters
Why default RBAC configurations are a security nightmare waiting to happen The hidden dangers of running containers as root and how privilege escalation attacks unfold Container image vulnerabilities that slip through CI/CD pipelines Network policy misconfigurations that create unintended attack vectors
Configuration Catastrophes
Resource limits and requests: the difference between "it works on my machine" and production stability How improper health checks can cascade into cluster-wide failures Storage configuration mistakes that lead to data loss The subtle namespace and labeling errors that break everything
Observability Blind Spots
Why basic CPU/memory metrics aren't enough for Kubernetes troubleshooting Missing runtime security monitoring that could have prevented breaches Log aggregation anti-patterns that hide critical failure signals How to detect anomalous behavior before it impacts users
Scaling and Performance Traps
HPA configurations that create resource thrashing instead of smooth scaling Node scheduling mistakes that lead to resource waste and outages Network bottlenecks that aren't obvious until it's too late
Beyond the Problems: Practical Solutions This isn't just a catalog of disasters—you'll walk away with:
Actionable checklists for security hardening Tool recommendations for continuous monitoring and assessment Automation strategies to prevent configuration drift Proven patterns for reliable observability
Whether you're just starting your Kubernetes journey or managing enterprise clusters, this session will help you identify potential issues before they become production incidents. We'll cover everything from CIS Benchmark compliance to modern runtime security approaches, ensuring your clusters are both performant and secure.
This is a story from the trenches of running one of India’s largest ECS fleets—serving millions of requests entirely on infrastructure that can disappear with just two minutes’ notice. We began with the “easy path” of a third-party managed solution, but as we scaled, it quickly became our biggest bottleneck and a massive cost center.
This session is our journey of taking back control. We’ll share the hard-earned lessons from hitting undocumented AWS limits, battling against opaque “black box” algorithms, and enduring production outages that forced us to innovate.
You’ll learn how we:
- Built custom controllers and predictive scaling
- Slashed instance boot times by 75% through EBS optimization
- Developed a “25-second miracle” shutdown process
All of this allowed us not just to survive—but to thrive—in the chaos of a 100% Spot environment, achieving 99.99% uptime while cutting compute costs by 60%.
This is how we transformed ECS from a simple orchestrator into a battle-hardened, intelligent platform.
Gaurav Chauhan is a Software Development Engineer at Zomato, with prior experience at Oracle, CRED, and Capital 2B. An alumnus of IIT Delhi, he has worked across backend systems, data science, and cloud infrastructure, and has held leadership roles driving large-scale initiatives.
Sudip Chakraborty is a Software Development Engineer at Zomato, working on the platform team with a focus on reliability, benchmarking, and large-scale microservice optimizations. Over the past three years at Zomato, he has contributed to ML Ops, backend systems, and platform engineering, leveraging technologies like Golang, Kubernetes, Terraform, and Kafka.
Incidents are inevitable. At scale, they’re not just costly, they’re chaotic, emotionally draining, and often followed by postmortems that feel more like interrogations than opportunities to learn. What if the very process of incident response could be reimagined? Not as human vs. system, but as humans and AI working side by side?
This talk explores how the rise of AI-driven SRE agents is transforming both incident response and the culture of postmortems. These systems don’t just accelerate recovery; they act as impartial witnesses, documenting events in real time, surfacing context, and stripping away the bias that fuels blame. The result? A shift from “who missed what” to “how do we evolve our systems and practices?”
We’ll walk through: - Why traditional postmortems often fail in high-velocity, AI-heavy environments. - How AI-powered incident response changes the narrative from firefighting to foresight. - What happens when AI becomes the first responder and the cultural ripple effects it creates. - Practical steps to harness AI for faster resolution, deeper insights, and blameless learning loops. - How to quantify impact through metrics like mean time to resolution (MTTR), recovery consistency, and learning velocity and why measurement is critical to long-term success.
Attendees will leave with a fresh perspective: AI not as a replacement for SREs, but as a catalyst for resilience, psychological safety, and collective intelligence. In the age of AI, postmortems aren’t just about the past; they’re about future-proofing reliability.
I’m a Senior Engineering Manager with nearly two decades of experience building and leading platform and storage engineering teams. Alongside that, I mentor with Women in Cloud Native, helping grow the next generation of engineers. I’m also a mom of two, balancing high-scale infrastructure with the chaos of homework and school runs, often with some help from AI (always with healthy boundaries). I’m passionate about speaking and writing technical blogs that make complex topics accessible, and I’m known for sneaking in the occasional dad joke; because reliability and resilience aren’t just for systems, they’re for people too.
Ram Iyengar is an engineer by practice and an educator at heart. He was (cf) pushed into technology evangelism along his journey as a developer and hasn’t looked back since! He enjoys helping engineering teams around the world discover new and creative ways to work. He is a proponent of product development and engineering teams that put the community first.
traceloop
— a flight recorder for syscallsI am a senior SRE at Sematext, where I am responsible for entire AWS/Kubernetes infrastructure that is used by hunderds of customers worldwide. We self host almost everything that we run, including Kubernetes. I love to solve challenging infrastructure problems with a focus on automation and reliability. I have spoken in several conferences, like DevopsDays, Cfgmgmtcamp before, including local meetups.
Conversational AI has captured widespread attention, but the true potential of AI lies well beyond chatbots and dialogue systems. As enterprises look to build intelligent, adaptive systems, the ability to provide real-time, contextual understanding becomes essential. This is where Model Context Providers (MCPs) come in — systems designed to dynamically supply AI models with the contextual signals they need to make relevant, timely, and intelligent decisions.
In this talk, we’ll explore the evolving role of MCPs in modern AI architectures and how they can be used not just to improve inference quality, but also to intelligently manage and orchestrate Kubernetes environments. By integrating context-aware models with cloud-native infrastructure, organizations can unlock powerful capabilities — from self-tuning systems and adaptive resource allocation to AI-driven automation across the stack.
We’ll dive into: - What MCPs are and why they matter beyond chat interfaces - How to build MCP server - How MCP can be used for managing Kubernetes cluster - How to integrate MCP with tools vscode, Claude, etc.
Aditya is a Senior Software Engineer at Walmart. As a proud CNCF Kubestronaut, he holds multiple Kubernetes certifications that showcase his deep expertise in the ecosystem. Beyond his work, Aditya actively shares his insights through his YouTube channel, creating tutorials on cloud technologies and software engineering, and writes technical blog posts to help others navigate and master these domains.
Kubernetes promises portability and scalability, but in reality, most production outages happen due to avoidable mistakes. Security gaps, misconfigured health checks, poor scaling strategies—all can derail even experienced teams.
In this session, we’ll uncover:
Security Disasters → The risk of running containers as root, overlooked image vulnerabilities, and RBAC pitfalls.
Configuration Catastrophes → Why “works on my machine” never works, and how resource mismanagement wrecks clusters.
Observability Blind Spots → Missing runtime security monitoring, misleading CPU/memory metrics, and logging anti-patterns.
Scaling Traps → HPA-induced thrashing, node scheduling inefficiencies, and bottlenecks hidden until too late.
But it’s not just about problems—we’ll explore solutions:
Actionable hardening checklists
Tools for continuous monitoring & runtime security
Automation strategies to prevent config drift
Proven observability practices for anomaly detection
Whether you’re new to Kubernetes or running enterprise-scale clusters, you’ll leave this session with practical, battle-tested strategies to keep your systems safe, stable, and observable.
Kaustubha Shravan is a Cloud Architect who designs and operates resilient, measurable, and cost-efficient platforms across Azure, AWS, and GCP. She blends reliability engineering with data-driven practices—SLOs, error budgets, and ML-assisted incident response—to make outages rare and recovery fast. With 46+ cloud certifications, she has led initiatives such as Benchmarking-as-a-Service and production-grade ML inference pipelines that improved performance while cutting spend. Kaustubha is a Women Techmakers Ambassador and frequent community mentor; her work has been showcased at NeurIPS workshops. She speaks about pragmatic reliability patterns, observability that drives action, and culture—how to turn postmortems into durable engineering improvements. When she’s not shipping guardrails, she’s helping teams adopt sustainable, privacy-aware AI practices and sharing playbooks that teams can put to work immediately.
This session is for SREs, DevOps engineers, and platform teams who want to strengthen Kubernetes security at the network layer. While cloud providers offer firewalls and service meshes, the last line of defense inside the cluster is Network Policies.
The talk balances concepts + live demos and provides a clear journey: starting from the “default open cluster,” then step-by-step applying Network Policies to enforce strict communication.
I will use Calico on GKE to illustrate examples, but the learnings apply to any Kubernetes distribution. The session ensures attendees leave with concrete policies they can apply to their workloads.
Sanket is a working as a Sr Devops Engineer at Lloyds Banking Group and He is a Google Developer Expert in GCP. He has worked as a cloud engineer at Searce (GCP Premium Partner) He is Google Cloud Champion Innovator in the category of Modern Architecture with Certified Kubernetes Administration (CKA) , 3X Certified in GCP and 1X in Azure . He has been helping small and mid sized startups to adopt and implement best practices in cloud and DevOps culture to fasten their software delivery process. He loves to integrate various GCP service like Vertex AI, Agents Development Kit, MCP servers,networking, compute, storage , containers, GKE etc, and likes to Implement various use cases. He loves to explore and deep dive into GCP services and also help community by creating content and writing medium blogs.
In modern applications, observability is essential. OpenTelemetry (OTel) has emerged as the standard for telemetry data, providing a unified way to collect, process, and export logs, metrics, and traces. Beyond data collection, organizations need effective ways to store, analyze, and visualize this telemetry to drive actionable insights.
This session is for developers looking to deepen their understanding of OpenTelemetry's architecture and its three key pillars: logs, metrics, and traces. We will walk through how OTel fits into an observability stack, the benefits it brings, and common integration patterns. A small demonstration will showcase how telemetry data can be captured and visualized in practice.
Audience Takeaway: A clear understanding of OpenTelemetry’s core components, how it supports observability across modern applications, and hands-on insights through a live demo.
Advisory Software Engineer at IBM Labs with a passion for Observability, APM, IoT, and Automotive solutions. A 3x Patent Holder and innovator in telemetry and distributed systems, specializing in Microservices, Kubernetes, and Cloud technologies. Expert in Java, Kafka, MongoDB, ElasticSearch, and Cassandra. Committed to building scalable, high-performance solutions that drive real-time insights and enhance system reliability.
What We’ll Cover
Build-intensive repositories consuming 60–70 CPU cores per run, showcasing CI/CD performance at extraordinary scale and speed.
Solving Critical Edge Cases
Fortifying orchestration against cascading failures caused by unpredictable GitHub service outages.
Cost Optimization
Implementing intelligent workflow filters to eliminate unnecessary runs and maximize efficiency.
Performance Enhancements
Leveraging caching of packages and builds to speed up performance and reduce redundant work across workflows.
Operational Standards
Key Takeaways
For Engineers and Operators:
- Scalable Infrastructure
Design ephemeral ECS-based self-hosted runners that handle massive CI/CD workloads effortlessly.
Cost & Performance Optimization
Unlock strategies to reduce costs and maximize efficiency with self-hosted runners.
Accelerated Job Runtimes
Improve provisioning speed and leverage caching for faster workflows.
Automation & Standardization
Enforce linters and automate standards across repositories.
End-to-End Deployment Cycle
Build deployment workflows with safeguards, visibility, and approval gates.
Advanced Reliability Features
Implement automation to enhance reliability during deployments while reducing human intervention.
GitHub Actions as a Unified Platform The talk emphasizes GitHub Actions as a single platform for CI/CD, streamlining workflows, improving developer experience, and eliminating the need for multiple tools.
By sharing Zomato’s journey, we aim to inspire practitioners to rethink and enhance their own CI/CD setups—making them more resilient, efficient, and developer-focused.
Akshat Goel is a Software Engineer II at Zomato with over three years of experience in building scalable systems. He holds a Bachelor's degree in Computer Science from IIT Ropar (2018–2022) and has a strong background in backend development and distributed systems.
Nishant Sarraff is an SDE-2 at Zomato with over three years of experience in developing scalable software solutions. He earned his B.Tech in Computer Science from IIT Jodhpur (2018–2022) and has expertise in backend systems and performance optimization.
```markdown Why should SREs care about systems thinking applied in aviation safety engineering?
Modern distributed systems face the same challenges as complex safety-critical systems: emergent failures, cascading outages, and the gap between system design and runtime behavior. While traditional monitoring focuses on known failure modes, aviation safety engineering provides systematic approaches to identify unknown risks.
In this talk, you will learn how we applied MIT's System-Theoretic Process Analysis (STPA) — a methodology from aviation safety — to analyze reliability risks in a large-scale eCommerce platform processing millions of transactions daily.
What you will learn:
- How to map STPA concepts (hazards, constraints, control actions) to distributed systems components
- A systematic framework for identifying cascading failure scenarios before they occur
- Practical techniques for analyzing interactions between auto-scaling, load balancing, and circuit breakers
- How cascading failures were reduced using insights from this analysis
- Actionable methods you can apply to your own systems
Real examples covered:
- Circuit breaker coordination failures that created retry storms
- Auto-scaling feedback loops that amplified rather than dampened failures
- Security policy interactions that blocked legitimate traffic during incidents
- Configuration drift detection that prevented silent reliability degradation
This is not theoretical — we'll show concrete code examples, architecture diagrams, and actual incident data. You'll leave with a practical toolkit for systematic reliability analysis that goes beyond traditional SRE approaches.
Whether you are dealing with microservices, serverless architectures, or hybrid cloud deployments, this methodology will help you build and maintain more resilient systems by thinking systematically about failure modes and control structures.
Perfect for: SREs, Platform Engineers, and Engineering Managers who want to move beyond reactive incident response to proactive reliability engineering.
```
Mahesh leads innovation in the area of application of artificial intelligence, data mining and machine learning in software engineering. He has led successful implementation of natural language processing driven test automation, usage and failure modeling using log analytics, empirical analysis of technical debt and application of knowledge graphs in discovering patterns and relationships for optimizing test suites and improve decision making for system integration projects. His passion is bridging the gap between theory and practice, between academia and industry and creative thinking in software. He is a regular keynote speaker in many conferences. He is currently working on addressing uncertainty in fault prognosis and diagnosis
Ever wonder how top teams keep complex systems running smoothly? This keynote will explore the current landscape of observability, moving beyond traditional pillars to embrace advanced techniques and holistic insights. We'll discuss how SRE teams are leveraging emerging technologies to navigate increasingly complex systems, transforming reactive firefighting into proactive engineering.