55 Stockton St, San Francisco,
CA 94108, United States
This talk is centered around how SREs can leverage ML using Adaptive Reliability to predict, analyze, and alert on symptoms of an issue and root cause in order to prevent downtime and system outages.
Jennifer Rahmani is the Co-Founder/COO of Thoras.ai, which leverages advanced, adaptive reasoning (ML-based) technology to detect symptoms for downtime prevention, rapidly uncover root causes, and discover optimization opportunities.. Previously, she spent a decade as a DevOps Engineer in Defense Tech building and architecting resilient scalable cloud workloads and monitoring solutions for large scale systems.
In this talk, we will talk about lessons learnt from taking a new first principles approach to metrics observability to achieve order of magnitude cost-efficiency and simplicity. Through eliminating disks and egress costs completely with AZ-aware architecture, you can truly achieve limitless scale and zero management overhead. By making an infrastructure primitive 10x cheaper, we can truly unlock the new use cases of metrics observability. The columnar based storage engine architecture allows you to send any cardinality (customer-id, request-id, user-id, pod-id, camera-id etc). Most observability tools downsample metrics (for 1-min or 5 min or 1 hour) depending on how the old data is. In this talk, we will cover how we can retain high fidelity data all the time with no downsampling at all while being cost-efficient.
Kiran Gollu is the Co-founder/CEO of Oodle.ai. Previously, he was head of cloud platform engineering at Rubrik (NYSE: RBRK), when he was responsible for observability, distributed databases, reliability/costs engineering and developer experience while they were scaling from <5M to 600M in ARR. He was also co-founder of YC backed observability startup, and was an early engineer of AWS S3 and DynamoDB.
In this talk we will explore how to embed security into every stage of the software delivery process. Learn key security checkpoints and discover effective strategies to protect them for a seamless deployment from development to production.
As developers strive to move fast, security cannot be an afterthought. This talk explores how to embed DevSecOps practices throughout the software delivery process, from writing code on the developer's machine to its deployment in production. We’ll discuss critical security checkpoints from developer machines to production deployment, and effective strategies to ensure a seamless, secure deployment. Join us to learn how to protect your application delivery pipeline!
Siri Varma Vegiraju is a seasoned professional in healthcare, cloud computing, and security. Currently, he focuses on securing Azure Cloud workloads, leveraging his extensive experience in distributed systems and real-time streaming solutions. Prior to his current role, Siri contributed significantly to cloud observability platforms and multi-cloud environments. He has demonstrated his expertise through notable achievements in various competitive events and as a judge and technical reviewer for leading publications. Siri frequently speaks at industry conferences on topics related to Cloud and Security and holds a Masters Degree from University of Texas, Arlington with a specialization in Computer Science.
Every new request coming into your infrastructure may not necessarily bring additional revenue for the company, but will most definitely cost money to observe given that most vendors today charge by data ingested. This is because each new request turns into N more logs, M more metrics and K more spans. Today Observability Costs have already risen to 20-30% of Total Cloud Spend and unless are getting actively managed and evaluated for ROI, will keep rising year over year. This talk will focus on what you should measure, why and how when it comes to ROI. We will cover best practices and present a broader instrumentation philosophy based on open standards like Open Telemetry that is targeted at ROI measurement. We will also focus on how to productize Usage Visibility and start a culture around consuming it across the broader engineering organization.
Ruchir is the co-founder and CEO of Cardinal, which helps engineers view machine data through a business lens. In his previous life, he spent 7 years as a Lead Engineer on Netflix's Observability team, where he built petabyte-scale Observability products that are used daily by thousands of Netflix engineers.
In today’s rapidly evolving threat landscape, Security is not just an option; it’s a necessity. With high-profile incidents like Log4J, SolarWinds, and major data leaks making headlines, organizations have turned their focus to DevSecOps and the adoption of robust security practices to protect their software supply chains. Responding to the U.S. Executive Order 14028, which mandates transparency around software components, Broadcom recognized the need for an automated SBOM (Software Bill of Materials) solution that aligns with zero-trust principles and is both accessible and actionable.
Our team of Backstage.io experts pivoted from Developer Experience to Security, creating a custom SBOM plugin tailored to Broadcom's environment. This solution integrates seamlessly with Jenkins, GitHub Enterprise, and BlackDuck, generating comprehensive SBOMs in both SPDX and CycloneDX formats. The automated reports are published to secure repositories and presented via the Backstage interface, making them easily accessible to stakeholders with the right permissions.
This talk will delve into our journey of developing and deploying the SBOM plugin, emphasizing how it enhances supply chain security while streamlining compliance with cybersecurity standards. Attendees will gain insights into how Backstage.io can serve as a robust platform for security initiatives and will leave with practical steps for implementing a similar solution in their organizations.
Nishkarsh is a DevSecOps expert and an International GitHub Star. Nishkarsh is an ardent supporter of open-source, GitHub, DevEx, and DevOps. Nishkarsh serves as StatusNeo Inc.'s Principal Evangelist & Consultant. Over the years, he has been actively GitHubbing and contributing to open-source. By giving talks at conferences, organizing meetups, and encouraging people to take on the #100DaysofCode challenge, he has encouraged many brilliant minds to embark on their journeys in open-source projects and preach the significance of collaboration to aspiring developers.
Effectively Learning from past incidents is crucial to improving MTTR. Despite implementing blameless postmortems, runbooks, collaborative incident responses, and on-call handoff meetings, organizations struggle to share and leverage collective knowledge effectively.
In this talk, I will explore why these traditional methods fall short and how misaligned incentives (e.g., no one’s promoted for writing runbooks) contribute to locking critical expertise away in the minds of individual experts.
I’ve built an open-source CLI that uses modern LLMs to meet developers where they are and automate the manual processes of knowledge sharing. The presentation will conclude with a live demonstration of our open-source product to show what’s possible. Here’s a 1 minute demo of Savvy in action: https://getsavvy.so/demo
Unlock the secrets to securing your Kubernetes clusters in a high-stakes digital world! Join us to explore cutting-edge strategies for protecting your control plane, nodes, and workloads. Learn practical tips on RBAC, encryption, and runtime protections to stay ahead of cyber threats. Don’t miss it!
As Kubernetes becomes a cornerstone of modern application deployment and scaling, its rapid adoption also heightens its attractiveness as a target for cyberattacks. With 63% of organizations projected to run Kubernetes in production environments by 2025, ensuring robust security for Kubernetes clusters is critical. In this session, we will explore a multi-layered approach to Kubernetes security designed to safeguard the control plane, nodes, and workloads against evolving threats.
We’ll dive into best practices for securing the control plane, such as implementing Role-Based Access Control (RBAC) for the API server, encrypting ETCD data, and utilizing network policies to protect intra-cluster communication. Node security will be addressed with a focus on runtime protections like SELinux and AppArmor to mitigate risks of privilege escalation and lateral movement. Additionally, we'll examine how integrating secrets management tools like HashiCorp Vault can enhance the protection of sensitive data.
By adopting these security measures, organizations can fortify their Kubernetes deployments, ensuring not only operational efficiency but also resilience against potential disruptions. Join us to gain practical insights into securing Kubernetes clusters, learn from real-world challenges, and discover strategies to maintain the integrity of critical infrastructure in the face of escalating cyber threats.
Manpreet Singh Sachdeva is an accomplished DevSecOps, Infrastructure, and AI Specialist with extensive expertise in software engineering practices such as MLOps, and Site Reliability Engineering. He holds a B.Tech in Electronics and Communication Engineering and a Post Graduate Diploma in Business Management, with a proven track record in cloud infrastructure security, automation, and AI/ML deployments across a wide range of applications including web, mobile, IoT, and client-server systems. Currently serving as a Technical Duty Officer at Walmart Global Tech, Manpreet is responsible for incident management, service restoration, and ensuring the availability of cloud infrastructure during critical business outages. His previous roles include MLOps Cloud Solution Architect at Domino Data Lab and DevSecOps SRE Lead at Verizon, where he built and maintained scalable machine learning pipelines, optimized cloud resources, and integrated AI models for real-time analytics. Manpreet has been instrumental in projects related to security, compliance, and incident management, with a focus on enhancing operational excellence through automation and cloud-native solutions. He holds several certifications, including AWS DevOps Engineer, Kubernetes Security Specialist, and Prometheus Certified Associate.
Unlock the future of Site Reliability Engineering with AI-driven assessments powered by Reinforcement Learning! Discover how personalized, real-time learning paths can boost your skills, enhance certification success, and keep you ahead in the fast-paced SRE world. Ready to revolutionize your growth
As Site Reliability Engineering (SRE) continues to evolve, the need for continuous learning and skill development is paramount. Traditional assessment methods often fail to adapt to the fast-paced, dynamic nature of SRE roles. Reinforcement Learning (RL), a branch of AI, presents an innovative solution by enabling adaptive, real-time learning experiences tailored to the specific needs of engineers. Studies show that AI-driven learning powered by RL can boost retention and performance by up to 30%, providing a significant advantage in skill acquisition and mastery.
This presentation explores how RL can revolutionize learning for SREs by transforming assessments into personalized, ongoing learning paths. Through AI-driven platforms, SREs receive immediate, adaptive feedback that adjusts in complexity based on performance, leading to a more engaged and efficient learning process. RL-powered systems enhance cognitive retention, ensure mastery of critical SRE skills, and support certification preparation with a 15% higher success rate compared to traditional methods.
In this session, we will showcase real-world applications of RL in SRE training environments, focusing on how it can optimize skill development, enhance certification success, and prepare SREs for the challenges of modern infrastructure management. This approach ensures that learning becomes a continuous, dynamic process, empowering SRE professionals to stay ahead in a rapidly changing technological landscape. Join us to explore how AI and RL can redefine the future of skill development in Site Reliability Engineering.
Vijay Valaboju is a seasoned Software Engineer with over 19 years of experience in designing, developing, and testing web and Windows applications using Microsoft technologies. He has a proven track record in cloud solutions architecture, having designed and developed numerous applications across various domains, including hybrid and entirely cloud-based systems. Currently serving as a Senior Software Engineer at Microsoft since 2015, Vijay has demonstrated exceptional expertise in project management, technical leadership, and client engagement. In his role, Vijay has been pivotal in developing strategic plans that enhance system performance and reduce incidents during complex project migrations. His leadership in end-to-end engineering responsibilities has ensured streamlined functionality delivery, minimal service disruptions, and operational improvements. Prior to his tenure at Microsoft, Vijay held significant roles at HCL America, Singularity, and Accenture, where he led cross-functional teams across multiple time zones, driving system modernization and delivering high-impact solutions. Vijay holds a Master of Business Administration and a Bachelor of Applications from Kakatiya University. His skill set encompasses project management, technical expertise, critical thinking, and quality assurance. Fluent in English, Hindi, Telugu, and Kannada, Vijay is known for his problem-solving abilities and technical communication skills. He continues to excel in cloud-based solution development and client-focused application delivery.
Unlock the future of software reliability with AI-powered code review and QA! Discover how AI can slash code review time by 75%, boost defect detection by 50% cut deployment failures in half. Learn actionable strategies to elevate your SRE practices, ensuring faster, more reliable software delivery.
The incorporation of artificial intelligence (AI) and machine learning (ML) in software development is transforming Site Reliability Engineering (SRE) practices, especially in the areas of code review and quality assurance (QA). This session will delve into the tangible benefits of AI-driven tools for SREs, focusing on how they enhance code quality, streamline development workflows, and fortify system reliability. AI-enabled automation in bug detection, test generation, and real-time feedback has proven to cut code review time by up to 75% and improve defect detection rates by 50%, significantly reducing production incidents. For instance, AI tools like Amazon’s CodeGuru can identify up to 90% of critical code issues, resulting in a 30% reduction in bugs in live environments.
In the realm of quality assurance, AI is revolutionizing testing strategies through adaptive, intelligent automation, achieving 30% increased test coverage and a 40% decrease in manual testing efforts. By analyzing historical data, AI-powered QA solutions can predict and prioritize bug-prone areas with 75% accuracy, ensuring a more focused and efficient testing process. Integrating AI into CI/CD pipelines has not only accelerated deployment cycles by 40% but also decreased deployment failures by up to 60%, contributing to higher system reliability and uptime.
Attendees will gain insights into how AI is redefining SRE practices by facilitating early issue detection, reducing technical debt by 25%, and enhancing project predictability. This session will outline actionable strategies for integrating AI into the SRE toolkit, enabling the delivery of more reliable software, faster deployment cycles, and greater resilience in complex system environments.
Prakash R. Ojha is a passionate and committed technology leader with over a decade of experience in designing and developing cutting-edge software solutions across desktop, web, and mobile platforms. Currently, he serves as a Software Architect at Wipro Ltd., working on projects for BNY Mellon that manage $50 trillion in assets, providing solutions that identify tax relief in billions of dollars. His expertise spans the entire Software Development Life Cycle (SDLC), with mastery in cloud-native applications, microservices architecture, CI/CD pipelines, and secure Spring Boot applications. Prakash holds a Master’s degree in Computer Science from the Georgia Institute of Technology and a Bachelor’s in Computer Science from Hope College. He is an Oracle Certified Developer in various Java technologies, including web services and enterprise beans. His technical acumen includes proficiency in front-end technologies like Angular, JavaScript, and back-end microservices, leveraging cloud platforms such as AWS and Azure. Throughout his career, Prakash has led teams of engineers to deliver complex solutions, modernize legacy systems, enhance performance, and drive digital transformation. His notable work includes implementing real-time data processing, reducing market event processing times from 96 to 72 hours, and creating secure microservices integrated with Kafka and Spring Security for authentication. He has consistently demonstrated a deep understanding of agile methodologies, event-driven architecture, and cloud-native ecosystems. In addition to his development work, Prakash is an advocate for best practices in secure coding and scalable architecture, mentoring engineers, and driving projects to successful completion. His experience across industries, from finance to marketing, demonstrates his versatility and commitment to delivering innovative solutions that meet business needs.
Many SREs view AI with a sense of uncertainty or even fear, worried about what it might mean for their role. But automation has always been a core principle of SRE, and AI should be seen as a leap forward in that mission—the ultimate automation frontier. For decades, we’ve sought to automate repetitive tasks and streamline operations, only to encounter fragile, incomplete solutions. AI offers the potential to overcome those limitations, tackling complex tasks that have long resisted automation. Importantly, AI doesn’t replace the human element in SRE; it enhances it. By eliminating the toil of manual, repetitive work, AI allows SREs to focus on higher-level strategic thinking, creative problem-solving, and driving innovation. In essence, AI helps bring humanity back into the role, empowering SREs to do what they do best—keeping systems reliable, resilient, and ready for the future.
Tina Huang is a VP of Product and Engineering at Harness. Previously, she was the founder and CTO of Transposit, an AI-powered incident management platform. Tina advocates for a human-centric approach to solving complex engineering challenges, focusing on leveraging AI to augment human capabilities rather than replace them. Tina began her career at Apple, where she designed and built APIs for the company’s application framework. As one of Google’s early engineers, she worked on the Blogger team and was instrumental in re-architecting the Google News frontend. At Twitter, she led the architecture, scaling, and operation of the company’s notification platform and also contributed to the developer productivity and build tools team. These experiences fueled her passion for enhancing DevOps and optimizing engineering workflows. Tina holds a degree in electrical engineering and computer science from the Massachusetts Institute of Technology. She also studied humanities at the University of Chicago, shaping her perspective on human-technology interaction.
How much hype is there around Platform Engineering? A lot.
Overall, the idea of Platform Engineering isn’t new, but what's new is the path it's taking.
Because of that, we need a production-ready approach to get Platform Engineering right without creating more tech debt.
How much hype is there around Platform Engineering? A lot.
Overall, the idea of Platform Engineering isn’t new, but what's new is the path it's taking.
Because of that, we need a production-ready approach to get Platform Engineering right without creating more tech debt.
In this session, you’ll learn from theory to hands-on how to create a proper platform. It will contain:
First, we'll talk about the various details that all engineers will need to know to create and configure a proper, production-ready Kubernetes cluster that works with Platform Engineering.
Next, we'll dive into the capabilities that will exist on the Kubernetes cluster in the Platform Engineering environment. These capabilities can be anything from GitOps to monitoring to cost and resource optimization. It all depends on what the engineers need (the engineers using the platform)
Lastly, you'll learn how an end-user (the engineer using a Platform Engineering environment) can interact with it. Is it an IDP? A CLI? Another automated solution?
You’ll see everything from Kubernetes to Crossplane to Backstage and everything in between. You’ll also learn how to think about Platform Engineering as a whole when it comes to using Kubernetes as your underlying platform.
Michael Levan is a Distinguished Engineer in the Kubernetes and Security space who spends his time working with startups and enterprises around the globe on Kubernetes consulting, training, and content creation. He is a trainer, 4x published author, podcast host, international public speaker, CNCF Ambassador, and was part of the Kubernetes v1.28 Release Team. Want to see what he is up to? https://www.michaellevan.net/