How can Site Reliability Engineers (SREs) effectively develop a robust Vision Language Model (VLM) platform leveraging Kubernetes, OTel, PyTorch, and Grafana? This approach enhances scalability and efficiency, empowering teams to innovate rapidly in artificial intelligence (AI) and machine learning (ML) while maintaining the observability and reliability of systems.
Site Reliability Engineers (SREs) have been working with AI for over a decade, utilizing AIOps tools to make sense of large amounts of observability data, thereby becoming familiar with AI/ML. As a result, they are a natural choice to tackle VLM platform hurdles.
This session pours Site Reliability Engineering principles, patterns, and practices into AI/ML platforms using the ModelOps operating model to propose a few ideas for the following questions:
- How do SREs insert SLIs and SLOs for VLM-powered applications through the ModelOps pipeline?
- What is observability for the new AI world?
- How do SREs identify and eliminate toil in such environments by streamlining AI experiments?
Benefits to the ecosystem
- Share practices and patterns on how SREs insert SLIs and SLOs for AI/ML models and their serving app through the ModelOps pipelines
- Explain the concept of observability in the new AI world
- Learn how SREs identify and eliminate toil in such environments by streamlining AI experiments at the Edge
- Understand how SREs can be the missing piece in the AI arena for a more efficient AI/ML platform that aligns with responsible AI policies
Rod Anami is a seasoned engineer who works with cloud infrastructure and software engineering technologies. As one of the Site Reliability Engineers from the SRE@Kyndryl CoE, he coaches other SREs on running IT modernization, transformation, and automation projects for clients worldwide. Rod leads the global SRE guild inside Kyndryl, where he helps plant and grow SRE chapters in many countries. Rod is certified as an SRE, Technical Specialist, and DevOps Engineer professional at their ultimate levels. He holds AWS, HashiCorp, Azure, and Kubernetes certificates, among others. He is passionate about contributing to the open-source software at large with Node.js libraries. Rod is also author of the "Becoming a Rockstar SRE" book.