All services emit telemetry data, but ensuring it is useful can be challenging. Too much, and you have a lot of noise; too little, and you can’t properly identify and troubleshoot issues. In 2023, it became clear to us at Mezmo that we had complex high cardinality metrics, needed to improve our observability practices, and consolidate our multiple Prometheus instances. We wanted that elusive “single pane of glass” experience.
This session will detail our transformation from a Sysdig-centric approach to a more flexible, centralized observability strategy. We'll explore how we:
Jon Duarte is a Site Reliability Engineer II at Mezmo with expertise in Terraform, Kubernetes, and GitHub. His career journey includes roles as a Linux application support engineer and Microsoft SQL DevOps Database Administrator at iHeartMedia. Jon holds a BBA in Infrastructure Assurance with a minor in Information Systems from The University of Texas at San Antonio. Known for his curiosity and problem-solving skills, Jon is dedicated to continuous learning and improvement.