The backbone of SRE - like all engineering, is monitoring and observability. When designing and implementing platforms, understanding how monitoring and observability telemetry data impacts your systems is critical to scale.
At modern cloud-fleet scale, and as systems grow more complex, the volume of telemetry data in the form of logs, metrics, and traces sent to observability (o11y) systems can quickly exceed manageable limits. This talk dives into tried and true methods for defining and enforcing telemetry data quotas for OpenTelemetry (OTel) Collectors, based on CAP Theorem. We have been applying the principles of CAP Theorem to provide the framework for achieving better telemetry volume & distribution. Through our research & application, we'll share strategies for efficiently measuring data from multiple OTel agents, ensuring rule-based data distribution & addressing the practical implications of the CAP Theorem within telemetry pipelines for cloud native systems. By reframing CAP principles in the context of telemetry data, we'll consider trade-offs between consistency, availability, and partition tolerance when scaling and managing quotas across distributed systems.
Shubhanshu Surana is a Software Engineer specializing in Observability, currently contributing to innovative solutions at Sawmills. With previous experience at Adobe, he played a key role in scaling the company’s metrics and tracing infrastructure using CNCF tools like Prometheus, OpenTelemetry Collector, Jaeger, and Grafana. Shubhanshu has presented at industry conferences like the Linux Foundation OSS NA Summit, showcasing his expertise in distributed tracing and telemetry data management. His career also includes impactful roles at VF Corporation and Accenture, where he worked on AWS cloud solutions, API integrations, and automation.