SREDAY

Site Reliability, DevOps and Cloud

November 19-20, 2025 Criteo, 32 Rue Blanche, 75009 Paris, France

2
Days
20+
Speakers
2
Tracks
90
Attendees

Crash-Proofing Your OpenTelemetry Collector

Juliano Costa & Yuri Oliveira Sa
Datadog & OllyGarden

When planning observability for a distributed system, it's common to avoid having each microservice sending telemetry data directly to the backend. Instead, a Collector is typically deployed per host or node to receive, process, and forward telemetry data.

This approach improves bandwidth usage and centralizes control over telemetry flow. However, it also introduces a critical point of failure: what happens if the Collector crashes after receiving data but before it can forward it to the backend?

In this session, we’ll walk through the most used reliability mechanisms in the OpenTelemetry Collector and their limitations (potential data loss, limited control in fanout scenarios and so on). Then, we’ll introduce a newly added OTLP exporter batching option (distinct from the batch processor), explain why it was needed, how it works, and demonstrate its behavior in a crash-and-recovery scenario together with the exporter helper and persistent queue.

Juliano Costa is a Developer Advocate at Datadog with a focus on OpenTelemetry. He is a CNCF Ambassador passionate about fostering the Cloud Native community and spreading OTel's word everywhere he goes. Juliano is an active contributor to the OpenTelemetry project and a maintainer on the OpenTelemetry Demo, as well as a member of the Developer Experience SIG.

When he's not talking about OTel, you'll find him chasing after his energetic 2-year-old.

Sponsors & Partners

Want to become a sponsor? Get in touch!