We think of containers as providing isolation for our applications, however a major source of performance interference remains unaddressed, significantly degrading performance. Contention for CPU caches and memory bandwidth has been shown to increase tail response times by 4-13x and reduce compute efficiency by over 25% – even with per-application CPU and memory limits in place. With current telemetry, affected applications simply show high CPU utilization, leading operators to "throw more hardware at the problem", which is expensive and ineffective at mitigating the high response times.
In this talk, we'll cover three key areas: 1. Characterize real-world triggers like garbage collection and container image decompression 2. How modern CPU features allow detecting interference and identifying noisy neighbors. 3. Practical approaches to mitigate these effects, including findings from Google and Alibaba's production environments
Finally, we'll provide a status report on our open source effort to measure memory interference and discuss future directions.
Jonathan Perry is Texas-based a maintainer of the OpenTelemetry eBPF-based network collector. He researched performance isolation in datacenter and cloud networks at MIT, then founded Flowmill, which developed an eBPF-based Network Performance Monitoring collector, now part of OpenTelemetry (the company was acquired by Splunk).