Defining Reliability through User Objectives

Tomasz Czajka, Ciaran Gaffney & Pascal Schlumpf

Tenable

In this talk, we’ll explore how we revolutionized our SLO practices by introducing User Objectives—customer-experience-focused metrics that transcend individual services. This approach transformed our SRE function from a traditional embedded model to a centralized Application SRE team, fostering collaboration with product and engineering teams across the organization.

We'll share how we:
- Collaboratively defined User Objectives with PMs and Engineering Leaders.
- Mapped dependencies across services and datastores to create a robust dependency graph.
- Built SLOs centered on User Objectives, with secondary metrics for individual services.
- Established effective processes like weekly SLO reviews for product teams and monthly Production Reviews with senior leaders.
- Introduced meaningful alerting using error budgets and burn rates.
- Developed an SLO framework that automates dashboards, monitors, and metrics.

This evolution redefined SRE’s role in our company, establishing a true partnership with product teams. By creating feedback loops that balance features and stability, we’ve elevated our understanding of product reliability and improved the customer experience.

Tomasz Czajka is a seasoned professional with extensive experience in DevOps, DataOps, and Site Reliability Engineering (SRE). His career spans leadership roles, automation projects, and performance optimization across high-profile organizations like Tenable, Deutsche Bank, and Citrix. Tomasz has led significant projects, including the development of scalable analytics workflows, robust CI/CD pipelines, and performance-enhancing solutions on large-scale infrastructure. His technical expertise includes Python, Bash, Docker, Kubernetes, and tools for hypervisor platforms. Tomasz is also adept at driving collaboration, implementing cutting-edge tools, and improving process efficiencies. With a strong academic foundation in IT from Wrocław University of Science and Technology and Politechnika Opolska, he embodies continuous learning and excellence in engineering practices.

Ciaran Gaffney brings over a decade of diverse experience in Site Reliability Engineering (SRE) and software development. His journey spans key roles at Tenable and Hosted Graphite, where he demonstrated expertise in designing and maintaining distributed systems, enhancing reliability, and implementing innovative solutions such as dynamic load balancing and a gRPC-based aggregation layer. At Hosted Graphite, he was integral in scaling a system that processed 160 billion data points daily while ensuring SLAs were consistently met. Notable achievements include developing Hosted Graphite's alerting feature, creating customer-facing APIs, and leading automation and hardware provisioning efforts. With a strong foundation in Python, DevOps tools, and infrastructure management, Ciaran is a skilled problem solver passionate about system scalability, performance optimization, and customer satisfaction.

Pascal Schlumpf brings extensive expertise in site reliability engineering, software development, and monitoring systems to our event. With experience spanning over a decade in leading organizations such as Tenable and AT&T, Pascal has honed his skills in automation, monitoring integration, and reducing MTTR through innovative solutions. From building exporters for Prometheus and Grafana to integrating systems with AppDynamics and Splunk, Pascal has demonstrated a commitment to advancing observability and system reliability. We're excited to have him share his insights and practical approaches at the conference.

SREDAY

Site Reliability, DevOps and Cloud

March 27-28, 2025 London, UK

Defining Reliability through User Objectives

Tomasz Czajka, Ciaran Gaffney & Pascal Schlumpf

Sponsors & Partners