In today's fast-paced development landscape, meeting SLAs is crucial for any SaaS organisation to maintain a competitive edge. Join us to learn how to build & manage SLOs to prevent customer SLAs from being breached. As a SRE, it's crucial to break down those barriers & communicate effectively.
The ability to handle Service Level Agreements (SLAs) is critical in today's fast-paced business world. For our organisation, the challenge lies in managing millions of conversations between employees and banks, each with their own unique ID. We have an internal process that takes these IDs through a sequence of steps, with each step having its own processing state, until the final state is achieved. Once this happens, the ID is marked as complete and made visible to customers through our user interface. Each of the step has a respective individual team working on the functionalities such as the Capture team, Indexing team, Archive Insights team, Ingestion team, etc. Our SLA with customers mandates that IDs must reach the final state within 3 days of being received. While 99% of the IDs move through the process smoothly, about 1% of them encounter various issues such as indexing, fatal errors, or XML parsing errors, which prevent them from reaching the final state. Our SRE team has developed an internal app based on Python FastAPI, Celery & mongo db that automates the verification of ID states and ensures that they are moved to the final state within the SLA timeframe. This app is bundled with the following scripts which are executed in the celery worker threads as cronjobs: Script that runs daily, verifies the states of the IDs received for t-3 days. It then moves the IDs which are not in the final state into a static collection into Mongo DB. Another script periodically verifies and updates the latest states of these IDs, and removes them if they have reached the final state. This script also takes care of updating the latest states of the id’s which were moved into the collection the previous days. An auto-remediation script that kicks in at regular intervals, which retries the failed IDs by invoking the respective team's internal API. Finally, another script that runs daily which creates an incident with the remaining failed ids using the Atlassian Jira API. As soon as the incidents are created alerts are sent to the respective teams informing the status of the failed IDs. The teams are notified about the incident via slack teams, emails enabling them to act promptly and address the issues before the SLAs are broken. Furthermore, all of these scripts have been designed to ensure there is no manual toil involved, which automatically triggers the respective application functionality through internal APIs and creates JIRA incidents to inform the teams about the actual failures left over after the retry functionality. This efficient process ensures that the teams can act upon the failed IDs promptly, way before the SLOs are broken, thereby delivering a seamless experience to our customers. Additionally, visibility to the customer as we navigate through reconciling every ID is crucial for transparency, customer confidence, and progress tracking. By leveraging the technology, the SRE team has developed an efficient and reliable system for managing the processing of millions of IDs, thereby ensuring that customer SLAs are met consistently. The system's ability to automatically detect and alert teams of any failed IDs has helped the teams to proactively address issues and prevent potential SLA breaches, ultimately leading to improved customer satisfaction and retention. This method not only allows us to meet our customer expectations, but also enables us to continuously improve our processes to ensure that we consistently deliver quality service. Join us at "Slaying the SLAs: Mastering Effective Communication for Seamless Customer Experience" to learn more about how to handle SLAs and achieve operational excellence in your organisation.