In today's fast-paced development landscape, meeting SLAs is crucial for any SaaS organisation to maintain a competitive edge. Join us to learn how to build & manage SLOs to prevent customer SLAs from being breached. As a SRE, it's crucial to break down those barriers & communicate effectively.
The ability to handle Service Level Agreements (SLAs) is critical in today's fast-paced business world. For our organisation, the challenge lies in managing millions of conversations between employees and banks, each with their own unique ID. We have an internal process that takes these IDs through a sequence of steps, with each step having its own processing state, until the final state is achieved. Once this happens, the ID is marked as complete and made visible to customers through our user interface. Each of the step has a respective individual team working on the functionalities such as the Capture team, Indexing team, Archive Insights team, Ingestion team, etc.
Our SLA with customers mandates that IDs must reach the final state within 3 days of being received. While 99% of the IDs move through the process smoothly, about 1% of them encounter various issues such as indexing, fatal errors, or XML parsing errors, which prevent them from reaching the final state. Our SRE team has developed an internal app based on Python FastAPI, Celery & mongo db that automates the verification of ID states and ensures that they are moved to the final state within the SLA timeframe. This app is bundled with the following scripts which are executed in the celery worker threads as cronjobs:
Script that runs daily, verifies the states of the IDs received for t-3 days. It then moves the IDs which are not in the final state into a static collection into Mongo DB.
Another script periodically verifies and updates the latest states of these IDs, and removes them if they have reached the final state. This script also takes care of updating the latest states of the id’s which were moved into the collection the previous days.
An auto-remediation script that kicks in at regular intervals, which retries the failed IDs by invoking the respective team's internal API.
Finally, another script that runs daily which creates an incident with the remaining failed ids using the Atlassian Jira API. As soon as the incidents are created alerts are sent to the respective teams informing the status of the failed IDs. The teams are notified about the incident via slack teams, emails enabling them to act promptly and address the issues before the SLAs are broken.
Furthermore, all of these scripts have been designed to ensure there is no manual toil involved, which automatically triggers the respective application functionality through internal APIs and creates JIRA incidents to inform the teams about the actual failures left over after the retry functionality. This efficient process ensures that the teams can act upon the failed IDs promptly, way before the SLOs are broken, thereby delivering a seamless experience to our customers.
Additionally, visibility to the customer as we navigate through reconciling every ID is crucial for transparency, customer confidence, and progress tracking.
By leveraging the technology, the SRE team has developed an efficient and reliable system for managing the processing of millions of IDs, thereby ensuring that customer SLAs are met consistently. The system's ability to automatically detect and alert teams of any failed IDs has helped the teams to proactively address issues and prevent potential SLA breaches, ultimately leading to improved customer satisfaction and retention. This method not only allows us to meet our customer expectations, but also enables us to continuously improve our processes to ensure that we consistently deliver quality service.
Join us at "Slaying the SLAs: Mastering Effective Communication for Seamless Customer Experience" to learn more about how to handle SLAs and achieve operational excellence in your organisation.
As an experienced Lead SRE Engineer with 15+ years of industry experience, I've utilised my skills and expertise in designing, building, and maintaining robust CI/CD pipelines, implementing DevOps principles, and leading SRE teams. With a strong background in containerisation technologies like Docker and Kubernetes, as well as expertise in CI/CD tools like Azure DevOps and Jenkins, I am well-equipped to lead and mentor teams on SRE best practices.
At my current role at Smarsh, I designed and developed a new framework for SRE apps using Python's FastAPI, Celery, and Jinja2 modules, Redis, JavaScript, HTML, bundling all SRE scripts into a single web app. This resulted in reduction of manual toil and kept the SLA's in tact, and I was awarded with 'Lead by Example Award' for spearheading in designing & building this app. It's the same app that was discussed in the CFP.
In my previous role at Siemens, I played a major role in designing and building CI/CD pipelines, collaborating with architects and stakeholders, and majorly contributed in bringing the infrastructure cost down. One such instance is that I've optimised cloud infrastructure costs by $5000 per month, brought down third-party vendor licenses for cross browser & cross platform testing by automating the setup required using Selenium Grid & PowerShell scripts. Another achievement that I am proud of is that I've developed a fully auto scale Test Infrastructure using Jenkins & PowerShell scripts which seamlessly simulates the hospital environment, installs software’s using DSC, executes test automation suite of 2000 tests, captures test results into Jenkins and leaves no footprints. Presented a paper on this topic in Software Testing Conference, 2016 and it was one among top 250 white papers.
My objective is to continue coaching, mentoring, and empowering team members and people across teams on SRE best practices while delivering high-quality results that exceed expectations.