Customer-first: Moving from Hero Engineering to Reliability Engineering
From the beginning, Slack has always had a strong focus on the customer experience, and customer love is one of our core values. Slack has grown from a small team to thousands of employees over the years and this customer love has always included a focus on service reliability.
In a small startup, it’s manageable to have a reactive reliability focus. For example, one engineer can troubleshoot and solve a systemic issue — we know them as Hero Engineers. You may also know it as an operations team, or a small team of Site Reliability Engineers that are always on-call. As the company grows, those tried and practiced measures fail to scale, and you’re left with pockets of tribal knowledge riddled with burnout as the system becomes too complex to be managed by only a few folks.
With any rapidly growing complex product, it is hard to move away from a reactionary focus on user-impacting issues. Reliability practitioners at Slack have developed effective ways to respond, mitigate, and learn from these issues through Incident Management and Response processes and fostering Service Ownership — these contribute to a culture of reliability first as a whole. One of the key components of both the Incident Management program and the Service Ownership program is the Service Delivery Index.
If you’re driving a reliability culture in a service-oriented company, you must have a measurement of your service reliability before all else, and this metric is quintessential in driving decision-making processes and setting customer expectations. It allows teams to speak the same language of reliability when you have one common understanding.
Introducing the Service Delivery Index
The Service Delivery Index – Reliability (SDI-R for short) is a composite metric of the success of jobs-to-be-done by Slack’s users and Slack’s uptime as reported on our Slack System Status site. It is a composite measure of successful API calls and content delivery (as measured at the edge), along with important user workflows (e.g. sending a message, loading a channel, using a huddle).
This is a company-wide metric with visibility up to the executive level, and in practice is implemented quite simply by:
API Availability
availability api = successful requests / total requests
Overall Availability
availability overall = uptime status site * availability api
You may be asking why uptime and availability are different; uptime is determined by monitoring key workflows that are critical to Slack’s usability and if the availability of any of those critical user interactions drops below a predetermined threshold, we count the minutes that the service is below that threshold to determine downtime.
Since small changes in availability (~0.0001) can have a drastic impact on the customer experience, we convert availability to a 9s representation, where 99% availability is 2 9s, 99.9% availability is 3 9s, and 99.99% availability is 4 9s, and so on.
We track daily and hourly aggregates of availability, tracking it over time so that we can spot trends and identify regressions and improvements.
We maintain company-wide goals on this metric in terms of the number of days in a quarter that we meet availability targets.
The Reliability Engineering team is fundamentally responsible for responding to and triaging regressions in availability that cause or can potentially cause us to miss those targets, but like any important effort we are far from alone in meeting our goals:
- Engineering Leadership: Decide prioritization and unblock needed solutions to regressions systemically and tactically
- Service Owners: Debug, understand, and mitigate the root cause of regressions, improving the services they own over time
- Reliability Engineering: Aid service owners, develop tooling, and identify threats that must be resolved to maintain availability
All parties combine SDI-R regressions with incident and customer impact data to align on the most important issues and drive them to conclusion.
We’ve found that by treating SDI-R as a “canary in the coal mine” instead of waiting for issues to become incidents, we’ve been able to solve reliability threats more proactively. Issues are:
- Easier to understand and debug, since the number of things breaking at once is reduced
- Identified earlier, giving more time to scope and implement any correct solutions
- Often solved before customers even notice, preventing outages entirely
Growing the Service Delivery Index from an idea to a program: Adoption
The SDI came to fruition from an idea by our Chief Architect Keith Adams in which he attempted to quantify the quality of a service with four measurements: Security, Performance, Quality, and Reliability.
- Security: How quickly are we addressing security vulnerabilities? Track ticket close rate.
- Performance: Is our service delivering responses to customers timely? Track API latency or client performance.
- Quality: How quickly are we addressing open software defects? Track ticket close rate.
- Reliability: Is our service reliably delivering requests to customers? Track error rates.
Over time, each of those four areas have evolved into their own separate programs and are tracked as key metrics company wide. We’ll talk about the Reliability program here and how we were able to establish a common language that teams understand and use to prioritize their work.
Slack—as a customer-first organization—established a high bar of quality and maintains a 99.99% availability SLA in customer agreements. This requires a program that ensures the metric is being tracked and that there is accountability.
The first aspect of the program is visibility — we must understand and see the signal of how well we are meeting the SLA.
Once we have visibility, we bring accountability. We publish this metric to a leadership group or company wide group of stakeholders, and establish an objective of Reliability in planning. Once the objective is published, and the key result is monitored, we can then establish a link between the SDI and teams. The SDI allows us to link regressions to services, which can be mapped to a team. Once the connection is made, we can then prioritize fixes or tradeoffs to correct the regression before it becomes a SLA breach.
Scaling action, learning, and prioritization
SDI-R is effectively an error budget that helps us decide how much time the company and individual teams should spend on launching new features, and when we must stop feature work to focus on availability. In this way, it helps us balance prioritization of investments across the company through a common view of user impact.
Because of our strong belief in Service Ownership, we’ve invested in tools and processes that help scale understanding and resolution of SDI-R impacting issues.
We aim to get the Right People, in front of the Right Problem, at the Right Time
Monitoring, alerting, and observability tools are important to scale the engineering response to customer-impacting issues. We observed several common use cases that were worth automating to make it easier for service owners to maintain service level objectives (SLOs) and respond to regressions. The first of which, Webapp Ownership Tool, is responsible for automating the setup of alerts, SLOs, and dashboards for Slack API endpoints using a common set of metrics and infrastructure. Service owners can often respond to and resolve an alert before it becomes an SDI-R regression, utilizing a common set of logging, metrics, and tracing to feed back knowledge of availability into the Software Development Lifecycle. The second of which is Omni, Slack’s Service Catalog responsible for being a system of record for ownership and escalation. Omni includes SDI-R data alongside owned APIs and infrastructure components, enabling the escalation of issues in dependencies and for us to automatically route regressions to the appropriate team. These tools are very effective in ensuring response and resolution of acute issues.
We aim to do the things that best serve our customers
Organizationally, it is important that we establish the correct forums and tools to understand ongoing regressions and for effective re-prioritization of investments to strike the right balance between reliability and feature work. The first of these is the Engineering Monday Meeting, a regular forum for re-prioritization of investments and understanding by engineering leadership of ongoing customer issues and SDI-R regressions. Secondly, we report group and team level aggregates of SDI-R that allow breakdown by organizational responsibility and tracking of success over time. Both of these help make sure that our organization-wide goal can scale and that all teams are aligned towards the customer experience. Often we’ve found that teams self-service utilize these reports to find chronic issues that slowly degrade the customer experience, but are otherwise not caught in incidents or alerting.
Not every system is perfect; there were many lessons
As we’ve worked with SDI-R over many years, it has evolved over time to make sure that it can bring maximum value to our customers.
Not all API requests are the same
One of the things we realized is that not all API requests are the same. We would encounter issues for specific users that would be significant for them but not move the overall metric. This led to the establishment of a breakdown of SDI-R for only our largest organizations and a weighting of different APIs by importance to properly represent the customer impact regressions in them may have. Often we’d find that regressions would affect our largest customers first as they pushed the limits of our products and infrastructure, but that with this breakdown we’d be able to resolve them proactively in the same way as the overall SDI-R score.
Long-term maintainability
The delayed nature of SDI-R reporting sometimes led to a disconnect between the time that an issue happened and when it impacted SDI-R. However, we’ve found that as we’ve scaled SDI-R through service-specific alerting this has mattered less, since by the time an issue was impacting SDI-R it would have already been captured by an alert.
It has become increasingly valuable to invest in maintaining availability headroom by proactively fixing issues before our availability goals are at risk of being violated. This proactive nature not only reduces operational toil, but is also regular practice in debugging and other skills necessary to triage and understand regressions.
SDI-R has been so successful as an approach we’ve adopted it to ensure the availability of new Slack products and infrastructure as we scale, in particular for our GovSlack environment.
Our approach must continuously evolve
Over time with new product launches, customer needs, and changes to our infrastructure it is important that we continuously iterate on our metrics and processes so that we can keep figuring out the best way to measure our own success. No business is static, and we must not be afraid to learn from failures and iterate to improve our reliability over time.
Conclusion
As organizations rapidly grow, it is often difficult to stay proactive while also prioritizing availability and product work together. By focusing on our customers, we’ve found SDI-R useful in striking this delicate balance. For both product and infrastructure, the customer is the most important thing and data-driven approaches combined with the right processes are critical towards keeping our customers happy and productive.
Acknowledgments
We wanted to give a shout out to all the people that have contributed to this journey:
Adam Fuchs, Ajay Patel, John Suarez, Bipul Pramanick, Justin Jeon, Nandini Tata, Shivam Shukla and all of those at Slack who have put our customers first.
Interested in taking on interesting projects, making people’s work lives easier, or improving our reliability? We’re hiring! 💼
Apply now