January 18, 2024

15 min read

The Scary Thing About Automating Deploys

Sean McIlroySr Software Engineer

Scared Robot

Most of Slack runs on a monolithic service simply called “The Webapp”. It’s big – hundreds of developers create hundreds of changes every week.

Deploying at this scale is a unique challenge. When people talk about continuous deployment, they’re often thinking about deploying to systems as soon as changes are ready. They talk about microservices and 2-pizza teams (~8 people). But what does continuous deployment mean when you’re looking at 150 changes on a normal day? That’s a lot of pizzas…

Graph showing changes opened, merged, and deployed per day, from October 16th to October 20th. Changes deployed is between 150 and 190. — Changes per day.

Continuous deployments are preferable to large, one-off deployments.

We want our customers to see the work of our developers as fast as possible so that we can iterate quickly. This allows us to respond quickly to customer feedback, whether that feedback is a feature request or bug reports.
We don’t want to release a ton of changes at once. There’s a higher likelihood of errors and those errors are more difficult to debug within a sea of changes.

So we need to move fast – and we do move fast. We deploy from our Webapp repository 30-40 times a day to our production fleet, with a median deploy size of 3 PRs. We manage a reasonable PR-to-deploy ratio despite the scale of our system’s inputs.

A graph showing deploys per day, from October 16th to October 20th. The number bounces between 32 and 37.

We manage these deployment speeds and sizes using our ReleaseBot. It runs 24/7, continually deploying new builds. But it wasn’t always like this. We used to schedule Deploy Commanders (DCs), recruiting them from our Webapp developers. DCs would work a 2 hour shift where they’d walk Webapp through its deployment steps, watching dashboards and executing manual tests along the way.

The Release Engineering team managed the deployment tooling, dashboards, and the DC schedule. The strongest, most frequent, feedback Release Engineering heard from DCs was that they weren’t confident making decisions. It’s difficult to monitor the deployment of a system this large. DCs were on a rotation with hundreds of other developers. How do you get comfortable with a system that you may only interact with every few months? What’s normal? What do you do if something goes wrong? We had training and documentation, but it’s impossible to cover every edge case.

So Release Engineering started thinking about how we could give DCs better signals. Fully automating deployments wasn’t on the radar at this point. We just wanted to give DCs higher-level, clearer “go/no-go” signals.

We worked on the ReleaseBot for a quarter and let it run alongside DCs for a quarter before realizing that ReleaseBot could be trusted to handle deployments by itself. It caught issues faster and more consistently than humans, so why not put it in the driver’s seat?

The heart of ReleaseBot is its anomaly detection and monitoring. This is both the scariest and most important piece in any automated deployment system. Bots move faster than humans, meaning you’re one bug and a very short period of time away from bringing down production.

The risks that come with automation are worth it for 2 reasons:

It’s safer if you can get the monitoring right. Computers are both faster and more vigilant than humans.
Human time is our most valuable, constrained resource. How many hours do your company’s engineers spend staring at dashboards?

Screenshot of Slack Message from Release Bot saying "ReleaseBot started for webapp"

Monitoring never feels “done”

Any engineer that’s been on-call will know this cycle:

You monitor everything with tight thresholds.
These tight thresholds, combined with a noisy service, lead to frequent pages.
Frustrated and tired, you delete a few alerts and increase some thresholds
You finally get some sleep.
An incident occurs because that noisy service actually broke something but you didn’t get paged.
Someone in an incident review asks why you weren’t monitoring something.
Go to step 1.

This cycle stops a lot of teams from implementing automated deployments. I’ve been in meetings like this multiple times throughout my career:

Person 1: “Why don’t we just automate deployments?”
Everyone: *Nods*
Person 2: “What if something breaks?”
Everyone: *Looks sad*

The conversation doesn’t make it past this point. Everyone is convinced it won’t work because it feels like we don’t have a solid hold on our alarms as-is – and that’s with humans in the loop!

Even if you have solid alerting and a reasonable on-call burden, you probably find yourself making small tweaks to alerts every few months. Complex systems experience a low hum of background errors and everything from performance characteristics, to dependencies, to the systems themselves change over time. Defining a particular number as “bad” for a complex system is open to subjective interpretation. It’s a judgment call. Is 100 errors bad? What about a 200 millisecond average latency? Is one bad data point enough to page someone or should we wait a few minutes? Will your answers be the same in a month?

Given these constraints, writing a program we trust to handle deployments can seem insurmountable but, in some ways, it’s easier than monitoring in general.

How deployments are different

The number of errors a system experiences in a steady-state isn’t necessarily relevant to a deployment. If both version 1 and version 2 of an application emit 100 errors per second, then version 2 didn’t introduce any new, breaking changes. By comparing the state of version 1 and version 2 and determining that the state of the system did not change, we can be confident that version 2 is a “good” deployment.

You are mostly concerned with anomalies in the system when deploying. This necessitates a different approach.

This is intuitive if you think about how you watch a dashboard during a deployment. Imagine you just deployed some new code. You’re looking at a dashboard. Which of these two graphs catches your attention?

Two graphs with a line on each denoting a deployment. The left graph is at 1, then spikes to 10 and 15 immediately after the deployment. The right graph is a flat line at 100 before and after the deployment.

Clearly, the graph with a spike is concerning. We don’t even know what this metric represents. Maybe it’s a good spike! Either way, you know to look for those spikes. They’re an indication something is tangibly different. And you’re good at it. You can just scan the dashboard, ignoring specific numbers, looking for anomalies. It’s easier and faster than watching for thresholds on every individual graph.

So how do we teach a computer to do this?

Picture of a robot emoji with a robot cat in a thought bubble. They are in front of a graph in the rough shape of a cat. The text reads "It's easy for humans to spot anomalies in data. For example, this PHP Errors chart resembles my cat".

Luckily for us, defining “anomalous” is mathematically simple. If a normal alert threshold is a judgment call involving tradeoffs between under and over alerting, a deployment threshold is a statistical question. We don’t need to define “bad” in absolute terms. If we can see that the new version of the code has an anomalous error rate, we can assume that’s bad – even if we don’t know anything else about the system.

In short, you probably have all the metrics you need to start automating your deployments today. You just need to look at them a little differently.

Our focus on “anomalous” is, of course, a little overfit. Monitoring hard thresholds during a deployment is reasonable. That information is available, and a simple threshold provides us the signal that we’re looking for most of the time, so why wouldn’t we use it? However, you can get signals on-par with a human scanning a dashboard if you can implement anomaly detection.

The nitty-gritty

Let’s get into the details of anomaly detection. We have 2 ways of detecting anomalous behavior: z scores and dynamic thresholds.

Your new best friend, the z score

The simplest mathematical way to find an anomaly is a z score. A z score represents the number of standard deviations from the mean for a particular data point (if that all sounds too math-y, I promise it gets better). The larger the number, the larger the outlier.

A picture of a robot emoji with sunglasses on the cover of Kenny Loggins Danger Zone, in front of a graph show a normal distribution with standard deviations. The text reads "A z-score tells us how far a value is from the mean, measured in terms of standard deviation. For example, a z-score of 2.5 or -2.5 means that the value is between 2 to 3 standard deviations from the mean.

Basically, we’re mathematically detecting a spike in a graph.

This can be a little intimidating if you’re not familiar with statistics or z scores, but that’s why we’re here! Read on to find out how we do it, how you might implement it, and a few lessons we learned along the way.

First, what is a z score? The actual equation for determining the z score for a particular data point is ((data point – mean) / standard deviation).

Using the above equation, we can calculate the z scores for every data point in a particular time interval.

Thankfully, calculating a z score is computationally simple. ReleaseBot is a Python application. Here’s our implementation of z scores in Python, using scipy’s stats library:

from scipy import stats

def calculate_zscores(self) -> list[float]:
	# Grab our data points
	values = ChartHelper.all_values_in_automation_metrics(
		self.automation_metrics
	)
	# Calculate zscores
	return list(stats.zscore(values))

You can do the same thing in Prometheus, Graphite, and in most other monitoring tools. These tools usually have built-in functions for calculating the mean and the standard deviation of datapoints. Here’s a z score calculation for the last 5 minutes of data points in PromQL:

abs(
	avg_over_time(metric[5m])
	- 
	avg_over_time(metric[3h])
)
/ stddev_over_time(metric[3h])

Now that ReleaseBot has the z scores, we check for z score threshold breaches and send a signal to our automation. ReleaseBot will automatically stop deployments and notify a Slack channel.

Almost all of our z score thresholds are 3 and/or -3 (-3 detects a drop in the graph). A z score of 3 generally represents a datapoint above the 99th percentile. I say “generally” because this really depends on the shape of your data. A z score of 3 can easily be the 99.7th percentile for a dataset.

So a z score of 3 is a large outlier, but it doesn’t need to be a large difference in absolute terms. Here’s an example in Python:

>>> from scipy import stats
# List representing a metric that alternates between 
# 1 and 3 for 3 hours (180 minutes)
>>> x = [1 if i % 2 == 0 else 3 for i in range(180)]
# Our most recent datapoint jumps to 5.5
>>> x.append(5.5)
# Calculate our zscores and grab the score for the 5.5 datapoint
>>> score = stats.zscore(x)[-1]
>>> score
3.377882555133357

The same situation, in graph form:

A graph that bounces between 1 and 3 continually, then jumps to 5.5 at the last datapoint. A red arrow points to 5.5 with "z score = 3.37".

So if we have a graph that’s been hanging out between 1 and 3 for 3 hours, a jump to 5.5 would have a z score of 3.37. This is a threshold breach. Our metric only increased by 2.5 in absolute numerical terms, but that jump was a huge statistical outlier. It wasn’t a big jump, but it was definitely an unusual jump.

This is exactly the type of pattern that’s obvious to a human scanning a dashboard, but could be missed by a static threshold because the actual change in value is so low.

It’s really that simple. You can use built-in functions in the tool of your choice to calculate the z score and now you can detect anomalies instead of wrestling with hard-coded thresholds.

Some extra tips:

We’ve found a z score threshold of 3 is a good starting point. We use 3 for the majority of our metrics.
Your standard deviation will be 0 if all of your numbers are the same. The z score equation requires dividing by the standard deviation. You can’t divide by 0. Make sure your system handles this.
1. In our Python application, scipy.stats.zscore will return “nan” (not a number) in this scenario. So we just overwrite “nan” with 0. There was no variation in the metric – the line was flat – so we treat it like a z score of 0.
You might want to ignore either negative or positive z scores for some metrics. Do you care if errors or latency go down? Maybe! But give it some thought.
You may want to monitor things that don’t traditionally indicate issues with the system. We, for example, monitor total log volume for anomalies. You probably wouldn’t page an on-call because of increased informational log messages, but this could indicate some unexpected change in behavior during a deployment. (There’s more on this later.)
Snoozing z score metrics is a killer feature. Sometimes a change in a metric is an anomaly based on historical data, but you know it’s going to be the new “normal”. If that’s the case, you’ll want to snooze your z scores for whatever interval you use to calculate z scores. ReleaseBot looks at the last 3 hours of data, so the ReleaseBot UI has a “Snooze for 3 Hours” button next to each metric.

How Slack uses z scores

We consider z scores “high confidence” signals. We know something has definitely changed and someone needs to take a look.

At Slack, we have a standard system of using white, blue, or red circle emojis within Slack messages to denote the urgency of a request, with white being the lowest urgency and red the highest.

A screenshot of a Slack message from Release Bot. The message is a blue circle emoji with text, "Webapp event #2528 opened for char Five Hundred Errors, in tier dogfood and az use1-az2".

A single z score threshold breach is a blue circle. Imagine you saw one graph spike on the dashboard. That’s not good but you might do some investigation before raising any alarms.

Multiple z score threshold breaches are a red circle. You know something bad just happened if you see multiple graphs jump at the same time. It’s reasonable to take remediation actions before digging into a root cause.

We monitor the typical metrics you’d expect (errors, 500’s, latency, etc – see Google’s The Four Golden Signals), but here are some potentially interesting ones:

Metric	High z score	Low z score	Notes
PHPErrors	1.5	–	We choose to be especially sensitive to error logs.
StatusSlackCom	3	-3	This is the number of requests to https://status.slack.com – the site users access to check if Slack is having problems. A lot of people suddenly curious about the status of Slack is a good indication that something is broken.
WebsocketEventsVolume	–	-3	A high number of client connections doesn’t necessarily mean that we’re overloaded. But an unexpected drop in client connections could mean we’ve released something especially bad on the backend.
LogVolume	3	–	Separate from error logs. Are we creating many more logs than usual? Why? Can our logging system handle the volume?
EnvoyPanicRouting	3	–	Envoy routes traffic to the Webapp hosts. It starts “panic routing” when it can’t locate enough hosts. Are hosts stopping but not restarting during the deployment? Are we deploying too quickly – taking down too many hosts at once?

Beyond the z score, dynamic thresholds

We still monitor static thresholds but we consider them “low confidence” alarms (they’re a white circle). We set static thresholds for some key metrics but Releasebot also calculates its own dynamic threshold, using the higher of the two.

Imagine the database team deploys some component every Wednesday at 3pm. When this deployment happens, database errors temporarily spike above your alert threshold, but your application handles it gracefully. Since the application handles it gracefully, users don’t see the errors and thus we obviously don’t need to stop deployments in this situation.

So how can we monitor a metric using a static threshold while filtering out otherwise “normal” behavior? We use an average derived from historical data.

“Historical data” deserves some explanation here. Slack is used by enterprises. Our product is mostly used during the typical workday, 9am to 5pm, Monday through Friday. So we don’t just grab a larger, continuous window of data when we’re thinking about historical relevance. We sample data from similar time periods.

Let’s say we’re running this calculation at 6pm on Wednesday. We’ll pull data from:

12pm-6pm Wednesday (today).
12pm-6pm Tuesday.
12pm-6pm last Wednesday.

We pool all of these windows together and calculate a simple average. Here’s how you could achieve the same result with PromQL:

(
	sum(metric[6h])
	+ sum(metric[6h] offset 1d)
	+ sum(metric[6h] offset 1w)
 ) / 3

Again, this is a fairly simple algorithm:

Gather historical data and calculate the average.
Take the larger of “the average historical data” and “hard-coded threshold”.
Stop deployments and alarm if the last 5 data points breach the chosen threshold.

In simple terms: We watch thresholds but we’re willing to ignore a breach if historical data indicates it’s normal.

Dynamic thresholds are a nice-to-have, but not strictly required, feature of ReleaseBot. Static thresholds may be a bit more noisy, but don’t carry any additional risks to your production systems.

Embrace the fear

Fear of breaking production holds many teams back from automating their deployments, but understanding how deployment monitoring differs from normal monitoring opens the door to simple, effective tools.

It’ll still be scary. We took a careful, iterative approach to ease our fears. We added z score monitoring to our ReleaseBot platform and compared its results to the humans running deployments and watching graphs. The results of ReleaseBot were far better than we expected; to the point where it seemed irresponsible to not put ReleaseBot in the driver’s seat for deployments.

So throw some z scores on a dashboard and see how they work. You might just accidentally help your coworkers avoid staring at dashboards all day.

A screenshot of a message from ReleaseBot with the text "Release Bot has called 'all clear' on that deploy!"

Want to come help us build Slack (and/or fun robots?!)

Apply now

#automation#backend#ci-cd#deployment#devops#engineering#infrastructure#observability

Our Journey Migrating to AWS IMDSv2

We are heavy users of Amazon Compute Compute Cloud (EC2) at Slack — we run…

December 12, 2023

12 min read

How We Built Slack AI To Be Secure and Private

At Slack, we’ve long been conservative technologists. In other words, when we invest in leveraging…

April 18, 2024

8 min read

The Scary Thing About Automating Deploys

Optimizing Our E2E Pipeline

How we built enterprise search to be secure and private

Migration Automation: Easing the Jenkins → GHA shift with help from AI

Break Stuff on Purpose