Databases. They are the single source of truth for our most critical business data, yet as engineers we tend to overlook tooling with this in mind.
An entire ecosystem of monitoring and administrative tools exist for operating our databases, making sure they replicate, scale and are generally performant. Similarly, a number of tools accompany the databases’ query language from linters and beautifiers to query builders and object mappers. But after our application has written data, there is very little tooling to verify that the data is as expected and remains as such.
This is not entirely different from say putting a cat in a box and spending all your efforts on making sure the air holes stay unobstructed and that the food delivery system is operational but never opening the box to see if the cat is dead or alive.
Ensuring Data Integrity
As we develop software, the business logic we apply in the application layer is directly represented in how our data is defined in the storage layer. That zipcode table you just added for the new address system? What is enforcing that a one-to-many relationship exists between zip codes and addresses? Are any zip codes outside of the 0–99999 range of integers? What if the code is within range but invalid according to the USPS? Does the cat still have four legs or has the plutonium you used in your food delivery system caused a mutation?
In a perfect world, we ensure these relationships by writing bug-free software, but even then, outages exist and it’s not practical for every piece of code that deals with data to check everything about the data each time it runs. A network blip could mean that an UPDATE succeeds on one database but not another. A software bug might cause 1 out of every 10k rows to get inserted with a text field set to an empty string. Even well tested code with good error handling cannot stop inconsistencies from creeping in and if you do this at scale, the problem will be compounded.
Some databases can enforce relationships at the systems level, using foreign keys, constraints, and other similar concepts, but these built-in checks and constraints represent only a small set of possible characteristics that may be modeled in any given system. Likewise, ORMs like ActiveRecord can check data at write time using validations, but these often need to be disabled in performance sensitive code, and can themselves contain bugs.
So why, when we write unit tests, smoke tests, integrated tests, all sorts of tests against our code, do we not do the same for our business data?
We’ve been thinking hard about alternative approaches to this problem, and we’ve codified this into what we’ve called a consistency check framework.
Applications
Building a responsive and customizable framework that can analyze and report on your data in meaningful ways has a number of proactive and reactive benefits. Consider the following use cases:
- Recovery from an outage or bug — “Oh no, the customer team just shipped some code that broke user signup mid-flow! We’ve reverted the deploy but how badly were we affected? How many orphan sign up rows do we have and can they be repaired?”
- Data migration, import/export — “We’ve migrated all the data for team billing into the new schema. We’d like to check the data is all accurate before we start reading from it. Can you write some scripts to do that?”
- Proactive bug finding — “I’ve run the consistency checks against the entire database and it shows that we have 80 users without a large profile icon. There has to be a bug in the new avatar code!”
The status of your data reflects the health of the overall system. A database that remains consistent over time says something about the quality and stability of your software. In the same way that good unit test coverage ensures the integrity of your business logic, consistency checks should look at the shape of your data and validate important assumptions about your business objects. But how many consistency checks does one need to write in order to be effective? What sorts of relationships are worth investing time into?
The checks you need depend on your stack, the nature of your data, and the types of bugs you’ve found in the past. They belong with the rest of your business logic in your application codebase. Fortunately, building a foundation for these checks is relatively easy to do.
Origins
At Slack, it’s common for engineers to write small one-off scripts to diagnose and resolve issues they’re triaging. These scripts are usually removed once the problem has been fixed, but occasionally they’re kept around in case the problem comes back.
One early example of this was the “no-channel backfill”, a script that checks the assumption that a team must have at least one public channel in order to operate correctly. Team metadata is stored on one set of database servers, channel metadata is stored on another so occasionally network partitions meant we successfully created the team, but not the first channel.
This ad-hoc approach worked, but over time, problems arose. Sometimes an engineer would not be aware of an existing script, and would rewrite it from scratch; this particular script was written three times before we noticed the duplication. Furthermore, each implementation varied slightly so even if you knew they existed, it was unclear which version was the canonical script.
As patterns emerged, these scripts were turned into reusable code that could scan a number of teams instead of just one. In a short while, a framework emerged which supported the ability to scan groups of teams and detect a multitude of problems. Standardizing this behavior and encapsulating knowledge not only prevented duplication, but gave developers an efficient and proven way to diagnose teams and by proxy the underlying data. Engineers began proactively adding checks during the development process and the practice quickly worked its way into our engineering culture.
Consistency Check Pattern
So what did this process look like? We noticed that all of the ad hoc scripts followed the same pattern: some code to scan some subset of teams, a function that checked a single team for a specific issue, and code to report on the results. So, we made generic versions of the scanning and reporting code, and created an interface for the checking code.
Here’s a simplified example of what a consistency check looks like today:
function team_consistency_domain_length($team){
$msgs = array();
if (strlen($team['domain']) > 21){
$msgs[] = array(
'type' => 'domain_too_long',
'message' => "team domain is too long",
);
}
return array(
'ok' => true,
'msgs' => $msgs,
);
}
The function takes a team object as an argument, and returns a data structure indicating whether the check ran to completion (`ok`) and an array of warning messages found (`msgs`).
Once this function is written it’s added to a map of checks, along with a name for the check and a description of what it checks. And then the consistency framework takes over.
Running Checks
Initially, you could run consistency checks from the command line and specify either a single team or scan the whole system for issues. This made it easy to check all data on a team for known problems, meaning many customer issues that used to involve lots of head scratching now just involved running a quick check. We then added options that the developer could specify as arguments to make this run faster — like running only against paid teams or teams created since a given date.
## Check consistency for a single team
$ php check_consistency.php -t T12345678
## Run team pref checks on all teams created since a given date
$ php check_consistency.php -c team_prefs --since=1467590400
The command line framework was a huge win, but only engineers had access to it, and we wanted the tool to be operable by anyone. So we designed a GUI and hooked it up to our internal administrative system. This allowed our customer experience team to ask questions about teams and their underlying data — giving them the tools to triage complex situations and report the underlying problem to engineers.
We’re now also experimenting with routinely running checks against a random sample of teams as a proactive way of spotting problems before our customers do.
Fixing Problems
Of course, once we had tools to automatically detect a problem, the next obvious step was to add code to automatically fix it. We’ve done this, but only for some consistency checks because many issues require human intervention before we can make any changes. For example, if a single channel guest is somehow in two channels we want to explain the options to a team admin, instead of just picking a channel to remove them from. And in other cases we’d rather look at a problem and ensure we understand the root cause of each specific inconsistency, instead of just blindly letting a script fix it.
The fixer framework works in much the same way as the check framework. First we defined a mapping of keys to callbacks where the keys are the errors returned from the checking process:
function consistency_fix_map(){
return array(
'domain_too_long' => 'fix_domain_length',
'invalid_domain' => 'fix_domain_chars',
...
);
}
If a matching check is detected the callback is dispatched which is responsible for actually fixing the problem and reporting the results.
function consistency_fix_domain_length($team){
$new_domain = substr($team['domain'], 0, 21);
return teams_change_url($team, $new_domain);
}
Conclusion
Most of what we do as software engineers has been done before. It’s easy to assume that if there isn’t an off-the-shelf solution to a problem then the problem doesn’t really exist, or that you’re thinking about the problem in the wrong way.
But, we all work in different domains, with different constraints. Sometimes the right solution is to build some customized internal infrastructure based on your specific business needs. In this case it’s a tool to check the data for a specific team. In other systems it might be the data for a user’s profile or a category of products in a store. Or maybe you need to write validation code for external data sets you’re importing.
Whatever your business, building tools to routinely check your data is a good practice to make part of your everyday operations. A service like Slack is a combination of running code, servers to run it on, and the stored data on those servers. A problem in any one of those will cause visible problems for our customers. Our consistency check framework has been hugely beneficial to decrease these types of issues and ensure confidence in our product both internally and externally.
Want to help Slack solve tough problems and join our growing team? Check out all our open positions and apply today.
Apply now