“A complex system can fail in an infinite number of ways.”
-“Systemantics” by John Gall
Incidents are stressful but inevitable. Even services designed for availability will eventually encounter a failure. Engineers naturally find it daunting to defend their systems against the “infinite number of ways” things can go wrong.
Our team found ourselves in this position when a service we use internally for dashboards went down, recovery failed, and we lost our teammates configurations. However, with creativity and a dash of mischievousness, we developed an exercise that addressed the cause of the problem, energized our teammates, and brought excitement and fun to the dry job of system maintenance. Come along as we share our journey from incident panic to peace of mind.
The incident
Slack engineers use Kibana with Elasticsearch to save custom dashboards and visualizations of important application performance data. On January 29th, 2024, our Kibana cluster—and subsequently, the dashboards—started to fail due to a lack of disk space. We began investigating and realized this was the unfortunate downstream effect of an earlier architectural decision. You can configure Elasticsearch as a stand-alone cluster for Kibana to use, which decouples the object storage from the Kibana application itself. However, our Kibana cluster was configured to use an Elasticsearch instance on the same hosts as the Kibana application. This tied the storage and the application together on the same nodes, and those nodes were now failing. Slack engineers couldn’t load the data they needed to ensure their applications were healthy.
Eventually, the cluster got into such a bad state that it couldn’t be saved, and we had to rebuild it from a clean slate. We thought we could stand up a new cluster by cycling in new hosts and restoring the Kibana objects from a backup. However, we were shocked and disappointed to discover our most recent backup was almost two years old. The backup and restore method hadn’t gotten a lot of love after its first configuration, and it didn’t have alerts to tell us if it wasn’t running correctly. On top of that, our runbook was out of date, and the old backup failed when we tried to restore from it. We lost our internal employees’ links and visualizations, we were forced to recreate indexes and index patterns by hand.
Explaining to our teammates that our recovery procedure had failed and their data was lost was not fun. We didn’t notice our backups were failing until it was too late.
No one is immune to situations like these. Unless you actively exercise your processes, procedures, and runbooks, they will become obsolete and fail when you need them the most. Incident response is about restoring service as quickly as possible, but what you do when the dust settles determines whether they are ultimately a benefit or a liability.
Breaking stuff is fun
We were determined to turn this incident into tangible benefits. Our post-incident tasks included making sure that our Elasticsearch clusters in every environment were backed up with a scheduled backup script, fixing our runbooks based on the experience, and checking that the Amazon S3 retention policies were set correctly.
We wanted to test our improvements to make sure they worked. Our team came up with an unconventional but exciting idea: we would break one of our development Kibana clusters and try the new backup and restore process. The development cluster is configured similarly to production clusters, and it would provide a realistic environment for testing. To ensure success, we carefully planned which cluster we would break, how we would break it, and how we would restore service.
Running the exercise
We planned the testing event for a quiet Thursday morning and invited the whole team. Folks showed up energized and delighted at the opportunity to break something at work on purpose. We filled the disk on our Kibana nodes, watched them fail in real time, and successfully triggered our alerts. We worked through the new runbook steps and cycled the entire cluster into a fresh rebuild. Our system recovered successfully from our staged incident.
Although the recovery was successful, we fell short of our goal of being able to recover in less than one hour. A lot of the commands in the runbook were not well understood and hard to grok during a stressful incident. Even trying to copy and paste from the runbook was a challenge due to formatting issues. Despite these rough edges, the backups ended up restoring the cluster state completely. Additionally, we found some firewall rules that needed to be added to our infrastructure as code. This was a bonus discovery from running the exercise — we did not expect to find firewall issues, but fixing them saved us future headaches.
In a final test of our new recovery process, we migrated the general development Kibana instance and Elasticsearch cluster to run on Kubernetes. This was an excellent opportunity to test our improved backup script on a high-use Kibana cluster. Thanks to our improved understanding of the process, and the updated provisioning scripts, we successfully completed the migration with about 30 minutes of downtime.
During both exercises, we ran into minor issues with our new runbooks and restoration process. We spent time figuring out where the runbook was lacking and improved it. Inspired by the exercise, we took it upon ourselves to automate the entire process by updating the scheduled backup script tool to be a full-featured CLI backup and restore program. Now we are able to completely restore a Kibana backup from cloud storage with a single command. “Breaking stuff” wasn’t just fun: it was an incredibly valuable investment of our time to save us from future stress.
Chaos is everywhere—might as well use it
“Complex systems usually operate in failure mode.”
– John Gall
Every production system is broken in a way that hasn’t been uncovered yet. Yes, even yours. Take the time and effort to find those issues and plan how to recover from them before it’s critical. Generate a lot of traffic and load test services before customers do. Turn services off to simulate unexpected outages. Upgrade dependencies often. Routine maintenance in software is often neglected because it can be dry and boring, but we pay for it when an incident inevitably hits.
We discovered we can make system testing and maintenance exciting and fresh with strategic chaos: planned opportunities to break things. Not only is it simply exciting to diverge from the usual job of fixing, it puts us in unique and realistic situations we would have never discovered if we had approached maintenance the traditional way.
We encourage you to take the time to break your own systems. Restore them and then do it again. Each iteration will make the process and tooling better for when you inevitably have to use it in a stressful situation.
Finally, remember to celebrate World Backup Day every March 31st. I know we will!
Acknowledgments
Kyle Sammons – for pairing with me on the planning and execution of the recovery exercise
Mark Carey and Renning Bruns – for getting the tooling functioning properly and automating the process
Emma Montross, Shelly Wu, and Bryan Burkholder – for incident response and support during the recovery exercise
George Luong and Ryan Katkov – for giving us the autonomy to make things better
Interested in taking on interesting projects, making people’s work lives easier, or just building some pretty cool forms? We’re hiring! 💼 Apply Now