Incident Management takes time
Incidents need responders that are trained and experienced. At Slack, training is a foundation of our incident management program.
Self-service training and live courses based mainly on prepared content are one piece of the puzzle, but there can be a missing piece in many organizations. How can staff get practical experience with incident response before joining a real incident?
Our first experience with the Incident Lunch
Our first experience with what we now call the Incident Lunch exercise was in a training session with the team from Blackrock 3 Partners in March 2018. They ran a two-day training for a team at Slack centered around The Incident Management System (IMS) and how it can be used to build an incident response program. During our sessions, they ran an exercise they call The Lunch Break exercise. They assign some roles to the group and set a limited time box for the group to get lunch to the training room. The focus was on having some constraints in place, teaching through role modeling some of the incident roles, and putting time pressure on the group. It was a lot of fun and we took that exercise and turned it into a regular occurrence in our Slack training options. It also gave the engineer that led the program an opportunity to get Slack to buy them lunch once a week.
Those key elements of this exercise to reiterate were:
- Time pressure,
- Role playing,
- Constraints (they called them considerations, but imagine simple rules like no pizza or fast food),
- And it was fun to boot!
Bringing the Incident Lunch to the rest of Slack
The team who took our initial incident training sessions wanted to bring things back to a wider group of engineers across Slack. The Incident Lunch we started is easily accessible for anyone in the company — there is no setup or expertise required for the participants. It turns out everyone is a subject matter expert at ordering and eating lunch.
Folks are invited to a two-hour incident training exercise at lunchtime, and told that lunch will be provided. When they arrive, everybody gets a 15 minute refresher on our incident process, and then we drop the bomb: the lunch order fell through, so their exercise now is to obtain lunch for everybody in the room, subject to a few constraints. They need to do this using our incident response practices; somebody needs to be the incident commander, we need to communicate in an incident Slack channel, we need to post periodic status reports in the channel, etc.
The trainer acts as referee and coach for the exercise. So far, no team has failed to get lunch, though there have been some close calls!
From the perspective of our Incident Management Program, minimal resources are needed to repeat the exercise on a regular cadence. The framework for running the lunch is:
- An outline for the setup and kickoff of the in-person exercise
- We use a GitHub repository with a markdown file that we set up to run as a GitHub Pages slide deck. The outline has the 15 minute introduction and refresher into what IMS is and a brief overview of how we respond to incidents at Slack.
- A conference room or two
- A workflow to announce and invite people to the exercise
- One or two staff to facilitate the exercise (did we mention they get free lunch every time they run the exercise‽)
- A small budget to pay for lunch, probably around $300-500 per session, an order of magnitude less than any third-party firm might charge for an hour of on-site training.
A former lead engineer on a team called App Ops, created a key element for our Slack version that sprinkled some extra fun and more time pressure into the exercise. We’ve come to call what they added, the “Chaos Cards”. These added new key elements to the exercise which are variability and unpredictability. The sets of cards have different actions or events that can change the course of the exercise. One card might be something like the Laptop Trouble card where you pick an SME at random that can no longer use their laptop for the rest of the exercise, or it might be a card that says Eerily Quiet (pick a new card in two minutes!). We play these cards throughout the exercise at a timed cadence, usually starting at five minutes, to add an unpredictable element to the exercise. It often makes participants more uncomfortable, which is something that happens in incidents.
Why the Incident Lunch has been a success at Slack
There are a few key reasons why this particular exercise has been a success. First I’ll share why it works really well for the team organizing and running it.
“I’m sure you all have spare capacity sitting around for engineering folks to create games from scratch and play them with other engineers,” said no one in most companies. Even if you were lucky enough to have some engineers create something like this — were they able to keep it up? I’ve heard tales of companies running a Dungeons and Dragons-like incident game. There is a blog post by Paul Kirk, where he has a great discussion about teaching incident response with games centered around using Keep Talking and Nobody Explodes. Games with a detailed story and choosing your own adventure setup can be really engaging for engineers, but they take a huge amount of investment to set up and keep running or up-to-date. A game centered around something like Keep Talking and Nobody Explodes can be limited to a small number of users and takes some amount of setup by each person participating. Scaling it up can be a growing investment of time. When we started running the Incident Lunch exercise regularly we found some great benefits as a small team trying to keep time for real incidents in our schedule.
We created a Slack channel for coordinating folks who facilitated our lunch exercise and have things set up so that it can be run in remote offices where you have willing facilitators.
It also works really well for your staff. Any staff can attend. We include our Customer Experience teams and we’ve included folks from Customer Success, Sales, and other non-technical teams in the organization. There is nothing they need to do to prepare for the training, though they should probably bring a laptop and they’ll need to block out two hours of their day. We don’t have any required preparation listed and we only tell them that we’re having an “exercise”. We fib a little bit and part of the setup is that we tell them lunch will be provided so they have an incentive to attend.
How does it work in practice?
When we run the lunch exercise there are a few simple steps for the facilitator.
Exercise set up
- Invite folks, make a calendar invite
- Reserve a conference room
- Find a volunteer who can be in the Incident Commander role. This is often someone who has taken the exercise before and wants to level-up their skills while keeping the exercise surprise
- Check that our slides are up to date with any incident process updates
- If you are in a new location you may add a map slide with the lunch exclusion zone
Facilitating the exercise
- Run through the training introduction
- Give some background on IMS and/or Incident Response — the basics of why are we here
- Walk through some of the common roles in your Incident Response process
- Talk through how an incident starts at your company
- What is the main goal of your incident response process?
- At most places it is to restore service as quickly as possible — say this part out loud
- Give some tips and tricks commonly used in incident response
- Be clear and concise
- Develop multiple plans
- Use time boxing to keep things moving
- Focus on roles not individuals
- Once you’ve introduced those basics it’s time to spring the surprise on them.
- Lunch has fallen through. The front desk notified us that the bike messenger delivering lunch ended up on the other side of town.
- Now they must come together as an incident response team and get lunch delivered as quickly as possible — hopefully before the two hours is up
- Set the ground rules / constraints
- Order must be made outside the lunch exclusion zone, to avoid making things too easy — no running to the Subway across the street
- They can pick up or order in (delivery timing is often unpredictable)
- Set a per-person budget limit (approx $25 USD/person is a good spot)
- Lunch is expensed, so keep receipts (someone has to submit an expense report)
- They can use whatever resources at hand — laptops, phones, Slack, Zoom, etc.
- Anyone with real dietary restrictions must be accommodated (and the chaos cards simulate some dietary restrictions as well)
- Have the group pick an Incident Commander (we recommend planting this person if you can); for groups entirely made of people new to the process, often the facilitator acts as The Incident Commander.
- Hand it over and start your timer for Chaos Cards
- Have someone from the group (or the facilitator) pick a chaos card every five minutes: if they’re doing a great job you can speed them up, and if the group is struggling you can slow them down a little bit (remember this should still be fun!)
- Hopefully, once lunch is delivered, we run a quick retrospective while we eat. The facilitator should share some things they noticed but lean into the group to see what insights they had during the exercise.
- Clean up the conference room and you’re all done!
What have we learned as we run these exercises?
Adding the Chaos Cards into the exercise really helped compound the time pressure with a level of unpredictability — that feels more realistic as that is how incidents often unfold in complex systems. The Chaos Cards give you some levers to make things easier or harder as you can slow or speed up the picking of them (or skip them if it would really ruin the day of everyone playing). If you have return participants you can make sure some of the harder chaos cards show up at the top of the stack — playing a card like Network Outage early in the exercise and making everyone figure out how to tether via their cell phones can get pretty spicy.
After a few runs, we discovered that it can be good to find someone who is willing to be the IC up front. That isn’t always possible, but the worst experiences we had were when no one really was ready to role play the IC and it turned into a struggle. We often didn’t let the person who agreed to start as the IC know too much about the exercise so it was often still a surprise for them. That leads into another great facet of this exercise — you’ll discover folks who have a predisposed skill set for facilitating an incident that you couldn’t uncover in a normal training session. This became an internal recruiting tool for us. It also can be a tool to build up ICs who want more experience and are willing to play along. If you end up in the situation where someone in the IC role is struggling, you can coach them, ask questions to prompt them, and slow down the cadence of drawing additional chaos cards.
People will quickly forget or become sidetracked in pursuit of their primary goal. Mitigating the issue and restoring service as quickly as possible is the goal in a real incident response. If you run these you’ll find that teams that finish quickly can get their order in within 15-20 minutes of starting the exercise. When a team decides they want to take a poll across the assembled responders about where to eat (not too different from asking everyone in an incident to weigh in on the best solution), you’d better hope you’ll get lunch within the two hour window. Incident Response isn’t an exercise in democracy — it’s about making decisions quickly and efficiently and often making trade offs that you wouldn’t make without time pressure — like choosing your favorite food of the day instead of thinking about what will be the fastest food option. If they do take a poll, that’s a great opportunity to talk about gaining fast consensus in an incident context by using the, “Are there any strong objections?” tactic.
We also see that choosing delivery instead of picking up in person is more likely to slow things down. You’re adding more complexity into your response by adding new dependencies. We had one lunch where the order was never delivered to the restaurant from the online ordering system; we ended up with a pretty late lunch that day.
What improvements are still on the table?
A big caveat for this exercise is that it works best in person. We never figured out a virtual substitute during the pandemic that seemed to satisfy how easy this is to run in real life. Finding alternate versions of the exercise that were as simple but work in a hybrid or remote work environment would be a great upgrade. As we experimented with running these we would often include one or two remote employees but didn’t have a requirement to deliver lunch to them, we’d prepare them beforehand if they needed their own lunch. This did give an option that meant they had to include someone remote though and added a nice realism to how our work environment is day-to-day.
We don’t use our daily incident tooling during The Incident Lunch. Setting up a large group of users in our staging environment would add too much overhead. Having an ability to use our internal tooling in dry-run or demo mode would add a nice touch to the exercise by giving participants hands-on experience with the tooling they’ll use in a real incident.
Keep a log of your retrospective insights and notes about what happened in each lunch. They can be helpful as you look back to help you evolve your program.
We hope that you’ll share with us if you implement The Incident Lunch exercise for your teams and let us know what you learn and how the exercise evolves in your environment. Thanks for reading!
References
- Sev0 Conference 2024: There is no such thing as a free lunch. How Slack runs their incident lunch exercise talk by Scott Nelson Windels
- PagerDuty Summit 2021: It all Starts with a Page – How CE and Engineering respond to Incidents together at Slack talk by Niamh Tighe and Scott Nelson Windels
- SRECon 2021: Evolution of Incident Management at Slack talk by Brent Chapman
- Slack Engineering Blog All Hands on Deck by Ryan Katkov
- Medium blog post Teach Incident Response with Games by Paul Kirk
- Incident Lunch example public repository
Acknowledgments
This work could not have happened without the original training and ideas from the Blackrock 3 Partners. The dedication of staff at Slack who made sure this got off the ground, Tricia Bogen, Joe Smith, and Brent Chapman, was key to making this a success.