Distributed Security Alerting

5 minutes • Written 8 years ago

How does a company know when it has been hacked? Let’s list some ways, in order of best case to worst case:

The company’s employees notice something strange
A 3rd party contacts the company because they notice something strange
Hacker(s) contact the company because they want them to notice something strange
They don’t

At first glance, this list seems to indicate “who notices” is the important bit, but that is not quite right. “When” plays an even bigger part here. Time is important because the longer someone has unauthorized access to your systems, the more damage they can do. When something bad happens, and assuming you’ve enabled logging, the first indication is probably sitting right there in a log.

We have at our disposal software to archive and search all of the data we can possibly imagine. We collect them and tag them and store them and :heart: them, because they help us do things like improve performance and verify that a new feature makes users happy. At Slack, system logs are some of our favorite things.

A screenshot of Slack’s searchable logs today.

Logs are also extremely important to our security team. Collecting authentication, system and network logs helps us hunt for anomalies and suspicious activity. Having mountains of data is nice, but to make it useful, we need to be looking at it all of the time.

So we have this data and we know that time is super important. How do we make sure the time between $bad_thing_happens and $we_notice_bad_thing is as short as possible? A few ideas:

You can hire a great security team and give them unlimited coffee and have them each `tail` a log. This may work, but it doesn’t scale. Eventually the volume of data will overwhelm the team. Your team may also experience alert fatigue, where they become desensitized and lose their ability to respond meaningfully to events. A great security team should probably be building things instead of staring at logs. :thumbsdown:

Maybe you employ security operations folks to watch your systems 24/7? You can automate much of the process and equip the team with tools they need for investigation. This is a pretty good strategy, but there are a couple of problems. First, it is expensive. You’d need to staff this operations center with people who can make informed decisions. Second, no matter how well these folks understand the systems they are protecting, they can’t know with certainty that something they detect is a problem without reaching out to others. :neutral_face:

Okay, one more idea. What if you send alerts directly to the person(s) involved and ask them whether the event was good or bad? What if you also do it the moment you detect something strange? This means you will be asking exactly the right person whether they did something within moments of seeing the activity. Now we’re onto something! :thumbsup::tada:

So what does this look like in practice? Let’s walk through the lifecycle of an alert using this idea:

Your monitoring system notices something suspicious
A bot sends the employee a message in Slack asking if they did $thing
If the employee confirms it was them, the alert is resolved and we’re done here. If the employee tells the bot it was _not_ them OR they don’t respond within a few minutes, we escalate the alert to a security team
If the alert requires action, the security team contacts the employee and begins investigating

But we have a problem! What if some baddie has taken over your employee’s Slack account? If a bot says “did you do this” and the baddie replies “yes”, this system won’t really work. To solve this problem, out of band communication can be used to confirm the alert.

The above idea isn’t hypothetical, it is exactly how we handle security alerting at Slack.

When a Slack employee receives an alert, they reply to the bot with `ack`, and the bot then sends a 2-factor push to their mobile device for final confirmation. If the baddie types `ack`, the evildoer still won’t have access to the employee’s mobile device to confirm the ack. Additionally, if the baddie just deletes the message, it will automatically escalate to the security team within moments. There are some actions that will escalate regardless of the employee’s response, but most suspicious activity can be handled directly by the employee without involving the security team.

To illustrate these ideas, imagine you are monitoring for any users who run the `flurb` command on your servers. In this example, `flurb` is something an operator may run occasionally, but it is rarely needed and commonly used by hackers.

Example 1 (the good case):

Ryan logs into accountingserver01
He runs `flurb -export` to show detailed information about the server
The monitoring system sees this and a bot named Securitybot sends Ryan a Slack message saying “Hi there, I see you have run `flurb -export` on accountingserver01. This is a sensitive command, so please acknowledge this activity by typing ‘acknowledge’.”
Ryan replies to the bot with “acknowledge”
Ryan’s phone buzzes with a 2fa request. He clicks ‘Confirm” to verify the action and the incident is closed.

An example interaction with securitybot.

Example 2 (the bad case):

Alice’s ssh key has been stolen and the attacker logs into supportserver01.
The attacker, posing as Alice, logs into supportserver01 and runs `flurb -export`.
The monitoring system sees this and a bot named Securitybot sends Alice a Slack message saying “Hi there, I see you have run `flurb -export` on supportserver01. This is a sensitive command, so please acknowledge this activity by typing ‘acknowledge’.”
The attacker cannot verify the message without Alice’s phone, and a notification escalates to the security team.
The security team takes action to disable Alice’s accounts globally.

Most importantly, this system allows us to watch more things than previously possible by deputizing every person at Slack as part of our security team, meaning we have hundreds of people constantly on watch.

The tools we have used to create some of the bots used by our team are freely available on Slack’s GitHub page here: python-rtmbot, python-slackclient. We also have node-slack-client, an excellent open source library for node if you’d like to create your own interesting integrations.

This investment in security alerting and monitoring is especially important as we grow, and our solution should help us keep pace with the rapid growth of Slack the company, but this is just one example of how we handle security at Slack. Stay tuned for posts from the security team explaining the tools and techniques we use to keep Slack safer, more secure and more productive.

Want to help Slack solve tough problems and join our growing team? Check out all our engineering jobs and apply today.