May 24, 2021

11 min read

Role Management at Slack

Giving Slack users granular controls with roles

Jake Byman

Aish Raj Dahal

Jose M. Medina

Controlling which users are able to take which actions is no simple task. Building this into Slack has always been an interesting challenge. In large enterprise organizations, the standard types of roles we offered to customers were too broad, and delegating a generic admin role can grant someone with too much power — what if you only want a specific user to be able to manage specific channels? When you make them an admin, they are able to perform a wide variety of actions beyond the scope of the intended purpose, and can view dashboards and see information that is unrelated to managing channels. We needed to build a system that was more flexible and allowed for granular permissions. We’d like to share the problems we were facing with roles, the solution we implemented, and our plans for the future.

Historical context

To date, we’ve had limited roles for what users are able to do. These roles were:

Guests
- This type of user is limited in their ability to use Slack, and is only permitted to see one or multiple delegated channels.
Member
- This is the base type of user that does not have any particular administrative abilities, but has basic access to the organization’s Slack workspaces. When an administrative change needs to be made, these users need the support of admins and owners to make the changes.
Admin
- This type of user is the basic administrator of any organization, and can make a wide variety of administrative changes across Slack, such as renaming channels, archiving channels, setting up preferences and policies, inviting new users, and installing applications. Users with this role perform the majority of administrative tasks across a team.
Owner
- This type of user has the ability to perform the administrative actions above, as well as additional compliance abilities, such as the ability to set up Data Loss Prevention (DLP) and retention settings.
Primary Owner
- This is the head administrator of the organization. This type of user is able to take any administrative action.

New requirements for roles

We needed a granular roles system to break down the core abilities of the generic admin users. Additionally, we needed to make sure the existing roles worked alongside this new system.

We opted to create a Role Based Access Control (RBAC) system, such that users can be granted one or more roles that are given the permissions associated with those roles. We need the ability to delegate these roles at an organization level (for our Enterprise Grid tier of customers) or at the workspace level.

A walk-through of our data model

Before diving into how we solved this problem, let’s get some terminology clear first.

Permissions

A permission is defined as the ability to perform some action in Slack. For example, the ability to invite a user to your workspace, to archive a channel, or to view users across your organization.

Roles

A role is defined as a set of permissions. A role can be assigned to users. For example, a role could be a Channels Admin, someone who is authorized to make administrative changes to channels, such as the ability to create, rename, and archive.

Entities

An entity is an object in Slack for which we assign a user a given role. A role is assigned with respect to an entity. For example: in an Enterprise organization, someone can be a Channels Admin across an entire company (where the entity is the Enterprise), or on a single workspace on that enterprise (where the entity is the workspace).

Role-Based Access Controls (RBAC)

A system in which users are granted access to certain resources based on their role in that system. When a user takes an action, the ability to take that action is checked based on whether the user has a role that encompasses the needed permissions.

How this works in practice

When a user performs an action, we check the permissions needed for that action. If the user has been delegated those permissions through their assigned roles, then they are permitted to perform that action. If they do not explicitly have those permissions, we fall back to our legacy model (our predefined roles) to determine if they have the ability to perform that action.

When a user takes an action, we always make an authoritative check to ensure that user can in fact perform that action. On each Slack client, we have a non-authoritative display of the UI components that we believe this user can see at a given time. We wanted to optimize for low latency, so the non-authoritative checks are made to our Flannel edge cache. Consequently, the client-side permissions can’t be authoritative, and are in near real time, due to the temporal nature of our Flannel cache.

To modularize this feature, we opted to build it as a separate, containerized Go service from our Hack webapp monolith, and have the webapp communicate to this service over gRPC.

We created three new system roles to start:

Channels Admin

This type of user has the permission to archive channels, rename channels, create private channels, and convert public channels into private channels.

Users Admin

This type of user has the ability to add and remove users from workspaces, as well as view the user groups for the organization.

Roles Admin

This type of user has the ability to administer roles, and delegate users to their associated roles.

Design of the backend

The role information is persisted through our Vitess data store. Given the distributed nature of this service, we were able to take advantage of Vitess’s sharding architecture and scale by user_id. We did this so that we are able to efficiently look up database rows by the relevant user ID, as opposed to having to scan through all rows in the table. In lieu of creating a new database storage system, we opted to have our permissions service read and write from the same Vitess store used by our webapp monolith, in order to have a single centralized data store and avoid data drift.

Our communication architecture

Prior to this, only an admin user or the creator of a channel was able to archive a channel. The way this is communicated is through a preference JSON blob attached to our team model. For example:

{'who_can_archive_channels' : 'admin'}

Now we want users with our new Channels Admin role to be able to archive channels.

If the user has the associated role, they are delegated the permissions for that role, in this case, ARCHIVE_CHANNEL. If the user has their role revoked, we also want to make sure that role is revoked immediately when we authoritatively check if a user can take an action, and as close to real time as possible for our non-authoritative checks.

Slack’s technical architecture is complex, and has many moving pieces and services. We wanted to make sure there was a uniform communication protocol between them with respect to permission checks. To accomplish this, we opted to use Protocol Buffers, sent over gRPC to communicate between our webapp monolith and the permission service.

For example, after a successful marketing campaign, a channel #proj-marketing-campaign is ready to be archived. Let’s say Bob is a Channels Admin for the organization. Bob goes into the channel and attempts to archive the channel. This enqueues an API request, which goes to our webapp’s API tier, where we run into this check:

Channels\ChannelCanActorArchivePolicy::checkPolicy($permission_context, $channel)

We separate our checks into policies, where we have some context (in this case, a user calling an API) and an entity (in this case, a channel). In this policy, we have several rules:

DenyIfActorRestricted,
AllowIfUserHasArchiveChannelLegacyPermissions,
AllowIfUserHasArchiveChannelPermission

We apply a series of rules to decide whether this user can perform the action. The rules are executed in order. First, we immediately deny the request if the user is a restricted user, as guest users cannot archive channels. If the user is denied, we immediately stop processing this policy and return. If, however, we get past that rule, we then check if the user has the legacy permission to perform the action. In this case, we check if they are an admin user. If they are not an admin user, we move onto the last rule, AllowIfUserHasArchiveChannelPermission. This is the rule that sends out the request to our permission service. In this case, we send out the following request:

EligibilityRequest( 'team_id' => T12345, // Bob's team 'user_id' => U12345, // Bob's user ID 'entity_id' => C12345, // #proj-marketing-campaign 'permission' => ['ARCHIVE_CHANNEL'] // What Bob intends to do with this channel )

This request is sent over gRPC to our Go permissions service. When the service receives this request, it parses out the contextual information, and sends a database query to our Vitess data store to see if a row exists for Bob to perform this action. If a row exists, we send back an ALLOW response, and we have determined that Bob is able to perform this action, and allow the action to proceed.

Our client architecture

In the example above, Bob is a Channels Admin for the organization and wants to archive a channel.

How exactly does the client know Bob is a Channels Admin? It doesn’t! The client only cares whether Bob is permitted to archive the specified channel.

On boot, the client performs an initial fetch for the essential permissions. The initial set of permissions are stored in redux and cached for a period of time. When an admin assigns the user to a new role, the client receives a real-time message with the additional permissions associated with the role. An update action will dispatch and any associated UI components will update accordingly.

Planning for a smooth rollout

To make sure we were rolling out this permission system in a safe way that didn’t disrupt Slack usage we opted to separate the release into the following steps:

Create a loopback gRPC service that lived inside the Slack webapp
Roll out to our internal workspace
Roll out to our pilot customers
Start reading from the external service in dark mode
Read from the service in light mode

When in dark mode, we would read from both sources of truth, the permission service and the webapp, but still fully rely on the webapp. During this process, we compared the permission check results to ensure that they matched. If they did not match, we logged this in our Prometheus monitoring system, and alerted on the difference. Once we were confident that the results were consistent, we made the switch to rely on the permission service as the source of truth. In light mode, we switch over to reading solely from the permission service. You can see when we made the switch here:

We then rolled this change out to all of our customers:

Additional requirements for building roles for Slack

Backwards compatibility

We needed to make sure everything stayed backwards compatible with our current permissions system, and didn’t disrupt the flow of Slack customers or engineering teams. To do this, we needed to check the existing preferences, as well as call out to our permissions service, to make sure everything worked harmoniously together.

Tooling

We had to make sure other engineering teams at Slack felt they could easily understand this system, and migrate their features onto roles. We built out a role CLI tool that leveraged HHAST’s codegen library to generate new roles and permissions. This was done in order to abstract away the roles framework from the business logic for other development teams, so that they could easily add new roles and permissions without having to worry about if they added everything in the correct places.

The requirements to add a new role or permission involved making changes to JSON files, protocol buffers, TypeScript files, as well as the generated Hack code. Instead of requiring engineers on other teams to get context on the underlying code that powers the roles framework, we decided to abstract this away from them, and instead think of roles at a higher level, and have the underlying tooling take care of the lower-level changes.

Let’s role

We’ve shipped an initial set of roles for all users on our Enterprise Grid product, with more to come in the future. We’ve already seen happy customers using the feature and getting value from it. Customers are able to have granular control over what their users are able to do, and we now have a sustainable way for adding more controls in the future.

Special thanks to the Enterprise Admin and Flannel teams, without whom this would not have been possible.

Are you also excited about building scalable, efficient, role-based access controls? If so, come work with us!

Load Testing with Koi Pond

Complex systems are difficult to reason about at scale; we often can’t accurately extrapolate system…

April 23, 2021

15 min read

How a Jenkins Job Broke our Jenkins UI

Artwork courtesy of the Jenkins project. At Slack we manage a sophisticated Jenkins infrastructure to…

June 3, 2021

7 min read