March 9, 2022

12 min read

Applying Product Thinking to Slack’s Internal Compute Platform

Javier TureganoDirector, Software Engineering

According to a recent Thoughtworks radar, “the industry is increasingly gaining experience with platform engineering product teams that create and support internal platforms.” They caveated this with a piece of advice: “When creating a platform, it’s critical to have clearly defined customers and products that will benefit from it rather than building in a vacuum.”

In addition, in their book Team Topologies, Matthew Skelton and Manuel Pais define a platform team as “a grouping of other team types that provide a compelling internal product to accelerate delivery by stream-aligned teams.”

But, where do we start? It seems to me that we can re-use a lot of the techniques that have been developed over the years when building external products in order to build this compelling internal product. We can simply call this applying product thinking to our internal platform.

At a personal level, one of the things that attracted me to Slack was the potential to work and learn in a company that is in the process of building a platform where thousands of developers globally can develop and expand the core product. I interpreted these two facts as really positive signs during my interview:

Slack also adopted similar product principles for their internal infrastructure platform, and
Teams like Cloud Engineering and Internal Tools employed their own Product Managers

We’ve come a long way on applying product thinking to our internal platform at Slack in the last couple of years, in particular to our internal compute orchestration platform, code-named “Bedrock.” The intent of this article is to share some of the initiatives that we have started to bring a product-thinking approach to building our internal platforms.

Introducing Bedrock

The Bedrock platform enables our developers to build their code, package it into a Docker container, and allocate computing resources to run it, all configured via a bedrock.yaml file.

In the own words of Tricia Bogen, who led the technical design and implementation within Cloud Engineering, Bedrock leverages a curated selection of Kubernetes features alongside guardrails and automation that aim to make launching production grade services simpler, more pleasant, and more productive. Bedrock abstracts and makes it easy to navigate Slack’s infrastructure ecosystem, including CD pipelines, building and deploying containers, service discovery, secret management, and an encrypted mesh network. This allows developers to not have to be experts in a myriad of technologies like Kubernetes, Jenkins, Consul, Vault, Envoy, and Nebula, and reduces the amount of configuration required for a service from hundreds of lines to tens of lines. Some of the benefits obtained include faster feedback loops for our developers as well as reducing the number of configuration errors (due to the guardrails).

How we’ve stayed close to our internal customers

Since the inception of Bedrock as a platform we’ve treated it as an internal product and have worked closely with our “customers,” aka the developers building services on top of Bedrock. In order to be close to their needs we’ve tried a few things:

User experience interviews

An important thing to take into account when you are developing your product is to differentiate the personas of the people using your platform. We’ve been developing and evolving Bedrock over a couple of years, and our user base—and the way they use the platform—has evolved too. For that reason, we conduct user experience interviews to learn from our users. I have to admit that I learned a lot when I first attended these with one of our product managers. A lot of use cases and ways of using the platform differed from the way we were using it within the platform team, or how we intended them to work.

The methodology can vary a lot but in general you want to see how the users are engaging with your product. For that first round, the question was very simple:

“Show us how you add a line of code to your service and release it all the way to production.”

All our preconceptions about how developers worked with our platform went out the window as we interviewed one team after another. We found that teams use our product in different ways, depending on their level of experience with the platform, with building applications in containers, or with the language or technology chosen to build their service. As an example, there was quite a bit of difference between a team building a Java app compared to a team deploying a third-party application. The most important take-away for me was the effect that the dependencies of your app had in your development process. Some of those dependencies could be on other services—internal and external ones—or on where your data was stored.

Some teams have an efficient setup that allows them to add a line of code and test it in a matter of minutes, while others expend many hours every week standing up all the services required for their testing. It can vary from “And now I’ll open a SSH tunnel to our integration database from my laptop to test the code” and “I have a special configuration for the desktop client that allows me to plug an IP address for testing” to “I’ll fire up a docker-compose that brings up a database that takes 35 mins to boot and load a test data set from S3.”

After this round of interviews, we identified a set of features that would make life much easier for developers. One of them included adding a new flag to our command line tool that will allow you to bring up any number of copies of a dev environment using your local code to quickly test a change. This would save hours for that initial test, and also allow you to build quick prototypes.

User surveys

Surveys are also a great mechanism to get input from your users. Using Polly, we sent a Bedrock survey to all the users in the #announce-bedrock and #devel-bedrock channels.

Getting this information on a regular basis helps you set your roadmap. For example, as we were trying to gain adoption from users of previous generations of deployment and compute systems, we used the survey to identify the most common blockers for adoption. Then we prioritized adding features or changes to our platform so those were no longer a problem. Another interesting category of changes we were able to identify was “low hanging fruit” that our engineers could tackle right away, providing a sense of quick response after the user gave us feedback through the survey. In one case we were able to fix deficiencies with the logging process that were making the service bootstrap process fail silently. We also added hints on how to fix your configuration files when there was a problem. Wouldn’t you love your tool to indicate that what you were trying to do failed, and provide a pointer to a dashboard or to log aggregation and a hint on how to fix it? Our users did too!

NPS (Net promoter score)

Our journey with using NPS has been interesting. As with everything, the devil is in the details. One of the mistakes that we made early on was to survey everyone in the organization. Some people responded more based on their perception of the team or other services that were offered or even their lack of interest in trialing a container-based platform. That was useful in itself as we heard a lot about what it would take for them to try our platform. But later on we started to focus our efforts on actual users of the platform (people who have tried to build or deploy something using it), and that changed the nature of the responses.

Dogfooding

This may sound obvious, but if you want to understand the highs and lows of your product, you should be using it constantly. This is something that we do a lot with Slack-the-product, where new features are rolled out to an internal-only alpha release of Slack, called “Dogfood,” that every Slack employee uses. This mechanism allows product teams to catch errors and get feedback early, without having to expose the feature to external users. This lets them try multiple approaches and prototype features that aren’t polished yet.

In the case of Bedrock, we started migrating our own internal services to it, in order to have firsthand experience using the platform. Even if we don’t have a polished dogfooding platform, as we do with Slack’s main product, we have a development tier where we can roll out changes to our dev or pre-production clusters before going to production.

User advisory group

It is very common that external products will have a “Customer Advisory Board” where they will meet or work in regular intervals to shape the future of the product together. To avoid the collision with the other type of CAB (Change Advisory Board), we decided to call ours the Bedrock Advisory Group.

In order to establish the group we asked different teams to nominate a representative to attend the monthly meetings, to act as the voice of that team and to share back the information that was shared in that meeting. And we dedicated a few spots to people from particular groups:

Someone very new to the platform
Someone who has shared concerns, or is not very amiable to your platform
Someone in our senior technical leadership group (principals or architects depending on the naming used in your company), and
Our executive sponsors/stakeholders (our SVP of engineering and Senior Director of Product)

We try to cover these core topics in every meeting, and also allow members to propose their own agenda items.

News (e.g. what is the platform team working on, what new features are available, or how is adoption or NPS tracking)
Roadmap (e.g. what’s in the pipeline and is our prioritization correct)
User stories (we invite one or two groups to share how they are using our platform, and describe what was easy or difficult and what could have made their lives easier)
Feedback and questions (e.g. debating topics like the need to slow down deployments to ensure consistency in our Service Discovery tier to increase safety)

We then share the recording of the session for anyone to consume on their own time and to raise further questions or topics via their representatives.

Self-paced training and video materials

Given the distributed nature of the engineering workforce of Slack it was difficult to replicate the classroom experience that I have personally used with previous teams. For that we have tried a couple of approaches:

A self-guided “hello world” example that teaches you how to build your first service and exercise the most common tools in our platform
A set of videos that introduce developers to the different concepts of the platform and walk them through some use cases

One area where we know we can improve is in our technical documentation and marketing materials. Our current documentation is great for people with some domain expertise, but it’s sometimes difficult to navigate and hard to discover. In order to bring people quickly up to speed with the benefits and possibilities of the platform, we could have created material that emulates the marketing pages that you find in existing commercial offerings. We are currently exploring Backstage as an alternative to centralize our developer tooling and documentation.

Promote your new product features within your own product

One of the most important areas of a product like Bedrock is to offer visibility and ease of use to all the engineers. To fulfil the need for a visual interface (where you can check your services, see where they are deployed, view their status, etc.) we developed a web portal that we call “Gaz.” When we develop new functionality in Bedrock there is an opportunity to reflect that in our Gaz interface.

With that in mind one of the engineers in the team created a prompt for users when they are working with a particular pod to start using one of the features we developed: Debug Actions. The idea behind Debug Actions is to create a set of commands that are approved to be run in your pod, defined by configuration in your bedrock.yaml, and able to be triggered from the Gaz interface.

Check out the prompt at the button of this picture to try out Debug Actions:

We have also tried this with prompts from the command line interface, which have proven effective.

Conclusion

In order to be successful in building your internal platform, you can benefit from applying product thinking and developer experience techniques to your processes. At Slack, we obtained feedback from our developers, prioritized our investments, and made our offering more appealing.

In order to better identify what our users needed, we employed user experience interviews and user surveys, which included tracking NPS (net promoter score). On top of that, a user advisory group has served as a great forum to present new features, discuss roadmaps, and for our customers to provide feedback and raise issues.

To validate our new features, we found dogfooding to be a crucial technique. It has helped us to identify bugs without having to bother users beyond our team. Once the feature has been developed, an interesting technique to gain adoption is to promote it from other parts of your product.

Good documentation and marketing materials will attract more users to your platform. In this space we have explored the usage of self-paced training and video materials which work very well for a globally-distributed engineering organization.

Our journey is still in its infancy and we are looking to expand the number of internal offerings that apply product thinking in the future. I hope that some of these techniques help you get started in this space, and make the lives of your engineers better.

¡Buena suerte! (Good luck)

#cloud-computing#infrastructure#product-management

Balancing Safety and Velocity in CI/CD at Slack

In 2021, we changed developer testing workflows for Webapp, Slack’s main monorepo, from predominantly testing…

February 18, 2022

14 min read

Scaling Slack’s Mobile Codebases: Modularization

In the first post about the Duplo initiative, we discussed the reasons for launching a…

March 28, 2022

13 min read