“What is the easiest way to securely connect tens of thousands of computers, hosted at multiple cloud service providers in dozens of locations around the globe?” If you want our answer, it’s Nebula, but I recommend that you read the rest of this short post before clicking that shiny link.
At Slack, we asked ourselves this very question a few years ago. We tried a number of approaches to this problem, but each came with trade-offs in performance, security, features, or ease of use. We will gladly share those experiences in future presentations and writing, but for now, just know that we did not set out to write software to solve this problem. Slack is in the business of connecting people, not computers.
What is Nebula?
Nebula is a scalable overlay networking tool with a focus on performance, simplicity and security. It lets you seamlessly connect computers anywhere in the world. Nebula is portable, and runs on Linux, OSX, and Windows. (Also: keep this quiet, but we have an early prototype running on iOS).
It is important to note that Nebula incorporates a number of existing concepts like encryption, security groups, certificates, and tunneling, and each of those individual pieces existed before Nebula in various forms. What makes Nebula different to existing offerings is that it brings all of these ideas together, resulting in a sum that is greater than its individual parts.
Today Nebula runs on every server at Slack, providing a global overlay network that helps us operate our service. While this is the first time most people have heard of Nebula, it has been in use at Slack for over two years!
How Nebula came to be
A few years ago, Slack was using IPSec to provide encrypted connectivity between regions. This approach worked well in the beginning, but quickly became an operational burden to manage our growing network. It also came with a small but measurable performance impact, because every packet destined for another region had to be routed through an IPSec tunnel host, adding a hop in the network route. We searched for an IPSec replacement, and even tried a few possible solutions, but none of them met our needs.
More importantly, as our software stack and service grew in complexity, network segmentation became increasingly difficult. One of our core problems was related to segmentation when crossing various network boundaries. Most cloud providers offer some form of user-defined network host grouping, often called “security groups”, which allow you to filter network traffic based on group membership, as opposed to individually by IP address or range. Unfortunately, as of this writing, security groups are siloed to each individual region of a hosting provider. Additionally, there is no interoperable version of security groups between different hosting providers. This means that as you expand to multiple regions or providers, your only useful option becomes network segmentation by IP address or IP network range, which becomes complex to manage.
Given our requirements, and the lack of off-the-shelf options that could meet our encryption, segmentation, and operational requirements, we decided to create our own solution.
What are our goals?
- Enable encrypted connections between hosts. This is probably the least interesting but most important single goal. Encryption of traffic, both locally and across the Internet, is table stakes.
- Be service provider agnostic. We want our solution to work on any computer we may use, be it a cloud-based host, a server in a datacenter, an individual laptop, or a cluster of machines sitting in a basement closet. This requirement almost certainly limits us to a software-based solution.
- Allow for high-level traffic filtering. The solution should enable individual nodes on the network to allow or deny traffic based on the identity of a connecting host, not just its IP address. You should not be required to think about the IP a box may have, especially when dealing with ephemeral hosts.
- Provide strong identity. Hosts should identify themselves via certificates, issued by a certificate authority, that encode user-defined attributes (datacenter, role, environment, etc.) when connecting to peers.
- Be fast. There cannot be a performance penalty that greatly increases latency or reduces available bandwidth between hosts.
- Enable testing. The system should allow changes to be tested first in isolation, the way modern teams release software. Most current network management requires all-or-nothing changes to an entire set of hosts, which makes it scary as hell to change filtering rules. By pushing rule evaluation down to individual hosts, you can test changes to filtering rules the same way you test new software releases before distributing them to 100% of your hosts.
- Give everyone a pony. Just kidding, but we are asking for quite a lot, so I guess we’ll call this a stretch goal.
Let’s write some software
We did a LOT of experimentation when creating Nebula, and probably discarded more code than exists in the final product. This experimentation was valuable because it allowed us to challenge our assumptions and come to more informed conclusions. Not all projects have the luxury of time, but in this case it was an enormous advantage, because we had the chance to try so many things.
The very first thing we did was research modern best-of-breed encryption strategies. In our research, we learned about the Noise Protocol Framework, created by Trevor Perrin, co-author of the Signal Protocol, which is the basis of Signal Messenger. We decided early in the project that Noise would become our basis for key exchange and symmetric encryption. Most importantly, we did not roll our own crypto.
As we looked at Software Defined Network (SDN) and mesh networking software, a project some of us have used personally, Tinc, came up. Some of the strategies Tinc uses to establish tunnels between hard-to-reach nodes informed our design goals for Nebula.
Sharing Nebula with everyone
Instead of trying to cover the diverse set of technical design decisions behind Nebula today, this post is purposefully high-level. We are ready to share Nebula publicly, so others can kick the tires and let us know what they think, and future posts will dig into the nuts and bolts.
We have shared Nebula with a small community of engineers prior to this release, and received positive feedback on the simplicity and power of the system. Some of them are using it to connect systems in their organization and have provided extremely useful feedback. Nebula is useful for connecting thousands of computers, but equally useful for connecting two or three.
Nebula has undergone a paid security vulnerability assessment, along with numerous internal security reviews. We are adding Nebula to our official bug bounty program, where we welcome submissions related to security bugs found in our software. (Note: while we may look at suggestions related to best practices, unless they constitute a vulnerability, these will likely not qualify for a bounty payment).
At Slack, we appreciate that we could not have built our service without open source software, and we hope this small contribution to open source can help others by providing software they need so they can focus on building software they want.
We hope you enjoy trying Nebula, and If you’re interested in helping us solve engineering problems large and small, check out our job listings.
Nebula was created by Nate Brown and Ryan Huber, with contributions from Oliver Fross, Alan Lam, Wade Simmons, and Lining Wang.