 Okay. Our next talk is sneaking in network security. Our speaker Max is going to tell us how to scale up defense for computer networks and in particular how to integrate that in existing networks. Okay. Max here is a former pen tester and now a blue team member. Please welcome him with a huge round of applause. Thank you. Hi, everyone. My name is Max Burkhart. I'm here to tell you today about sneaking in network security. How I and a small team about the security engineers managed to implement a strong network segmentation model in an already running high scale large network. I'm a security engineer at Airbnb. And so the sort of practical experience of this project occurred in that network. However, I think that the techniques we'll go over here today will apply to many other networks. And so I'll spend some time talking about the technical theory behind this approach, as well as what happened when we rolled it out in an attempt to give you some good evidence and experience to run this in your own environment. So let's talk about network security in 2018. Segmentation continues to be a really good idea because we all know that compromises are going to happen. Those boxes are going to be popped, whether it's a zero day or something less fancy like somebody forgot to patch a server. And network segmentation gives you the controls to be able to keep those compromises contained, to make sure that low security systems can't pivot into higher security zones and help your incident response teams keep incidents localized. However, if you've ever been involved in network pentesting, you'll know that a well segmented network is a rare thing to see. And I think we know why this happens. As networks grow quickly, small security teams, especially ones that something like a startup like Airbnb was, find themselves having to prioritize their work where it is the most impactful. And that usually ends up being the perimeter, the internet-facing hosts. And so as a network grows quickly, you end up with a large network that has this sort of hard shell, soft center architecture, where the external perimeter may be hardened, but once an attacker is able to compromise that, they may have relatively free rein inside the rest of it. And this obviously isn't something that we want. And ask any blue team member and they'll know that this is a bad place to be. But change is hard, and especially with a pretty large network. So to give you an idea of the scale that this project dealt with, earlier this year when we were implementing this, Airbnb's production network had about 2,500 services, about 20,000 nodes, and I define a node to be something that's sort of like a host, whether it's an instance running an EC2 or a Kubernetes pod. And over 1,000 engineers were doing hundreds of production deploys per day. So things are moving really fast, and it's hard to go in and build in these large architectural changes, like adding segmentation. Furthermore, because of this sort of highly service-ified architecture, there was a lot of complex interconnectivity between these things. So determining where the zones should be was difficult in itself. Finally, developer productivity is a really big concern for us, and especially to my managers and their managers. If you have over 1,000 engineers writing code every day, if you slow them all down by 5% or 10%, that's actually a really expensive thing to do, and it's not something that's going to fly. So the question became, how do we go from a soft-center network to something that has good segmentation and has the security properties we want, and we're not allowed to stop development, we can't start over, we've got to be able to build a security in as the network is running. So we hear a lot especially in the pen testing, often as a community, about trying to be like a ninja, get into the network, do stuff without anyone noticing. I'll argue that it's also just as important in defensive security. We need to be defensive security ninjas and be able to sneak in, put in the defenses, and have nobody know we were there. So what's the theory that we're going to be applying in this approach? We need to stop thinking about security as this layer around development as another step in the waterfall model. This is maybe what we were thinking about 30 years ago that you'd build an application and then you'd do security testing and then you chip it to production, but it just hasn't really held up anymore. So there's been a lot of smart people talking about sort of the new way to do things, agile security, DevSecOps, SecDevOps, if you can't decide, this whole concept is really unifying security operations and software engineering so that you're building a secure thing all the way through. And this certainly isn't something that we invented. Many people have been working on it. But I've found that most of the time people think about this concept when in the terms of application development. And I think it's time that we integrate this with network security as well. I think the important thing here is scale, right? We need to build a security solution that scales with development. There's this saying that it's good to hire lazy engineers and developers because they're going to build things that sort of scale up and don't require a lot of manual work. That's even more important for security engineers. You're never going to outwork the attackers and so you need to build something that's going to scale along with your engineering group. So we're good project managers. So we're going to lay out the requirements for this solution before we jump into how it actually works. Whatever we build needs to stay out of the way of engineers. It may be something they're aware of. But the farther we can keep it out of their scope, the better. So they can just keep writing applications that make the company money or accomplish your organization's goals and their stuff ends up being secure. Security by default is of course something that we have been chasing for a long time. But I think that we can also go further than that and say beyond being secure by default, it should actually be hard to have an insecure configuration with this system. So we'll try and design things in that manner. And finally, we want to build something that as much as possible is flexible to whatever sort of network or protocols you are using. You don't really ever know what's going to be coming six months down the road. When this was being worked on, every of you is mostly Linux on Amazon shop. But I don't know what's going to happen in the next six months. We might acquire a Haskell on Azure company and try and integrate that or we'll start going to on-prem data centers. I have no idea what's going to be in the future. So we want to build a solution that's going to be as agnostic as possible to those sorts of decisions. So my next slide is basically the whole solution, tried to condense into two sentences. We're going to use mutual TLS built into the service discovery system for authentication and confidentiality across all service communications. And we're going to discover those access lists totally automatically for security with zero to almost zero configuration. This is a lot of jargon on a single slide. So I don't expect you to kind of visualize it yet. We'll dive into each of these parts. And I'll show you how they fuse together to build a system that is invisible and secure. So to start off, I sort of isolated three pillars of this approach. The first is TLS in service discovery. So we love TLS. It's one of the really powerful protocols that the security industry has managed to build. And it gives us great security properties if we can use it everywhere. So the first pillar is get everything to be using TLS. And by building the service discovery, make sure that it runs everywhere without a lot of per app configuration. Pillar two is binding identity to nodes. So in a more traditional network segmentation model, you might define subnets or restrict things by IP address. We're going to be able to be a little more flexible with how we refer to individual nodes in this network because we're using TLS as an authenticator and therefore can sort of define our own concepts of identity. And I'll get into that soon. Finally, we're going to generate an authorization map. So by automatically determining what services need to talk to what and figuring out how data flows through this network, we can attempt to update ACLs automatically to stay out of engineer's way while still ensuring that the connections between services are trusted and can be verified. So this is a diagram that we'll be diving into individual pieces of. But basically this is a very simplified view of a network. We've got three nodes. Those nodes each have a certificate sort of defining who they are, and they can use those certificates to communicate with each other through TLS tunnels. They have authorization logic that runs on them that is fed by this sort of centralized map of what nodes the network should talk to each other. Let's jump into the first pillar here, which is the implementation of TLS. So specifically here, we're looking at these tunnels. Before I start, though, it's important to cover some basic concepts here just to get all on the same page. We're using mutual TLS here, which is, you know, you've certainly heard of traditional TLS. That's what your web browser uses all the time, where you have a client that is verifying the identity of a server. Normally it will get the cert, make sure that the subject alternative name or the CN matches the domain name, and if so, tell you it's verified. But TLS is really awesome, and it actually out of the box supports verification in both directions. So you can have the client also present a certificate in that initial handshake, and the server can check who is talking to it using an equally strong authenticator. This is pretty hard to deploy on the public web because users can't really manage certificates, but in your own production network, this works really well, because you can distribute certs to everyone. So this is really great because this means we can have two-way strong authentication with key material that security engineers understand. We know how to deal with these sort of systems. So not only can we make sure that clients of services know they're talking to a legitimate service, but that service can look at who's talking to them and make sure that that's a caller that seems appropriate. So that's mutual to us. Service discovery. This was a term that I hadn't heard about a lot before I started working in companies that used a lot of cloud environments in SOA. But at its core, service discovery is this concept that you have some node in a network, and it needs to find other nodes to provide services to it. So if you think about it, DNS is like a very old sort of basic service discovery system. You want to perform a Google search, so you do go to www.google.com, and DNS finds you a server that can provide Google services to you. So these have gotten a lot more complex and varied as people move to these environments where hosts are very flexible and stuff moves around a lot, and they're pretty ubiquitous in modern service-oriented architectures. And service discovery can actually be kind of problematic for security if you do it wrong, because fundamentally it's trying to be a map of the network and be really helpful about like, oh, hey, find the service here, find the service here. But I'll argue that we can actually use this to great effect in achieving security. So Airbnb uses a framework called Smart Stack, and so that was what was there when we started this project, and we built this security extension on top of the Smart Stack framework. So that's what I'll sort of be talking about, but I believe that these concepts can be applied to most service discovery systems. As a brief aside into how Smart Stack works, this is an open-source system that Airbnb created and open-sourced a few years ago. But the basic idea is that it uses two other publicly available projects, ZooKeeper and HAProxy, in order to make it easy for services to talk to each other. If you look at this example above, Node2 is hosting a service, ServiceB, and so ServiceB is going to report into a ZooKeeper cluster, hello, I'm a ServiceB instance and you can find me at Node2. Node1 wants to talk to ServiceB, and so it will load the relevant addresses for ServiceB from ZooKeeper, and it will put them into its local HAProxy instance. HAProxy is a reverse proxy that just kind of puts it in the server. ServiceB is a service that just sends a request to a local host and leaves it to HAProxy to find a suitable host to fulfill that request. So an important thing to note here is that this system was not designed for security. Anything can write into ZooKeeper, it is like the most prone to impersonation thing possible, because you just ask for a list of nodes and you get them and it's not really authenticated. But I'll show you how in the next few slides we can build security into this system. So the old way that we connected to services before any security upgrades is that ServiceA wants to talk to ServiceB, it sends a request to its local outbound proxy, and that sends it along. So it's going to make an HDP request to local host, that gets sent through the reverse proxy, goes across the internet to ServiceB. Not a lot of security going on here. What we added is a secure shim. So we added a new reverse proxy that runs on the receiving node in front of ServiceB, and we reconfigured the proxies to communicate to each other with mutual TLS. So now all of the traffic that's going over the internet is in a TLS tunnel. But crucially, ServiceA and ServiceB did not change at all. ServiceA is still sending HDP traffic, ServiceB is still receiving HDP traffic. So we were able to pretty radically change the security model of this cross-host communication without touching a single line of an engineer's code. So this is where we're getting our invisibility from. There's some other really big benefits to this. Because there are these two service discovery proxies that are doing the TLS setup, and they are the things that can do authentication and verification of this TLS tunnel. Security was able to build these controls once and distribute them basically across the entire fleet. The same proxies can run no matter what language or the service is written in, what sort of protocol that service uses. And so instead of having to verify authentication authorization code in dozens of different frameworks and languages, we were able to do it just about once. The other thing that ended up being really helpful is that having these proxies on either side of your service communications is actually really helpful for non-security reasons. So things like consistent metrics, better tracing, better ability to do load testing. We got all those for free by adding in more proxies, and thus we got to really get the support of other infrastructure teams at the company who maybe didn't have direct security goals, but they wanted to help us do this. So basically what we've done with this whole proxy thing is sort of the opposite of what the NSA wants. You may remember this slide from a leaked NSA presentation where they were discovering with Glee that inside Google's cloud network at the time, there was a lot of plain text HTTP going on, and SL was added and removed. We are just adding SSL and keeping it there. All of the arrows on the right need to be TLS in the modern age. One important caveat about this particular approach is this concept of proxy exclusivity, which is that basically we are relying on this inbound proxy to provide the security benefits of TLS, confidentiality and authenticity, and thus it is crucial that going through the inbound proxy is the only way to talk to a given service. If that service is reachable by going around the inbound proxy, you would still be able to talk plain text HTTP to it and possibly evade authentication mechanisms. And so it's important that this is impossible. I'll talk a little bit about how we solve this particular issue. It's just something that's important to be thinking about if you're going to implement this approach. So that's TLS. By implementing a new proxy into an existing service discovery framework, we can switch all the traffic to be going over TLS without radically changing the code of services running. Next up, though, is that really what we wanted out of all of this is segmentation, right? We want to make sure that only legitimate things can connect to a given service. And so we need to build a sense of identity that can be used to do this verification. So in this next period, I'm going to be talking about how we put these certificates there and, more importantly, how we decide what that certificate is going to say. So segmentation. You know, we're trying to make sure that a node in the network can only talk to the things that it should be allowed to talk to. You know, if a node needs to talk to the payments back-end service, it's going to do that for business reasons. But we can maybe make sure that only nodes that have to talk to a given service can. But a lot of previous thought about segmentation tends to happen on this subnet level. You make a zone of hosts and things in that zone can talk to each other and then maybe they can get out to other zones via certain predefined channels. But in a microservice network or something that has a lot of like dynamic communication going on, it may make more sense to think about this on a service level as opposed to a host level. So we'll say things like we want the payment config page service to be able to talk to the payment back-end service. That seems like a reasonable thing to do. But in our network, we've also got a slack bot running that makes memes for engineers. And that thing should definitely not be able to talk to the payment back-end service. So we can start representing these sorts of decisions as instead of these sort of static tiers of hosts, we have a bunch of these services and each service sort of keeps a list of identities that's going to allow to connect to it. And we just did all this work to build up these proxies on either side of a service communication that understand TLS and are using TLS. And TLS is fantastic at verifying identities. So we can now start to build the segmentation by saying for a given service listener, here are the following identities which are allowed to connect to it. And thus you can end up in a state where only the right things can talk to a given service based on business need. We do have to identify all the nodes in our network though. And this is something that's going to vary a bit depending on how your network is set up. So you need to sort of find a concept of an identity that fills a few key attributes. So this identity that you decide for a node needs to be pretty varied. You know if you have one identity for everything you're back at soft center network again because you won't be able to do any distinguishing. You need an identity that a node can't change about itself. Otherwise an attacker would be able to compromise a particular host, change its identity and then move into zones the network it shouldn't be allowed to. It should really be able to be something that you can detect automatically so that you can sort of automate the distribution of these certificates. If you end up having to go through to an Excel spreadsheet and figure out what each host is and then like Mendo certs yourself it's not really going to work. And finally we do need to represent this concept of an identity in a TLS certificate. So in our case we wanted something that could fit into a subject's alternative name. So most modern networks have some concept of a role that works pretty well for this. When you have a config management system or a cloud permission system you almost always are giving things identities based on their function and this tends to work pretty well for this. So in our network we used Amazon IAM roles which is a sort of designation given to an instance that gives it some level of permissions in AWS. And this worked really well because most different services had them. They can't be changed unless you have very high level administrative permissions in AWS and it can represent it as a string so it fits well in the certificate. So to kind of look at what we're going to do here now we need to give everything identity and we need to make certificates that allow nodes to prove their identity in these TLS communications. We can then build this map of what identities should be allowed to access what services. This is what is going to give us our segmentation because we're going to be able to distribute that map saying for the payments back in service you allow the following identities and no others. And thus you can get to this place where only a very select set of nodes in your network can access the sensitive stuff. But how do we make that map? That's pillar three which is the final segment of this diagram. So how do we figure out what needs to talk to what and distribute that? So a big question here is really all about trust. How do you figure out what needs to talk to what and do it with a minimum of human involved computation? A lot of what I was talking about earlier in the very beginning of this presentation was about the sort of human cost of segmentation. If you have people who are spending all day trying to make firewall configurations that's going to be rather expensive, difficult to keep safe, etc. We want to try and get away from the configurable list style of security engineering where you hire a ton of security engineers to try and figure out what is supposed to talk to what. So we wondered could we just infer this from existing code? Can we look at how the network currently works, at how our configurations are defined and use that to build the sense of how communication should happen? So this is getting to an interesting point because the decisions you make here really depend on how you think about threats at your organization. So we decided that if you are somebody who can merge peer reviewed CI past code into our config management system, that means you're reasonably authorized to make changes. And this is something that may vary based on your organization's setup and I'll kind of dig more into those questions in a bit. But for our case we realized we have this Chef repo. The Chef repo is a Chef as a config management system that can distribute information to all of the nodes running in our network. And it already in a nice machine parsable way was saying what the dependencies of every service were. So in this hypothetical example we have a service one. Service one has dependencies on the production database, a cache, and a monitoring service. And this is already set up in a repository that is rather heavily controlled. You have to be an engineer that gets peer reviewed to commit to this. So what we can do is we can take this, determine that service one is an authorized caller of these services, and then sort of build that into this map. Say for production DB service one is authorized. To do this we built this service called Arachne. Arachne named after Greek spider goddess is kind of computing the web of services and nodes at the Airbnb network. And so basically it's continuously pulling our chef repository and deployed Kubernetes artifacts to figure out what connections have been defined by trusted people and building a sort of reverse map of for given service what identity should be allowed to connect. It can then push these into s3. I'll talk about why we did that in a little bit. And then those can be sent to all of the nodes that are actually doing this computer, this allowance. So the barriers that you're going to be putting into place about how this map is generated really depends on how you think about insider threats at your company. So in our case we've made the conscious decision to trust our engineers and rely on things like CI checks and peer review in order to make sure that legitimate things are committed. But depending on how you approach this you may want to have more controls in place. And this system is rather flexible to do that. All you need is something that can automatically discover as much as it can and then under some conditions publish a new authorization map to some location. So you could certainly imagine if you wanted more controls than this making it so that when a new connection is discovered it prompts the security team for a quick manual review and an acknowledgement before that actually gets distributed. So this does give security a single point of control where they can do any sort of monitoring or additional approvals if they wish while still taking away a lot of that boilerplate work of trying to figure out what actually connects to what. We can actually go further with authorization instead of just telling all of these service discovery proxies to allow these identities and ban these others. Because we're just using vanilla TLS we can rely on the heavy support for these sorts of protocols in many things. So the reverse proxy that we use is the inbound proxy has the feature to inject information about the client certificate used into HGP streams that went through it. Most of our APIs are HGP based ones so this applies to most things and it means that whenever service gets a call over TLS it can just parse this very simple header and know exactly what sort of identity is calling it. Making it trivial to implement various like permission levels depending on your service caller. So this is something that sort of authorization control would have been really tough to implement before the system because you'd have to set up maybe your own TLS system or maybe a system of tokens or keys or passwords. But this lets us leave all of the tricky crypto stuff to the security on components and let app developers just parse a very simple header and make decisions based on that. So those are the three pillars of this solution. We set up TLS in between everything to give us the security properties and communication that we need. We give everything an identity in order to make sure that they can authenticate to each other and enforce segmentation by having specific allow lists for every service and then we automatically discover this map by parsing configurations already there. But I'm not here just to sell you on this solution because I like it. There are some downsides and to be perfectly honest I want you to know about them before you consider implementing something like this. So these are just some of the things that we thought about and decided to accept. First, you are going to need to constantly synchronize out this map of allow lists or some subset of those allow lists. Instead of having centralized allowance of various network communications like you might have if you have a central firewall, you're sort of doing it in a distributed way. Every node is determining whether or not a connection is allowed and so that means that you have a reasonably strong need for a lot of bandwidth to synchronize this out. You can use caching which will make some things a lot easier and I'll talk about why we did that but that is going to cost you some in terms of update latency. If the web changes and you need to allow a new identity for a service that may be slower if you're using a cache. Second, if TLS has a problem, you have way more problems than you used to because you're now relying on one of the core security elements of your system. So this is something that we know but the reasoning here is basically if heartbleed happens again, if we find some sort of major core issue in TLS, already security is going to be working nights until we can get that patched on our front-end web servers and so if we're going to be massively deploying new open-sell versions as quickly as we possibly can, that's going to end up patching up all these as well. So basically we are relying on the fact that major SL issues are going to get a quick community response and be something that we can move quickly on. Adding more reverse proxies in your traffic flow is turns out to be kind of complicated. This introduced a lot of interesting behavior in some services and I'll talk a little bit more about the specific things you ran into but it's just something to note that the actual addition of TLS to things broke very little but the additional hop in the network had surprising effects. You do need to be able to run software wherever you're receiving traffic through this system because you need to install that secure listener and something that can download the allow lists. If you manage all of your own infrastructure this is relatively easy but if you have things like vendor devices or hosted services where you cannot install arbitrary software that gets a little harder. When we have some services that are in this state we basically put proxy boxes in front of them and use those to handle the authentication. Finally you are going to want some sort of certificate revocation because if a node does get compromised you'll need to kick out its permissions and this I say is usually tricky. There are certainly ways to do it but this is something to be thinking about and scoping as you're considering doing a deployment like this. So rolling it out my hope is that you know I've described this solution but it's not just theoretical this is something we did and so I hope that I can share as much as I can about what we learned throughout this whole process. So to start with sort of the technical details we built this mostly out of components that are available and open source. So for the inbound proxy we used Envoy which is a project open sourced out of Lyft that is really growing in popularity in the service mesh world and for good reason it's really designed for this sort of thing. It's modern it's fast it has great support for TLS has a ton of metrics which are really useful and generally served us very well. The one thing we ran into with Envoy that is quite the stickler about the HTTP 1.1 standard and that led to some funny behavior in certain other applications that were not so strict about it but overall Envoy was a great choice and we're actually migrating to use that on our outbound side as well. As I sort of alluded to earlier we gave every node an identity based on its AWS IAM rule and this it was just sort of a natural choice for us because this is already how we were thinking about permissions for nodes. Now nodes got their permissions by their IAM rule and that also kind of controlled what services they were allowed to talk to. The Arachnese service I mentioned is basically just a continually running Ruby script that loads a Chef repo and some Kubernetes artifacts and parses them. These authorization maps the quote-unquote web files are uploaded to and downloaded from S3. So we're using S3 as the source where this all gets actually pulled from. All in all it was it's about four minutes to fully compute the web of services and generate one of these web files meaning that it's about a four minute delay in between a change in topology that is a service adding or removing a dependency and when that gets reflected in allow lists. In our experience this is far shorter than the time it takes to actually deploy such a change to production. So we haven't really run into race conditions where a new dependency gets added before it's allowed. We had some pretty specific availability considerations mainly on caching the output of these of Arachnese. These web files. So we wanted to make sure that if Arachny went down and we stopped being able to generate these authorization maps that all the traffic kept working. We didn't want to be owners of a service that if it went down would ban all traffic. And really if you think about it by decentralizing all of the authentication of service calls you want to be able to rely on decentralization benefits. So by putting everything in S3 and letting nodes download it from there we can make sure that if Arachny has some sort of critical problem if it stops running the worst thing that happens is that new topology stops being reflected. So this means that traffic keeps flowing even if S3 goes down as it famously did last year I think that was a fun day. Things basically still work. Nodes won't be able to download new topology changes but they'll still have local cash ones on disk and all the traffic will keep flowing as normal. So this was a choice we made early on and has served us very well because when there are new and interesting things that happen with Arachny no one really notices generally security is able to fix it before someone changes the topology. So the plan for a rollout was basically these six steps. We started by computing this authorization map since this all kind of works on offline data we were able to spend some time writing the software to do this and getting it to work nicely before we had to actually touch any production services. So we could build that map and verify its correctness. Next we wanted to give everything an identifying certificate. So the idea of doing this first is that this is a pretty small change and something that we could pretty safely roll out. We're simply dropping a certificate on a bunch of nodes and it's also relatively easy to verify that this worked before moving on to the next step. We can check for the existence of these files in a large scale way and make sure they look good. Third we installed this receiving proxy everywhere and started listening and setting up this traffic routing. At this point no traffic is actually flowing through these TLS tunnels. We're simply giving it the path to. This also lets us configure or verify the step before moving on. Next we can start actually doing the testing and building the confidence in this system. So we can start routing some traffic through these new secure listeners and we configured our sort of configuration in a way that we could turn it on or turn it off per service. So we picked a bunch of services that seemed representative high QPS ones low QPS ones ones that use HTTP ones that were playing TCP just a great variety of things that that seem like they would sort of stress test the system and we turn these on one by one and built confidence that this is going to work. Step five is sort of the radical one and that is the switching everything over at once. This is not always how you want to run operations but we chose this for a very good reason which is that there were two people working on this project and there were you know 1098 other engineers building services as fast as they could and we were reasonably confident that if we tried to go one by one we would never catch up. We had to be able to build a system that we could switch on all at once and confidently move into a post plain text future. Our final step is rebinding services to local host so that these security guarantees were were enforced. So we did this last and this was sort of painful from a security perspective because you really have to wait till step six is complete before you really get the security benefits here but we had to give ourselves the ability to roll back if things turned out to have problems. We wanted to make sure that if switching to TLS for a service caused some unintended effect we could roll back fix that and then roll forward again once that was dealt with. So to sort of visualize this we start with the nodes we've always had those. We built the authorization map and made sure that was available first. Moved on to adding the certificates to everything. We installed the reverse proxies with their authorization logic. We turned on TLS for some things to make sure that it worked and then on plain text deprecation day everything went to TLS. So we did this in April of this year and there's a lot of things that went well. We went from about 15% internal TLS usage to 70% in one evening which was really awesome and something that I don't think would have been possible with any other scheme. We made sure that there were a lot of non-security benefits to this system and this let us get wider organizational support for such a change. These sorts of massive sweeping infrastructure changes because they affect everything can make other engineers nervous especially people who are primarily concerned about uptime. And so we wanted to make sure there was plenty of stuff in there for them too. Some of the chief benefits we provided included much easier configuration because we were automatically assigning identities to everything and pre-configuring certificates. Engineers no longer had to think about setting up a custom MTLS connection if they needed security benefits. Performance I will talk about that in a sec but the numbers are good and then there were a ton more metrics available so people could have greater observability in their services and realize what was going on and that was operationally very helpful. The other thing that we did that was a really good choice was making sure that we had the right configuration. We could disable TLS routing for individual services on a one-off basis so that if we determined that a certain service was having a problem it we didn't have to roll the whole thing back. We could keep the wins we'd gotten and roll certain services back in order to fix them before moving forward again. So of course I'm here to be honest with you there are some things that were the hiccups during the whole process. As I mentioned earlier running everything through an inbound proxy sounds good on paper but leads to some weird stuff in practice. So of the 2,500 services most of them took this fine. There was just a small percentage that did weird things. There are some things that change if you're using a reverse proxy like all of your traffic is suddenly coming from local hosts. Even small things like changing the case of HTTP headers which is fully allowed by the spec can lead to weird behavior in some applications. Reverse proxies also can mess with particularly staple things like web sockets. We didn't think about the web sockets case and did not have support for that on day zero. That was a quick day one patch to teach our reverse proxy that web socket connections are special needs to be handled specially. So all of these things are generally surmountable but you are going to run into some weird behavior. The thing that I thought was funny about all of this is that really the biggest problems we had had nothing to do with the security properties. Even if we had had a plain HTTP reverse proxy we would have had the same problems. Our testing process because of how we turned this on was very good at testing the case where suddenly all your traffic starts coming in over this TLS channel. So you enable TLS for service B and suddenly all the service B nodes get all the traffic over TLS. We tested that well. What we didn't have great testing coverage on was what happens if all of the services that your box depends on suddenly start requiring TLS. And so we ran into some interesting issues with this. Most particularly HAProxy what we were using for the outbound proxy was a bit of an older version. It handled TLS certificates very poorly and so for certain roles that had thousands of dependencies it would load the same certificate into memory over and over again for every connection it was making and that does some pretty crazy memory issues. So that was something that we could have tested a little better. The final thing to mention is that binding these services to localhost again this took longer than expected. We expected to be able to use easy service config templates that were built into our configuration management to say okay everything that used to be binding to 0.0.0.0 you're now binding to localhost. This ended up taking a few weeks longer than we expected because there was more drift in how we did configuration than we expected. This is just one of those things that I wish we could have allocated a little more time to in the beginning. I mentioned I'd talk about performance because this always comes up whenever you introduce a TLS project someone is like but what if it's really slow and fortunately I can sort of confirm the security industries assertion about it which is that things often actually got faster which was really I actually I didn't expect it and whenever somebody says this you I at least I had this sort of like kind of disbelief like and did it really and yeah so for a number of our services we improved 95th percentile latency by as much as 80 percent so what was happening here is that we had a bunch of these services that had sort of hand implemented mutual TLS for security reasons particularly high sensitivity things like password services or things like that did implement MTS because they wanted to be secure but they were implanting entirely at an app layer and so application to application was communicating with mutual TLS and these applications tended to restart reasonably frequently whenever their deploys new boxes spun up etc and so they were unable to take particularly good advantage of TLS session caching and session resumption meaning they had to use the TLS handshake all the time making them quite slow service discovery proxies restart very very infrequently they kind of come up when a box comes up and often can last for weeks or months and thus their TLS session caches are very well warmed and thus we're able to keep a session resumption rate of near 100% meaning that we're basically just paying the AES encryption cost which was just happening in hardware and added very little so that was a really great benefit and we're able to pretty much squash the concern of this will be too slow for our network so doing this in your infrastructure I imagine some of you may be involved with networks that are not as segmented as you like and I think this provides a good approach to implementing segmentation on a large scale in a way that is actually shippable there's some questions you should ask yourself when thinking about this that might help you assess whether or not this is a good solution for you first how effective is it for you to be able to distribute these proxies in your service communications so we had a lot of benefits in that we had a configuration management system that could deploy software and configuration and we already had these outbound proxies that were in place because of the service discovery system we used so this is something that came pretty naturally for us but it's something to think about in your own environment how would you assign identities this is really important because an identity is a zone it's a segment in our network and so if you have a highly specific way to refer to things that you can turn into TL certificates this may work really well for you if you don't have this you may need to do some work to get there in our case IAM roles what we went with but at the beginning of the project not every instance had an IAM role we had to do a little legwork in the beginning to get that to be an enforcement for our entire infrastructure will you need to sort of manually configure these access control lists or will you be able to automatically generate them if you can automatically generate that's where you're going to get these huge efficiency wins so that's something that you really want to push for if you can the other good news is that there's some available options on the market right now that can help do this for you we sort of hand implemented the whole thing but Istio and console which are both solutions being pushed as sort of the new way to do service mesh especially in Kubernetes implement this sort of security system already so to be clear this is not something that we we totally invented this is an idea that's been going around for a while and Istio and console implemented for you in a easily packageable way they do less on the automatic generation side but you could easily kind of build the sort of system using these tools but if you don't want to make such a huge leap and switch to a whole new service mesh system you can certainly implement the security benefits here with your existing service discovery stack as we did so to kind of sum up here I'm here to tell you that you can switch to a deeply authenticated network and the reason you can do that is because you can make the changes here invisible and you can make the system fast because of these generated authorization maps and the automatic TLS and the engineer who is working on a microservice before the system and after the system has basically the exact same experience they still use the same H2B calls they always did they still add a new dependency get that changed approved and merged to master and then their service talks to it no problem their flow remains the same as it always has been but now when an attacker compromises that slack meme bot with some sort of you know meme injection or whatever it is they find themselves in a network zone where they can talk to basically nothing and all of the services that were wide open to them beforehand simply reject their connections out of hand when they ever try and go past a layer four connection so this is something that I believe is possible we've done it now and I think it's a great strategy as you try and build into the security what you weren't able to do when your network first started so thank you very much for listening if you want to stay connected or ask me more questions about the details of this something that's not as easy to do in the Q and A section you can hit me up at maxb on twitter max.brickard.everingb.com or if you just want to see what we're up to every and be engineering every and be.io thank you very much thank you max thank you max if you do have a question please line up on the microphones try to limit your question to a single sentence if you'd like to leave at this point please do that ask quietly as possible signal angel your first question from the internet please hello question from internet why open SSL and not all open SSL so I guess I said open SL just sort of a random example you can use whatever SSL stack works best for you I believe that the way our packages were being built used open SL but switching to something like boring or Libre SL would probably be a good idea for further hardening thank you mic for number two your question hi great talk what are you currently doing to mitigate the increased risk of local host bound SSRF so look yeah SRF is something that is deeply troubling to me as somebody working on app sec in a company that works almost exclusively the HB API calls our approach honestly is very dedicated as data analysis we are watching engineer written code very vigilantly for anything that might make HTTP calls outbound and trying to ensure that it doesn't hit internal stuff that's an area that my team is trying to do a lot of work to improve and perhaps that could be a future talk cool thanks but for number one please very interesting idea are you going to include workstations too our workstations it's an interesting thought at the moment we don't have our workstations plugged into the same service discovery and so they don't have the sort of core proxies that could work for this but I think that if that's an architect you wanted to go to this would actually lend itself pretty well because you if you're managing your workstations you could hand them identities just as well you need to think of probably a slightly different approach to identities because you can't give a physical machine an AWS I am role at least for us but if that's something that your network has then I think it's a very reasonable way to go thank you signal intro your next question I would eat the proxy code before placing it in a front of whole services yes we took a close look at it all right like for number two what are the cost implications for implementation and on your operations in general costs are pretty low because the reverse proxy is pretty efficient it doesn't use a ton of extra compute so we didn't have to scale up anything in order to support this verification again being able to just do AES is is pretty cheap the generation of the map is very cheap it's running on a single Kubernetes pod just running a Ruby script so that's fine probably the greatest cost is simply an S3 transfer of that authorization map and that's something that we think we're going to be able to continue to reduce by sort of being a little smarter about how often we check in certain areas the network that evolved very infrequently and don't have a lot of topology changes we'd be able to sync that a lot less frequently so that's something I think we can improve in but overall the cost is pretty low all right signal introduce more questions from the net okay then microphone for number two please in terms of certification authority how are you managing the lifetime of the certificate and which kind of consideration did you do on that side so like certificate expiration renewal also CSP if it's already implemented or not yeah so we want to get to a point where certificates expire a lot faster there are some companies that have done a really great job with certs that only you know refrat that only last about a month or even two weeks something like that and unfortunately I think our infrastructure isn't an appointment where we can reliably reduce it that much our current approach is that we can currently ban things by basically introducing them into a denial list in the allow list generation stage and that will result in something being banned within about four minutes so that's how we can deal with active compromises but there's just sort of an over a longer running effort to be able to increase our infrastructure refresh rates so that we can have really short-lived certificates to deal with those sorts of like stolen certain attacks thanks your question since you're all of it on a flat layer 3 network and you already mentioned payment information what does this mean for your PCI DSS scope and how does it affect certification if you handle payment data and the systems are connected to other systems in your network and not separate by like firewalls or something it's so our PCI network is a little interesting it actually is a totally separate thing from the most of the Airbnb production network so that specific certification didn't affect us but I think we've also had a pretty effective time of convincing auditors that this is an effective way to do access control even though it's traditionally happening at a layer that is not as standard so for PCI DSS specifically cardholder environment actually is just a web page that syncs to Braintree so we don't have to deal with that one specifically but it's something that has been received pretty favorably by our compliance folks thank you signal Angel could you elaborate on how you got the management and application engineers by in for the changes described in your talk what the objection did they erase and how did you address them thank you for asking this something that I like talking about I think that a lot of security is actually being a good salesman for your solutions whenever you are presenting something like this that has such a wide scope it's crucial to make sure that there's something in it for the stakeholders beyond just securities goals and so a lot of those things for us were around developer ease and productivity reducing the pain that engineers were feeling in trying to set up their own TLS implementations of their own authentication stacks better performance benefits like I discussed these were all things that other infrastructure and product teams heard about and wanted and so they were very open to our original request and then from there on it was all about being a good steward of an operation you know having really good operational plans showing that we've done our homework in terms of testing and really thinking like a infrastructure engineer or an SRE instead of just a security engineer you know security is our ultimate goal but we need to make sure that we are not you know burning our credibility with the rest of the organization when when going for that so there was a lot of time spent thinking like you know forgetting about all the security benefits right now how am I going to make sure this isn't going to take everything down thank you Mike for one your question do all nodes have the whole web files and what technology sector you use to apply them to NY so yeah everything gets a web file it's yeah technology stack is JSON so basically there's a very small shim that downloads this file from S3 and then puts the relevant list of allowed identities into the envoy into an envoy configuration file and then envoys using it sort of automatic updating SDS configuration to load that every few seconds so that's how that synchronization works okay thanks Mike for two last question have you considered using pub sub push of just the relevant metadata based on the x509 identity of the the clients that you're not also giving them all the information about the entire map for the entire network yeah so you can rather easily segment what information you're providing and it's really just a matter of sort of engineering time at the moment we have pretty wide availability of that through other service discovery mechanisms so it wasn't a priority for us but it would be relatively easy to have customized our web file availability in particular since everything is an IAM role and everything has its own IAM role you can simply make IAM role specific web files and set up the permissions to allow just those to access it so that actually wouldn't be that hard to implement all right thank you thank you for answering all the questions thank you yeah please give us some applause for his patients