 Okay. Our next talk is sneaking in network security. Our speaker Max is going to tell us how to scale up defense for computer networks and in particular how to integrate that in existing networks. Okay. Max is a pen tester, an ex pen tester and now a member of the blue team. Applause to the public. Thank you. Hello. My name is Max Burkhardt and I'm here to talk about introducing security in networks. Like me, with a small team, I managed to implement a model of segmentation of networks in a sensitive way. I'm a security engineer at Airbnb, so I have practical experience in that network. I think that the techniques that we're going to learn today can be applied to a lot of networks. So I'm going to talk a little bit about the technical theory and what happened when we implemented it. I want to give you some evidence to run this in your experience, to how to implement it yourself. So let's talk about network security. So let's talk about network security in 2018. Segmentation is still a very good idea because we know that everything, we know that commitments will happen on the networks whenever someone can forget and pass on a server. And security on the networks can prevent that vulnerability to the security areas higher. However, I think I've never been involved in network pentesting in the whole segment. We know why. We grow quickly. We know that when networks grow quickly, small teams, especially in a startup like Airbnb, have to prioritize their work where they have the most impact. And that's usually the perimeter. So when a network grows quickly, in the end, you have a network with a hard shell, with a hard exterior and a soft interior. So if the attacker finds a way to penetrate, it has relatively much freedom of moving inside the network. So we know that as network experts, we know that there is a problem. So to give you an idea of the scale that this project dealt with, last year we implemented this. The Airbnb network had more or less 2,300 services, more or less 20,000 security nodes from the network. And maybe an instance in EC2 or in a Kubernetes pod. And we did more than 1,100 installations per day. And we know that it's difficult to build, to modify this large architecture. And because of that architecture, it's difficult to work and find the areas where there are difficulties. Finally, it's a very important issue to talk about the productivity of the real estate development. If you are writing code every day, if you lose 5% or 10%, that's a very expensive thing. The question was, how do we make the center of the network have appropriate security? We can't start from scratch and stop all the development. We heard a lot, especially in the pen testing, often in the community, about how to do the same thing in defense of security. We can put in the defenses and have nobody know we were there. Theoretically, we have to stop thinking about security as if it were something that was around the development process. This was the way it started 30 years ago, so first you have to do your application, test it and then put security. But it doesn't work anymore. So what we are talking about is a new way to do things. The one that mixes security with development, like agile security, etc. This is not something that we invented, because a lot of people are working on it. But I think that most of the time people think like this, in terms of application, and it's time to put it together with the network security. I think the most important thing is to build a security solution that goes for web development, which is something that also serves for people to develop the waves, because it's something that can be easily scaled. It's not impossible to work so that the attackers do nothing, so we have to find a solution. And as we are managers of good projects, we are going to prepare all of this here. Everything we build can't be on the path of engineers. Maybe it can be something that they don't have, but they shouldn't think about it. They should implement security by default. It's something that we wanted to do for a long time, but in that case our goal is to make it difficult to make an insecure configuration of a network. And finally, we want to build something that is flexible, a flexible network to implement different protocols that we are going to be using. You don't know what we are going to be using within six months, everything we are working on. It might be a Linux, it might be another project, maybe Haskell and Azure, we don't know. I don't know what it will be in the future. So it should be agnostic of those decisions, the implementation. So this sheet is basically the solution in two orations. We are going to use the SMUTU in a system of services discovery for authentication and security, and with zero configuration, without configuration. We are going to dive into each of these parts, and I'll show you in different parts, and I'll show you how you can see everything to make an invisible security. So, to start off, I have sort of isolated three pillars of this. The first is TLS in the development of the service. TLS is one of the most important protocols in the security industry, and it gives us properties that we can use in all parts. So this is the first pillar. Prepare everything in a way that we can use TLS, and make sure that they work in all parts. The second pillar is to combine the identity, not with the identity. In the additional network segmentation, everyone addresses the IP address, but we want to talk with one of the different from our network, because we are using TLS identifiers. I'll talk about this later. Finally, we are going to generate an authorization map, determining automatically what services you have to talk with. We can try to update things automatically to make sure that the connections can't be lost. So this is a diagram, and we are going to see different parts. It's a very simple diagram of a network with three nodes, and they can use this authentication to know who they are, and they can use authorization logins that is sort of centralized map, and it shows what nodes the network is. It shows that it decides that it has to do with the others. Let's jump into the first pillar here, which is the implementation of TLS. So specifically here, we're looking at the... Before I start, though, it's important to jump to basic concepts here. We're using mutual TLS here, which is... You've certainly heard of traditional TLS, where you have a client that is verifying the identity of a server, and you get the cert to make sure that the... It matches the domain name, and so on. But TLS is also very wide, and it can also verify two senses. So you can have a client that also has a certificate so that the server can check who is using an equally strong certificate. This is pretty hard to deploy on the public web because it's not used, and it has to do it in your own network. This works really well. So this is really great because this means you can have a strong authentication and we know how to deal with these sort of systems. And so not only can we make sure that a client doesn't know that they're talking to the client, but the server is looking who's talking to them, and they know that the way it seems appropriate. This is a term that I hadn't heard about a long time ago. This is a term that I hadn't heard about a long time ago. This is a term that I haven't heard much about in companies before being in Airbnb, where we use a lot of cloud services, but basically the meaning is that you have a node on the network and you have to find other nodes on the network with which to communicate. And basically it's implemented with DNS, and for example you want to, for those of you who don't know, you want to do a Google search and you go to google.es and DNS is looking for a server that can provide you with Google. And as we know, a network has to be very flexible and things move a lot. So service discovery, the service discovery has to be very flexible. It can be problematic because it basically tries to be a network map, it tries to help and say, oh, look, I find the service here, I find the other service here. But it can also have an effect on security. Airbnb uses a framework that's called SmartSec, and that's what was there when we started this project. That's what we were using then, and that's what we mounted our architecture. What we're going to use is applied to what we did, but I'm sure you can also use it in other places. This is something that Airbnb created in open source a few years ago. Basically it uses ZooKeeper and HAProxy to make the services can communicate with each other. If you look at the example above me, you have service A, service B, and it's going to try to communicate with a ZooKeeper cluster. One wants to communicate with the service B, so it's going to look for the location of the service B through HAProxy. The service A, if you want to call the service B, sends a question to the local host, and HAProxy is going to find a node that can answer your question. The important thing is that this system is not designed for security. You just ask for a list of nodes and you get a list of nodes and you're going to try it. But I'm going to show you how to implement security in this system. So, before the security modifications we did, when service A wants to talk to service B, it sends a question to HAProxy and it's going to send it to another node and it's going to send the internet to service B. There's not much security here. What we added is a secure shim. So, another proxy, the reverse, which is in the node that receives the messages, it's in service B and we reconfigured it to communicate only with TLS Mutual. So, now all the packages through the internet are in a TLS tunnel. But the important thing is that service A and service B haven't been modified for anything. So, we modified the security model radically without touching anything from the engineer's code. And here's where we're talking about invisibility. There are several benefits. The way we have two services with proxies that are using TLS and there are things that can do authentication and verification through the tunnel. Security was able to build these controls and build these controllers at once and distribute them through the internet. The same proxies can work as well as languages or what protocols they're using. So, instead of using authentication, authorization code and having to verify the authorization of the code, we just did it once. The other thing is that having these proxies on both sides is actually really helpful for no reasons. So, for things like this, it's a very better thing. Better ability to load testing. You know, load testing. And that's what we got through the internet on the support of other infrastructure teams at the company. They don't have to try to do it on their own, but it helps. So, basically, they're doing the opposite of what the NSA wanted to do. Here's the presentation where they're showing what's in Google's cloud. So, there are a lot of things that need to be TLS. There's one important point about this particular approach that's kind of the basic way of doing it is the exclusivity of the proxies. So, we leave all the... So, this is very important that the only way to talk to a proxie is by going around and you don't have to use the proxie. So, you can evaluate the authentication. So, we have to make sure that this is going to be impossible. Now, we're going to talk about how to solve this problem, but we have to think about if you're trying to implement something like this. So, by implementing a new proxie into an existing service without changing the code of the services. The first point is what we want to do is a segmentation. We have to make sure that the only things that are allowed can be communicated. So, we have to build a sense of identity that allows us to do this verification. So, we're going to talk about how to do these certificates there and we're going to talk about and how we decide how we're going to function. So, segmentation. We're trying to make sure that we're able to make sure that a node can only be able to talk to the items that it can talk to. If a node needs to talk to a service, it's going to do that. So, maybe for the reasons of business, it needs to talk to it. But it needs to make sure that only the nodes that have to do it can do it. So, we want this to happen at a level that can talk to each other and then maybe they can get on a certain predefined channels. But in a micro-service network or a network that has a lot of dynamic communication, maybe we have to think about this more at a service level and not at a node level. For example, we say that we want the service of payment configuration to communicate with the back-end of the service. So, we can start to represent these decisions instead of this static node level. That is to say that we have a lot of services and each service that wants to communicate what allows to save locations. So, we have these proxies that understand TLS and TLS is great to understand identities. So, we can start to create this network so we can be in this state in which we only know certain identities and can only communicate with those specific identities. And this is something that can change a little depending on the network's topography. So, you have to find a concept identity that meets certain attributes. So, this identity that you decided to use has to be different for each case because you cannot distinguish between the services. It is an identity that the node cannot change for itself because the attacker can modify the identity of the node and move to other parts of the network in which it should be automatic so that there is no to configure it manually so that there is no need to be an Excel table and find a certificate for a specific node. And finally, we decided to use the TLS certificates and we also wanted something that could be introduced in the certificate so if you have a configuration system or a cloud permission system you always talk with a function that gives you the identification of the instances. And this worked really well because the different services can change and you have to manage the services yourself and represent the certificates like strings. So what we are going to do now is we need to give an identity and we need to give certificates that allow nodes to prove their identity in this TLS communication so we can then build a map of what identities can access what services. And we are going to distribute that map so we are going to say for the payment let's say that you can talk with these identities, but with other nodes. So only some nodes can access some things. So how do we do this map? This is the third pillar which is the final segment of this part. So we have to think what I can do and distribute it. The question is it's a matter of trust. How do you realize what you have to talk with? With the least amount of human action possible. So we wanted to lower the cost of human if you have to have people who spend the whole day making firewall configurations it's going to be very expensive and difficult to keep safe. And we wanted to be far from a configuration list where you have a lot of security engineers who think what to do with who. So we are going to go to the existing code. You can see how the network works and how the configurations are and how to use it to try how the communication works. This is getting to an interesting point because the decisions you make now depend on how you think about your organization. We decided that if you can use code and that you can make changes. This is something that may be based on your organization's set up. We are going to talk about that in a moment but for us we realized that we have a Chef repository that can distribute information to all of the nodes running in our network. And the solution is in a parsable way was saying what the very service work. And then there is the dependency of the services. So service one has the dependency in our monitoring and etc. And there is a repository that is difficult to control. You have to be an engineer that has peer review to be able to upload code here. We realized that service one is a part authorized for the production. To do that we built this service called Arachne, Arachne like the Greek goddess and Arachne So basically it's continuously pulling our Chef repository to find out what connections have been defined by people that can be trusted to build this reverse map that defines what services can communicate with what services. And that can be sent to the three reasons that we are going to talk about later. And this comes to the services of this kind of authentication. And the way you generate this map depends on how you think of internal dangers in the company and in our case we made the decision in pure conscience to trust our engineers but it depends on how you see this problem and understand a little bit of control. And this system is quite flexible and it can have something that already discovers almost everything it can and it can publish a new authorization map. So if you want more controls than this for example if there is something that you have to give a new connection and confirm it while still taking away a lot of that boilerplate work we can actually go further instead of just telling us discovery proxies to allow these identities and banning these others because we are just using this without additional we can trust in the support of these protocols the proxy reverse that we use has a functionality to inject information about the client in the Http streams so this applies almost all of our services use an Http so that already with almost everything we need whenever you enter a message with TLS you can read this header to discover information that makes trivial configure permission levels within the network this would have been something very complicated to implement before having this system because we would have to configure all kinds of passwords, codes and secrets and this makes us reduce all the cryptography complicated to the TLS protocol to be able to focus on other things so these are the three pillars of our solution we implement TLS for everything to have the security that we need then we give the identity to everything so they can communicate one with another and reinforce segmentation with lists of allowable connections and then we generate this map reading the chief configurations but I'm not here to sell this solution because I like it but there are disadvantages and I'll be honest before you consider implementing a solution like this these are the things that we think and we decide to accept first we will need synchronize this map constantly lists of permissions instead of having centralized instead of having centralized configuration of permissions like through a firewall or something like that you are distributing it through various nodes so you have a lot of communication through the network to synchronize this configuration you can use a little bit of caching to make things easier but still you can add attention to the configuration second if Telesa has a problem like for example Hartplit you have a lot more problems than what you are used to because you are trusting in the security in the center of your system but the reason for this is that it happens again if there is a security problem between Telesa security will be working many nights so if we are installing several systems to configure to repair this security problem even so we will repair the rest of the solutions adding more reverse proxies in the traffic is quite complicated introduced a lot of interesting behavior in the services and it is interesting to know that the addition of Telesa almost did not break but the additional communication in the network had surprising effects you have to be able to run software in all the places where they communicate because you need a place where you can download the permission lists normally with servers this is very easy but with services but with devices and OTs or services of other offers it is difficult to use this finally you have to have a way to deactivate certificates because if one of these services is attacked and compromised you have to be able to deactivate all the permissions it has that is a bit difficult there are ways to solve the problem but you have to consider it so install it and implement it I hope you have described it in a practical way this is not purely theoretical this is something we have implemented and I hope you can explain what I have learned during this process so to start with the technical details we build this mainly with open source and open source so for proxy we use Envoy it is an open project that is growing a lot in open source and with a good reason it is very good it is modern, it is fast it has extensive functionality with TLS and generally it helps us a lot but a problem we had with Envoy is that it is very specific with HTTP 1.1 and that leads to some other applications that did not have much care with the use of this protocol as I as I said before we gave each entity an identity because it was natural for us it showed the way we think about security and the permissions to control what services we have a Ruby script that works all the time is Acne the authorization maps are files that can be achieved in S3 where you take out the things it was about four minutes to web of services and the network of services meaning that it is about a four minute delay in between a change that is a service at least four minutes that you have to wait until the changes are solved and then you have to deploy such a change to production and then you have to install those changes where a new dependency gets added that that has to be done we have specific availability considerations of accessibility that are very specific we wanted to make sure to make sure that if Arachne falls and we can't generate those maps that the traffic could continue working we didn't want to stop all the traffic that was happening and if we decentralize all the authentication we want to be able to base it on the benefits of decentralization and then we have to make sure that if Arachne had a technical problem that the changes would stop being refreshed that means this still works even if it falls like last year things still work the nodes can't go down the topology but the cars still works this was a choice that we made early on when there are new interesting things that happen with Arachne and security can be realized before other people realize it and the plan was to activate these six points to do a map of authorization and verify it and make it work before we touch some services and then we verify and then we have to give a certificate of identification which is a very small change that can be applied easily to put a certificate in the nodes which is easy to verify that this works before starting the next point to check that it works and it's all good we install the blockchain proxy we install proxies in all parts and then we start checking how the traffic works the traffic still doesn't work because of the traffic the roads the paths and this helps us configure this step and we also configure and we try this step before continuing next we start creating the traffic and we configure the configuration so we can turn it on or turn it on depending on the service and then we have a part of the service that seemed representative of different types a great variety of things that seemed to us that they could make our system and we look if this worked and the next step was the radical step which was that we started we put everything to work in a moment and so it's the way that sometimes you want to do things but we chose for a good reason we have two people working on this program in this project and we have other thousands of engineers doing other services as quickly as possible so we wanted if we knew that if we went little by little we would never be taking the step or the others all at the same time and then we moved to the next point and the next was to put together all the localhosts so we could have security and this was the hardest from a security point of view because you have to wait until you are at point 6 to see the benefits but so we could take a step back if something didn't work we realized if a service did something that we didn't have to do, we took a step and we could fix it to visualize this we started with the nodes we built the authorization map we moved on to adding we put the certificates we installed the reverse processes with the logic of authorization we turned on the TLS for some things to be sure that it worked and then on the day of application we turned on TLS for everything else we did this in April of this year and there were several things that worked we went from 15% of internal TLS to 70% in just one night which was very good and it wouldn't have worked in any other way we made sure there were several benefits that had to do with security or not and this gave us more support from the organization for other changes and so these changes helped these changes to other engineers who were afraid of losing their time so we had to make sure that there were benefits for them and then we put a more easy configuration because that way we identified everything else and I didn't have to think about creating a connection with TLS and that brought benefits performance numbers are good I'm going to talk about it later and there were several metrics that showed that something worked and that improved the other thing that we did was making sure that we had the right configuration we did routine TLS in several services so we realized that a service had a problem we didn't have to roll the whole thing back in all parts but we could work only in a few services and as I'm here to tell you the truth there are several things that didn't work as I said before using proxy in everything it sounds good on paper but it has problems in practice in 20,000 to 25,000 services almost all work but some of them started to work some things change even small things like changing the case of headers it can give some weird comfort in some applications and it can also bring problems like web sockets and there were things that didn't work with that but then there was a quick part to show how the servers had to work but then it worked pretty well but we didn't realize that there was weird behavior in some services but these behaviors would have problems not because they had to do with security and our process of testing was good to realize what would happen if everything started to work with the TLS we tried that well what we tried was what happens if all the services that depend on your box start to want to use TLS and then we had some problems especially the Open Proxy had a previous version it couldn't work well with TLS so it would load the same certificates in all the connections which worked a little weird and that's something we could have tried better before and the last thing I wanted to mention is connecting the services with the localhost it was something that took more than we thought we had created templates that showed the services connected with the localhost but it took a couple of weeks more than we expected because it was more difficult than we expected I wish we could have allocated a little more time it's something that I would have liked to have planned more time for it I already talked about performance because someone always asks when you talk about a TLS and people ask why it's so slow luckily I can confirm that things even sometimes go faster things I didn't expect when they say it wouldn't work it would happen but it did a number of our services improved even 80% what happened is that we had a bunch of these services that had been implemented for example TLS services and we wanted them to be very secure but the TLS had been involved in the application but that takes a while when you have to restart it so it was useful to remove the TLS the handshake in all the connections when a box comes up with a system that connection can exist weeks or months so those cash are very hot in such a way that the caching is very close to 100% so it only pays the cost of the encryption that is in the hardware in the network it was a huge advantage it didn't slow down the TLS and to implement this in your own infrastructure I imagine some of you don't have the networks that are as segmented as you would like and this is a nice way to implement segmentation in a big and easy way and I have a few points that can help decide if it's a good idea for you to do this first, how can you distribute these proxies in your infrastructure we have an advantage that we have a management system that could install software that could configure it and we had software that could also be configured in the same way this was quite natural for us to have in mind next how can you assign identities to nodes this is important because identities are very specific and if you already have a specification system then it's quite easy to configure it to be certified otherwise it's much more work will you need to configure manually these access control lists or can you generate them automatically if you can generate them automatically that's when you're going to have these efficiency gains so if you have if you're in a situation then that's what you want to push or promote if you don't have it there are some solutions in the market that can handle it but Istio and Consul are two systems that are promoted as solutions for these types of problems and already implement this security system so to explain it this is not something that we implement Istio and Consul implement it for you in a way that you can easily package this is something they don't do it on the side of the configuration but they do it quite well so of course you can implement it with the security discovery system as we did if you don't use these systems so to summarize I'm here to tell you that you can change an authenticated network and the reason that is because you can make invisible changes and you can make the system fast for the system of authentication maps the engineers that are developing a microservice before the system implementation and after the system implementation basically does the same as before basically does their changes, tells who wants to communicate with the system with the other service but now if an attacker manages to attack one of those systems suddenly there is a network where before it could communicate with any service now it can't go through the system and this is something that I think is possible very good I think where before you tried to implement security now it's possible thank you very much for listening if you want to stay in touch ask me questions outside of the session questions and answers you can write me on twitter or email or if you just want to see something under rbnb.io thank you very much applause from the audience thank you max thank you max if you do have a question please line up on the microphones try to limit your question to a single sentence if you'd like to leave at this point please do that as quietly as possible signal angel your first question from the internet please hello why do you use opensl and not implementation so I guess I think I said this as an example any but do you use the stack that works best for you but switching to something like something like libresel is probably a good idea number two your question hi great talk what are you talking about what are you doing to mitigate the risk of ssrf in the host well that's something that I'm worried about as someone who works on the app in a company that only works with hp it's something very delicate we are seeing how that code is very careful to avoid any problems that come to the internet but my work is trying to make that better number one very interesting idea interesting idea basically it includes a vocation it's an intelligent idea but now we don't have the work together so there's no proxies that works with this but if it works well because if you work if you manage this work it's something that would work because you can't make it an identity at least we but if you have a network that has it, it could work did they audit the proxy code before placing it before putting it in front of all the services yes number two what are the cost implications for implementation what are the cost implications for your operations the costs are pretty low because the proxy reverse is very efficient we didn't have to do anything else just see the configuration it's pretty cheap the generation of the map is very cheap because it's only written by Ruby the most expensive is the transfer of the authorization I think it's something we can reduce seeing how often we see and if we don't make a lot of topology we can lower the frequency in which we exchange but in general the costs are pretty low in terms of certification are you limiting the certification life or what kind of control do you have there we want to get to a point where the certification is more quickly there are communities that are doing a very good job in terms of making it last a couple of weeks or a couple of days but we can still say our point of view is that we ask things like putting them on a list in which they are denied it's something that prevents things in about four minutes so we can use the active promises but we are improving our infrastructure so that we can use it more quickly thanks your question since you do all of this on a flat level 3 do you use payment information how does your authentication affect if these systems are connected to other systems on the network and not separated by firewalls and similar things our network is interesting the payments are separated and the rest of the network affects us that doesn't affect us but this is an effective way it's specific a web page so we don't have to do this specifically it's something that has been done favorably you can elaborate how how to react to the changes you made in your talk how thanks for doing this it's something that I like to talk about I think that the security is going to sell well your solutions when you're thinking of something like that it's crucial to make sure that there's something important for the security school and so a lot of those things for us are more on developer ease and productivity for example reducing the difficulty for them for example improving the performance as I talked about before these are things that other teams other teams like that they were open to our proposal and so it was a good plan to have a good plan to do the homework to do tests and think as in the structure more about security security is the last end but we have to make sure that we're burning our credibility in the rest of the organization in a way that I thought a lot about forgetting about the benefits of security and how I'm going to sell Thank you microphone one your question do all nodes Do all nodes and what technology do you have to use to render emfoy? Everything that's my website is jason almost which downside is the data form it puts the list of identities in the configuration of the on-boy, and this uses it as a configuration of its automatic life, and they go down in two seconds. Have you considered using PAPSAPUSH as just the relevant metadata of the clients to give them the information of the entire map of the network? Yes, you can easily segment what information you are providing. It's a matter of time development. But we have two other methods of discovery, but it would be much easier. In particular, if it's like any other IAM role, you can do IAM specific web data. So that wouldn't be very difficult to implement. Thank you for all the answers. The chat is over, applause for the audience.