 My name is Navin Joy and I'm a Neutron hacker. I contribute code to the networking VPP project. For those of you who do not know what that is, it is a Neutron ML2 mechanism driver project that works with the VPP switching platform. So I've been working on that for several months and close to a year right now. Worked on security groups, role-based access controls and VxLan technology implementation over there. So today I'll be presenting to you about securing open-stack networking, some strategies and based on my experience, what I've learned. So here's the agenda. We will review some basic security strategies and we will look at the various services that interact with Neutron and its components and try to understand those because understanding those interactions will really help us secure them. And network security domains. We will examine a network architecture for Neutron, a typical deployment architecture and understand the various segmentation mechanisms and the security zones that we will look at. Then we will also look at some considerations for designing networks. What is R-back, role-based access control and how it is implemented in Neutron. Then we'll look at security groups and some spoof mitigation strategies. Then the role of quotas as a cloud or network administrator, how can you leverage that? Then we'll look at some of the checklists and go through some of the items that we have seen to summarize and finally conclude our talk. So primarily when you review security strategies, the overarching security strategy is the defense in depth strategy. That is going to be a primary strategy for securing our network. The idea here is that you use a series of defensive mechanisms and you protect your network in such a way that if a single mechanism were to fail, you still have other mechanisms in your network available to toward an attack. So you protect each and every service component, communication channel and the API access. You design your network security in layers, not just looking at the perimeter alone. Perimeter is one of the best defenses, but it is not the only one because there are so many attackers with such a wide variety of attack methods that no single component can successfully protect all the components of your network. Leverage, role-based access control and domains. So what a role-based access control and domain will look at keystone domains for those of you in the networking area that are not quite familiar with the keystone domain concept. So we will restrict access to project resources based on roles, we will look at how to do that in Neutron. Then we will look at again strategies to harden Neutron service components, which is to patch any reported vulnerabilities. So one area where you can look at all the vulnerabilities reported is in this URL. Also proper service configuration, such as service owner, file permissions, et cetera, critical. So if you don't really have to run your agent or distributed service as a root, think about it. I see a lot of that happening all the time. People just start up an agent as a root, even though a root privilege is not really necessary. Then evaluate the security impact of using a certain tenant network type. So in Neutron, as you know, you have the concept of a tenant network type that you can assign or allocate as an administrator to tenants. You have various options. You can do flat, VLAN, VXLAN, or GRE. But especially if you do flat, which I've seen for very simple deployments is that you just have one broadcast domain and all your tenant virtual machines or projects are in that broadcast domain. So you have to be aware of the fact that there could be ARP spoofing going on there in that one large broadcast domain. And you'll have to have a strategy for mitigating that. So limiting resources available to projects. So especially when you're looking at self-service networks, your users now have the ability to create, update, and destroy, basically crud on your network resources. So if you don't enforce CODAS as a cloud administrator, you may be surprised very often by a huge uptick in the sudden usage and that could cause some outages in your network. So ensure that CODAS are set to limit network resources and we will look at the capabilities provided by Neutron for doing that. Finally, isolating network access. So there are two ways to isolate. One is the basic network segmentation that we'll look at into the various networking, VLANs, management, APN, external, and data. Then besides that, you'll have to understand the security domains concept to control the interaction to networking services. And also use Neutron security groups and anti spoofing filters to protect project instances. So this is like a high level strategy of what we will go more in depth in this presentation. So this diagram shows the various components and the communication channel within Neutron. At the center you have Neutron server and the APIs. And then you have the clients talking to the Neutron server which are typically Horizon, CLI, and API. Then you have the ML2 plugin, the Neutron. That plugin interacts with the various distributed agents in your network. Typically through RabbitMQ or an AMQP mechanism. You have the L3 agent responsible for routing and floating IPs and this will typically be running on your network node. And this L3 agent will also be attached to your AMQP service. Then you have a DHCP agent that is responsible for DHCP. Also you will be typically running this on your network node. For redundancy, you can pretty much run multiple copies of DHCP agents. Then you have the ML2 plugin agent that is responsible for actually working with the mechanism itself, the switch. So the way it works is the ML2 plugin sends a message via RPC through the AMQP layer. And the ML2 plugin agent on the compute node will then pick up that message and then start binding the port for NOVA. So that is one flow. So if you look at the other interaction with other services, Neutron server talks via REST to Keystone and NOVA. So from a security standpoint, the domains is a concept of providing the same security controls to all the components within that domain. So you have in this case, the management domain and the external domain, two domains here. And then you have identified all the components that communicate between them and then restrict their communication to those components only. For instance, there is no real need for a tenant to connect to the management domain. I'll talk about the network architecture in the next slide where we will show, I will show you the exact network topology, the various network segments and how to classify those into domains. So also one more key piece here is that we use the ETCD key value store for messaging between the, from the plugin to the distributed agents on the compute node. So you'll have all of these components will have to be secure. Basic idea is that if the service offers it, ensure that the components connect via SSL into the queue, into HCD layer and so on. So this is a standard architecture for any Neutron deployment. It includes a Cloud Controller host, a network node and optionally an SDN node if you're using SDN in your network and a number of compute nodes and a horizon dashboard node. So you can see that all of these nodes are connected to a management network and called as a management security domain. The primary purpose of this management network is to enable these components, the controller to network to compute and all the flow that we saw in the previous slide to happen basically. So when Neutron essentially or the ML2 plugin driver makes a call to notify the plugin agent on the compute node, that traffic will go through the management network. Similarly when NOVA makes a call to Neutron to bind a port for an instance on a certain network, that call goes through this management network. So it is important to identify all the nodes that are in this management network and isolate that communication to one security level, which means that only these components should be allowed to communicate and maybe a few authorized segments in your network that are responsible for management communication. Maybe the network administrator or a cloud administrator segment. Then you have the data network, which is the tenant security domain. So when you create a tenant VM and you have let's say the various tenant network types that you saw, the topology of this data network shown will vary depending upon the tenant network type that you have chosen. If you choose a flat network, it's going to be just one broadcast domain. If you choose, let's say VLAN, that's going to be a VLAN segmented network. The basic idea here is that all the tenant networks are going to be connected basically to this data network here and the tenant instances are going to be applied to this network and they, when those instances communicate to the external network, that traffic is going to go from the compute node, traversing through that network to the network node and then get routed by that Q router namespace and then, you know, net it out to the external network. There are other, there's a VXLAN network which is basically an overlay network. So this network has to be isolated for tenants to communicate. There is, you know, you have to have the necessary security mechanisms out there that those networks don't talk into the management network. Then you have the external network that is used for floating IPs and external internet access and so on which is directly accessible from the internet. So this is essentially in the public security domain and you have, of course, the API network for, this is for the purpose of making API calls and it's also most likely from the internet you may want to expose it. So typically both of those are in the same security domains. So by creating these security domains and by analyzing your network, you are able to enforce those controls appropriately and then restrict who talks to these devices and these nodes in these security domains. So what are some of the networking considerations? So one is that evaluate your network design for providing self-service tenant networks because users now have the ability to create, update and destroy virtual network resources and it's important that if you're a cloud architect or an operator, you evaluate the various design use cases in providing users that ability. For instance, you have to consider the pros and cons for supporting various tenant network types. And as I discussed earlier, especially the flat network is one single layer two network without any segmentation and may not be suitable for any production deployment. If you're using VLANs, VLANs are limited to 4,094 networks and the V switch on each compute node is attached to a VLAN trunk port. So you'll have to look at the number of VLANs that you want to support and allocate to tenant networks and then plan your code allocation appropriately. Otherwise, you may have, you may run into this, you may run out of VLANs, causing some denial of service issues for your tenants. So for all practical purposes, if you are planning to use a self-service tenant network or planning to implement that, your options would primarily be VXLAN, GRE being another option. So the idea here is that you take a layer two packet, encapsulate inside layer three, and then deliver it to the target node, where it is de-capsulated and sent to the local bridge domain. So it's very stable because it is, the layer two packets are encapsulated. When two machines communicate across two compute nodes, you are encapsulating the packets in IP and using the IP routing to deliver that packet. However, it is difficult to monitor and support, and theoretically you have, I would say infinite kind of scalable, 16 million typical support, but you can't really go anywhere close to it. Your switches will run out of IDs for allocation. Another thing to consider is you'll have to investigate the maturity and security features of pluggable virtual network services that you plan to use. So Neutron offers the ML2 plugin architecture, which is a blessing, but at the same time it could be a problem. If you don't plan ahead of time and think about all the ML2 drivers that you plan to use, that one services that they plan to use, are they really stable, are they mature, are they secure, and things like that. So you'll have to carefully consider the various pluggable components that you plan to use. And if possible, leverage layer 3 routing and NAT for tenant VMs, that provide per tenant routing and floating IPs. Essentially the layer 3 router provides NAT network address translation, hiding tenant private IPs behind the router. And you can also allocate floating IPs to tenants and then use security groups to control the inbound traffic to those tenant instances. So this is a much more secure way of providing tenant networks. Some of you may know the next is a tap as a service. Neutron extension is available. I've played with some of this. So it allows you to mirror traffic from several ports into one port. And that mirror session is also capable of spanning across compute and network nodes. So if you are a network admin or a cloud admin, this is a very nice tool for you to debug problems. And if you are a security admin, you may want to leverage these APIs to gain visibility into monitoring network traffic and for security analytics. Messaging layer, we spoke briefly about this. Messaging queue, which I'm talking about RabbitMQ and HCD key value store, if you use that for your messaging, which we do in our project. So message queue should only accept incoming connections from the management network. So there is no real requirement to keep it wide open and allow the tenants to connect to it. Encrypt all connections, client connections to the messaging layer using SSL and TLS. So basically you configure your neutron services, which includes the ML2 driver and all the distributed L3 agents, the ML2 plugin agent and so on, to talk to this queue using HTTPS. Both RabbitMQ and ETCD support that mechanism. You know, you can speak HTTPS. It's not a very complex way of implementing. If you decide to use self-service certificate generation, you can, RabbitMQ and HCD itself can generate certificates for your client. So it's much easier to roll out certificates in that way and secure your client communications. So you don't really have to run an independent TLS deployment in that case, which is going to be a more very complex endeavor. Also you have to secure peer-to-peer inter cluster communication. So this is if you have multiple RabbitMQ nodes or multiple HCD nodes, you want to secure that using HTTPS as well. So additional layers of security we have implemented. Some of these password authentication is supported by both RabbitMQ and ETCD and client authentication. If you already have a certificate authority in place, you can distribute certificates to your clients and then have those certificates authenticated, which is really a very high security, but more complex to implement. So this is a basic concept in Keystone domains. I just wanted to put it in there because as network and cloud admins who are focused on the networking area may not know some of the capabilities of Keystone domains. It is a new resource that is offered in the Keystone V3 API. What the domain does is it sets up an administrative boundary. So we know that a project owns is networking, compute, and storage resources. And the project and users can now be grouped into a container called domain. So you can think of a domain as a grouping of projects and a project as a grouping of your resources. So some high-level container. And the advantage here is that earlier on without these domains, when you create an admin user, he used to be given complete privileges. He was considered a cloud admin, which is not the intended thing in real use case. You just want that admin that you create to be a project admin or an admin for a collection of projects. So the domains essentially make leverage that, make that happen. So with domains, an admin user becomes a domain administrator who manages a collection of projects that he's responsible for without being a cloud admin. So again, this ties into the concept of the principle of least privilege and security, which means that a person shouldn't have any more privilege than what he really needs to get his job done. So that is a foundational principle where prior to domains, it was like you just had one domain administrator for all the projects. So networking allows you to secure API access using RBAC, excuse me. Now what is RBAC for the beginners is, RBAC is basically defining roles and granting access based on roles to individuals in an organization. There is a role that does not need a password to be set. It's just an entity that you create in Keystone. And what that role does is it merely defines your access rights. And Neutron has a policy engine. And that policy engine has a configuration file called policy.json. And within this policy file, you basically specify based on the role what API access that particular user has. So that's a basic idea. So when I have another slide to go into the details of this, but these role based access control policy definitions are necessary to maintain your network security, availability and overall cloud security, tying back into the principle of least privilege that we discussed earlier. How does RBAC work? So Keystone allows you to create these role entities. And once you create these role entities, you just go to Keystone, it's just a one line entity. Just create a role, no password. It creates an entity. And once you create that, you can assign those roles to user project or group project pairs for a basic. Keystone allows other assignments as well, which I'm not discussing here. And Neutron uses this information, role information from Keystone to authorize user requests by reading entries from policy.json file, which we discussed earlier. It is a file that specifies policies to rule mappings. And the idea is that when a user makes an API request, it triggers a policy. It's like when somebody makes v2.0 slash routers, it triggers a policy that you're specified for, or when somebody sends a post request, excuse me, to v2.0 slash policies, interpreted as a create router request, right? So you have a create router policy, and then you specify an action. So we'll see in the next slide. And then the request, the API request is only permitted if the rule permits that particular action on that policy. So it's an example of a Neutron RBAC policy. So we have an API action or a policy called delete subnet, and we have defined a role called network admin, or you can also define a rule called admin or owner. What it means is that if the API user has his, if his authenticated token, the API user's authenticated token has the role network admin. Okay, or if it is permitted by the rule admin or owner, he's allowed to delete that subnet. That's what it means. For creating the router, you're saying that the API user has to be authenticated to Keystone and should have the role admin. And then the admin or owner rule, again maps to role or project IDs. So what it means is that you can only perform this action in the context of your project. So ETCD is a key value store that is becoming very popular right now because it allows you to create these keys and values and exchange those messages between the various components of Neutron by using ETCD watch mechanisms. RBAC support is also available in ETCD. So if you're using it, try to leverage RBAC. So you can define a role in ETCD using the role add. And then once a role is defined, it can be granted to various parts of the key space, the same concept. You define a role and you've granted to various parts of what you want, the areas that you want, that particular API user to have access to. So the below command grants read access to all keys prefixed with networking VPP. So you can grant that role that you created to a path in the key value store. And then you can prefix it. This is what we use for our networking VPP project. And then you can say you have read access to all the keys in this particular, anything that is prefixed with networking VPP. And then you can grant roles to users using the ETCD CTL user grant command. Security groups. So the security groups in Neutron functions like virtual firewall at the Neutron port level, that's an important thing to understand. That's, it's better to visualize that in that way. It is not at the instance level, whereas NOVA security groups were at the instance level where all the instance ports had the same security group. So if you're using NOVA disabled NOVA security groups and configure NOVA to proxy, all security groups calls to Neutron. So you can do this by setting the firewall driver in NOVA to NOVA word firewall NOVA firewall driver and the API set to Neutron. The advantage, another advantage of using Neutron security groups is that it enables traffic filtering for both ingress and egress traffic. Whereas in NOVA it allowed for filtering ingress traffic only. So the way to visualize a security group is that it is just a container of permit rules. So there are no deny rules in there. It's just permit, it's like if you're a firewall admin it requires a little different way to think about it. You can only put in permit rules and these rules are stateful. So which means that if you permit traffic outbound you don't need an explicit rule to permit the response traffic and the other way around. So the rules are pretty flexible. I mean, they're pretty rich rules. I mean, you can filter several types of IPv4, IPv6, TCP, UDP, any protocol port can also filter on remote security group ID, remote IP prefix and so on. Since these rules are just permit rules, if the order of the rules in that security group doesn't really matter because just permit. Anything that you don't permit is just denied. So you can just remove any rule and add it's very flexible in that way. Unlike a firewall rule where you'll think about the order. You don't have to worry about the order at all with security groups. And another feature is that the rules can be updated at instance runtime and neutron will implement that. So mitigating spoofing. Yeah, so neutron implements fortunately anti-spoofing on all ports by default. But then it gives you an extension called port security extension for administrators to control that. So enable, make sure that that port security is enabled on all neutron ports and networks. What it does is when you enable it, neutron, when you, by default, every time, regardless of whether you program security groups or not, when you create a neutron port and bind that port to an instance, neutron will implement several anti-spoofing rules that will prevent spoofing activities by project instances. Nobody can spoof a DHCP server or run a router for traffic or send ICMP v6. There are a bunch of a lot of spoofing rules. So you don't have to worry about that. Just make sure that you don't turn port security off unless you really need to. If there is a situation in which you will have to turn it on, turn it off, which is if you're running a VNF or a router that you really want to run a load balance or something that is going to forward packets and you really want to disable port security. And neutron also provides APIs to update a port security, port security of a port, and it should only be available to admins, which is also taken care of in its default implementation. Quotas. So as an administrator, you can limit the number of networking resources available to projects. It's a very basic thing, but a lot of people forget to do that. Quotas protect networking services. As a cloud administrator, you may, if you don't implement Quotas, you may be surprised by unforeseen spikes in resource usage at nighttime. So just make sure that you have set your Quotas properly. And when a project consumes all of its allocated resources, the resource becomes unavailable. So if you have a project that you have allocated 10 networks and they're limited to that, so especially if you're using VLAN-based networking, this is an important thing to set and manage in the sense that you have to control the Quotas, how many projects I have, I've set so many Quotas, my limit is 4,000 or so VLANs, so how many can I go before I need to start a fresh deployment? You're limited by that, so you have to do some planning. Quotas help you to do all of those planning. In Neutron, Quotas can be set for network, sports, basic resources, subnet, security group, security group rule, and also for routers and floating IPs as well. These are also precious resources, especially your floating IPs, they consume a routable IP. So Neutron provides two options for setting Quotas. One option is to set the same Quota, the default, across all projects, using in the Neutron.con, set it and forget about it, or you can also set per project Quota limit using an API extension. So Neutron also enables that, which is Quota set command is also there. You can use that to update Quotas on a project by project basis. So we have seen several security mechanisms, so I just wanted to refresh your memory and put together a security checklist. So first is, we looked at whether all interactions with the networking service have been isolated into security domains. Security domains means similar security controls, I mean the same level of control for all the nodes in that domain. So if you're using an ML2 mechanism driver, does it mitigate orbs poofing? This is something that you may want to investigate. Have you considered the pros and cons of supporting various tenant networking types? Are you using flat? If so, have you mitigated orbs poofing? Are you using VLANs? If so, have you planned your Quotas properly? VXLAN, again, troubleshooting and things like that, have you taken care of that? Have you hardened all Neutron service components? We saw several Neutron service components, the API server, the messaging layer, the communication to that messaging layer have to be TLS. Then are using Neutron security groups and enabled port security? Again, all communications are using SSL encryption. Have you implemented RBAC using the concept of least privilege? Identify the API actions and who can do what? Now, if you're using any new service, a new component, have you investigated the maturity and security features available of that pluggable component that you're using? And finally, are you leveraging Quotas to limit project resources? If you specify a minus one, it is unlimited resources, so that means it is going to be, that you'll have to be very careful for your Quotas. So in conclusion, in this session, I've tried to simplify a lot of complex concepts. We reviewed several approaches to secure OpenStack networking and I've tried my best to elucidate complex concepts. And I hope you found them useful. You can follow me on Twitter and pretty much that concludes our talk. Thank you so much for coming. Yes, questions? So we recently built OpenStack and tried to follow some of the networking in it and run into all the bits and pieces and everything. And I have to say, it's a bit, I mean, it was pretty arcane to follow. We eventually deconstructed it. There's some information out there in the web and so on. But trying to follow a packet as it goes into Linux Bridge, first of all, for IP tables, for the policy deployment, then into OBS and the OBS switching, the VX LAN and so on the transport and then out to a router, perhaps it was there. And then to another tenant. Are you aware of any tools that help virtualize all of this so that you can pick up traffic at specific points? Because you wouldn't know if, you could have an denial of service attack occurring on one OBS, you wouldn't even know it. Right, right. So what I do is I use tap as a service, if you have used it. Tap as a service. What it allows you to do is TAP. TAP, tap, tap as a service. So it is a neutron extension where basically if in the real world without, you know, neutron, the way that you go about doing in the physical world is that you create a mirror port, right? And then you just funnel all the traffic and then you use some sort of a sniffer or, you know, your TCP dump to analyze that. So that's exactly what the tap as a service will allow you to do. Yeah. So you can just mirror ports, whatever you want, monitor traffic, restrict, you know, if you feel that there is a denial of service attack going on in some compute nodes, you can mirror traffic across compute nodes and then have a single view of what's going on. Okay, thank you. And if you have additional, you know, analyzers, network analyzers that can capture and display these packets, you know, it'll be much more helpful than TCP dump. Yeah, of course, like wire shark. Yeah, exactly. Yeah, you can use that as a very nice tool. Yeah, okay. Thanks for the sharing. So I have a question about when you mentioned flat network, maybe it has some security concerns, but I'm not quite understanding. So if you have a security group and you have a port security, so after spoofing is already considered, so what are the risks? Right, so it depends on the actual mechanism driver that you're using. If you're using an open V-switch mechanism driver and a lot of new mechanism drivers provide that capability by default, there is a setting there saying ARPS spoofing, right? And then they do that now. So ensure that that is turned on. And then if you're using open V-switch, you're protected, but a lot of pluggable mechanism drivers don't have such features. So you may want to investigate that to see whether that is supported. Okay, thanks. Any other questions? Thank you.