 Good afternoon, everyone. My name is David Lapsley. I'm an engineering manager at Cisco OpenStack Private Cloud, formerly MetaCloud. This afternoon, I'd like to talk to you about some of the work that we've been doing on VxLan and on a distributed VxLan service node. This is joint work. My collaborator is there, Chet Burgess, who's our chief architect, and Kahale, who's one of our senior software engineers. Before I start, I'd just like to get an idea of how many folks in the room actually know about VxLan. Could you show me with a raise of hands? Oh, wow. That's impressive. Last time I gave this talk, I think there are only a few people, so this is great. And how many folks here are actually engineers working on OpenStack? Okay, wow. This is very good. So I'd like to start by... I'll go through the introduction slides very quickly and get to the interesting stuff. But I'd like to start briefly by motivating VxLan how it's kind of come about and other technologies like it. So we all know over the last decade or so, virtualization in the data center has really changed dramatically the network requirements. So we have virtualization on hypervisors, so instead of single physical servers now, we have 10, 20, maybe 100 virtual servers that look like physical servers to the switches. So it's an order of magnitude or more increase in the number of end hosts. Similarly, the number of networks is increased and also bandwidth requirements continue to go up. So this is a bit of a problem for traditional data center networks. And the reason is because they were designed with L2 access aggregating into an L3 IP core. And so there's a couple of inefficiencies with this. First of all, you have wasted capacity. So using STP, which you need to do at L2 to prevent routes in your network topology, that blocks a lot of ports in the network. And so you have wasted capacity there. You also have Vlan exhaustion. So at L2, we use 802.1Q labels to actually segregate different networks, different tenets. And so with a single 802.1Q label, you only have up to roughly 4,000 Vlan IDs. And this is a problem when you have more than 4,000 tenets in a single data center. And each one of those tenets might have a lot of networks. Top of rack scalability as well. So now you have an increase in the number of L2 endpoints. So you also need to increase your hardware tables in your top of rack routers. So pushing L3 all the way to the edge is something that can actually help with this. So L3 is scalable. It's well known and supported. It's been around for a long time. People are very familiar with it. It has equal cost multi-path routing. So it means you can have all of your links active. You can maximize your link utilization. There's a challenge, though. And the challenge is we need to still be able to scope our tenets and projects. How do we actually scope networks within this L3 network? And that's where VXlan and other IP overlay networks come into play. So VXlan is basically a Mac over UDP IP overlay network. So it takes Layer 2 frames on a hypervisor or at an endpoint, and it encapsulates them in a UDP over IP packet. It reuses the existing IP core. So you get L3 ECMP. It reduces the pressure on your Tor L2 tables because now the switches are just seeing a single physical endpoint. A single physical endpoint for the tunnels that are going across them. But within those tunnels are encapsulated multiple virtual endpoints. It also supports over 16 million or virtual network IDs, not VLANs, but virtual network IDs. And it maintains your existing Layer 2 bridging semantics. So from the networking perspective at the endpoints, it looks exactly like a Layer 2 domain. So here's what VXlan encapsulation looks like. So you basically take from your, say your VM on your hypervisor, you take an L2 packet from that. It's got its destination Mac, source Mac, L2 to the 1Q label, ether type, payload and so on. You take all of that and you put it into an encapsulating packet. So our packet goes into here. We put a VXlan header on top of it. And really the only field that's important to us is this VXlan network identifier, or VNI. And this is what gives us 16 million unique virtual networks. And then in the encapsulating packet, we have the UDP header, the outer IP header, and then the outer Mac. And of course, that's the source and destination Macs. So there are three basic components to VXlan. So you have your virtual network identifier. So 16 million unique virtual identifiers. You have your VX tunnel endpoint. And this is responsible for doing the encapsulation and the decapsulation of traffic as it's coming out of an endpoint or going into an endpoint. The VTAPs actually listen, well, there's two ports that they listen on. The IANA standard is to listen on port 4789. But by default in the Linux kernel, you actually listen on port 8472. And then one of the very important components of VXlan is actually how you maintain the VNI to VTAP IP mapping. And I'll talk about that in detail as we progress. So here's an example of VXlan deployment. So you have two hypervisors. They're connected over an L3 network. Each of the hypervisors has virtual machines from two different tenants. So we have a blue tenant and a red tenant. And you can see on hypervisor one here, we have a VTAP, which is basically a tunnel endpoint. And we also have a VTAP for the second hypervisor there. On top of the VTAP, we would normally build a bridge, an L2 bridge, so that we can have multiple VMs from that tenant actually connecting into the bridge so that those VMs can send their packets over the bridge, over the VTAP, then over this tunnel here, which I've labeled VXlan 100 to the end VTAP where it's de-capsulated, sent out onto the tenant bridge, and then makes its way to the other virtual machines there. So VM 1, 2, 3, and 4, either in the blue tenant or the red tenant, look like they're all on the same L2 domain. And over here, you can see the various stages of encapsulation. I forgot to mention at the beginning, all of these slides are actually going to be... I'll have a link at the end to all of these slides so you can download them if you like. So one of the challenges though is how VXlan is point-to-point. We've talked about a point-to-point tunneling mechanism. So how does VXlan actually deal with broadcast unknown or multicast packets? Packets that it doesn't really... that either have to go to everybody that have an unknown destination. So there are basically two mechanisms. There's actually quite a few more, but they fall into the two basic categories. So one mechanism is actually to use multicast. So whenever a VTAP comes up, or actually whenever... so for any VTAP or any VNI, you actually have an associated multicast group. So when the first VM on a particular VTAP comes up, that means that that VTAP actually wants to listen for broadcast traffic. So it sends an IGMP join to the group so that it can actually receive multicast traffic sent to that particular group. Whenever the last VM on a particular VTAP is spun down and there are no longer any VMs on that particular VTAP, that VTAP doesn't need to receive any broadcast traffic anymore because it has no destination. And so it actually sends an IGMP leave and it removes itself from the multicast group. Whenever a VTAP receives a broadcast unknown multicast packet, it actually sends it to that multicast group and the multicast group then sends it to all of the other VTAPs that are associated with that virtual network. So it's kind of like a unicast implementation or it's a multicast implementation of broadcast. The second approach is to use a service node and this is what we're going to be talking about. And the idea here is that you have a central node that just sits there and maintains the mapping from VNIs to the IP addresses of all the VTAPs that belong to those virtual networks. And so whenever a VTAP needs to send a broadcast unknown multicast packet, it actually sends it to that central node. That node will actually do a lookup in its table. It'll figure out who all of the corresponding VTAPs their IP addresses are and it will broadcast, not broadcast, it'll actually send or flood to each one of those VTAPs. The multicast mechanism is usually... Well, multicast has a number of challenges. First of all, you need to have support from the underlying network and then you also need to have the network configured appropriately and this can be a bit problematic. So the service node is a much more convenient option. So this is an example at a high level of how the service node works and we'll go into a lot more detail shortly. So if we imagine we have traffic going from VM2 to VM4, so none of these VMs on Hypervisor 1, none of the red VMs on Hypervisor 1 or Hypervisor 2 know about each other. So VM2 will send its packet out onto the bridge. The packet will go to the VTAP, VXLan101. That will actually not have it in its forwarding database so actually initially what will happen is an ARP request will come in which is a broadcast packet. So VXLan101 will see that it's a broadcast packet and it will send it to the service node. The service node is then going to do a lookup and it's going to see that for VNI 101 it has two VTAPs, 3333 and 4444 and so now it knows that the packet just came from 3333 so it's going to send the packet to 4444. So when VXLan101 receives a packet, it decapsulates it, puts it on the bridge and then VM4 will pick it up and then respond. So here's a little bit more detail. This is a sequence diagram. So on Hypervisor 1 we have VM1 and we have VTAP1 and then we have the corresponding VTAP and VM on Hypervisor 2. So the ARP request goes from VM1 to VTAP1. It encapsulates the packet, sends it to the service node. The service node then looks up the address and floods the packet to all of those IPs. In this case it was just one other IP but in practice there could be hundreds. VTAP2 picks it up. It learns. So when it receives that packet, it knows the source IP of the sender, which is VTAP1, but it also knows the source MAC, the VTAP1 MAC and it also knows VM1's MAC and IP address as well. So it learns this and it learns that VM1 is reachable via VTAP1 and so when the ARP request goes through to VM3, not VM2 as I put here, but VM3, VM3 will actually respond and send the response back. The response will hit VTAP2. It'll encapsulate the packet and because it's already learned where VM1 is, it's going to send that packet directly back to VTAP1. So it doesn't need to go to the service node. VTAP1 in turn from that packet will learn that VM3 is reachable via VM2 and so it's only the first packet that needs to get sent to the service node. All of the other packets, by the time they've done their exchange, the state is already set up on both ends so that they know how to reach each other and then the ARP response from VM2 will reach VM1 and then VM1 presumably will send a data packet that'll be encapsulated, make its way through and they'll continue their exchange. So that's basically how the service node works. So the simplest implementation of the service node is just to have a single node, a central service node here. They have a large number of hypervisors. They only have two tenants, red and blue. They're all connected to an L3 network and all of their VTAPs will point to this single service node or it could be a cluster for HA. The problem is if that service node goes down, then you basically can't do any more learning. It's not possible to send broadcast unknown or multicast packets anymore so your network is down. So the solution or one solution is to actually distribute the service node functionality. So instead of just having a single service node, you could have local service nodes on all of the hypervisors. You can use a distributed cache with replication so that you have redundancy and also, well, basically redundancy in where you store your distributed data. All of those virtual distributed service nodes could pull data from the distributed cache and write data to the distributed cache. And that way, they're all storing exactly... they're all sharing state via the distributed cache. The advantage of this is now we don't have a single point of failure and we also don't have a single bottleneck from a performance perspective. So we could lose one node, everything would keep working. We could lose another node and we'd still be able to manage even a third node. The number of nodes that you would need to lose before you lost connectivity is going to depend on the factor on the pattern of the node values and so on. So that's what we did. So here's the basic design. So first of all, we have... in our deployments, we have three controller nodes in Quorum for HA purposes. We have, in this case, 500 hypervisors that are all serviced by those three controllers. So this is an example. One of the things I forgot to mention is that with the distributed service node, you have the possibility in how you deploy things. You do need a distributed service node on each one of the hypervisors, but your distributed cache could be on all of your nodes or it could be on a subset of nodes. It's totally up to you. So in this case, we're going to put... we have three controllers. We have our hypervisors. We're going to add a distributed service node to each of the hypervisors. The VTEPs point to those distributed service nodes. We have a memcache cluster that's fronted by McRouter. So McRouter is actually a product or some software that came out of Facebook. It's a memcache protocol router, and it's what they put in front of their memcache clusters. It offers... it has a wide variety of functionality. One of the things you can do is replication and failover. So you get that basically for free. So Facebook uses it, and I think they have something like... at peak traffic time, they have something like 5 billion queries per second going through their McRouter cluster. So it's pretty impressive. So then each of the distributed service nodes points to a pool of these McRouter or a pool of these McRouter endpoints. So anytime this node here is writing into this instance here, its data will be replicated across all of these memcache instances here, similarly for the other two. So each distributed node has a pool of these so that if the primary goes down it can failover to the next and so on. So as far as the implementation, it was actually pretty straightforward. It really didn't take very long. We have a multi-threaded Python program. We use multi-processing, and so it's actually more than multi-threaded. It looks like it's multi-threaded, but it's actually multi-processed. So it means that you can scale with the number of CPUs or cores you have on your boxes. It runs on every hypervisor. It has a distributed cache, as I mentioned. It does basically two things. So the first thing that it does, it listens on its local host for any new registrations. So any new VTEPs that come up. And then when those do come up, it'll make a note of that and it will send the mapping and update the distributed cache so that all of the other distributed nodes know about this. The other thing it does is also listen for broadcast unknown and multicast packets in the virtual network. So before, this was what the sequence diagram looked like. We had the hypervisor service node and the second hypervisor. Now we actually have on the transmission side we have a local distributed service node. And that's really it. And the only difference here is that we're actually sharing state with a distributed cache rather than a central server. So I'd like to talk a little bit about how we configure VXLAN and then I'll give you a demo of this in action. So to create a VXLAN interface, you basically just use your IP link or your IP route tools. So we're adding an interface called VXLAN 1. The type is VXLAN. We give it an ID. And then we point it to a remote. And so the remote is actually where we send broadcast unknown and multicast packets. Then in this case, we actually add an IP address. In an OpenStack deployment, you might just add a bridge. And then your IPs or your virtual machines would actually connect into that bridge. We set an MTU size so that we have enough room for the VXLAN header. And then we bring it up. And so when we have a look at it, this is what you'll see if you do an IP adder show. You'll see the interface name. You'll see the IP address associated with it. So there's one really crucial role in honor of our chief architect because he came up with it. And this is actually really, really neat. This won't work the way VXLAN is currently implemented in the Linux kernel without this particular workaround. And I'll explain why. So what the VXLAN module does in the Linux kernel, it sits there. It binds to all of the IP addresses. And it listens on all interfaces on port 8472. Or you can actually configure that port. And then when VXLAN packets come in, it decapsulates them, looks at the VNI, and sends those packets to the appropriate VTEP. So ideally what we would want to do is we would want any unknown packets from those VTEPs to be forwarded to our distributed service node. And so, naively we would bind to local host port 8472. And the idea being that these guys would forward their unknown packets to here. Unfortunately that doesn't work. And the reason that doesn't work is because the kernel is already bound to that port on all interfaces. And so you basically just can't bind to that because the kernel is already intercepting all of the packets. So that's where this rule comes in. So if you look at this rule what it does, it basically does a destination nap. So it sees packets coming in on this IP address which is an imaginary IP address. There's no IP address assigned on the system. UDP packets coming in on that address. And then it translates them, sorry, destined for port 8472 the VXLAN port in the Linux kernel. And then it translates them to local host on another port 8473 in this case which is where we can actually bind. So those broadcast and unknown packets come in here they get intercepted in the kernel by IP tables. And then IP tables forward those packets to the VXLAN distributed service node. And that's how we're able to do this. All right, let's have a look at a demo. So this is the setup. So we have three virtual machines running on this laptop and they're basically pretending to be controllers and they have memcache running on them. They have microuter running on them clustering the memcache. We have two hypervisor virtual machines. Each of them has three VTEPs for different components. And they're in three different networks. So they're in 172.16.14. This should be 2434 and then 172.16.15 25 and 35. And then we have distributed service nodes running on those. I should also mention at the end I'll have a link to a GitHub repo that has Ansible Playbook. So if you want to set this up yourself, you can actually do that by running the Ansible Playbooks. Okay, so now let's see if we can bring up my terminal Okay, that's good. And see if I can find my other terminal. Great. So this terminal here with all the writing on it is actually just a simple monitoring program that I wrote that logs into every second it logs into both of the MHVs and it pulls all of the interfaces all the IP interfaces and it shows us on MHV1 for example it shows us the two physical Ethernet interfaces, their MAC address it shows us the VXLAN interfaces which are both bound to Ethernet 0 and then soon we'll see ARP tables we'll see ARP entries for for VXLAN interfaces and then here we have the forwarding database. So at the moment the only thing in the forwarding database is the default forwarding path to that special address which is where the distributed service node is listening. So what I'm going to do and this is all going to happen pretty quickly So first on MHV1 I am going to ping its local address which is 172.16.1.4 Okay, great that worked. So now I'm going to ping 5 which is on the other MHV which is shown at the bottom there so that didn't work which is exactly what we want destination unreachable now I'm going to start up the two distributed service nodes so there's one and here comes the second one and you can see that it's a forwarding list it already knows the IP addresses of the two members of this particular VNI it's sending a packet from this MHV to the other MHV on VNI1 and then if we look over here we'll see that the pings are in fact going through and if we go back and have a look at this monitoring tool here you can actually see that now we have an entry MHV1 has learned about 172.16.1.5 which is on the other MHV it has the MAC address for that and similarly MHV2 has learned about the .4 address which is on MHV1 in addition to that if you have a look down here at the forwarding database you can see that this MAC address is associated with the other MHV so both of them have entries for that so if we look one level of detail down you'll also notice that this VXLAN the entry for VXLAN1 here is f661 or begins with f661 and that corresponds to VXLAN1 here sorry that corresponds to VXLAN1 on the top right there which is what we would expect so that's just a regular app entry but what's interesting is if you have a look in the forwarding database you can see that we also have that f661 address but that's associated with the actual physical address ethernet0 and that's because when you want to send a packet from it it does an app request and it looks at its app table and it sees that the .5 address is accessible via the f661 MAC address so it fills that into its L2 frame and then when the VTEP goes to forward it it sees the MAC address f661 and it knows that it has to forward it to the VTEP so it encapsulates it it forwards it to the VTEP and it gets through to MHV2 there so if I go and kill this and clear all of the entries out of there and do the same thing on the other side oops my computer will hang this was not supposed to be part of the demo so oh boy so anyway hopefully that'll come back the point that I was trying to illustrate is if you do that then you can see here that the entry has been removed on the MHV1 side I can do the same thing on the MHV2 side and then you'll see that the ping will actually stop but it looks like my computer is not too happy about that there we go do we? yes alright so here, so I'll clean out this and go back and now the host is unreachable so just to prove that if we kill the distributed service node and we clean out those forwarding database and up entries then we can't reach it anymore the interesting thing is if those entries are still there even if the distributed service node goes down they can still communicate with each other so that's it as far as the demo I captured a couple of slides there with the before shot and the after shot if anybody's interested so as far as future work I'd hope to actually be able to open source it and give you a link to the software but I just ran out of time so I'll be doing that very soon probably in the next couple of weeks I'll have the code up if anybody's interested in playing with it it's for what it does it's actually remarkably simple it's really not that complex if there's any interest one of the nice things about this is that it'll actually integrate with Neutron very very easily all you need to do is add a single configuration option to the VXLAN interface creation command that I showed you it's basically that remote option so you add that in and you can hook Neutron's VXLAN implementation into this which is pretty neat so if anybody's interested I'd be happy to work on that and then performance and scalability testing that's something that we're going to be looking at in the future so here are the references so the slides you can download from here the source code the source code's not available but the answerable playbooks are there actually it took longer to get the answerable playbooks to work than the actual source code so feel free to download that if you have any problems accessing it or if you run into any issues or have any questions feel free to contact me my Twitter handle is there and my IRC handle is there as well we currently run VXLAN in production and our production implementation you can actually download now it's actually multi-area of VXLAN and it's highly optimized we had some unique constraints in production that meant we really had to optimize the implementation but it is, it's pretty it requires some expertise to be able to configure and troubleshoot that so yeah if you're interested in that you should certainly take a look so here's a presentation that our chief architect Chet Burgess gave with Nolan Leake at the Atlanta summit here's a good, here's the IRC which is actually really informative there's a very useful book on data center data center architectures McGrouder, the McGrouder code's there one of the, probably the most useful thing you'll get out of the Ansible playbooks is how to actually compile and build McGrouder there's quite a few steps there but that's all automated in those Ansible playbooks and then there's the source code for McGrouder and some other tools that we've used so any questions? I just wanted to make sure I saw this this is nothing but Linux bridge it should work with the Linux bridge agent just with that I mean obviously you've gone to a lot of a lot of work to make sure that you're doing that distributed cache write and there's a lot of processing there any thoughts about that versus scalability or speed of OVSDB VTEP schema updates? I don't know the last time I worked on OVSDB was a couple of years ago actually so I can't do any comparison that's certainly something that we'll look at in the future and if anybody has any insight I'd love to chat with them about that the thing that one of the things that we really liked about this architecture it's just really simple I mean there's really not that much to it and so from a maintainability perspective in production it means it's really easy to configure it's really easy to stand up it's a horizontally scalable architecture so yeah any other questions? yes so you had a thing with a single service node and if it went down then the entire thing would go down what if you just had a cluster of service nodes you could definitely so I'd be the same self say a little bit well okay we don't see what the thing with the cluster of service nodes is then you need to add a whole layer of you know something like carouselink or pacemaker to keep the cluster in quorum and have floating IPs or VIPs so that you can track that and there's a fair amount of work there you could actually do a similar thing you could actually do that cluster using this approach as well so yeah that's a good point great I haven't talked to anybody about it so I'm not sure I don't know yep access control so you mean as in for connections that are coming in like authentication so I believe McGrouder has support for SSH access or certificates or something similar I'm not sure though I haven't looked into that yet but yeah that's oh you mean so for the VMs that are spinning up okay so this is actually this is really just focused on the service node component on distributing that service node functionality how you manage VMs it's independent of this it's kind of another orthogonal dimension so you could do it however you normally do it you could use OpenStack to manage that if you wanted to yes you can and I think there is some work going on so basically what you want to be able to do is add a feature into the kernel module so that you can pick which interface you want it to bind to and I think there's been work on that but I don't know that it's being completed yet yep great alright great well thank you everybody