 And I guess, welcome to the afternoon, the last session of the afternoon. Hi, I'm Nolan Lake. I'm the co-founder and CTO of Chemulous Networks. And I'm Mark McClain. I'm co-founder and CTO of Aconda. So let's talk a little bit about where we're going to be headed today. We're going to talk a little bit about Neutron with the X-Land, kind of talk a little bit about the problems and challenges, some of the alternatives and solutions available. Then we're going to dive into what a real world deployment and looks like in what we've been working on the last couple of months, and then finally take a look at the conclusions from deploying the solution in the real world. So to kick it off, we're going to dive in, level set, go with a few of the basics on really Neutron's story with the X-Land. But before we do that, we have to take a real quick primer on ML2 just to make sure we got all the primitives down. And so we've probably all heard of the Modular Layer 2 plug-in by now. It's a full implementation of the plug-in. It has two types of drivers. One's a type driver, which is common interface for all segment types. It manages the specific resources. If you have a VNI, it's going to handle the selection. Or if it's V-Land, it'll handle the ID. And type drivers are extensible. So you have type drivers for local networking, which is really for testing, flat networking. So you can tie in with existing infrastructure in a data center or V-Land if you have pre-provisioned V-Lands, GRE, VX-Land, and now Ginev's available. And the other driver is the mechanism driver, which is really for ML2, which is where a lot of power comes from because it's what enables you to connect in back in. So it's a common interface. It manages the specific elements of communication. And it supports multiple drivers. So you can have a multi-vendor strategy. And in Liberty, it's extensible so that you can implement some of the extensions and expose them both from ML2 perspective and also some extensions, which may be unique to a particular solution. OK, so I'm going to start with a little bit about VX-Land. So VX-Land is a protocol and technology that's used to encapsulate layer 2 packets in layer 3 packets. And so the reason you want to do this is it allows you to create layer 2 networks on top of a layer 3 fabric. And that's important because layer 3 or IP fabrics scale. The internet is larger than any data center anyone will ever build. And so using the same technologies that make the internet work, you can build very scalable and very reliable data center networks as well. And so the key one for these purposes is ECMP, which is Equal Cost Multi-Pathing. And it's important because now all of those links in a densely interconnected network are active at the same time. Traffic can go over all of them. And this gives you better, more predictable latency, higher bandwidth, obviously, and faster error handling. Because you already have connectivity. If one of the links fails, it's just withdrawing it from service as opposed to if you're using something, an L2 technology like STP, where you have to do a lot more calculation to figure out the new path the packets are going to take through the network. There's another solution people sometimes use for building layer 2 networks, which is called MLAG. It has its own disadvantages. Proprietary, there's no standard for MLAG. It's a very complex thing to do. You're essentially intentionally creating loops in your L2 network and then doing a lot of dancing to keep that from melting your network down. And typically, you can only have two switches in an MLAG pair with most implementations. So that limits the total amount of bandwidth. It creates a bottleneck. And so I talked about these point-to-point tunnels. And so if you're experienced with layer 2 networks, that probably kind of triggered something, which is, well, how do I handle traffic that's not point-to-point? And so we call that bum traffic. And it stands for broadcast, unknown unicast, and multicast. And so broadcast is traffic that is destined for everyone on a given network. Unknown unicast is if you don't know the destination of the network. The way layer 2 works is you flood it to all possible destinations, which is essentially the only way to guarantee it gets to its real destination. And then multicast, which is where it's similar to broadcast, but instead of going to everyone, it goes to only the endpoints that have expressed interest. And so there's a couple of different ways we can handle this. And we'll get into which one we chose later. The VXLine standard talks about multicast. Basically, on the physical network, you create a multicast group for each logical L2 network. And so now anyone who's participating in that L2 network can register as interested in that multicast group. And so they'll only get the bum traffic for their layer 2 network. The other way, which we've talked about, Mark and I have talked about at previous summits, is a software replicator. And the idea there is everyone who's encapsulating VXLine traffic, when it sees one of these bum packets, instead just sends it off to a special replicator, which can be just software running on an x86 server or any other place. And what it does is then replicates the packet to all of the different, unicast replicates the packet to all the different listeners that are interested in it. And the final one is what we call head end replication. In this model, everything that is encapsulating VXLine traffic can also replicate to all the interested listeners. And it can do that in hardware. So if it's a switch, it can do it in hardware. So there's a couple of different ways you can deploy VXLine with Neutron. The default one is doing VXLine in the hypervisor. And so we talked about the L2 packets being encapsulated in these VXLine packets. In this case, that actually happens in the V-switch. And it happens in software. So this can create some performance issues, because now the software is having to append this header, basically poke on each and every packet that goes by. And the big thing about networking is to get performance, you want to make sure that software doesn't have to look at each and every packet. You want something else handling each and every packet. And software just setting kind of rules for how to handle the packets. And the other problem is that you end up with a huge number of V-teps. Now, the V-tep is a VXLine tunnel, VXLine termination, virtual tunnel endpoint. Virtual tunnel endpoint, thank you, acronyms everywhere. And you end up with one per hypervisor node. So if I have 32 racks each with 32 servers in them, you end up with 32 times 32 V-teps. And so the numbers scale up pretty quickly. And since it's happening in the V-switches, there's nothing you can do to bring in, nothing you can do easily to bring in devices that don't have a V-switch. So bare metal servers, hardware appliances like load balancers and routers and things like that, cannot be connected to the logical networks, only VMs and containers. So the other way to do this is to do the VXLine encapsulation and decafcelation in the top of REC switch. And so this is interesting because it gets you hardware acceleration, right? If you're a top of REC switch, supports the feature, which most new ones do now, you can have it in hardware, take the L2 packet, put that VXLine header on it and send it along to its destination. And so having hardware do something on each and every packet is absolutely fine because it can do it all at line rate. Another cool thing is now the number of V-teps scales with the number of racks. Instead, if I have those same 32 racks with 32 servers each, instead of having 32 times 32 V-teps, I just have 32, one for each of the top of REC switches and those racks. And we can now bring all of those non-virtualized devices into the logical network. So you're bare metal servers managed by Ironic, hardware appliances, and so on. And so the feature that allows us to do this is one that showed up in Neutron in Kilo. And it's called hierarchical port binding, which is kind of a mouthful. But basically what it means is that you can have multiple different segment types. In this case, a VLAN segment type and VXLAN segment types. And they will be merged using a ML2 mechanism driver. So the ML2 mechanism driver knows how to talk to the switch, the top REC switch, to tell it to connect this VLAN going on this front panel port to a hypervisor to this VXLAN, which is going to then traverse the IP fabric. So I feel like I've talked a lot, so maybe a diagram would make this clear. So here you can see the kind of leaf spine layer we've alighted most of the leaves just to make the connectivity a little less insane with all the lines. And so you can see, oh, we've got a build here. You can see the first hypervisor has chosen VLAN 37. So the traffic for a specific VM coming out of this will be tagged with VLAN 37. It will then hit the top REC switch, which will then have the VLAN tag removed, and then have it VXLAN header added. And in this case, it's chosen 2003 as the VNI. And so these numbers can actually be much larger than VLANs, which are limited to 4,096 different tags. These can actually be up to 16.7 million. And then the packet will traverse the spine layer, which is all layer three, and land at the other top of REC switch. And in this case, the same virtual network has been allocated VLAN 55 on this switch. Now, if you're a traditional networking person, this is probably really weird to you because VLAN numbers are supposed to be global because it's a virtual LAN. But in this case, since the layer two boundary never leaves a single rack, you can reuse the VLAN numbers. In fact, in our case, you can reuse VLAN numbers between different front panel ports. So you have the full 4,096 VLAN numbers for each and every hypervisor, which should be enough for decades. So now we're going to talk a little bit about a real world deployment. In this case, this is done with Dreamhost, and their new Dream compute cluster. So they built the network on white box or bare metal switches running Cumulus Linux. So for those of you who don't know, Cumulus Linux is a Linux distribution based on Debian that runs instead of on x86 boxes like your laptop or your server or your desktop, it runs on switches. And so a switch is basically just a fairly standard looking server with one very special part, which is a forwarding ASIC. And so that's what allows it to do 32 ports of 40 gigabit per second bandwidth at line rate. At Cumulus Linux, unlike most traditional network operating systems, does not have a proprietary CLI. The way you configure it is the same way you configure Linux switching on a hypervisor. Use the standard Linux tools. IP route to. You can do IP route add, and then type out your little command. You can use BRCTL to create a bridge and to put ports into the bridge. So on a hypervisor, you would take a tap device and add it to it. Create a bridge device, usually called BR0 or something like that. Create the tap device from the VM, put that into the bridge, and then you can bridge that to eth0, for example, to connect that VM to eth0. So those exact same tools work on Cumulus. But behind the scenes, what we're doing is we're hardware accelerating all that. So in a hypervisor v-switch, all the packet forwarding is being done in software. All those forwarding decisions are made by software. In our case, we programmed the hardware, that forwarding ASIC I talked about, to do in hardware what Linux would have done in software. So this is similar to if you're old like me and you played video games in the 90s, you used to have software rendering and it would draw the little game world that you're playing in 3D. And then along came hardware accelerators and you dropped one in. Everything looked pretty much the same. Maybe the graphics got sharper, but everything was way faster because it's being done in hardware. So basically, you can think of a switch running Cumulus Linux as a Linux box that just happens to have a huge number of 40 gig or 10 gig or 100 gig interfaces. A stick kind of relatively common switch today would have 32 40 gig interfaces. And so those would just show up as 32 ethernet interfaces. So on top of the layer two provided by Cumulus and Dream Compute, a conda is being used to implement layer three and above. So what we've done with the conda is a conda provides a layer three plugin. It's providing routing via service VMs and then also provides additional services such as DHCP metadata. And then coming soon, implementations for load balancing as well as, and it's all based on the open source of Stara project which is the newest entrant into the open stack big tent. But a conda, while being the newest project is actually was developed to Dreamhost in 2012 and has been a project that's been in production for a while and hardened. And so it enables changes like this because of rolling out a new architecture as well as transitioning off the old one that was used in early deployments of Dream Compute. The open stack environment is based on the stable kilo release. It's using ML2 drivers designed for hierarchical port binding. And essentially it's kind of what Nolan touched on a little bit earlier with the hierarchical port binding is where you have VLAN between the server and the top of rack switch. And then you have VLAN, or VXLAN between at the top of rack switches. So it's essentially kind of the exact kind of sample case for hierarchical port binding. So I talked a little bit earlier about these, the concept of IP fabrics and the advantages there. So I figured we could have a diagram that might make it a little clearer. So this is a leaf spine or fat tree in the terminology. And so in the Dream Compute case they have chosen 32 by 40 gig switches for the spine layer. And so what that means then is since there are 32 front panel ports on those spine switches, there can be up to 32 racks because each spine switch needs to connect to each rack. And so this scales pretty big, right? 32 racks full of one RU servers can get a pretty high number of servers. And then with all those VMs on there, you end up with a fairly large number of endpoints. And the other cool thing about the leaf spine or fat tree architecture is that it's very scalable. So we've shown right here a zero overcommit network. And what that means is that any server can talk to any other server at full speed. So in a overcommitted network, two servers in the same rack can talk to each other at full speed because they're connected to the same switch. But if you then have to talk to a server in another rack, you're gonna be going through a congested core switch. And so that's gonna slow you down. Depending on the overcommit factor, common ones can be five to one, even 10 to one, wouldn't recommend that. So you can look at it as if I have 10 gig connectivity coming on my server and I'm going through a 10 to one overcommit core, that means I can talk to other VMs in my rack at 10 gig, but the rack next to me, I can only talk at one gig. So I have to be very careful about where I place my workloads so that they can communicate with the other collaborating servers that they need to talk to. So going back to the bum packets, I talked about the different options, but now I'm gonna talk about how we drill down on how we did it in the dream host deployment, dream compute deployment. We did end up, we decided to use head end replication. We wanted that hardware acceleration of the packet replication. We didn't wanna have to have that done by software. And so that means we were doing the replication in the top of rack switches because in this deployment, that's where the VTAPs are. So the VLAN tag packet goes up to the top of rack switch. It's something, it's not a unicast packet. It needs to go to multiple destinations. So then the hardware looks up the list of destination and sends one copy of it unicast each of those destinations. So now the IP fabric side only ever has to see unicast packets. And so I kinda glossed over something there. I don't know if anyone's gonna call me out on it, but I said it replicates it to all the people, all the other endpoints that need to know about it. But how did it know? So the answer there is a pair of very simple Python programs that we wrote. They're open source. One is called VXRD, which again is kind of a mouthful. We like these kind of mouthful acronyms for some reason. And so that's the VXLAN node registration demon, which is probably clear as mud. What it does is, as a VTEP, it keeps track of which virtual networks, which VNI is in the terminology, this particular VTEP is interested in. And so that means in our case, this specific case, which, what virtual networks are connected to VMs that live in that rack. And so what it does with that information is it sends it to the VXSND, which stands for the, it doesn't stand for anything really. It's the VXNode database service. And so what it does is a simple program that runs in, well, it can run anywhere. In this case, well, I'll get into that. It collects all that information from all of those VXRDs. So now it knows it has a global database over all the VTEPs of which VNI, which VTEPs are interested in which VNI's. And then it pushes that information back down to the VXRDs. And so now the VXRDs know, hey, for all the VNI's I care about, all the ones that are associated with VMs running in my rack, here's the list of other endpoints that are interested in it. And that's all the information it needs to program the hardware to do that replication at line rate. So I kind of glossed over where they run. It's, in this deployment, it's actually extremely straightforward. The VXRDs run on the VTEPs, which are the leaf switches or the top rack switches. The VXSNDs, to reduce the complexity, we just figured, hey, run them on the spine switches. Chemless Linux switches are just Linux boxes. So any software that runs on Linux and on a Linux server, you can run them on the switch. And so in this case, I think we drew it kind of as every spine switch runs. It typically, in a normal deployment, you'd run it only on two or three spines. You could run it on more, but they're not particularly resource-intensive. You only need more than one for high availability. And if you lose more than three spines, you probably have other problems. So one of the questions that comes up, hierarchical portbinding is really a mouthful. Is it really complicated to deploy? And if you take a look at the config file, that's actually the entire config file necessary for ML2 and Neutron. It includes both the plugin and the agent file all collapsed into one. I mean, so if we kind of walk through it, you'll see, I mean, basically at the top, we're just saying our default network is VXLan. And then we go through, say, the type drivers we want available, and then the mechanism drivers. And if you notice, we're running both cumulus and Linux bridge. Reason we have both those mechanism drivers enabled is because with hierarchical portbinding, you're going to have a, you have to bind each of the segments. So the cumulus drivers will take care of binding the VXLan in the switches and notifying the switches. The Linux bridge driver will take care of notifying. The hypervisors, which VLAN ID has been selected and pass that information along to the agent. Within that, you just have your standard configurations for what's the physical network name and within the rack, which we just called racknet, which ranges are available to be used. And then going towards the bottom, you have a little bit of extra information in terms of for the agent. Basically, you've told the agent that VXLan is not being enabled because one of the challenges with Neutron is that it's so configurable that sometimes you can turn on a couple configs and you have to turn some things off selectively, which is why you'll see VXLan enabled faults when it's actually on for the cumulus. And then you'll see down there's the VLAN information which has a little bit more specific information, as well as for the bridge itself. You have the physical interface mapping. So within the actual hypervisor, it maps the physical network name onto the actual interface. And then lastly, it's just a collection of when it's time for the cumulus network driver to notify the switches. It's going to basically call out the APIs and that's the last kind of config. But I mean, really that's it. It's amazing for something complex that the configuration is really simple. Yeah, I mean this configuration here is almost all of this is boilerplate. Things like enabling the firewall driver that any ML2 configuration is going to have that. Similarly, the agent root wrap and things like that. That's pretty much everything. The only things you have to customize is if you want to use a different VLAN range in that top of rack to hypervisors, you can change that. There's no real need to. And the other thing you have to change is you have to put the IP address of all the switches, all the top of rack switches in that last part, the ML2 cumulus part, just so that ML2, the neutron server, knows how to talk to all the switches. But pretty much all the rest of this you can cut and paste right in. So that's just one part of the solution, though, of course. We still have the VxSND, which, given its great name, you may have forgotten, this is the database server. And so its configuration is very simple. It basically needs to know its own IP address so it knows what interface to bind on. We could just have it bind on all interfaces, but for, I guess, safety or clumbliness, we'd like you to specify a specific IP to bind on. And then you enable VxSND listening. So this is for a different mode. If you're using this thing as a software replicator, you set that to be true. In this case, we're not using software replications, so you set it to be false. And then the agent that runs on the switch, which takes these commands from ML2, from the neutron server, also has a pretty simple configuration. In this case, we've thrown clumbliness to the wind and bind to all ports and all interfaces. And we've specified port 8140. I believe that actually might be the default, so we may not have even needed to specify that port there. And then we tell it what local interface to use for originating VxSND packets. So this is the, I'm sorry, IP address to use for originating VxSND packets. So for those unicast packets that get encapsulated, this will be the source IP. And similarly for those replicated packets, this will be the source IP. And the final piece is you have to tell it which interfaces to ignore. In this case, ETH0 is the management interface on the switch. And so obviously we're not gonna be sending VxSND packets out of that because we're only using that to manage the switch. And the other is SWP1 in this config fragment, we only have one port going up to the fabric. In a real deployment, you would have a few more interfaces there. We're gonna clean that config up a little bit. I think we can eventually get this down to just the local bind. And then the last piece of the kind of neutron specific part is the VxRD. This is that registration deamon. This is the one that runs on the VTAPs and tells the database what VNI's it's interested in. And so in this case, we need the service node IP. That's the server running the database server. And then we need our own source IP to bind to and then to use when sending the packets up. And then the last part is that head rep equals true. That's turning on head end replication. If you're using software replication, you'd set that to false. And then the last piece of the puzzle, I talked about an IP fabric. Sometimes people will get a little freaked out about routing protocols. This is the entire routing protocol config. You may not recognize some of these commands. That's actually AOK because you can just cut and paste this whole thing. Probably want to change the password. And you might need to fiddle with the interface names if you plug things in differently. And you'll have to pick your own ASN, which is the address space number. The good news about that is you don't need to know what that means. You just need to pick a number there and use it everywhere. And then the last piece is that IP address. That's the only piece that needs to change for each and every switch. So if you have some templating engine or some automation tool, it is extremely easy just to blow out this template, changing that IP address on all of the spine switches and all of the leaf switches in your network. And so results. So we did all this work, quitted it by us. So when we've moved the VTEP to the hypervisor top of rack, so we've greatly reduced the number of VTEPs in our system. And so you may be wondering, okay, that's kind of cool, but what did that buy us? Well, it increased bandwidth a little bit. Before, with the software VTEP, you can get about two gigabits per second using VXLAN. And the reason for that, without getting too deep into the gory details, is a NIC feature called TCP Segmentation Offload. The big problem is most NICs today don't know anything about VXLAN, but they do know about VLANs. So this optimization has to be turned off when you're using VXLANs, but can be left enabled when you're using VLANs. And so it's apparently a pretty significant optimization because it gets you a good four to five X speed increase. And it's proven at scale. You can build fairly large networks out of this. I think it's pretty much as large as you'd want to be a single open stack cluster. And it allows you to fully utilize your capacity. By building an L3 fabric and IP fabric, you can use ECMP to enable all of your links, the ones that something like spanning tree protocol would disable in a layer two network. So that's it. Any questions? Do we have the microphone? We'll repeat it. Yeah, we'll repeat it. Yes. How do we do L3? Where? So in this case, the question was, where do we do L3? I'll let Mark field that one. So the question, yeah. So with L3, Aconda is actually providing the L3 service. And the way Aconda works is, Aconda will distribute the router, will actually instantiate the logical routers within the fabric, within, so in the case of Dream Compute, they're based on service VMs. So those VMs are actually running alongside other workloads already in the Nova Compute cluster. So in terms of upgrades, it's actually pretty easy because you can go through, you can update the control plane and also with Dreamhost itself is you can take that image in the appliance that's used for routing and actually do a rolling upgrade over time and replace those in a nice seamless way and keep traffic up. Microphone, microphone. Oh, are they running, are they running in a compute node on each, as a service VM in each compute node or? They're running on some of the compute nodes, not on every compute node. It really, so within the configuration you have the router, they can be paired with ERP and so you can get failover and you can fail, you can go through and roll them and upgrade one at a time. But it's really depending on how they're scheduled with the compute cluster. So likely, yes, you do have them but not every tenant's traffic may not necessarily be on the same node as its router. Okay, so I'm trying to understand the packet flow. And the other question I had was how do you allocate the local relands on each rack? What do you guys use for that? Because I think the example was 155 or something of that sort. Yes, so the local allocation is basically when it comes time to bind the port onto the host, the driver will go through and look at the available VLANs for that particular hypervisor and then select one within the allowed range and then mark it as in use. And then so it will repeat that process on, and so that gets you the first, like if you spin up a VM, when the port's bound it's going to allocate and bind it all the way up to the switch and then when you spin up another endpoint somewhere else it's going to do the same thing but because the pools for values are different between each host, you may get the same one, you may not. It's kind of unpredictable that way, but... So, but then would each compute node do the same allocation in the same rack or would it be different on each compute node? So let's say I have a VM on network one on compute one and then on compute 10 the same VM on the same subnet comes up. Yeah, so that's actually kind of one of the more interesting features. I can talk to that a little bit. The way the Linux bridging model works is you essentially have a interfaces, right? Like in normally ETH0, ETH1, ETH2, whatever. We call them SWP1 for a switch port but if you create a VLAN basically you're creating a sub-interface. So VLAN 100 on ETH0 would be ETH0.100 and so this is kind of a separate interface. It is a separate interface and so if I create a bridge I can very easily bridge ETH0.100 with ETH1.200 and so now I can choose whichever VLAN numbers I want as long as I kind of keep track of them and make sure that all of the VLAN numbers in that rack that I want to be on the same network connected to the same bridge and then to connect it to other racks we connect a VXLAN interface into that same bridge and that has the VNI for that virtual network and that is global but of course that space is 24 bits so you have a lot more of those. Interesting, I didn't get that but I'll... So the other question I had was you guys are using Linux bridge so what happens to security policies and ACLs and all that stuff? I mean security groups are spoofing all the standard features that you find now work as expected within Neutron. So but on Linux bridge how do you apply the OES flows and all that stuff? Sorry, what was the last? So I mean I meant to say the Neutron security rules and stuff without OES how do you enforce those so with the Linux bridge? So the interesting thing right now is if you're running OES today from the reference implementation in Linux bridge to do the same thing they're using that filter to enforce all the security group policies. And the other question I had was so you talked about the VX SND right? Or those services, how do you replicate state in case of failure or for example let's say the case of a VM migrating from one rack to another rack how do you determine the topology there? Yeah so Neutron well OpenStack knows when it migrated a VM from one rack to another and it tells Neutron. So it'll see the VM disappear in one and pop up in the other and so then the new VTEP will register its interest with the database and will show up and the old one will eventually age out. And so for a brief period of time the packets will be unnecessarily replicated to the old destination and they'll just be dropped there by the hardware. And then the mesh you create within your switches does it happen automatically or is it or how do you direct, let's say if I bring up a new rack for example how do you connect to? So all the racks are connected to the spine layer and it's just a routed mesh so you can use OSPF, you can use BGP we're using BGP unnumbered in the dream host case and I forgot to mention that. The reason that config file for BGP was so simple is because instead of having to assign a subnet to each pair of interfaces between routers you just give each router switch a single IP address and it uses that to communicate with all of the other kind of routing protocols. Okay, and the last question I had was in the ML2 I saw a bunch of switches. So do you, so when I make a neutron API call do you talk with each of the switches to provision? Is that right or is my understanding right? So you found one of our hacks. Yes, today we would send it to all of them. We have, we're working on a change to learn back from the individual switches which hypervisors are attached to them and so then we can prune it and only send it to the ones that need it. But for relatively small deployments having to replicate a short message 32 times is not a big deal. But yeah, the rollback is a problem though in case things fail. Sorry what then? In case things fail on a 32 node switching if two of them fail for whatever reason then you'll have to start rolling back on all the rest of them, right? I'm not sure I understand this. I mean, when a neutron API call comes in if you're talking to let's say 32 switches in your fabric you need to take off transaction rollback and atomicity and so on. So we're gonna clean that up a bit. Cool, thanks. Thank you, right there. I'll repeat the question while I put the mic back. Thank you. So the question is in the slides you show cumulus mechanics driver that if I understood correctly is the software element that makes the configuration on the cumulus switch so that the VLAN is mapped to the VNI. Yes. Okay, now what is not clear to me if is this mechanism driver is already upstream in OpenStack or if you get it with when you buy the cumulus Linux license or if it is yet another product. So it's not upstream yet, it's up on our GitHub. You can download it today. We're working on upstreaming it. Right now you download it from our GitHub and then copy it into your plugins directory or a multi-mechanism directory, but it'll be upstream soon. Okay, so it's so new that it's not been published yet. Well, it's published on our GitHub, it's not been upstreamed yet. Ah, okay, okay. And so all the documentation it will be integrated on the cumulus documentation or? Well, it'll probably be in the cumulus documentation and also when it's upstreamed it'll be in a different repo now because the mechanism drivers, the multi-mechanism drivers have been split into different repos, so the core Neutron no longer has any vendor drivers in it. Okay, thank you. What's the actual, from Neutron's perspective, what is the network type of the tenant network? Is it still a VXLAN segmentation ID and then, or is it a VLAN segmentation ID? The network type will report back as VXLAN from the tenant's perspective, they'll see as a VXLAN network. And that makes sense because the VLAN numbers could be reused in different racks, so the only number that uniquely identifies a logical network is the VXLAN VNI. Okay, so at that point are you eliminating the whole VXLAN tunneling between virtual switches and it's just basically gonna do all that stuff but directly through the towers, right? Yeah, the virtual switches only work in terms of VLANs. The bridging between VLANs to the XLANs happens in the top of the rack switch. Is there a question in the back? Sorry, could you repeat the question on the microphone? I couldn't. So when you have two different logical switches within the same tenant connected to a Neutron router, which in this case is gonna be in a Kanda router, the package will have to traverse the entire fabric, reach the Kanda router, get router and come back in. Right, there's no distributed routing. So in this particular case, if you're routing between two network segments of the same tenant attached to the router, yes, it will traverse the router or it will traverse the two routers, it's not DVR. In other modes, you could run DVR on it but for this particular case, that's not because of some other operational concerns. A Kanda makes it easier in terms of rolling out new appliances, that was kind of the choice there. Just one more other question. In this current deployment, how many physical interfaces does each server have? Does it have like two 10 gig interfaces connected to two cumulus switches or is it? So this particular deployment is on the data plane side, single attached. So there's one connection from the top Rx switch going to the server for carrying kind of tenant traffic. It's kind of a more of a public cloud use case. We can actually support MLag which you end up with two top Rx switches and so their pair is an MLag pair. They appear to be a single switch to the hypervisor, just shows up as a bond and then the same L2 to L3 transfer happens in the pair of top Rx switches. So now the rest of the network is still L3 so the MLag is limited entirely to a single Rx. You don't have that looping problem you have with the large MLag networks. So with MLag, you would be able to support this entire architecture that he just mentioned. Yes, so a new feature in 2.5.4, cumulus 1.2.5.4 is the ability to run VxLands and MLag at the same time. Thanks. How does the north-south traffic work from the service VM? Does it go over a VxLand tunnel and just pop out on the WAN somewhere or? Yeah, so there's a L2 gateway currently in use and that connects into the rest of the routing infrastructure inside of Dreamhost. One of the two things with Aconda is the each of the individual service VMs is actually speaking BGP north to announce reachability information for the tenant segments. So within that information it's learned and so you get routing both ways. Yep, yep. Close to the mic, you're next. Sorry. Is there a roadmap to have distributed L3 at the top of Rx switch? So instead of just doing bridging from VLAN to VxLand, can you support routing as well? Well, so the current common hardware platforms, the forwarding A6 chips can't route once the packet has been decapsulated. So today, no, new chips are coming that will support being able to route out and then we could in some configurations have that tenant to tenant traffic be hardware routed as well. But today, you have to bounce through the VM or through an L3 agent if you don't use the conda. So coming back to what you referred to as your little hack by addressing all the switches with the port information for the hypervisor, as far as I can remember in the configuration, you specified on the hypervisor a particular port where you are connected to the upstream and all the switches. So if two hypervisors are connected to the same port on different switches, how do you differentiate? For something like P6? So the question, if I understand it right, you say if two hypervisors are attached to the same front panel ports on different switches, how do we differentiate them? So within that, the driver and the way it talks to the REST API is that the switch learns the host attached to it. And then when it comes time to bind, when we flood the switches with information for binding, we're able to say, if you have this host attached, the switch will know which front panel port it is. So that way you can actually have different... Are you asking how the switch knows which hypervisors are on which port? You're using LLDP or something like that. Today we use LLDP. You can also configure it statically with a config file if you don't want to use LLDP for any reason. Okay, I understand. I was just a little bit confused because there was an actual port in the configuration mentioned. Yeah, so today, if you don't specify the list of hypervisors, it assumes you're using LLDP and tries to connect to the LLPD. You know, it does make sense to me to have the L2 agent that's running on the hypervisor report up to the switch it's attached to, you know, who it is, but that's not there today. I'll have a question over here. So how do we handle IP multicast traffic? So the routed fabric, the physical network does not handle multicast. So, you know, all of the problems that are associated with, you know, routed multicast are just go away, right? You know, you don't have to run PIMSIM or any of the kind of multicast routing protocols. And so the way that then gets handled is that the head-end replication will take care of it. So if a VM sends a multicast packet, it goes up to the top of rack switch, which then knows, hey, the following other six racks have other VMs that are in this logical network. I need to send that multicast packet to all of those other top of rack switches so then it can send it down. And so the replication, it actually works fairly similar and actually uses the same hardware inside the chip that is used in multicast. That's how we do the head-end replication in hardware. So it's very similar in terms of how it happens, but it means you don't have to run another routing protocol. But the VNI only tracked the layer two. If you have, you know... We don't do IGMPs in the VNI. Across different VNI, I mean... So different VNI's are in totally independent. So a multicast packet on one VNI will only go to other racks that are interested in that VNI. It will not go to other racks. Yeah, but if you have multiple VNI belong to the same tenant and they're in the same IP domain, they could have layer three, the IP multi-group, right? Well, you'd have to have a multicast router between the different VNI's associated with a single tenant. Okay, so you're gonna run... They're different on two networks. Or you're gonna run the PIM or those kinds of protocols? You could, I'm not sure. I'm not sure I understand the use case. It seems, you know, if you have a bunch of VMs that want to multicast with each other, the easy thing to do is put them in the same VNI in the same logical network. Okay, all right, well, thank you very much.