 All right. Thank you all for joining us today. Welcome to our session. We're going to discuss a little bit about hierarchical port binding and what it is it, why you should deploy it. Real quick, just want to introduce ourselves. My name is Mark McClain. I'm the co-founder and CTO of Aconda. I'm Nolan Lake. I'm the co-founder and CTO of Chemilus Networks. So where are we headed today? What we're going to talk a little bit about is just kind of recaps and neutron basics for those who are not familiar with some of the details of neutron. Talk a little bit about when you're deploying, how you're building and constructing network, some of the design considerations. The real meat, what is hierarchical port binding? Why do you want to run it? What's the benefit? And then lastly, we're going to do a demo with hierarchical binding with full open components and full open source. So neutron in two minutes or less for some. Again, it's an open stack talk. We have to show this logo. Remember, compute, networking and storage, networking binds the pieces together. Neutrons design goals, just in real nutshell, is to be able to be able to create rich topologies. Be technology agnostic. This is what allows us to have multiple L2 backends, to have multiple implementations for layer 3 through 7 services. Make sure it's extensible so that if you want to support new technologies and new features, hierarchical port binding is an extension. Support advanced services, load balancing, VPN, firewall. And remember, the APIs are generic. So the cool thing about hierarchical port binding is from the user perspective, they still see the compute API, the network API, and the storage API. So with Neutron, we can create the basic topologies and really create rich topologies within open stack. So this is kind of the recap of Neutron. So in terms of talking about building networks. Yeah, so I mean, Raise your hand if you've seen me rant about L2 and L3 in the past. All right, so I won't be boring too many people then, hopefully. So when you're building a network for a, in general, you have some decisions to make up front. Crucially, whether you're going to build an L2 network or an L3 network. And then more specifically, if we're building an open stack network, we need to decide how we're going to perform tenant isolation to isolate traffic between the different tenants that might be running on the cloud. And so there's some common ways to do this. The kind of traditional approach is VLAN. And I'll talk a bit about that. But then there's kind of more modern approaches like GRE and VXLand and VXLand in particular we're going to talk about quite a bit here. So I'm sure everyone's seen a slide that looks roughly similar to this before. So what we're showing here is you've got top of rack switches labeled access here. And those are actually just L2 switches. They don't have any L3 routing functionality. All they're doing is bridging frames. And on top of it, you have the aggregation tier. And so this is traditionally where you would run your L2, L3 boundary. So this is going to route in between different VLANs and is also going to provide connectivity to these individual top of rack switches. Now what you'll see here is we're running in a configuration that's typically called MLag, which stands for multi-chassis link aggregation. I'm not sure where the C went. And so this actually causes some problems, right? Chemilas has an implementation of this. And I'll even admit that sometimes it causes problems with ours. And the biggest problem is that this is proprietary. Every vendor has a different implementation of MLag. And the other problem is you're still using VLANs, which means you're limited in scale to the 4,096, actually a little less, VLANs that you can use. And finally, and the biggest one, those pair of switches in the aggregation tier are absolutely crucial. So you have to have two of them. In fact, you have to have exactly two of them. And so the issue with having exactly two is that you probably don't have enough capacity there. You probably oversubscribe through this. And typical oversubscription ratios can be as high as 10 to 1, right? So if I'm talking to another VM or another host in the same rack, I can talk at full speed, 10 gigabits, 40 gigabits, whatever I've got provisioned. If I have to talk to someone in a different rack going through one of these things, they have one gigabit or two gigabits. And so that creates a huge east-west bottleneck. And so if you're deploying something like a network virtualization overlay or a distributed storage system, this is going to bottleneck that traffic pretty badly. So what do we do? We build an L3 network. So this, you'll notice now that there's a lot more of those switches in the second tier. Because this is an ECMP L3 network, we can have as many as we want, which means we can choose what oversubscription ratio we want. In this case, we've actually drawn it with no oversubscription. So in this topology, every rack can talk to, and a VM in any rack can talk to a VM in any other rack at the same speed as it can talk to VMs in its own rack. So now you don't have to know or care where the individual VMs are scheduled or located. But we still wanted that isolation, right? We still want isolation between different tenants. And many applications expect Layer 2 connectivity between their VMs. And so what we do there is we use a technology called VXLAN. And the way VXLAN works is it's going to encapsulate those tenant L2 packets inside L3 tunnels. So they can traverse the Layer 3 network as if they were just IP frames because they are. And so the advantage of this brings is now we can scale that IP network very easily, right, because it's an ECMP network. And all of those links are active. You don't have something like expanding tree protocol behind the scenes shutting off some of your links to prevent loops. And because you have much richer connectivity through the spine, you have much better failure scenarios. If you lose one of those spine switches, it's much less of a big deal than if you lose one of two, or God help you both, of your aggregation switches. So now given the design considerations for building a network, how can we leverage that with hierarchical portbinding? So what we want first of all to take a look at is kind of a quick primer, again, on like module Layer 2. And so we're all used to seeing the Neutron reference implementation. You have multiple agents. You have Layer 3 agents, DCP agents, and a Neutron server. And inside of the Neutron server is where the ML2 plugin runs. It's basically broken down into segments, a type manager and a mechanism driver, which has its own manager. And so type drivers, primarily what they are, is they're for common interface for all segments. Basically manages type specific resources. So if it's VXLAN, it's VNI selection, or VLAN ID selection, type drivers are extensible. So you can create new types. So if there's a new format, you can add those in. You can configure those as a deployer. So the built-in types are the built-in types that you find within ML2. Local is just mainly for testing. It's basically traffic's never going to leave your test environment, because you just get a bridge, and it plums all the VMs together. Flat gives you access to the underlying network within Datacenter. VLAN gives you VLAN isolation, GRE, and then VXLAN. In terms of mechanism drivers, it's basically a common interface to all L2 back in, so that when you're orchestrating the segments, it manages the specific back-end communication. So commonly what you'll see is just the drivers have calls like create network, or update network, create port, update port, is kind of the interface. You can have multiple driver supports. So Neutron supports multiple mechanism drivers. And so it's extensible. You can create new types of mechanism drivers. So as new hardware comes out or new solutions come out, you can plug them in. Yesterday, when we were talking, and Neutron now has 50-plus drivers, if you count plugins and ML2 drivers, so it's kind of crazy how extensible it can be. And then so you have both open and source closed. So rule to meet is, when we talk about hierarchical port binding, what it allows you to do is take Neutron into buying multiple network segments together. Previously, if you did this, you had to manually configure it. Or you had to pick your tenant isolation type, and that was basically it. You couldn't have multiple rounds. And so you could actually build a hierarchy. So you can have a mix of segment types. So maybe you could have VLAN within the rack. You could have VXLAN across the tops between the tours. And only available in hierarchical port binding are currently right now supported with ML2 drivers. The monolithic plugins obviously are owning all the network. And so they work a little bit differently. It's available now in Kilo. This feature was released. The team really did a great job of getting it out the door. So for example, let's say we have two VMs. And the left VM, what happens with ML2 is so when the VM spends up, we create a port. No request to the port be bound and wired in. And so in hierarchical port binding, what will first happen is that you go through and you take a step, and there's going to be a VLAN selection. And so in this case, we pick VLAN 37, which will then be selected. The next step in the binding is to say pick a VXLAN, a VNI. So we pick a VNI. If you notice here, because it's L3, we have full access here. And then coming down, we can have VLAN 55. The cool thing about hierarchical port binding is that because it's dynamic, you can actually have your physical networks limited per rack. And so you can even go further extreme and actually have VLAN segments dynamically allocated per port if you're underlying hardware supports it. So Matt Mark did a great job of explaining what hierarchical port binding is. But if you're used to traditional VLANs, or if you're a little bit more forward looking and you've configured VXLANs in your hypervisor, you may be asking, well, why don't I just have the VXLANs encapsulation happen in the hypervisor e-switch? And there are a couple of reasons. The biggest one and what originally motivated this work, at least my interest in it, was we were having problems. We were having customers deploy with VXLAN, and they were having performance issues. They would have a fancy Xeon dual socket server with two 10 gig NICs in it, and they would only be able to push two gigabits of traffic across the network. And so we of course immediately started investigating what we found was the NIC supports a lot of very cool offloads. The crucial one in this case is something called TSO and RSO, which is the important one is TSO, which is TCP segmentation offload. And so the idea is now the application and then the guest can give the NIC maybe a multi megabyte buffer, which then the NIC breaks up into individual packets. This obviously reduces the overhead quite a bit because you have a lot fewer interrupts coming down. You have a lot fewer descriptor rings being consumed by the NIC. But the NIC only knows the things that it knows, and so historically it knows VLANs. So it knows how to put a VLAN tag on the front of each one of those little packets it generates, but it doesn't know how to put a VXLAN frame on there. Now there are NICs coming onto the market that do support TSO in the presence of VXLAN, and so if you are careful and buy specifically those NICs, you don't have this issue, but if you just buy a 10 gig NIC today, you are very likely to run into this. So since we can use these dynamic VLANs coming out of the hypervisor, now all those offloads are preserved, but we don't want to use VLANs everywhere because of the earlier issues talked about around large L2 networks. So then as Mark explained, we will do the VXLAN encapsulation in the top of racks. So now we can traverse that core, that densely interconnected core as an IP frame so we can take advantage of ECMP and routing protocols and all the other great things that you can do at L3. And the last thing is, Ironic is still relatively new, but it's a cool open stack technology to be able to bring physical servers into your virtual networks and have them collaborate with your VMs. But the problem is if you're doing isolation and security inside the hypervisor reswitch, with a physical server there is no hypervisor reswitch. So this allows you to then, instead of trunking a VLAN down from the VXLAN from the top of racks which you can actually just have the untagged packets come down to the server which as far as it's concerned is just plugged into a normal L2 network. And finally, we can avoid the explosion of L2 table sizes, having to learn every single Mac of every single VM all over this network from the physical hardware because all it cares about is the L3 addresses which since they are hierarchical in nature are much fewer in number. And you'll be able to deploy this today as we'll get into right here. And the last point, although most people don't run into this because it happens at very high scale but the number of VTEPs which are the VXLAN encapsulation points scales with the number of racks now not the number of hypervisors. So if you have 42 or 40 or whatever servers per rack, now you have a 40X reduction in the scale so you're gonna hit whatever bottlenecks in terms of scalability of the VTEPs way later at a much higher scale. So now we'll get into the demo. So yes, slightly cursed with live demos so now the fun. So what's kind of inside our demo? We have essentially our topology is we have two hypervisors connected through a pair of switches and what we have is so it's VLAN within between the server into the switch and then between the switches is VXLAN. So basically our open stack release is the latest stable open stack release. We have MLTEL drivers designed for hierarchical port mining. These drivers are the Alto cumulus drivers. Like I said, it's VLAN between the server and the switch and VXLAN between the switches. On for layer three services we have a conda which is running so it's a conda if you're not familiar as a neutron L3 plug-in implementation in addition to routing provides DHCP metadata and load balancing. It's fully open source available. It is a SAC forward project so it basically functions as the rest of the open stack ecosystem. And on the switches we have chemo-slinux. So I was trying to get too close so that you don't pick up all the fan noise here since these are designed to be running a data center but we've got on the bottom of a quanta-ly9 which is an interesting switch that we just recently got support for. It's interesting because it's actually 10G base T so you can connect your servers at 10G using just cat six twisted pair cabling instead of having to use SFP cables or optics or more complicated physical layers. And then on top showing the flexibility we also have an LI8 which is a more traditional SFP plus 4810 gigs plus 640 up links. Crucially these both have a trident two chip in them which means they can do VXLAN in hardware and we'll be leveraging that. And the cable here this is essentially the L3 network. It's just a single 40 gig cable in this deployment but normally you would have six or 12, 40 gig ports going up to six or 12 spine switches. But just carrying two of these heavy things around the world was complicated enough. Should we dive in? Yeah, so chemo-slinux is a Linux distribution that if you're familiar with Debian it'll look very similar because it is derived from Debian and we try not to deviate too much. All of your standard tools are gonna work out of the box so if you wanna add a route you can use IP route add. If you wanna configure ACLs use IP tables using the rules that you're already comfortable with. Similarly to configure a bridge you can use BRCTL and I'll show you actually on the switch in just a second. Behind the scenes we're actually using the trident two ASIC in this switch to hardware accelerate the sporting that Linux would have done in software. So much like if you used to play video games in the 90s you would play your 3D game and it was very slow and then you got your voodoo accelerator or whatever and now suddenly it looks the same but it's way, way faster. So we're doing the same thing with the Linux software forwarding but with the trident two ASIC here. So if you look at it it looks like a server just one that happens to have an awful lot of nicks. So, escape? Yeah, okay. So I can show you. I have here SSH'd into the LI9 so we can if config is officially deprecated but I still use it. How do you scroll up on a two finger? Two finger, okay, two finger on the mouse. So you can see all of the front panel ports show up as SWP number. So you're gonna see all of those there and then lots of them. And then you can see the VXLAN interfaces that the ML2 plugin created. So we have three right here. One is for the tenant network. One is for the internet and the third one is the management network. So also you can see is the server is plugged into SWP one which is the first switch port. So you can see we've created two VLANs, three VLANs on that. And so those VLANs are then bridged using standard Linux bridging into the VXLAN. So you can see here VXLAN 1002 is bridged with the VLAN 2001 on the front panel port. And so that's how they're all communicating here. So we kind of talked how they're all connected at layer two just kind of as a reminder from the neutron side the logical topology is still the same. And so you see we have the router, we have two VMs just to prove that they can talk to each other standard ping demo packets are going back and forth. And then so, but if we were to click in underneath oops, time out. As you see the network type is VXLAN segmentation ID is 1001. Also if we take a look at admin side just to show you that the instances are running on different hypervisors. So that way you see that's going across and connected. So one's on the one and the other's on the three. So we turn back to this. And so there we go. So there's one piece of kind of glossed over here. You know, if you're familiar with VXLAN you're probably familiar with the bum packet problem. And so bum stands for broadcast unknown unicast and multicast packets. And so the fundamental issue is it's very simple in the unicast case to just basically learn MAC addresses. I learned that this destination MAC address lives on the VTEP that has this IP address. So if I see that MAC, I just send it out. Works very similar to L2 bridging. The problem is when I get a broadcast packet or an unknown unicast packet that I need to flood or a multicast packet that actually doesn't have just one destination that could have many destinations. And so the solution we're using here today is something called VXRD, which is a VXLAN node registration daemon, and VXSND, which is the service node or replicator. So the service node is running on one of these NUCs. The same one is running the neutron server and Nova and all the other kind of open stack components. And the registration daemon is running on the two switches because they're the VTEPs. And so with the registration daemons are telling the replicator is saying basically I am VTEP this IP address and I'm interested in the following three VNI's. And VNI's are the numbers that identify a VXLAN segment. And so when a packet needs to be sent to everyone associated with certain VNI, it's sent to that replicator, which will then replicate it to each VTEP that needs it. And so now as far as the users are concerned, that packet did get broadcast just as it would on a normal L2 network. And this is open source. You can go to our GitHub and download it and play with it. We've only got one replicator running right here, but it supports running multiple for both high availability and for scaling the packet processing capability, as well as hierarchical replication where you actually have basically one replicator per rack so that you reduce the number of copies you have to ship across the core of the network. So just kind of to kind of summarize this hierarchical port binding. Basically it's available today. You can deploy a fully open solution. You don't have to necessarily leverage or wait for proprietary solutions to be available. So in the situation we show here, this is the open solution is open stack, cumulus linux, the conda, and then only enabled switch. It works at scale. It actually allows you to go further. You can maximize the capacity of your hardware and your investments. And so lastly, thank you for your time. Any questions? If you could, because the recorder, please go to the mic. Thanks. Hello. The question I have is as far as the L2 population for let's say ARP and the L2-POP mechanism driver today does like an ARP proxy. Does that, is that the same behavior here? The way we have it configured right now, we're not actually using L2-POP. And so it's from the perspective of a VM trying to identify things like find an IP address, it'll send an ARP broadcast on the virtual layer to never created inside the VXLine tunnels. And then whoever's got that IP address will respond and you can reply. If you were to add, you could in theory pre-provision these things and essentially turn on proxy ARP and have, you know, avoid that broadcast traffic if you wanted. Okay, thanks, thanks. Follow-up, was there a reason you didn't go with L2-POP updates of our RPC channels to make it compliant with the VTEPs that exist today and what happened to OVSDB VTEP schema? So as you might have seen when I was showing the chemo-slonics switch, it's actually, the switch is not using OVSDB. I know, but you can implement an OVSDB instance and be, communicate with everybody else that's using OVSDB schema. We're doing that on our devices today, I was just curious about why another proprietary L2-POP service? We haven't actually implemented a proprietary L2-POP service here. Okay, well let's do it. If you were to try to do that, you would probably do it in the hypervisors and the OVSDB switch there that's configured for VLANs, as opposed to trying to do it at the VTEP layer in the switches. Just because you want to short-circuit those things as close as possible. Oh, I understand, I was just curious about why you did it that way and why, because we're starting to see interoperability between OVSDB VTEP schema between vendors, even those that aren't open V-switch. They just implement an instance of OVSDB so they can do interoperability. Oh, we implement the OVSDB schema. We don't use it in this demo. We actually did that for integration with some proprietary network overlay virtualization solutions. But in this case, since we could actually just run the neutron code right on the switch, since it's just a Linux box, there was no need to have that extra layer in there. Is there any help in port hierarchy to make sure that the VLAN that gets chosen on both of the integration bridges on either side is the same? The VLANs don't need to be the same. That's actually how we eliminate the scalability bottleneck of the VLANs. The VLANs are essentially port locals. So the service VM, where we're starting to talk about things like logical VLANs to allow VMs to be more than VIFs to actually have VLAN tagging to the VMs. Any help there? We have, I mean, that's essentially Q and Q at that point in the coming out. And we haven't looked into that. I mean, something we could do, definitely, but it's not something we've tried yet. And then also from the driver side, you could have an ML2 type driver that actually understood VLAN selection and made sure it's uniform across. Okay. Yeah, that's an option as well. But for this particular one, we decided to maximize and go for VLAN ID selection per port. Okay, thank you. So this question is maybe a little related to the latter section of that question. There is a scale of really two related questions. What does it look like when you run out of VLANs? And is there any intelligence in the Nova scheduler to place your, or you have any plans to coordinate so that VMs that would be in the same network end up in the same place. So you don't end up in like a worst case scenario. Every VM on every server in the rack is in a different neutron network and therefore you're burning thousands of VLANs per rack. Because the VLANs are actually port locals. So to run out of the all 4,000 VLANs, you'd have to use 4,000 VLANs per hypervisor. Oh, so you are doing Q and Q on the switch then? No, no. How are you doing, how many VLANs does the switch support? Is it 4,000 or is it 4,000 times 4,000? It is around 4,000. But just because I'm using VLAN 100 on SWP one for one tenant doesn't mean I can't use VLAN 100 on SWP two for a totally different tenant because the VLAN interface is a sub interface off the port, which is then being bridged into the VXLAN by a separate bridge. I'm just thinking if you have on the order of 40 or more hypervisors in a rack and on the order of 100 VMs, you could be approaching exceeding 40, 96, no? No, you would be using, if you have 40, you said 100 VMs per hypervisor? Yeah, I mean just to make the numbers work out, it seems like you could get in the ballpark in a rack or a pair of racks. So in that case you'd only be using 100 VLANs. You'd be using 100 VLANs on each port and they could be the same VLANs or different VLANs. It doesn't matter. So, you know. But if they're the same VLANs, don't they get bridged in the switch? No, so. So if you come from a traditional switching background, the assumption that a VLAN ID is really baked into or is global is really baked into the model. But the traditional Linux bridge, which is what we're using right here, a VLAN is actually just a sub interface on a port. So if I have SWP one, which is the first port there and I create a SWP one dot 100 interface, that's now VLAN 100 on that one port only. If I create SWP two dot 100, that's a totally different VLAN on a different port and I can bridge them into different VX LANs. I could bridge them to each other if I wanted or I could bridge them to another SWP dot 200. So then I'd have VLAN 100 bridge to VLAN 200. So it's extremely flexible. Okay. All right, but that's a hardware capability then, right? That would depend on what switches I have deployed in the network. Correct, but every switch we support can do that. Okay. Most of ours are Cisco, which is... Hi, two questions. Firstly, you've talked about working with cumulus drivers. I'm just interested to know which other drivers are known to work, for example, OVS. And second will be, could we have some links to configuration detail so if we can replicate this? Okay, so I missed the second question was you wanted the configuration detail. Yeah, the first question is which other drivers, which other ML2 drivers, other than the cumulus ones, can we try this? Are they ML2 drivers? Yeah, I'll just hold the config. Yeah. Oh, yeah. So see at the top, essentially we're running, yeah, tenant network type, type drivers, local plot and then the mechanism drivers is the cumulus driver and just standard Linux bridge. So nothing about bridge mappings or anything nasty down at the bottom then. Nothing about bridge mappings or anything nasty down at the bottom that you've changed. Oh, yeah. The devil's always in the bottom of those files, not the top. So essentially, yeah, we just kind of, just for demo purposes made it kind of restrictive just for a little bit testing in terms of the VLAN dynamic range we could assign. And then the old VXLAN mapping, the local IP, if you know what's, it's turned off so you can ignore that stands up. And then as far as just because we're using a Linux bridge agent, just telling that EM1 is our physical NIC. Unbelievably simple. Yes. The goal. Unbelievably simple. The other piece is the Linux bridge agent and the switch we're using LDP to discover each other and really the information back for Neutron so that the topology is kind of built dynamically. So the first question was about other than cumulus drivers. Other than cumulus drivers. Yeah. So can we do this, for example, with just straight OVS? Could you do this with, hierarchical board binding can, because the drivers, it's essentially, if you write drivers, are they aware of it? You can connect any number of segments together as long as the driver understands how to support and provision. In some cases, the driver might have to, if you want to make it fully orchestrate, have to set the bridges up and bridge different segment types together. Yeah. So the specific question is, which other drivers are you aware of that would do that? Oh. Currently, right now in Tree, I'm not aware of any of the merged yet. There might be some of the ecosystem that are floating around, but when I was writing this, I basically took the original specs and test code and implemented it based on that. So your sense is that this is going to be followed by other vendors as well? I would assume this was actually contributed by other people in the community outside of us who were really driving it. Like the ML2 team really did a good job of adding this feature in the last cycle. The Cisco drivers are supporting it? Yeah. Sorry, I didn't know they merged yet. Didn't catch who that was. Who else is supporting it? I wanted to Cisco people. Cisco. Cisco, they said that driver supports. Thanks, Bob. My question is around the hierarchical board binding. Can we use it with L3 DVR facility, having a requirement where we need to connect multiple routing to external ports? In theory, yes. I haven't tried it. So there probably will be some kind of minor bumps on the road, but if you're interested in trying it, definitely reach out and I'd love to work with you on that. All right. All righty. Just wondering if this works with MLag on a dual home server? So today, no. It's being worked on actively as we speak. I'm sure if I checked my email, I would see some comments on that. So it needs a capability that we internally call VXLAN active-active, because what you would need is two top-of-rack switches that have an inter-switch link between them and then have an MLag pair down to the servers. And then you need to terminate the VXLAN and then bridge it into the MLag pair. And that last piece doesn't work right now in Keynote Linux 2.5.2. It is coming in the next release, though. Okay, thanks. All righty. Thank you very much. Thank you.