 Good afternoon, everyone. My name's Scott Drennan. The first session after lunch is always challenging. Hopefully, I'll keep at least some of you awake. But I'm going to be talking about how to use Neutron to combine workloads from SRIOV, Ironic, and DPDK to provide a seamless high performance networking across those workloads, which you'd think would be trivial, but turns out it isn't. So getting started, what I'm going to talk about, first of all, is why is this an issue in the first place? And then talk about the topology that we're working through and a couple of different approaches that we considered for this. And then demo what we did. And finally, once the demo is over, talk about what the next steps are, both on our side and within the broader OpenStack project. So first of all, what are we trying to do? We've got some OpenStack workloads, which they have networking requirements, but they don't really care about high performance networking. They don't really care about high rates of small packets. They don't really care about 40 gig throughput, any of this sort of thing. But there are some that do. And there are a bunch of different ways to achieve this. And depending on the workload, each workload may choose different ways. Some workloads may still need to be bare metal, but you want to manage them through OpenStack. Some workloads may be mandate using an SRIOV virtual function, in which case they need a direct network attachment. And some workloads are using DPDK so that they can use virtual networking. But maybe using DPDK in the workload itself, maybe a huge page connection into a DPDK on the hypervisor, may just want to be using DPDK as a hypervisor networking stack. And like I said, you'd think this would be easy, but it turns out that it is actually fairly difficult to get all of these working together and playing nicely together in this heterogeneous environment and let them talk to each other. Easy enough if you have your bare metal workloads over here and your SRIOV workloads over here and your DPDK workloads over there and you go out to the edge of your data center in order to connect between them. And for some use cases, that's OK. But if you actually want all of these things operating in the same router or better yet in the same network, within Neutron, that's kind of tricky. So what are we talking about here? We've got your standard, say, class fabric using VXLAN with some tor switches, some core switches. And we've got SRIOV and or DPDK workloads that are using VLANs or VXLAN to get out. We've got ironic workloads that are connected into the top of REC using ports and VLANs. And then we've got traditional VMs that are using VXLAN. So what that means in Neutron is you've got multiple segmentation types in a common network topology. So just to provide a bit of background, SRIOV is fairly common for high-performance workloads that aren't yet DPDK enabled and you want to have flexible networking for them. For those who aren't familiar with SRIOV, it's a mechanism where you're basically sharing the physical NIC. You're not providing a virtual NIC. It looks like a physical NIC into the VM, and it's called a virtual function. But to the VM, it looks like an actual physical NIC with a PCI address and all that sort of good stuff rather than your standard vert IO or V host. SRIOV has been part of OpenStack for a long time. But it assumes a provider network model, which assumes consistent VLAN to network mappings, and it assumes that those mappings are pre-provisioned. And we'll talk more about that. For Ironic, Ironic is used for bare metal workloads in OpenStack. As of Mitaka, all bare metal workloads in the standard release must be on the same network. That's been a pain point for a couple of releases now. We showed something proprietary back in Kilo in Vancouver, showing how to do this and watching and trying to help this get in. Newton is now looking like it's really going to happen. For one VLAN per bare metal, and if the VLAN-aware VMs spec also gets merged, then we may even be able to do multiple VLANs, which would be really nice. But again, bare metals, you want to have them on multiple networks, and you want to have them on the same networks as your SRIOV workloads in some instances. For DPDK, in most cases, if you're using DPDK, you're looking to do more sophisticated networking in the hypervisor itself. SRIOV and Ironic present, in most cases, a very basic networking model. That's for things like NSH and security rules in the hypervisor managed by the DPDK driver. And in this case, we're looking at VXLAN between the Compute Host and the Tor. So what that means is yet another binding model. So what do we need to do? From a configuration perspective, we need to be able to map VLANs to overlay VXLANs in every rack locally and dynamically. We need to align the SRIOV and Ironic workflows because those are very, very similar. We need to be able to connect DPDK compute workloads into the same network. And from a performance perspective, you're caring about performance, so you don't want to have a network node sitting there performing your routing between networks for you. So you need something that will actually do routing between networks at wire speed the same way as you could so that you're not losing any performance that you've gained by doing all this work with SRIOV and DPDK. So what does that mean? On the top of rack switch, you need some way to tell the top of rack switch that this port belongs to this hypervisor so that you're provisioning the right things on the right switches. And you need, obviously, VLANs to match between the top of rack and the hypervisor. If they don't, then things don't work. And you want to be able to permit a mix of VLAN and VXLAN. So that's sort of setting the stage for what we need to do. There's an obvious, OK, maybe not obvious, a fairly obvious approach to this that has existed for a couple of releases now in terms of ML2 port binding. Hierarchical port binding is, at least apparently, designed to do precisely this. Designed to have a VLAN mapping between your compute node and your top of rack and a VXLAN mapping between your top of rack and the rest of your data center fabric. Unfortunately, there are a few limitations with that. There's some limitations around static segment creation. With the way that SRIOV does its binding, there's no way to pass the segment around the same way that you do not really, sorry, that's not the segment. There's no way to pass the top of rack port around the same way that you do in Ironic. And you can't do dynamic segment creation. So that's probably an approach for future. But the approach that we've taken today is to align the SRIOV configuration with the way that Ironic is doing things. And Ironic provides local link information on the port binding. And I'll show you that a little bit later, where SRIOV doesn't. SRIOV, if you try and provide local link information on the binding profile, the SRIOV driver actually overwrites it with all of its PCI stuff. So that's not helpful. So two approaches. One is to update a NOVA SRIOV to allow local link information. That would require patching NOVA. Or the cheaper and easier approach is within your plug-in, I do the port to hypervisor mapping in Neutron so that you can get that back even though it's not being passed around between NOVA and Neutron. So that's what we've got today. In future, with the work that's going on around VLAN-aware VMs, with luck, this will become simpler. And VLAN-aware VMs, for those who aren't familiar with it, is a way to expose multiple VLANs to a workload. Originally, that was, per the name, a VLAN workload. But some of our participation in that was to make sure that it applied to non-VM ironic and SRIOV, which I guess SRIOV is VM, but non-traditional VM cases. So we're going to be doing some work over the next cycle to turn that into code. And hopefully, it'll land in Neutron. So enough of me talking and showing slides, what does this actually look like? And this will take a moment or two because I suppose it would be good if I actually started the beginning of the demo. It takes a moment or two for the screen to recalibrate. Excellent. And rather than excite you all with watching an ironic instance boot for five minutes, I've recorded this such that we don't actually have to watch paint dry. So a standard Horizon dashboard. We've got three hypervisors. We've got one for SRIOV and traditional VMs, one for DPDK. And then we've got an ironic bare metal node. This is sort of the minimum viable demo. And we're starting out with no instances. We're starting out with two networks, a tenant network and an ironic provisioning network. And for those not that familiar with the ironic provisioning network is a mandatory component to allow the VMs to get their images over a secure channel before the bare metal workload is handed off to a tenant. You don't really want your tenants able to go and download images from an image server or go and harass other people's instances. So that separation is good. So here we can see that we've got an ironic node that is powered off and available. And we'll shortly be able to see that we've got a grand total of one port with the MAC address. And then I'm doing a port show. And what you can see here is, well, first of all, that I'm using a micro version of ironic because the client is using an older version, but is we've got a local link connection with switch information, port IDs, switch IDs, so that Neutron can figure out what attachments it needs to make in order to correctly configure that bare metal instance. And here's me highlighting it. That's very exciting. So switching back over here, again, rather than configuring a whole bunch of stuff manually, we're going to use heat to provision all of these VMs. And we'll launch a creatively named demo stack here. And that is a pretty standard stack. All we're doing is launching four VMs into one network instance. We've got a bare metal. We've got a VM. We're going to use a centOS image, and we're going to put it on our tenant network. And just before we look at that, so that's probably fairly small. But you can see in here that here's the heat stack. We're launching a bare metal with flavor bare metal. We're launching a normal VM. And we're creating a direct port in order to use it for SRIOV. And then we're launching a TPDK VM. And we're setting different availability zones so that they go into the different servers that are configured differently. So that's that excitement. And if we can now go over and take a look at our compute instances, we'll see that we've got a bunch of stuff spinning up. We've got our normal and DPDK VMs that come up pretty much instantaneously. SRIOV takes just a little bit longer to provision all of the bits. And bare metal, of course, takes longer. So switching over here to the new Wage VSP architect, you can see that you've got your two VMs. You've got your host interface. And you've got another sort of blank port sitting here. And that's just waiting for the ironic instance to appear. And we can see over the ironic provisioning network over here has the ironic conductor attached and is basically provisioning the VM, pixie-booting it, and launching things. So while we're waiting, we can do a neutron port list. And we can see we've got actually five interfaces, including two with the same MAC address. That's because you've got your ironic provisioning network. And you can see that here's our unbound port at the moment. And we can see that bare metal, I didn't speed it up. I didn't speed it up enough, I think. But anyway, here we have the bare metal. It launched into the ironic Python agent. And so it should be just about ready to come up. And there it is. It's gone away from the ironic conductor. And it should exist over here now. And there we are. So now we have all of our interfaces plumbed into the same network. And here we're back on our system, me checking IP addresses. We can see that all of them are up. All of them are on the same network. And they've got nice consecutive IP addresses. So this is certainly what we were attempting to accomplish. And now, for the exciting high-performance validation of connectivity between these four instances, I'm going to use ping because that is the traditional way to validate things in OpenStack. Perhaps I should have brought up some of the performance testing tools. But unfortunately, when I was recording this, I didn't have anything readily available. So ping it is. So we are successfully pinging our SRIOV instance, our traditional VM, and our DPDK VM. But they are all plumbed and all connected. So I'm switching back over to slides. Where does this get us to? We've got, oh, I forgot something. Oops, my bad. Yes, so going back over here, the key point with all of this is that you can actually put multiple different heterogeneous types all in the same network, all in the same broadcast domain, which allows you to do networking that, say, OpenStack doesn't necessarily have ways of managing, lets you do things like protocols that are neither IP nor IPv6, and other things like that. So that's one of the key use cases here. And for NFE applications, that's important. Switching back again. So what does this do for us? It gets us our 10 gig per core with DPDK, or better as DPDK continues to evolve and improve. It gets us whatever our bare metal and associated NICs are capable of doing with SRIOV. And same thing with bare metal except more so. So that gives you the flexibility of moving all of this around. And because we're doing this with top of rack switches, which are VXLAN and VLAN capable, and can also do distributed routing as well as distributed switching, that means that you can also do direct routing without having to go out to a big router at the edge of your data center. Much more efficient, no tromboning. We support this with our NewHodge 7850s. We're in the process of making this work with a number of our third-party top of rack partners. And what that means is you can freely combine hardware attached and software attached workloads. So that's where we are today. What do we need to do to move this further forward? The key points are we need to align with ironic and neutron as that evolves. And for those of you who were in the ironic and neutron session yesterday, there's some good progress on that. And it looks like we'll be in good shape in Newton to actually make this commonly available without any patches. The SRIOV changes to Nova to clean up the way it's using port binding. We're planning on submitting those shortly. And the VLAN aware VM work is really critical to bring this to the next level, which is not just one VLAN or an untagged interface going to each of the workloads, but multiple associated with multiple networks in the neutron. So that's where we are, and that's where we're going. So thank you very much. And I would be happy to take questions at this point. If you have a question, the folks who are going to be watching this on YouTube would appreciate it if you actually came up to the mic to ask your question, rather than asking from the floor. You mentioned that using VXLAN, you're removing the network from the path and achieving higher data throughput from SRIOV or bare-medal provisioning. How are you overcoming the VXLAN hairpinning that gets introduced because the devices in a rack which do not have the default gateway on it needs to go back to its default gateway rack and then egress. So the default gateway, at least in our implementation, the default gateway actually lives everywhere. So the default gateway is resolved at the top of rack for each rack. So obviously, if you're doing a bare-medal or an SRIOV workload, you need to go out of the hypervisor, but you only need to go as far as the top of rack. And then if you have another workload that is on another hypervisor in the same rack, it can get turned around immediately. So routing occurs in each top of rack. Do you mind sharing the details on how we are doing that? Is that something proprietary to the implementation or is that an industry standard? So with the new R7850, we're using a Broadcom Trident 2, which inherently doesn't support this. But there is some trickery that you can do in order to make it work. And we've had that working now for a couple of years now. I wasn't actually planning on talking about that because we've talked about that elsewhere. But if you're interested in more details, certainly swing by the Nokia booth or come talk to me afterwards. And we're also working with some of our third-party switch vendors to help them do the same thing. Thank you. Hi. How did you end up with the 10 gigabit per second per core in an environment where you have multi-socket, multi-core, and hyper-trading enabled? Can you elaborate that number? How did we end up in that number? That is, so for DPDK, that is somewhat of an arbitrary number. You can make that number bigger or you can make that number smaller depending on, yes, you've got multiple sockets, you've got NUMA, you've got packet sizes, you've got, what are you actually doing with the traffic? But at least in our testing, we see 10 gigabits per core allocated to the DPDK on the hypervisor is a reasonable number. What ML2 plugins do you have enabled in this demo? And if you do upstream, your changes and they get accepted, what ML2 plugins do you expect to have or need enabled when the code gets upstream? So the plugins that are enabled in this demo, so first of all, this doesn't necessarily require ML2. We did, in fact, use ML2 in order to use the ML2 SRIOV driver. And we are using the Nuage ML2 driver. And when I was alluding to adding things to Neutron, that was within the Nuage mechanism driver. So in order to do this upstream, you would need another mechanism driver that is also able to manage the top of rack and a mechanism driver that is able to handle your DPDK and traditional VM workloads. So depending. So the parts of this that are generic that will make it easier for anybody to do this, but you do still need a mechanism driver that supports it and you need a mechanism driver that can drive top of racks that have the capabilities that ours have or something similar. But there's nothing super magical about that. We make it much easier. One last question. You mentioned that you have the SRIOV workloads and then the ironic workloads. But maybe I'm not so familiar with DPDK, but what aspect of SRIOV adds to the complexity? Because it just maps the nick directly into the virtual machine to provide better throughput and those kinds of things. Is there other complexities there? So for these purposes, ironic and SRIOV are very similar, but the way the ironic driver is written makes it easier to do than the way the SRIOV driver is written. So if we make some changes to the SRIOV driver, then we can make it equally easy. But no, there's nothing inherent to SRIOV itself that's complex. Anything else? Oh, one more question. You said an ironic instance have two parts. One is a tenant network and one is ironic provider network. So these two parts actually connect to the same part of the TOR switch or they connect to the different. So there are two logical ports in Neutron, one on each network. But physically, there is one port on the bare metal instance and there is one port on the top of rack and there is one wire connecting them. But logically, you are reconfiguring to move that port between the provisioning network to the tenant network. At the moment, there is only one port which is blocked. Yeah, as Vlad said, and he's the one who did most of the work on this, there is only one port active at any given time. OK, OK, thank you. All right. So thank you, everybody, on your way out. If you don't rush past too quickly, I'm told there are t-shirts available. Thank you for listening to me rather than falling asleep after lunch. Thank you.