 Hi, everybody. My name is Stephen Heminger. I worked for Microsoft out of Portland, Oregon, and one of the engineers working on Linux as an infrastructure on Azure. Today, I'm going to talk about accelerated networking in Azure, which is something that started out very simple, got horribly complex, and we tried to make it simple again. I also use this as an example of why, as a kernel developer, sometimes user space generally sucks. To give you the simplest view, this is what the Azure Porter looks like. This is the gooey way to create and manage virtual machines and resources, and there's lots more flavors of Linux. So we have like 20 different distros that you can buy at the store. Buy means click, and whether it's Ubuntu and it's free or other ones to charge you, that's all handled. More importantly, there's just one item. We try not to have different flavors for all the different machine types. You choose a machine type and you get a really small one, or you get what we call the Beast VMs with 128 CPUs, and we don't want to have SKUs for every different machine type there. One of the options when you create AVM is right here now something called Accelerated Networking. So you can choose to have accelerated networking or not. If you choose a machine type with eight or more CPUs, and in this case you're in developer mode, you get to choose it. It'll be publicly available soon. So that looks really simple. On the back end, what it does is with a normal without accelerated networking, if you have two VMs and they're talking to each other, they talk, the networking goes through a virtual switch in the host side, goes through a physical switch in a rack and goes to your other VM, which may be on another rack or the other side of the world, and basically all traffic goes through the virtual switches. If you choose Accelerated Networking, what we do is we use SROV and virtual function devices to pass through a virtual PCI device in your VM. So you get a slice of the network. This, why would we do this? It's very simple. When you have accelerated networking, you get much better performance. In this graph, what we did was took data from, the blue is what unaccelerated synthetic path looks like. So with one connection, you're getting about five gigabits a second through, and this is out here with 10K connections. What you see is with the Orange with Accelerated, you get much better performance. You wonder why everything seems to level out here at 25 gig? Because you're in a virtual environment on Azure, there's a host side limit which is programmed into the network card that says, we won't give it any VF more than 25 gigabits. For testing, we can go in the back side, the back door and bump up the cap and just like your ISP can set the values. The reason that's done is just because this is a 40 gig network behind the scenes, and if we're hosting two VMs on something, we don't want one VM to totally saturate the network. But the point is that it's fairly easy to get very close to saturating the network, even with two connections with accelerated networking. Whereas with a synthetic one, it takes away about 16 and the roll-off is much faster. This, by the way, should not be a roll-off at very high number of connections. That's actually probably a firmware issue on the 40 gig card which we're working on. So what does that look like inside Linux? Well, the first version we did, which is done about three or four years ago, predates accelerated networking. So if you do SIRV on Windows Server and you run a Linux VM, this is what you get. You have a synthetic device and a virtual function device presented by the hypervisor. What the hypervisor does is it gives the first packet of any connection and multicast traffic all shows up on the synthetic device. Then once it discovers the flow, it moves the flow over to the virtual function, the accelerated path. So you guys are familiar with OpenVswitch? The default mode in OpenVswitch is until you have the connection handled, it sends the first packet to user space. For the same reason we do this in the hypervisor, we need to validate that this is traffic that matches the rules that have been defined in the hypervisor. So when you create the hypervisor in the virtual network, you say, I want to let port 80 in or not port 80, and all those rules are enforced in the hypervisor. So basically the first packet gets the rules applied by the hypervisor, and then once it knows the flows, it goes through the faster path. In order to make these one connection in the Linux, we use the bonding driver, use active backup mode to push those two streams together and push them up into user space. This has existed for a while, and we supported that, and we actually started using that for the accelerated networking testing on Azure. But the problem simply became, when you look at user space of configuring networking, we have so many different components, and I'm not going to go through every error, that would drive people. But basically we have Cloud, we have configuration information coming down from Azure, we have an agent to configure the server, we have Cloud in it, we have Network Manager or not Network Manager on some distros, and we have each distribution with a different set of up-down strips, and you'd have getting involved. So what we did before was we had a bonding script, and the shell script would go, oh, I know your Red Hat, I know your Red Hat, I should go over here and change these files. By the way, and of course, that kind of work with the developer doing it, would then get totally upset, because Cloud in it on this distro would run, and it would expect ETH0 and ETH0 had to be there, and so I'd have to play games, and then maybe accelerated networking was there, maybe it wasn't. It goes back to that you want to have, you buy RHEL in the store and it works, whether you turn accelerated networking on or off. So after sitting through many meetings, people trying hard and test running, I know it doesn't work on RHEL 6 and all this. Finally, it kept coming down, the people involved on the Azure side was, they kept asking the question, why doesn't it work just like Windows? It just works on Windows. Why doesn't it work on Linux? One can only self-justify yourself for so long. So at some point I said, look, can't we just make it work like Windows? What would that mean? What that would mean is basically, get rid of the bonding driver completely. Just have the synthetic NIC device, have it talk up the stack, and let it manage the VF device completely itself. And I made up a name, because we needed a name to describe this, which was transparent VF mode. So the initial version of this internally had it as a module parameter and everything. And then we gradually got along of like, no, this should just be this way. This should just be how it is. It should just work. So what did that mean? Well, that meant that we needed to implement a few more state transitions during the discovery of a network device. And in the case of Windows Server, the removal. So basically, what happens in Azure is that we start out with just the synthetic network device and then the hypervisor comes along and hot plugs a PCI device. It says basically, I'm giving you a virtual function device and I'm making it appear to be on a PCI bus. Here it is. So the kernel process is that, picks up the driver that he needs to run that device, which in case of Azure is the Melanox device driver, starts that and that network device driver registered. And when that device driver registers, the NetVSC device has a callback from the kernel that says, I want to see new devices getting registered. And when it gets registered, it has to remember that there's a virtual function device associated with that synthetic device. And that's right now done by MAC address. They both show up with the same MAC address. So the yellow ones are the transitions that always existed when we use bonding. The green ones are the new ones we had to add to do transparent VF. So then what it does, it just says, well, I want to take all the packets that the VF ever receives. I want it to be my slave. Well, there's a kernel function to do that. It was there for bridging, it was there for bonding. We didn't have to invent anything. In fact, luckily it was around a long time ago. So all those old distros we can backport to. So then after it picks up that, then it looks at the state of the network synthetic device and says, if that was up, I should now bring up this VF device. So then it goes and does, sets the empty of the VF device and brings it up. And when that comes up, then the VF device responds with, hey, I'm up. And we then have to tell the hypervisor, okay, we know what this VF device is. We've configured it. You can now switch all the received packets over to it. Which we still needed to do these steps and these steps even with bonding. So it's just introducing this in the middle. Now, do you notice there's a gap here? Well, there's a gap just to make you to have happy or to make users of you to have happy, which is if we grab the network device right away and bring it up, you can't rename it. Because the Linux kernel right now will not let you rename a device that's up. So we actually have to put a small 100 millisecond delay between when we discovered the device and when we actually bring it up just to let user space do its little happy dance. So what does it result in? First of all, really simple thing is there's always an ETH zero. Whether you click accelerated networking or not, it's there, cloud init doesn't have to change. You don't have to do 16 patches, the cloud init and everything else. The other one is by putting the delay in, we let all the UDF rules that some distros want. Some distros told run, there's nothing. And the point is sometimes the kernel has to kind of change to help out user space. So this was a great idea, but we still have some things we'd like to do to make it better. And the first one is that delay about waiting to just allow renaming to happen. I want to evaluate and work back in should the kernel allow you to rename devices that are up. It was a restriction that's around for a long time. I think the same thing is true for lots of other things. Even Docker containers may want to be able to rename devices after they brought it up. The other one is if you go back, no, let me, there's basically, let me zip back here. Oops. You still have two layers of queuing disciplines. So when you transmit, you get a queuing discipline here into synthetic device. And then when it goes over the VF path, it goes again through another path for the queuing disciplines. You aren't getting a lot of value out of that. And it's also causing measurable performance decline, not versus pure synthetic 40 gig networking. The other one is right now, the synthetic device has the number of virtual queues that are presented to the operating system is matches the number of CPUs up to 64. So if you have an eight CPU system, we give eight transmit queues up to the layers. The VF device has a richer set of transmit queues because they're in hardware. So typically 128 and you can even place games with data set or bridging where there's different priorities on them. I'd like to pass that through up so that the same presentation is done to the kernel independent of whether it's a VF there or not. And that sort of matches the same thing with offloads. All this height and networking hardware has a lot of offload features on flow matches and tunneling and when it gets fronted by the synthetic device, we get basically the intersection of what's available. So much less functionality. But those are all stage two things which perfectly works fine without it because both the synthetic device and the VF device now have checks on offload, TSO offload, all the common offload things. We're only talking about the more esoteric features. This solution was first done and completed in time to make it into the 4.14 kernel. But just like you've seen earlier, all the distros, all the world runs at a much slower pace. And in fact, when I went back to that button about why I said it's auto developer mode, we're not turning it out of developer mode until we're sure that the distros people are most likely to choose in Azure are ready to work right out of the box. Yes? No, it's actually, yeah, it's there, but it's only in the driver. The driver basically has a delayed work queue that says, oh, I got this event, I know I need to do something later, I'll do this later. That we also, Microsoft worked with Ubuntu to basically, I mean, you have seen the releases, but they basically took a bunch of patches and created it and they have a 4.11, now it's gonna be a 4.13 kernel which they're calling the Linux Azure package. And it's basically a back port of this and a few other things. So that was the first distro that's rolling it out. All the other distros, Red Hat and Slezzer are rolling it out as a bug fix in the kernel. And when those, all those major distributions have it in by the end of the year, it will be publicly available as a checkbox. We also, Microsoft has something called Linux Integration Services, and what this is is a set of driver packages that are taken from upstream and back ported to the enterprise distros. Because the enterprise distros say, no, we won't take new things. And our customers say, I want the highest performance and least buggy version of drivers I can have. So we work hard and test and back port all that. So it's in that as well. The other thing about all this is, oh, the other thing, I wanted to go mention the developer thing. It's not something we magically hide. If you just look for Linux accelerated networking, you'll see the blog post that gives you the, do these things and say the yes op in. It's not something that only Microsoft developers can do. And by the end of the year, it will be available to everyone. And it shouldn't be that hard. And with that, I tried to make this short because I knew I was talking before lunch. Do you have questions? Yes. Is there a way to turn off the transparent VF mode so that I can use, you know, the bypass all of the multiple queuing disciplines? There isn't a way to do it because you don't really get the first packets of a flow in. So even if you had just the VF device raw, you wouldn't actually be able to get to work. You'd have to go do your own team and you're bonding on top of it. So you've mentioned the fact that this will be enabled with a button probably by the end of the year, and you said you're making sure that distros are going to have this feature integrated before that. But the question is, for example, how would you detect if there is someone, maybe some customer is using some kind of esoteric distro and it wouldn't have this capability. How would you detect that? Or do you just circumvent that with... Well, actually, there is a good answer there. If you don't run the bonding script and you don't have this, you basically don't get accelerated networking because what happens is you never get the switch data path message sent from the driver back to the hypervisor saying, yeah, I know it. And I'm guessing because it was way before my time, but I'm sure that they had things like old Windows distributions that there was no way this was ever gonna work. So it's all kind of a handshake. Over here. Why is there a checkbox for accelerated networking? Why would you want to turn it off? That's actually a very good question. And actually the answer is changing. So his question was, why is it just a checkbox? Why is it not always on? Well, it's really because it's a feature that they're not sure everybody wanted to opt into right away. And also it requires that every, the region that you're in support it. So some of the regions, the hardware hasn't been updated so it's not really available yet. So if I say, I'll just use an example. If I say I want to put this in China and the Chinese service haven't been updated yet, it won't show me the box. But the other side is the plan is to make it just always on by default for every VM. And that is as much because the security is higher actually with VF than it is with the non one because you can't DOS the hypervisor as much. You basically end up DOSing the VM and they, you know, we're very sensitive to DOS. So. Two questions from a user perspective. Is this setting something that you can change after the machine has been created? Like... In fact, that's a good question. Can you change this after the VM's created? And the answer is you cannot change it in Azure now. It's part of the properties of the VM when it's created. I suspect that no answer like that ever stays fixed. The plan is eventually to let it be turned on. The other one is if you can get the same effect on Windows Server. And on Windows Server, it shows up as an accelerated network, SRV checkbox on the VM and you can actually turn it on and off live and the VF device shows up and disappears as a hot plug device on the fly. So for testing, actually it's better for me to do it on Windows Server than it is to do it on Azure. And Azure, it's actually implemented slightly differently. The actual VF support is integrated with some hardware FPGA stuff. So the generating of the flow rules and everything is done by the FPGA. So it's done very high performance, much bigger scale than you would see on the regular Windows Server hardware. And the second question related to that is, the second question related to that is, once this become mutable at runtime, who is taking care of reconfiguring and switching from one mode to the other? The Azure agent or? The switching which switching in and out of synthetic mode. Well, actually all that's, the host gives information to the VM that this is what it wants to happen or has happened and the guest is up to the guest to do the right reaction. So all actions are really initiated by the host side. So for example, if the host decides to, if you want to migrate a guest, the VF disappears, the host says, they went away, it's rescinded. And so it switches over to the synthetic path that migrates you and then it gives you a new VF when you migrate to somewhere else. There's other things, scenarios like backup and so on. And suspending VMs where similar kind of things happen. I think you already answered. I mean, is live migration is easy with this transparent VF? It's actually good for that because there's no other way to get a VF pass through and handle live migration. With live migration, the problem is if you give a virtual function device or direct access to hardware to a guest and then you decide to migrate it, there is a period where that doesn't exist or it may exist in a different form where you migrated it to. And because of the way this works, I didn't do the graph similar kind of chart. The host says, this device has disappeared. And we go, oh, okay, unregister shows up and we switch back over to synthetic path and we run slower on the synthetic path during the transition period and then obviously kick back up once it comes up the other side. The VF hot plug device. When it shows up in the VM's PCI bus, is that showing up as a networking device or is it the different? It shows up just as if it was a Melanox card plugged into a hardware slot on the board. Okay. So is there potential for say network manager or some other networking setup feature to pick it up and mess up with it while it's being? Yes and no, yes it could but as soon as we see it registered, we toggle it as a slave device and basically only one person gets to claim it as a slave device. Right. Is there any potential for race in us there? The kernel, there is a potential but the kernel callback runs before the user space callback runs. So we get to win. In fact, I tested this. One of the early things I was testing this was if we're in transparent mode and we ran the old bond script, who would win and what would happen? And what happens is the kernel wins, makes it transparent. The bonding goes, oh well I didn't, I don't see this other device so I can't use it and it makes a one legged bond. Got it. So it still runs, it's just obviously less than optimal. Okay, thanks. Leonard? Just a quick question. You said the first flow, or the first packet for each flow that comes in goes through the synthetic device because the VM needs to execute whatever security rules you have or something like that. So it's actually two questions. First of all, does that also the case for UDP or do you have some sort of state? It's UDP everything. And in fact, like our packets always show up on the synthetic path. Okay, interesting. And then what about the first packet for a transmitted flow out of the VM? You get to basically transmit out of either path and so what we have is as soon as we switched over we always send over the VF path. Gotcha. So now that may change because they're talking about supporting full wildcard matches on the host side. So we don't wanna build any infrastructure that depends on packets from certain flows arriving on certain sides. So if I do an ETH tool now I would see the hardware transmit queues or would I still see the virtual number of transmit queues? Basically what you see is you see two devices. You see the synthetic device and you see the VF device and you can query each one and I put ETH tool stats on the synthetic device thing. These are the number of packets I've sent and received on the VF path versus these are the number I've, the other one is the combined. Gotcha. So should one not do like transmit side steering or anything like that? Yeah. Okay. Look at this before I forget. The only thing I wanted to say is like you did the 100 millisecond weight thing in the kernel and you just did that for the renaming of the interfaces. We don't actually wanna do that in user space, right? Like you can have it and just like name the devices properly in kernel anyway based on some policy setting. Yeah unfortunately it's basically there are PCI devices so people want the PCI names. We actually, the other option we were thinking about doing was naming has two questions. One is do we want to name the synthetic devices with something besides the eth0 whatever device name? And that's the classic question of what's the good choice versus how much do you wanna upset user space tools that already have buried into their brain which I know cloud and it has eth0 buried in its brain. And then the other one is the PCI devices do we want to name them the same as if they were showing up on a real system or do something like we have a UDF rule. But I mean just in general like we have no interest in carrying this stuff in UDF and if you wanna take this to the kernel side the naming of the interface. Like I mean all you need to do is have some knob how we can easily put policy in the kernel to do that because we have all like I said we have all the I know the pairing between the synthetic and the VFs they both know about each other. I mean even completely without any kind of VF or anything. Yeah. I mean if the kernel names everything the right way anyway then we have to do less in user space and there's no reason to play ping pong like that. Yeah that probably makes sense. You may have it. Okay thank you.