 My name is Koko Shadal. I work for Red Hat and this talk is a little bit about the addition of AFX-DP as sort of a data pad or an open v-switch. It's not really going into the details what the AFX-DP is and does, but in short it's an enhancement that will allow packets to be sent directly to a user space, so similar to DPDK. With the advantage, it's a little bit more friendly, so you can either decide that you want the packets to be handled by the kernel and then sent to the user space, or you could actually say I want the packets to be directly sent to the kernel and do normal processing. The modern advantages in the OVS perspective, it's purely a user space application, so you don't have to worry about updating a kernel module, which sometimes makes it easier to do a quick change or maybe some experimental phases that you won't do. So last year at the OVS conference, or about a year ago, there was a presentation about AFX-DP and how it actually was integrated in OVS, so that might be something if you would really like to see how the technical details of the implementation look like. That's a good presentation to watch. I put the reference there. You can skip this slide, push it up. But the main question always is, you know, is AFX-DP faster than DPDK or is it faster than the kernel? Or vice versa. A lot of people ask that question and there's not much data out there that show the differences. If you talk to the AFX-DP people, they say AFX-DP is better or almost faster. If you talk to DPDK, they say they are always faster. So what we tried to do is a apples-to-apples comparison between the two or between the three different data paths and see what the differences are. What I did in this specific example is that we used the PVP test suite. Can you just speak up a little bit, because the guys in the back are probably hearing you. Okay, I tried to talk a little bit faster, harder, not faster. So what we did, we used the OVS Perf test suite. There was a presentation for me about it last year, I think, also here and also in the OVS conference. It's a PVP test where we send packets to a virtual machine through a physical port and then back out on the same physical port and then we measure the performance on that. For this, we also test the physical-to-physical port just to make it easier because then you don't have the overhead of the virtual machine, so it might be easier to spot the delta differences. We use an open-flow rule with the normal action, which means just bridging between the ingress, egress ports, or the ingress and egress ports of the virtual machine. We try a couple of different packet sizes and a couple of different number of flows, and the flows are based on five-tuple variations in the test. I have no latency tests in this thing, so there's no comparison between latency for DPDK versus AFXDP versus the normal kernel data path. What will we compare? There's a couple of different flavors. There is the pure kernel data path, which is a kernel module for OVS that does all the data path processing. We use AFXDP with a TAP interface into the virtual machine because that's the default case from a kernel perspective. Then what we also try to do is we do AFXDP with the V-host implementation from DPDK, just to see where the bottleneck is there. Then of course, most people would like to see is the AFXDP versus the DPDK implementation of the data path. Then also there is an AFXDP PMD in DPDK, so we try to compare what's the difference if we use that PMD versus the native implementation of AFXDP, and there are some differences there as well, surprisingly. So some of the results, some of the numbers, it's probably a bit small for the people in the back to read, but I'll walk through it. So this is a kernel version of the kernel data path. We have the different colors or the number of flows, and then you see the more flows we have, the more performance we get from like one million packets per second for a single flow, up to eight million packets per second in the reference architecture that I use for the test. But what you can see here is the CPU utilization, so the CPU usage is just a difference because it uses the queuing mechanism of the kernel. So if you do one flow, you can see you're roughly use one core of your system, and if you see if you increment the number of flows, you start to use more CPU power because you get a distribution over multiple threads in your system, or multiple cores in your system, and the performance picks up. So it looks nice. You can do eight million packets per second on the kernel data path, but you have to take into account that is using about eight to nine cores just for packet processing as a total. That's what I tried to show here. So this is the physical port in and the same physical port out, so just looping it back into the system. This is the same thing for the virtual machine in the picture, so packets go to the virtual machine and all back out of the physical port. Here you see that a single flow is a bit faster, and I think that has to do with caching, but I haven't figured out the details yet, so that's still the thing where I would like to figure out why it's faster. But you see you get about 0.6 million packets per second, but you see the same usage of the CPU, it even gets higher, because the virtual machine also needs some extra CPU power because you basically have two interfaces that you're pulling in this scenario. So this is for AFXDP, so the same test, but you use AFXDP as your data path in this example. So physical port to physical port. Here you see that the single port is faster, a single flow is faster, but it's roughly around 3.5 million packets per second. We do a comparison of it in the next slides between the two, but you can see that for all the number of flows, the CPU usage is sort of stable. You get your, here you can see your PND thread, so your PND thread is roughly 100%, which in the user space data path is always pulling the PNDs because it's a similar implementation as for the DPDK side, but you get additional CPU resources because the kernel also needs to pull the XDP part, so it needs to pull the driver, so your total CPU utilization is roughly about double in this scenario. But at least the number of flows is not depending on it. The V host interface, so the PVP test, you see the same. You get about 800.8 million packets per second. The flows are roughly the same. The CPU utilization is higher because you get the PND thread about 100%, and then you get 100% for the virtual machine pulling the V host queue, and then you get 100% for the hardware nick to be pulled as well, and then some overhead from QEMU that is actually doing the interrupt handling. So out of all this test or the different tests and the different frame sizes, I picked out one so we could compare the results rather than trying to compare all the possible frame sizes and all possible flows, so we took the medium, which is the 100 streams with the 64 byte packets because that's most of the people who would like to reference again 64 byte small packets. Also because in that scenario, we're not filling the pipe, we would take a bigger packet, we would fill the pipe and we don't really see where the bottleneck is if we want to compare them, and then we use the PVP test because it's more of a realized scenario where you actually go to a virtual machine and then to go back. So doing that for AF-XDP versus the kernel, and I have to apologize that the skills are not the same on both graphs, so it's not, you cannot compare them one by one, but at least you can see here that from a performance perspective you get about 8.8 million packets per second on AF-XDP versus a little bit more, roughly maybe 5.6 million packets per second. So you would say it's a bit slower, but also you have to look, take into account the CPU utilization, and you can see here that this is basically what is used for the PMD, which is always 100% on the AF-XDP part because it's constantly pulling if packets are ready. The virtual machine, of course the gas, it uses 100%, and then you can see the remaining part of the CPU that is used on this scenario. I cannot really read it there, but I can see it here. So the system is using a lot of the additional CPU power, about 200%, just to pull the two XDP drivers, and then there is a little bit overhead from Pueymi doing the interrupt handling on. So the conclusion is it uses a lot less CPU power, oh my gosh it showed the CPU for the kernel, because that's not on scale, but it's still about almost 10 cores, additional cores that is used to process the packets. So it uses way less CPU, there is more throughput, and there is no kernel module dependency for AF-XDP. So the good part is that on this side, sorry the bad parts are, it's not a full feature, so the features compared to DPDK or user space and the kernel is less, so there are some features missing, including QoS egress policy, and there is some design which is also there for DPDK, is the way that if you send traffic in from a kernel interface into DPDK or AF-XDP, it needs to bypass, so your packets take longer to process. So this is the same data, and I'm going to go through it a bit quickly, it's the same data for the physical port for the DPDK, so you get about roughly around, was it, million packets per second, sorry 10 million packets per second on the physical port, for the PVP it's roughly 2.5 to 2.8 million packets per second. So DPDK uses the V host user as the interface to the virtual machine, because DPDK and V host are both user space data path and OBS, they can be mixed, so in this scenario the V host user is also, the DPDK V host user is used in this scenario to loop back the packets, so you get the better performance, it's roughly about 1.6 on average for 100 flows, so if you compare them you can see that the DPDK, oh this is not DPDK, so this is the comparison between the XDP using a tap interface to the virtual machine versus a V host interface to the virtual machine, you can see the difference, it's about 0.8 versus 1.6 million packets per second, so roughly a double the speed if you have a different interface into the virtual machine, from a CPU users it's roughly the same except because you have only one XDP interface, your CPU usage for the other, for the V host part is totally taken into the DPDK PMD threads CPU utilization, so you will only have the additional XDP but not the V host, so conclusion, so the V host uses less CPU power, the throughput roughly doubles and you have a constant CPU usage if you increase the number of virtual machines in that scenario, if you add more XDP physical interfaces then your of course your CPU utilization will increase but not that big, that part is that you need to set up VPDK infrastructure, so you need to set up your huge pages and you need to set up the DPDK memory assignment as well. I think this is the one people like to see most, it's the AFXDP versus the DPDK, so what is the performance gain there or the difference, so as you can see it's like 1.6 million for AFXDP through a virtual machine where it's about 2.75 million, so the delta on this is roughly I guess if I call it already 1.6 or something times faster in that scenario, so DPDK is still faster if you take a normal use case scenario, but what is the main advantage, so the CPU power is less because you actually know what's going on in the thing, whatever you reserve for the PND threads is what you have, so if you reserve one PND thread your CPU utilization will be max 100% independent on how many ports you add to it, whereas for AFXDP you add the additional CPU usage for the XDP, so your driver, you'll use roughly double the CPU power there, so roughly 1.6 times like I said need to set up DPDK, you have to have PND drivers that might not be up to spec with the latest version, so if you have your kernel drivers that might be better, and you cannot use XDP steering, so you cannot put an XDP program there that say okay maybe some of the packets I would like to be handled by the kernel, and the rest should be go to OBS, I'm going a bit faster because we have only five minutes left, so the same thing for AFXDP but using the DPDK PND and not the native implementation, I'm just going to click through it so people can watch it, the thing is that the AFXDP PND is a little bit faster where you would expect the other way around, the reason is that with the AFXDP PND they reuse the buffer, so when you send something from one port to another port, you use the same AMBUFF for transmission, so you don't copy it, in the current native implementation when you copy something to a different egress interface, it actually copies the entire buffer, so there's a buffer copy for every packet transmitted, and that is one of the, I have some future items here, that's one of the things that's still being worked on, so we have a shared UMEM between the different interfaces, and then there are some kernel work for trying to get the native zero copy support there as well, so that we can bypass, there were some people thinking about having our own VO's library for AFXDP, rather than use the DPDK one, I'm not really a fan of that, but I think someone's researching that, there is currently no egress QoS support, but we can probably use, reuse the DPDK library to support that, but we have to see, but that requires you then to link DPDK to the project if you might not want that, and then this is some other stuff that is, I think the lower two ones are already in the master branch at the moment, and the other thing that we need to look at is some CI testing of the whole stuff integration, so the conclusion, so we didn't compare latency, and we also didn't do any multi-Q in the examples to see if that will speed up even more, so that's still something that needs to be done, so why does the speed comparable sit, so AFXDP is somewhere in the middle, so it's between between the kernel and DPDK, it has the advantages of that you can keep the kernel driver, so it may be a little bit easier to set up, but if you still need the additional throughput, you might want to go and set up DPDK in that example, and there I think your kernel needs to support AFXDP, so some of the kernels in the distributions don't have it yet, so you might want to still put into a new DPDK. Okay, that's it, any questions? I think we have one minute left, okay, so perfect, any questions? We have extra time, all right, thank you.