 Hello. Welcome. My name is Brandon Blanco. Hi. This is Yun Song Lu from Huawei. Hi. And today we're here to talk about how we've been leveraging XDP and how you can do so as well to do programmable high-performance data paths for OpenStack and how you can use that. So we just want to give a short thanks to our sponsoring members. So we're both part of the Iovizer project. And Iovizer project, it's a Linux Foundation collaborative project for enabling people to do IO. So today we're going to give a little bit of background to Iovizer. And then we're going to focus on one application called XDP and give a bit of use case. And so Iovizer projects is a community and a set of tools developed by that community for doing a various set of things for high-performance IO, so from tracing and security to networking. And the goal is to use these tools to develop new infrastructure applications. So we started by building networking applications. It's one of the original use cases for software-defined networking. And we wanted an SDK to extend this low-level infrastructure thing for like processing packets and things like that. But we didn't want to have everyone need to become a kernel developer to do that, which is kind of the status quo within Linux. And even outside of Linux you have to be pretty low-level to be able to develop new networking applications or IO applications in general. So we kind of looked around and looked at different application frameworks that were successful to kind of look for guidance in how to build a new framework. So take, for instance, Node.js. I'm actually not a Node.js expert, so it might be a horrible example. But it seems to have some popularity. And why? Well, it starts by realizing that writing multi-threaded applications is hard. It's very confusing. And Node.js has this different approach that doesn't match up with the CPU architecture, rather it looks at the way that events happen in those applications and try to enable a syntax that is more suitable. But it does that without sacrificing performance. You can write in this expressive language, but V8, the compiler, has a nice translation into native code. And you have this community that publishes all of their applications. It's easy to install and so on. And so there's a lot of sharing and a rapid cycle of development. So let's apply that analogy to infrastructure applications. It's a little bit different. You have a different set of restrictions and goals. So you want high performance access to data. So that's why a lot of the things are in kernel or in user space frameworks. It has to be reliable. You're building applications for high uptime. So it can't crash. It can't be working some of the time. And you need to be able to, again, have this high uptime. Besides just not crashing, even your development model needs to avoid having to reboot your server. And we want something that's not, again, doesn't require mucking around with your kernel that you can write as if it's a normal application, but still having those first couple of points of access to data. And a programming language abstraction to go along with that should be easy to write. We've learned a lot of things over the past couple of decades. We've learned a lot about how we're writing these applications. So here's something, a tool that's been around for a little while, that as part of the community effort, we were able to extend. So BPF programs are, well, they're not really programs. We'll get to that in a moment. So it's kind of an instruction set for low-level, initially packet parsing, but it's actually a little more generic than that. And within iAvisor, you can take these BPF programs and attach them to events within the kernel, and you get a few data structures that you can access to. You get kernel events, packet events, IO events, that you can match up with your BPF program. And you also get a user space library and a set of system calls to interact with that low-level primitive. And so BPF is this tool inside Linux that's been around for several decades. And again, it's not really a program. We call them BPF programs, but there's no process ID. It's kind of running off to the side in the kernel or even in other hardware where it supports BPF. And most importantly, what makes this kind of a really killer application is that you can take these programs, run them in the kernel, in the native instructions of that machine, even though it looks like a high-level language. And it's not a kernel module. It's not, like, binary that you're loading into the kernel, sort of. But it does run very quickly. It runs in the native instruction set. So there's a jit inside of the kernel that translates these programs to native instructions. And so that's a little bit of background of what iVisor comprises. And now I'd ask Yonsang to talk a little bit about what XDP is. Actually, a little bit background about XDP. Before that, actually, networking has been the major use case of iOVisor from day one. Actually, we know that Plumgrade has their SDN solution, built-on iOVisor technology. Cisco has Selium solution built-on iOVisor technology. Huawei has its micro-dataplan container technology built-on iOVisor. All those things are using iOVisor technologies and some internal tools, iOVisor tools to build networking. XDP was proposed in early this year at the Linux NetDive 1.1 conference. At that time, people were talking about how to improve networking for Linux. XDP was proposed. Yeah, the SDP design, there were several goals on mind. Firstly, it must be high performance because the Linux solution on Linux and for OpenStack, for whatever other environment, has been criticized for its bad performance. You may have suffered some low performance of the networking solutions. Also, as Brandon has talked about, the programmability is also very important. You want to write some new application very easily and you don't want to really change your kernel that often. XDP really doesn't replace the current TCP-IP stack. If you add on to the new stack, it's fully integrated and it augments the current stack and the kernel to do better work. Of course, XDP is designed to run on general-purpose hardware. It can be running on X86 ARM and, of course, some NPs. Also, we mentioned that the acceleration because for other stack and other technologies were designed purely as a software, then later try to add some offloading and acceleration things. But with XDP, the offloading acceleration concepts was taken into consideration from day one. That's the design purpose. So look at this. Simply, there were three places to run XDP data plan. Basically, from bottom up, we can see hardware. There is kernel and user space. XDP natively is implemented in kernel. Actually, I will show some demo and some more detailed packet flow in kernel shortly. Also, look at the app. There are some user cases. This way, we wrote her firewall load balancer, those things. You can write those applications, those functions in C, P4, or some other language you prefer. It's fairly easy. Shortly, in the demo, we will also show how you can quickly update your program and load it to the VM. Okay, firstly, we want to really clarify the terms, right? IOvizor and XDP. IOvizor is the framework, the platform as a community. It includes BPF tools and a lot of things, many use cases. XDP is for networking, particularly for Linux, for solutions. You can build it to network functions based on XDP. Okay, again, look at the use cases, right? You can rebuild the virtual storage easily with XDP technology. Of course, with IOvizor tools, we router. Actually, today there is a router which has been deployed in Facebook. They have implemented their new IROA router with XDP for their data center. Load balancer as well. Easily, if you need to do the application load balancing, XDP is really a perfect technology for that. Security. Security is something very interesting. People use firewall for OpenStack for a lot of applications. Of course, we know a famous thing happened recently. We'll talk about it a little bit later. So, also, we are building not only... With XDP, we are not only building some small blocks, right? Those building blocks are good, virtual storage, virtual router, those things. We are also building... In the community, we are also building some solutions. One thing is that we will be announced soon. We will be a celebrated OVN solution. When OVN gets announced, there will be talk. Please follow up the OVN development and some... the XDP development from iweather.org and some other websites. Of course, there will be solutions particularly designed for container networking. There are already some names there. You will see more. Okay, this is the internal data pass. We talked about the XDP can run in kernel. You can also put it to hardware and also can be running in user space. This is how you can run in kernel. I would give the data back to Brandon to talk about more details. Then it will be followed by a demo. Thank you. So, here we see the placement of... A little bit of layout of existing Linux kernel stacks. The things on the right, you have drivers managing the device state on the bottom getting data in and out of those. Passing that eventually up to usually the TCP stack. Sockets, your applications. The place that XDP fits is right at the very bottom layer before even the drivers have handed off the packets to the stacks for the application processing. The reason that we do this is that processing, if you want to run fast, you have to put the fast processing before any extra work has been done. As the drivers are receiving packets from the network, you will be passing these off to these customizable programmable units of BPF programs that can then have some actions that they'll pass upwards, either maybe to drop the packet. The story ends there. Or you can receive upwards and influence some of the Linux behavior, such as packet steering, so to hash your packet to different applications or cores or numinodes. GRO for doing packet coalescing or forwarding if you want to let the XDP application be the entirety of your application. If you're building one of these applications, like V-router, V-switch, load balancer, the packet might just immediately go back out on the network, maybe with some modification or with some bookkeeping that you do. And if you remember back to a previous slide where we had some APIs between the user space and kernel space, those are also available here as well. So the BPF event model applies here as well, and you get system calls, ability to access, the map data that you keep in your program so you can build a functional model out of this. And so now getting to a use case. Does anyone know what this graph signifies? There's some hints already, but can anyone see the date on that? Yeah, this happened Friday. This is a graph of the packets received over time, so the packets per second on the level three network that was front-ending the traffic to the Dynnet, the DNS service. And with all of the compromised, whatever it was, the cameras and routers and so on, out on the Internet, they were flooding the DynService with DNS traffic with SIN packets and various other attacks. It brought Twitter to a halt, for instance. Oh, no. So Adidas attack is one example of a use case where you need fast networking. So consider this attack, and it had a fluctuating mix. So you saw, like, the graph had a couple waves. Within those waves, some of the analysis has shown that there are, like, different types of attacks that were coming and going on, you know, SIN packets, DNS, like, a spam of DNS attack requests. And to be able to defend against that high rate, like, you know, many, many gigabits spread across multiple servers, you have to be able to do as fast as you can filtering of traffic. You can't afford any cycles to, you know, go and do some slow analysis. Otherwise, your good traffic is completely lost. It's going to be in some buffer somewhere. It's going to be tail dropped. You're never going to see any of the traffic to be able to even pass it along to the next host. And so, as we're defending against this, we need to be able to adapt just as quickly. And so, we need a very flexible model for that. And you need tools that are very, very nimble. So, we have, and we're actually working on this before this attack is just coincidental, that to defend against Adidas, we're trying to see, just to start with, how fast can we drop packets? That's really what Adidas mitigation is, is just dropping as fast as you can. So, we have just a simple setup that will show in a second, where we have two x86 boxes. The machine on the right is our sender. It's just going to send on a 40-gig NIC UDP, you know, small frame size as fast as it can. And I'm going to start the setup with an open stack on the left-hand side. So, just simple dev stack, obvious bridge router, a couple VMs with floating IPs. We're going to send traffic to that. See what happens. So, here we have our dev stack setup. We have a couple instances there. IPs really aren't important, so let's keep the screen small. Our network topology has, like we said, there's a private network, a router, a public network. The public network here is what's connected to the Melanox card on the receiver side. So, it's just doing a simple floating IP translation there. And let's start with, let's see how we'll measure our good traffic. Very, very simple. We'll just send 50 pings over a short period of time and expect, there we go, zero packets lost. Good. And here we're going to be monitoring our receiver to see the packets that are being received. We'll get a kind of a meter here on the right, and the CPU load. So, we'll start that off. And, let's see, give it a few seconds to ramp up. Our CPU is already at 100%. And our rate that we're measuring that OBS is being able... So, keep it, remember we're, here we're sending still to the VMs. Our VMs are being able to handle about, you know, 430,000 packets per second. And I haven't really tuned OBS here, so you may get different numbers in your performance dust bed. That's not the point here. Because it's definitely not going to be able to keep up with that line rate to a small packet size, no matter what we do. So, let's do... Let me stop this. Let's do something simple, and we'll check our access group rules. And, yeah, so we were allowing you to be trafficked, so our packets were reaching to the VM. That's definitely not going to perform well. Let's make sure that, at least, before reaching to the VMs that our host is dropping those. Let's try again. I forgot to check to see if we could actually ping. We wouldn't have been able to ping. Here we'll... Let's see what we can do. Doesn't look good. 86% packet loss. And we were doing about 440 before. Now we're doing about 480K. And our CPU is at 100%, not leaving a lot of room for the good traffic. So, we've done most of what we can using the standard OpenStack tools. Now we're going to use XDP to add a filter for UDP traffic. So let's just take a quick look at that script we have. It's using Iavizer tools. So we have a set of Python... We can make this a little bigger. A Python script here, which takes command line arguments. And we have a simple parser that can parse IP headers, and it's just going to return one to our caller whenever we see UDP packets. And our program here is just going to... We'll call that parse function, and it's going to return drop whenever it gets the positive value. And it's going to keep it statistic. So we'll just run that, start up our traffic, and let's see that ramp. So we'll notice right away that our CPU utilization is no longer 100%. We're now sitting here having around 20% CPU, and we're dropping 20 million packets per second. And that's actually pretty good. So there are some nicks out there, some hardware that can do rates like this. You can do drops in hardware without doing much processing at rates like this, but to actually have programmability and then getting these up into Linux and then taking action in Linux is pretty novel. So let's go one more step. So let's update this on the fly. Let's show you how easy it is to adapt this script to the attack. So I'm going to... What I'm going to do is I'm going to take this script, and I'm going to add a hash table where I'm going to keep a blacklist of IP addresses. So we'll add... We'll keep UN32 for our IP address, just for our results called blacklist. 1,000 entries. I can tune this as I want. We'll write the parsing code here. So if we see... So if the destination is one of those four floating IPs that we have, we're going to drop this. So first, we'll extract the IP address from the packet, and if we don't find a result, we're not going to bother trying to drop it. We'll let it pass. So if this IP address is not in our blacklist, we'll return zero, and now we'll add a little bit of Python code to inject this. So we'll get a handle for our table, and we'll just take out this old code, and we'll let our user give the list of IP addresses and feed that into our blacklists. So we'll say... So if you're curious in the syntax of this, these are all on our GitHub. So we'll just update the table whenever I type in here, so we'll start that up, and we'll start up our traffic. So I haven't put any IP addresses into the blacklist yet, so I expect our... We'll still be... Until we get the list populated. So we'll add one, see what happens. So now we're dropping one of the four IP addresses. Not there yet. I have to get all four of them. There we are again. And so we can... That's our DDoS tool. Very simple, and I'm sure it's not really going to protect against a botnet out there, but it's a start. So something you guys can use, go to the GitHub and play around. And there we are. We have a microphone here, so I'm going to ask Fouad to carry this around. Anyone has any questions? Any questions? Yeah, we have one. Good. Just to be sure, the packets are still handled with interrupts. This is not the case in DBDK. Right. But it's impressive, because we still get very high performance. Yeah, so this will hook into... However, the drivers are written that support XDP. For instance, in the Melonox one, they can run in polling mode or in nappy interrupt mode. And it's... Depending on how the application is using those drivers, it can adapt. But the XDP hook is supported for both. And again, just to make sure, because it's so impressive, is the drop done in hardware, in the NIC, or in the kernel? In the kernel. So I guess one caveat is that this setup has DDIO from the Melonox NIC to the CPU, but the packets are making it all the way to the CPU, for sure. And so this program has... Let's see, I think it had three packet accesses. It was doing at least three reads from the packet, which should have been warm in the cache, as well as keeping one statistic and looking up in a hash table. A different question. You showed that the packet is first handed to the XDP component and then is sent to the stack. Is there a double parsing in this case? There could be for those... The overhead of this parsing, though, is going to be... Well, think of it this way. If you want to handle line rate traffic, let's say at a 40 gig NIC, you have about 16 nanoseconds per packet. So on a 3 gigahertz, that's 50 cycles, if I remember correctly. So you have about 50 cycles if you want to handle line rate traffic. Doing an allocation of an SK buff and just getting into the stack at all blows that budget. So in the good case, you might be adding a little bit of overhead, but it's not going to be more than 50 cycles or so. That's how fast some of these are going. And you saw we have cycles to burn as well. And that's just one core. Hello. You mentioned that Facebook has built a switch using this technology. I was just wondering what's the advantage to build the switch using this one and comparing to the normal OVS? Actually, what we mentioned is router. It's an IPv6 router built by Facebook. They are using it in their data center today. So I guess... It's a different router than what's available in Linux, say. It uses different primitives. So they're able to program a completely new use case on top of iOS or in BPF and then run it in XDP so it's fast. If they had to do it within the kernel itself, and I think they have been doing some of this, but they're able to do the BPF-based one much faster. You can just code up in these files the parsing logic and not have to submit kernel patches because it is new functionality in the iLay router case. It doesn't exist yet. Also, back to the V-switcher's part. The current thing is the performance, actually, is burned in kernel by driver, by packed processing by a lot of things. With this, things can be done much earlier than you can much higher packed rate from a lower level of the driver. You see the 20 million packed processing. So if you can process from there for virtual switching, you can get much better performance. We actually saw we got very impressive virtual switch performance numbers, but we are not going to go too deep in that topic today. The answer there is join the community and come code with us. Can you elaborate on the last bullet, the Docker Kubernetes container? No, no, no. On this slide. Yeah, the last bullet. So what's the state of the container networking? First of all, what are you doing there and then what kind of startups do you have? Actually, there are several things going on today, but the thing is something, so some work hasn't been fully announced. Actually, you can see there is a Selium product done by Cisco, which is using the technology we are talking about here. Also, there are some other solutions out there people are working on, but maybe you can go to the community to look at those things, what's going on. I don't have all the details. I know about Selium, I don't know about the others. Some solutions haven't been fully announced. It's still growing, so not solved yet. Also coming soon. Also coming soon. Because it's the open-source part. By the way, many vendors and many companies are working on their own solutions. This is talking about the open-source part, the community part. Are you aware of vendor wanting to support EBPF in hardware? I believe there are some in the room. Yeah, actually. So what can you do in the status of our run-map? Yeah, there you go. So, Netronome has offload for BPF. It's not hooked up to XTP yet, but it will be soon. So currently, you can use it to the EBTC. Basically, the way that it works is we jit the code, the EBPF in the kernel, and then push that down to the hardware that we have, then runs the jitted code. The major feature that we're missing apart from XTP support at this time is support for EBPF maps, which is the facility where the news to dynamically update which packets are being dropped, but we're also working on that, and we expect to make progress very soon. And the basic offload for the DTC is already in the main line kernel. Yeah, also there are some other names. You will see Barefoot is working on putting XDP on their switch hardware. Also, FOPDA, Netronome, Broadcom, all those things can be done. Some work has been done, some are still in progress. But to be clear, XDP, actually the BPF is a virtual machine, right? For the DOMNIC, the traditional very simple NIC, you cannot run BPF on the hardware. Virtual machine in the sense of like a Java virtual machine? Yeah, like the Java virtual machine, not like the virtual machine for OpenStack. It's an internal virtual machine, the instruction level virtual machine. So you need some new hardware, we call it Accelerator to run that. Of course with FOPDA multi-core processors, it's much easier to do such work. Okay, thank you very much. Thank you.