 Okay, so I think we're out of time. Thank you Quintana, thank you very much. Doesn't matter. Hello. Okay folks, can we bunch up from the sides to let more people sit down? Okay folks, we're starting the next presentation. So I'm going to introduce my good friends, Magnus and Bjorn, and they're going to be talking about AFXTP. Can we have some quiet from the room please? Alright, let's get going. So my name is Bjorn Tobal, and this is my partner in crime, Magnus Carlson. We're from Intel, as you can see from the nice blue. And we're here to tell you a bit about what AFXTP is. What's that? Is this better? Better? Okay. Okay, so you missed the word from our sponsor, but we'll skip that right. Okay, so why are we doing this? So if you ask most developers or network application developers today, you have to pick a platform and you have to pick an API. Most people say Linux and BSD sockets. So why is that? Well, it supports a lot of features. There it works. People are familiar with it. So that's the pick. So say that this developer wants to develop a DPI application. That's the packet inspection application. And it goes to what's provided by the kernel, which is the AVE packet socket layer. After a while he realized that, okay, I'm not getting the performance that I like. So what can I do about it? Throw more cores at it, which is, I mean, we as Intel would love that, but it's not for everyone. So, I mean, it might not even be possible to throw more hardware at the problem. So the typical solution then is to go to a hardware vendor. For example, someone that provides a specialized NIC or some kind of specialized hardware. And the problem here is that usually with these hardware, you get a proprietary SDK. So you have to change your application. And if you want to try another hardware, you have to change the application once again. They're usually hard to use. They usually lack a lot of features that Linux has, but this specific hardware doesn't have. But then there's software solutions as well. For example, DeepDK and there's PF Ring and Netmap. They're really fast, but they're not really integrated into Linux. So again, it requires you to rewrite your application. And the biggest problem here is that it's not really integrated into Linux. So if you want to use the feature from Linux, then you have to reimplement them yourself in your software. And that's not optimal doing things twice. So the problem statement is, so can we take the ease of use and features of Linux and the software and combine it with the performance from here? And let's see. So what we're proposing is a new socket application. What we're proposing is a new address family called XDP. And it's a socket layer. So I can view it as the user space part of XDP. And we just had a great introduction to XDP from our previous speaker, so I won't dwell into that. The solution is free from system calls. So the transmit path and receive path has zero system calls. We also provide a new kind of allocator that if you modify the device driver and use this allocator, you'll get true zero-copys mantics all the way out to use alone. So that means that the NIC will DMA the frame out to memory in the use space application. Okay. If you do not use this allocator and use an unmodified driver, you will still get a copy, but pretty good performance. Anyway, the last thing is that we do not expose any hardware details to the user application. So instead we're exposing virtual hardware descriptors that are being translated in the kernel to... So the virtual hardware descriptors are being translated to hardware descriptors in the kernel. Yes. All right, one thing to note. So if you're using the zero-copy mode, then your hardware has to support hardware steering. So for example, we have two sockets receiving two separate flows. Then in order to use zero-copy out to use alone, you must be able to steer the flows. But Magnus will tell me more about this later. Okay, so what else? Our goal is to hit 40 gig for large packets, 40 gigabit per second, and for 64-byte packets we're hoping for 25 gigabits per second. All right, so from a Linux perspective, why is this good? Well, we get a new socket layer that's fast, and we're hoping to be closer to deep decay or just some percent off. You get out-of-the-box support for all network devices that's already there in Linux. So Linux support a whole lot of devices. What else? It states that XDP is required, which is not entirely true. To get the best performance, we require XDP, but it can fall back to a mode that's based on SK Buffs instead of XDP, and it's so you can use the same application interface for all the device drivers, but some device drivers will have better performance, obviously. Okay, I'll wait for the future work for later slides. Okay, so from a deep decay perspective, why should deep decay care about this? Well, say that you implement a promo driver based on AFXDP, then you can still use your deep decay applications without any changes. But instead of having the device driver in use space, you let the kernel handle the hardware, which is honestly, the kernel is pretty good at that. So the goal here is to have, of course there will be a performance hit, but we're aiming for 10 percent, and for some application, that might be good enough. As a follow-up, another good thing is if we leave all the user space drivers aside, then we can actually use deep decay inside containers, which is not possible today, at least not in an easy way. And another thing is that you don't have to re-implement things, like scheduling and back off in your user space application, because the kernel can do this. We support, for example, a pole syscall, so we can first do bicep hauling, and then if we don't want to do bicep hauling anymore, we just back off using the pole syscall. So again, maybe the goal is to, if deep decay is using this driver, then deep decay can be seen as a really good packet processing library, which currently is as well. We think AFX to P could work really good in conjunction with deep decay. All right. So enough talk. Here's the code. So you start off by creating a socket. You allocate some buffers using your favorite allocator, and the buffer is where the framed data will be DMA to. You pass this buffer pointer to the kernel, so you register a memory to the kernel. You create the virtual descriptor rings, the Rx and Tx for ingress and egress. And then you bind the socket to a certain interface. So you state, okay, I want to bind this ifindex and a specific queue in the nick. And then in this example here, I'm using bicep hauling. So I read messages from the ring. I process them, send them out. So again, I can put the pole syscall in there if I want to back off. Right. So again, so where does X to P fit within AFX to P? So my plan was to do some introduction to X to P here, but again, I'll skip that. So what we've done is that we've taken the redirect action from X to P and said that you cannot, on top of redirecting your frames to other net devices, you can redirect to an AFX to P socket as well. So it's just another target for redirect. And some somewhat crazy idea we had here is that maybe we can do descriptive rewriting from X to P as well. So we could from X to P support another descriptive format, for example, a virtual net. And then we could do virtual machine networking using the X to P socket layer. What else? Right now the current patch that only consumes the frames and the received frame that are being redirected to a socket is consumed, but we'll need to be able to copy the frame to a socket and then pass it on to the stack because then we will have TSP dump could be able to use that and have a much faster TSP dump than the current AF packet based one. And also I've been discussing with Jesper down here about doing load balancing from X to P programs so we can receive your frame and then load balancing it over multiple sockets. Right. Yep. And just a few notes on the operation mode. So if a driver is unmodified and doesn't ever have X to P support, we support something called X to P SKB, which is a mouthful. But that means that you can still run your X to P programs it'll be slow, obviously, or as slow as the SKB path is. But it will work. So you can use that on any device driver or any net device driver and a net dev device from the kernel. If the device driver has support for X to P, then you use the mode in the middle. And that's still copy, but it's much faster than the first one. And finally, there's the modified driver that's using the zero-copy allocation scheme and that's the fastest one. All right. So I think it passed on to Magnus. So how do we do this zero-copy stuff now? If you look at the picture on the left-hand side there, that's the classical non-zero-copy case. In that case, the TX and ARC descriptors of the harder ring is just mapped to the kernel. They're not visible to user space. Same thing with the packet buffer, only visible to the kernel. What Linux does then is just copy out the packets into user space. And with the TX and ARC descriptors, it just translates those into some hardware agnostic format that's in user space. And that's a good thing. I mean, operating systems, it's about hardware abstractions, security, robustness, isolation. You get that here. So we want to keep that in our zero-copy solution. So in the zero-copy solution, we do not expose TX and ARC descriptor harder rings into user space. They're still translated by the Linux kernel into some format, which looks different than other sockets, but it's still a hardware agnostic format. And it's going to be secure and isolated and stuff. But the key difference here is that we take the packet buffer and map it straight up into user space. So packets from the NIC are mapped by DMAGES straight into user space. If you have two applications here, two processes, note that the RX and TX descriptors are never, ever shared between these two processes. Packet buffers on that can be shared. By default, they're not shared. But if you like shared memory, there's nothing that hinders you to share the packet buffers. Of course, you create a huge false domain. The other process might actually pollute your data and so on. But yeah, you chose it. That's what you get. But it's possible, but never with the descriptors there. Okay, so security and isolation, that's very important. So what are the requirements here? We make sure that user space can never crash the kernel or another process. And also that it cannot read or write any kernel data, which is, of course, you have to note. And you cannot read or write any packets from any other process. So what do you need in order to do this? Let's say you have two processes at this point, A and B, and all your traffic goes in through a single interface. So you have to split up the stream of traffic to process A and process B. And if they have like 80 address X, 80 address Y, then you actually have to use some hardware steering in order to steer this packet. That's a requirement for untrusted applications, multiple untrusted applications, where the flows come in from the same port. Because otherwise, I mean, you couldn't just take that flow and just push it up into a single untrusted application and have that spread it out, because then you would be able to see any packet and modify any packet. So that's not a good thing. But fortunately, I mean, since like 10 years ago, most NICs actually support classification. And if you look at NICs today, I mean, they're becoming more and more advanced and can do more and more stuff. You can even download XDP programs into network stuff. So they're becoming more and more flexible. I think this is a problem. But there's always going to be, in some cases, something where you can't perform the classification in hardware. And in that case, you have to use the XDP SKB mode or the XDP driver mode, which copies out the data into user space. And then what we want to achieve is having then the classification being done in XDP. So you download the XDP program, it does the classification. Of course, it's going to be slightly slower, but still a lot better than before. Okay, let's look at some numbers. So we have this experimental setup here. It's the latest RC that came out Wednesday, I believe. No, Tuesday, it was even. It's just on some Broadwell server here. We use only two cores for these simple microbenchmarks here. And the app runs on one core. And then I know this is completely branded, but TX and RX runs on the same core. Of course, that's not good. But we're going to fix that. It was just a faster way to get to this point. TX, of course, should run on its own core, but it doesn't at this point. So they will compete, as you will see in some of the benchmarks, where we use both TX and RX. We want to use one cube pair. It's a 40 gigabit 4-tool, Nick. And we use an X-Yellow generated blasting it just for 40 gigabits a second full traffic all the time. We have three microbenchmarks. We have RX drop, which just receives the packet, doesn't look at the data, just drops it. TX push is kind of the opposite. It has pre-computed TX packets. Just try to send them out as quickly as possible. Only RX, only TX. And L2 forward is then receive the packet, swap the MAC header, send it out again. So this actually does touch the data. And we have four different columns here. We have AF packet B3 that exists already in Linux. And then we have the three modes that we have introduced. And the first thing that you can see is that, well, even this XDP-SKB mode would work with any, you know, network device. It's like two to five times as fast as the previous fastest one. And we compared B2 to when it's similar to this, so there's nothing special with B3. So that's pretty good. That works on anything. And it gives you, you know, pretty much the same. It's a raw data interface in both cases. I think that's good. And then we have the XDP driver case, which is going to work on any driver with XDP support. And with RX drop, you can see it's like 15 times faster than B3. I think that's also really good. You can see that there are dashes on TX push and L2 forward. So that's one of the challenges we had with this RFC. We didn't reuse the XDP TX NDOs in the driver, but I think that's going to be worked out. So hopefully we'll get to some change those NDOs where we can actually use, have a mode for this even in TX here. So we'll get some numbers here, but the code is not ready. But even better performance you can get with a zero copy one. If you, we start by just looking at TX push. It's about like 22 times what you have here. And RX drop on that hand, it's like 17, or the attack has also done 20 times. But something to note is that, I mean, we had a previous version of this, which we called AF packet V4. It's the new version of AF packet. And it actually had like 40 times the performance. The reason for that is that we have done absolutely no performance optimization on this code. It's just that for functionality. While we're this guy, we actually did some performance optimization. So my guess is that there's lots of low hanging fruit to be had here. So this will definitely increase. We have spent no time on performance optimization. So the goal is to get it at least about 30 million packets per second for these 64 byte packets. But clearly we have, we have work in the optimization area to do. But I think in all these numbers to me look promising. There's lots of future work that we can do. I'm going to say we have nine minutes. Of course, we have to do the performance optimization work. And really something very important. And it's a call to you guys. If you have some real workloads, please try them out. I mean, I had toy micro benchmarks here. It's very different from real workloads and those toy micro benchmarks. If you want to try it out, the RFC is on the mailing list. Just download it and, you know, let us know what works. What doesn't work, please. We really want to make the sys code. This is called on the TXS to get it going. We'll try to get that to get rid of it and also to get TXS of the RX core. Because, I mean, you saw that performance of the L2 forward. I mean, it's limited by RX and TX competing on that core. Of course, that's not good. But we'll fix that. Pakistan is using XTP, talked about. Also, I mean, XTP has metadata support now. It would be nice to tie it into this so you can get metadata. Maybe it's hardware offload stuff. Or it's things that the NIC had already pre-computed for you. And you can, you know, get that up there. Also, I mean, let's say you start an application. You bind to a certain NIC and you bind to a certain QID. Let's say QID 100. And then you run it on another NIC. It doesn't have a QID 100. You still want that program to work. So you should be able to emulate that QID by some, you know, copying or whatever. Today, that's not supported. But you really want to be able to have the same program working on every single NIC. You know, independent on many Qs it has. Once we would like to see an XTP redirect to other NET devices, RX path, today you redirect it to the TX path and send it out again. But it makes a lot of sense in certain cases to actually redirect it to the RX path and up upwards again. Today, there's only, at least in the previous corner, I looked in this one. You correct me if I'm wrong just a little bit. You don't have one XTP program for the whole, your whole NIC. You don't have it per QPER. But here, if you can open an AFCTP socket per QID, it makes sense to have an XTP program per QPER. You can do this today. By yes-for-spatch, you can have one XTP program and filter stuff out on QID in your XTP program. But it's still easier from a management point of view to have these as separate programs instead of just one monolithic entity. Somebody talked about, what's another question about XTP support on TX? I mean, yeah, EBPF has TX support, but XTP doesn't have it. But it makes a lot of sense to have it, especially in conjunction with this AFCTP socket. To actually be able to execute an XTP program on the TX path, too. That's not in place today. And the Q's are really single-consumer, single-producer, just to be as fast as possible. But there are use cases where you really want them to be multi-producer, you know, single-consumer or even single-producer, multi-consumer. So we're nice to be able to plug in such a ring into it. We have tried to write the code so that the ring structures are completely abstracted away from the rest of the codes. You can actually plug in different ring structures. So it would be nice if somebody could try this out. And the clone packet configuration, just to be compatible with AF packet. AF packet today, it clones the packet, sends it out to user space, and another one is sent through the stack. We want to be able to do the same thing to, say, support like TCP dump, wire shark, and these things. And in the previous version, we did a TCP dump implementation that took, I mean, only took three, four hours to convert the TCP dump, and we got 20X performance improvement on TCP dump. I think that's decent for four hours of work. So we want to be able to do something similar here. Right? So we've got five minutes. All right, conclude. So we introduced this AFXDP formula on this AF packet V4. It's integrated with XDP. It's basically the user space interface of an XDP program. And the zero copy. We have it up to 20X now. We had it up to 40, but the current patch set is 20. So we hope it can be even better than this. And there's an RC on the net-based mailing list. Please check it out. It has, of course, all the details. This is just an overview. But there are still lots of performance optimization work we need to do. It's there for you to look at, and it's there for just getting the design out for you to comment on. It's not there because it has the greatest performance on Earth yet. We shouldn't start to optimize performance before it's ready, as we all know. But it's very tempting. Of course, and we also think there's lots of exciting XDP extensions to be had in conjunction with this. And if you hit the one, I'm not going to go through this, but there's a couple of people in the audience that we want to thank and more people here. The RFC, you can find here. Right? Questions? Go ahead. If I have to write an application with that, why would I choose one or the other? Yeah, I mean, okay. So you started from scratch. I mean, you made the choice. What should I do? So it depends on if you, for example, had very, very tight performance targets, or you were making an embedded system in an integrated box, then that would probably, you could probably go with DPDK, especially if you had different deployment snares. You wanted your code to run on both the embedded system, a box you wanted to run in the cloud. You wanted to run on somebody's server. Then go with DPDK. You could deploy it anywhere, for example. There are many other things too. I mean, DPDK supports offloading. We don't do that at this point. That's all. And it doesn't have to be an either or choice either. That's a good question. Do you know RDMA? Repeat the question. Okay. Repeat the question. So the question was, why not RDMA? Yeah. It's a good point. We actually started out looking at that. And I think the main thing is that it's so different from what's in the networking stack in terms of it. But I mean, hey, it might be a good fit. But lots of things where it is inspired by RDMA, I mean, you know this. Right. Yeah. And it's, this is my subjective opinion, but it's a bit too much, I'd say, for, for most applications. I mean, RDMA is for all the storage back, things that are too much, but that's, but I mean, it might be right. And I mean, it's RDMA has been around for a long time. So it's really mature as well. But there's like two people waving. Yeah. One more. No. No. Yeah. Yeah. The question is about how you design it, let's take that offline. Sorry. Thank you.