 All right, now it's my great pleasure to introduce Paul Emmerich, who is going to talk about demystifying network cards. Paul is a PhD student at the Technical University in Munich. He's doing all kinds of network related stuff. And hopefully today he's going to help us make network cards a bit less of a black box. So please give a warm welcome to Paul. Well, thank you. And as the introduction already said, I'm a PhD student. And I'm researching performance of software packet processing and forwarding system. That means I spend a lot of time doing low level optimizations and looking into what makes the system fast, what makes it slow, what can be done to improve it. And I'm mostly working on my package generator, MoonGen. I have some cross promotion. I have a lightning talk about this on Saturday. But here I have this longer slot. And I bought a lot of content here, so I have to talk really fast. So sorry for the translators. And I hope you can mainly follow along. So this is about network cards, meaning network cards. You all have seen this. This is a usual 10G network card with the SFP plus port. And this is a faster network card with a QSFP plus port. This is 20, 40, or 100G. And now you bought this fancy network card. You plug it into your server or your MacBook or whatever. And you start your web server that serves cat pictures and cat videos. You all know that there is a whole stack of protocols that your cat picture has to go through until it arrives at the network card at the bottom. And the only thing that I care about are the lower layers. I don't care about TCP. I have no idea how TCP works. Well, I have some idea how it works. But this is not my research. I don't care about it. I just want to look at individual packets. And the highest thing I look at is maybe an IP address or maybe a port of the protocol to do, to identify its flows or anything. Now you might wonder, is there anything even interesting in these lower layers? Because people nowadays think that everything runs on top of HTTP. But you might be surprised that not all applications run on top of HTTP. There is a lot of software that needs to run at these lower levels. And in the recent years, there's a trend of moving network infrastructure stuff from specialized hardware black boxes to open software boxes. And examples for such software that was hardware in the past are routers, switches, firewalls, middle boxes, and so on. If you want to look up the relevant password, that's network function virtualization, what it's called. And this is a recent trend of the recent years. So now let's say we want to build our own fancy application on that low-level thing. We want to build our firewall router packet forward modifier thing that does whatever useful on that lower layer for network infrastructure. And I will use this application as a demo application for this talk, as everything will be about this hypothetical router firewall packet forward modifier thing. So what it does, it receives packets on one or multiple network interfaces. It does stuff with the packets, filter them, modify them, route them, and send them out to some other port, or maybe the same port, or maybe multiple ports, whatever these low-level applications do. And this means the application operates on individual packets, not a stream of TCP packets, not a stream of UDP packets. They have to cope with small packets, because that's just the worst case. You get a lot of small packets. Now you want to build the application. You go to the internet, and you look up how to build a packet forwarding application. The internet tells you there is the socket API. The socket API is great, and it allows you to get packets to your program. So you'll build your application on top of the socket API. It runs in user space. You use your socket. The socket talks to the operating system. The operating system talks to the driver, and the driver talks to the network cards, and everything is fine. Except for that it isn't, because what it really looks like if you build this application, there is this huge, scary big gap between user space and kernel space. And you somehow need your packets to go across that without being eating. You might wonder why I say this is a big deal and a huge deal that you have this gap in there. And because you think, well, my web server serving cat pictures is doing just fine on a fast connection. Well, it is because it is serving large packets or even large chunks of files that it sends at one to the server. Like, you can take your whole cat video, give it to the kernel, and the kernel will handle everything from packetizing to TCP. But what we want to build is an application that needs to cope with the worst case of lots of small packets coming in. And then the overhead that you get here from this gap is mostly on a per packet basis, not on a per byte basis. So lots of small packets are a problem for this interface. When I say problem, I'm always talking about performance, because I'm mostly about performance. So if you look at performance, a few figures to get started is, well, how many packets can you fit over your usual 10G link? That's around 15 million. But 10G, that's last year's news. This year, we have multiple 100G connections even to this location here. So 100G link can handle up to 150 million packets per second. And well, how long does that give us if we have a CPU? And say we have a 3Ghz CPU in our MacBook running the router. And that means we have around 200 cycles per packet. If we want to handle one 10G link with one CPU core, OK, we don't want to handle we have, of course, multiple cores, but you have also multiple links and faster links than 10G. So the typical performance target that you would aim for when building such an application is 5 to 10 million packets per second per CPU core per thread that you start. That's like a usual target. And that is just for forwarding, just to receive the packet and to send it back out. All the stuff that is, all the remaining cycles can be used for your application. So we don't want any big overhead just for receiving and sending them without doing any useful work. So these figures translate to around 300 to 600 cycles per packet on a 3Ghz CPU core. Now, how long does it take to cross that user space boundary? Well, very, very, very long for an individual packet. So in some performance measurements, if you do single core packet forwarding with a raw socket, you can maybe achieve 300,000 packets per second. If you use LibPcap, you can achieve a million packets per second. These figures can be tuned. You can maybe get factor two out of that by some tuning. But there are more problems like multicore scaling is unnecessarily hard and so on. So this doesn't really seem to work. So the boundary is the problem. So let's get rid of the boundary by just moving the application into the kernel. We rewrite our application as a kernel module and use it directly. You might seem what an incredibly stupid idea to write kernel code for something that clearly should be user space. Well, it's not that unreasonable. There are lots of examples of applications doing this, like a certain web server by Microsoft runs as a kernel module. The latest Linux kernel has TLS of loading to speed that up. Another interesting use case is OpenV-Switch that has a fast in kernel cache that just caches stuff and does complex processing in a user space thing. So it's not completely unreasonable. But it comes with a lot of drawbacks, like it's very cumbersome to develop. Most of your usual tools don't work or don't work. It is expected. You have to follow the usual kernel restrictions like you have to use C as a programming language, which you maybe don't want to. And your application can and will crush the kernel, which can be quite bad. But let's not care about the restrictions. We wanted to fix performance. So same figures again. We have 300 to 600 cycles to receive and send a packet. What I did, I tested this as a profile to see how long does it take to receive a packet until I can do some useful work on this. This is an average cost over a longer profiling run. So on average it takes 500 cycles just to receive the packet. Well, that's bad, but sending it out is slightly faster. And again, we are now over our budget. Now you might think, what else do I need to do besides receiving and sending the packet? There is some more overhead. There is, you need some time to the SK buffers, the data structure used in the kernel for all packet buffers. And this is quite bloated old big data structure that is growing bigger and bigger with each release. And this takes another 400 cycles. So if you measure a real-world application, single core packet forwarding with open v-switch with the minimum processing possible, one open-flow rule that matches on physical ports. The processing, I profiled this at around 200 cycles per packet. And while the overhead of the kernel is another 1,000 something cycles, so in the end you achieve 2 million packets per second. And this is faster than our user space stuff, but still kind of slow while we want to be faster because, yeah. And the currently hottest topic, which I'm not talking about in the Linux kernel, is XDP that fixes some of these problems but comes with new restrictions. I cut that for my talk for time reasons. So let's just talk about not XDP. So the problem was that our application and we wanted to move the application to the kernel space, and it didn't work. So can we instead move stuff from the kernel to the user space? Well, yes, we can. There are libraries called user space packet processing frameworks. They come in two parts. One is library. You link your program against this in the user space. And one is a kernel module. These two parts communicate. And they set up shared mapped memory. And this shared map memory is used to directly communicate from your application to the driver. You directly fill the packet buffers that the driver then sends out. And this is way faster. And you might have noticed that the operating system box here is not connected to anything. That means the operating system doesn't even know that the network card is there in most cases. This can be quite annoying. But there are quite a few such frameworks. The biggest examples are Netmap, PF ring, and PFQ. And they come with restrictions, like there is a non-standard API. You can't port between one framework in the other or one framework in the kernel or sockets. There's a custom kernel module required. Most of these frameworks require some small patches to the drivers. It's just a mess to maintain. And of course, they need exclusive access to the network card, because this one application is talking directly to the network card. OK. And the next thing is you lose the access to the usual kernel features that can be quite annoying. And then there's often poor support for hardware offloading features of the network cards, because they often found in different parts of the kernel that we no longer have reasonable access to. And of course, these frameworks, we talk directly to a network card, meaning we need support for each network card individually. Usually, they just support one to two or maybe three NIC families, which can be quite restricting if you don't have that specific NIC that is restricted. But can we do an even more radical approach? Because we have all these problems with kernel dependencies and so on. Well, it turns out we can get rid of the kernel entirely and move everything into one application. This means we take our driver, put it in the application. The driver directly accesses the network card and sets up the main memory in the user space, because the network card doesn't care where it copies the packets from. We just have to set up the pointers in the right way. And we can build this framework like this that everything runs in the application. We remove the driver from the kernel and no kernel driver running. And this is super fast. And we can also use this to implement crazy and obscure hardware features and network cards that are not supported by the standard driver. Now, I'm not the first one to do this. There are two big frameworks that do that. One is DPDK, which is quite big. This is a Linux Foundation project. And it has basically support by all NIC vendors, meaning everyone who builds a high-speed NIC writes a driver that works for DPDK. And the second framework is SNAP, which I think is quite interesting, because it doesn't write the drivers in C, but is entirely written in Lua in the scripting language. So this is kind of nice to see a driver that's written in a scripting language. OK, what problems did we solve and what problems did we now gain? One problem is we still have the non-standard API. We still need exclusive access to the network card from one application, because the driver runs in that thing. So there's some hardware tricks to solve that. But mainly it's one application that is running. Then the framework needs explicit support for all the NIC models out there. It's not that big a problem with DPDK, because it's such a big project that literally everyone has a driver for a DPDK NIC. And there's limited support for interrupts. But it turns out interrupts are not something that is useful when you are building something that processes more than a few hundred thousand packets per second, because the overhead of the interrupt is just too large. It's just mainly a power-saving thing if you ever run into low load. But I don't care about the low load scenario and power-saving. So for me, it's pulling all the way and all the CPU. And you, of course, lose all the access to the usual kernel features. And well, time to ask, what has the kernel ever done for us? Well, the kernel has lots of major drivers. OK, what has the kernel ever done for us, except for all these nice major drivers? There are very nice protocol implementations that actually work. Like the kernel TCP stack is a work of art. It actually works in real-world scenarios, unlike all these other TCP stacks that fail under some things or don't support the features we want. So there is quite some nice stuff. But what has the kernel ever done for us, except for these major drivers and these nice protocol stack implementations? OK, quite a few things. And we are also throwing them out. And one thing to notice, we mostly don't care about these features when building our packet-forward-modify-router-firewall thing, because these are mostly high-level features, mostly, I'm saying. But it's still a lot of features that we are losing. Like, building a TCP stack on top of these frameworks is kind of an unsolved problem. There are TCP stacks, but they all suck in different ways. OK, we lost features, but we didn't care about the features in the first place. We wanted performance. Back to our performance figure, we want 300 to 600 cycles per packet that we have available. How long does it take in, for example, DPTK to receive an center packet that is around 100 cycles to get a packet through the whole stack from like receiving a packet, processing it, well, not processing it, but getting it to the application and back to the driver to send it out, 100 cycles. And the other frameworks typically play in the same leak. DPTK is largely faster than the other ones, because it's full of magic, SSE, and AVX, and TwinZix. And the driver is kind of black magic. But it's super fast. Now, in kind of real-world scenario, open research, as I've mentioned, as an example. Earlier, that was 2 million packets was the kernel version. And open research can be compiled with an optional DPTK back end. So you set some magic flags when compiling, then it links against DPTK and uses the network cards directly, runs completely in user space. And now, it's a factor of around 6 or 7 faster. And we can achieve 13 million packets per second with the same processing step on a single CPU core. So great. Where does duty performance gains come from? Well, there are two things. Mainly, it's compared to the kernel, not compared to sockets. What people often say is that this is zero copy, which is a stupid term, because the kernel doesn't copy packets either. So it's not copying packets that was slow, it was other things. Mainly, it's batching, meaning it's very efficient to process a relatively large number of packets at once. And that really helps. And the thing is reduced memory overhead, the SK above data structure is really big. And if you cut that down, you save a lot of cycles. These DPTK figures, because DPTK has unlike some other frameworks, has DPTK memory management. And this is already included in these 50 cycles. OK, now we know that these frameworks exist and everything. And the next obvious question is, can we build our own? Can we build our own driver? Well, but why? Well, first for fun, obviously. And then to understand how that stuff works, how these drivers works, how these packet processing frameworks works. I've seen in my work in academia, I've seen a lot of people using these frameworks. It's nice, because they are fast and they enable a few things that just weren't possible before. But people often treat these as magic black boxes you put in your packet, and then it magically is faster. And sometimes I don't blame them. If you look at DPTK source code, there are more than 20,000 lines of code for each driver. And just for example, looking at the receive and transmit functions of the IXGBE driver and DPTK, that's one file with around 3,000 lines of code. And they do a lot of magic just to receive and send packets. No one wants to read through that. So the question is, how hard can it be to write your own driver? Turns out it's quite easy. This was like a weekend project. I have written the driver called IXGBE, it's less than 1,000 lines of C code. That is the full driver for 10G network cards. The full framework to get some applications and two simple example applications took me less than two days to write it completely. Then two more days to debug it and fix performance. So I'm building this driver on the Intel IXGBE family. This is a family of network cards that you know of if you ever had a server that has this, because almost all servers that have 10G connections have these Intel cards. And they are also embedded in some Xeon CPUs. They are also onboard chips on many main boards. And the nice thing about them is they are, well, they have a publicly available data sheet, meaning Intel publishes this 1,000 pages of PDF that describes everything you ever wanted to know when writing a driver for these. And the next nice thing is that there is almost no logic hidden behind black box magic firmware. Many newer network cards, especially Melanox the newer ones, they hide a lot of functionality behind a firmware and the driver mostly just exchanges messages with the firmware, which is kind of boring. And with this family, it is not the case, which I think is very nice. So how can we build a driver for this in four very simple steps? One, we remove the driver that is currently loaded because we don't want it to interfere with our stuff. Okay, easy so far. Second, we memory map the PCIO, memory map IO address space. This allows us to access the PCI Express device. Number three, we figure out the physical addresses of our DMA of our process address region and then we to use them for DMA. And step four is slightly more complicated than the first three steps is rewrite the driver. Now, first thing to do, we figure out where our network card, let's say we have a server and we plugged in our network card, then it gets assigned a address and the PCI bus. We can figure that out with LSPCI. This is the address we needed in a slightly different version with their fully qualified ID. And then we can remove the current driver by telling the currently bound driver to remove that specific ID. Now the operating system doesn't know that this is a network card, doesn't know anything, just knows there's some PCI device with no driver. Then we write our application, this is written in C and we just open this magic file in SysFS and this magic file, we just, M-MAP is no magic, just a normal M-MAP there, but what we get back is a kind of special memory region. This is the memory map IO memory region of the PCI address configuration space. And this is where all the registers are available, meaning if, I will show you what that means in just a second. If we go through the data sheet, there is hundreds of pages of tables like this and these tables tell us the registers that exist on that network card, the offset they have and a link to a more detailed descriptions and encode that looks like this. For example, the LED control register is at this offset and then the LED control register, all these registers are 32 bits. There are some bit offset, bit seven is called LED zero blink. And if we set that bit in that register then one of the LEDs will start to blink and we can just do that by our magic memory region because all the reads and writes that we do to that memory region go directly over the PCI express bus to the network card and the network card does whatever it wants to do with them. It doesn't have to be a register, basically. It's just a command to send to a network card and it's just a nice and convenient interface to map that into memory. This is a very common technique that you will also find when you do some microprocessor programming or something. So and one thing to note is since this is not memory, that also means it can't be cache, there's no cache in between. Each of these accesses will trigger a PCI express transaction and it will take quite some time speaking of lots of, lots of cycles whereas lots means like hundreds of cycles or hundred cycles, which is a lot for me. So how do we now handle packets? We now can, we have access to these registers. We can read the data sheet and we can write the driver but we need some way to get packets through that. Of course, it would be possible to write a network card that does that via this memory mapped IO region but it's kind of annoying. The second way a PCI express device communicates with your server or MacBook is via DMA direct memory access and a DMA transfer unlike the memory mapped IO stuff is initiated by the network card and this means the network card can just write two arbitrary addresses in main memory and this, the network card offers so-called rings which are queue interfaces and like for receiving packets and for sending packets and there are multiple of these interfaces because this is how you do multi-score scaling. If you want to transmit for multiple cores you allocate multiple queues, each core sends to one queue and the network card just merges these queues and hardware onto the link and on receiving the network card can either hash on the incoming, incoming packet like hash over protocol headers or you can set explicit filters. This is not specific to network card, most PCI express devices work like this like GPUs have queues, command queues and so on, NVMe, PCI express disks have queues and so, let's look at queues and the example of the IXGBE family but you will find that most nicks work in a very similar way. There are sometimes small differences but mainly they work like this and these rings are just circular buffers filled with so-called DMA descriptors. A DMA descriptor is a 16 byte struct and that is eight byte of a physical pointer pointing to some location where more stuff is and eight byte of metadata like I fetch the stuff or this packet needs VLAN tech of loading or this packet had a VLAN tech that I removed information like data stored in there and what we then need to do is we translate virtual addresses from our address space to physical addresses because the PCI express device of course needs physical addresses and we can do this using PROCFS in the PROC SELF page map. We can do that and the next thing is we now have this queue of DMA descriptors in memory and this queue itself is also accessed via DMA and it's controlled like it works like you expect a circular ring to work. It has a head and a tail and the head and tail pointer are available via registers in memory mapped IO address space meaning in an image it looks kind of like this. We have this descriptoring in our physical memory to the left full of pointers and then we have somewhere else these packets in some memory pool and one thing to note when allocating this kind of memory there is a small trick you have to do because the descriptoring needs to be in contiguous memory in your physical memory and if you just assume everything that's contiguous in your process is also in hardware physically, no it isn't and if you have a bug in there and then it runs to somewhere else then your file system dies as I figured out which was not a good thing so what I'm doing is I'm using huge pages two megabyte pages that's enough of contiguous memory and that's guaranteed to not have weird gaps. So now we want to see packets we need to set up this ring so we tell the network card via memory mapped IO the location and the size of the ring then we fill up the ring with pointers to freshly allocated memory that are just empty and now we set the head and tail pointer to tell the head and tail pointer that the queue is full because the queue is at the moment full it's full of packets these packets are just not yet filled with anything and now what the NIC does it fetches one of the DMA descriptors and as soon as it receives a packet it writes the packet via DMA to the location specified in the register and increments the head pointer of the queue and it also sets a status flag in the DMA descriptor once it's done writing the packet to memory and this step is important because reading back the head pointer via MMIO would be way too slow so instead we check the status flag because the status flag gets optimized by the cache and it's already in cache so we can check that really fast. Next step is we periodically pull the status flag this is the point where interrupts might come in useful there's some misconception people sometimes believe that if you receive a packet then you get an interrupt and the interrupt somehow magically contains the packet no it doesn't the interrupt just contains the information that there is a new packet after the interrupt you would have to pull the status flag anyways so we now have the packet we process the packet or do whatever then we reset the DMA descriptor we can either recycle the old packet or allocate a new one and we set the ready flag on the status register and we adjust the tail pointer register to tell the network card that we are done with this and we don't have to do that for any time because we don't have to keep the queue 100% utilized we can only update the tail pointer like every 100 packets or so and then that's not a performance problem. What now we have a driver that can receive packets next steps well transmit packets it basically works the same I won't bore you with the details then there's of course a lot of boring initialization code and it's just following the data sheet there like set this register set that register do that and I just coded it down from the data sheet and it works so big surprise then now you know how to write a driver like this and a few ideas of what I want to do or what maybe you want to do with a driver like this one I of course want to look at performance to look at what makes this faster than the kernel then I want some obscure hardware offloading features in the past I've looked at IPsec offloading which is quite interesting because the Intel network cards have hardware support for IPsec offloading but none of the Intel drivers had it and it seems to work just fine so not sure what's going on there then security is interesting there is this obvious some security implications of having the whole driver in a user space process and I'm wondering about how we can use the IOMU because it turns out once we have set up the memory mapping we can drop all the privileges we don't need them and if we set up the IOMU before to restrict the network card to certain things then we could have a safe driver in user space that can't do anything wrong because there's no privileges and the network card has no access because it goes through the IOMU and their performance implications of the IOMU and so on of course support for other nicks I want to support virtual nicks and other programming languages for the driver would also be interesting it's just written in C because C is the lowest common denominator of programming languages to conclude check out XE it's BSD license on GitHub and the main thing to take with you is that drivers are really simple don't be afraid of drivers don't be afraid of writing our drivers you can do that in any language and you don't even need to write current code just map the stuff to your process write the driver and do whatever you want okay, thanks for your attention we have very few minutes left for questions so if you have a question in the room please go quickly to one of the eight microphones in the room does the signal angel already have a question ready? I don't see anything anybody lining up at any microphones all right, number six please as you're not actually using any of the Linux drivers is there an advantage to using Linux here could you use any open source operating system? I don't know about other operating systems but the only thing I'm using of Linux here is the ability to easily map that for some other operating systems you might need a small stop driver that maps the stuff in there you can check out the DPDK FreeBSD port which has a small stop driver that just handles the memory mapping okay, all right, number two hi, slightly disconnected to the talk but I'd just like to hear your opinion on smartNICs where they're considering putting CPUs on the NIC itself so you could imagine running OpenVswitch on the CPU on the NIC yeah, I have some smartNICs somewhere on some lap and I've also done work with the NetFPGA I think that's very interesting but it's a complicated trade-off because these smartNICs come with new restrictions and they are not magically super fast so it's interesting from a performance perspective to see when it's worth it, when it's not worth it and what I personally think it's probably better to do everything with raw CPU power thanks all right, before we take the next question just for the people who don't want to stick around for the Q&A if you really do have to leave the room early please do so quietly so we can continue the Q&A number six, please so how does the performance of the user space drivers compared to the XTP solution? it's slightly faster but one important thing about XTP is if we look at this this is still new work and there are a few important restrictions like you can write your user space think in whatever program of language you want like as mentioned, Snap has a driver entirely written in Lua with XTP you are restricted to EBPF meaning usually I restricted subset of C and then there's the bytecode verify but you can disable the bytecode verify if everyone disables it and meaning you again have weird restrictions that you maybe don't want and also XTP requires patch drivers but requires a new memory model for the drivers so at the moment DPDK supports more drivers than XTP in the kernel which is kind of weird and there are still lacking many features like sending back to a different Nick one very, very good use case for XTP is fireballing for applications on the same host because you can pass on a packet to the TCP stack and this is a very good use case for XTP but overall I think that both things very, very different and XTP is slightly slower but it's not slower in such a way that it would be relevant so it's fast to answer the question. All right, unfortunately we are out of time so that was the last question. Thanks again Paul.