 that means I spend a lot of time doing low level optimization and listening to what makes the system fast, what makes it clear. I'm mostly working on my course and I'm always talking about that on Saturdays, but here I have become a vlog and I got a lot of content here, so I hope you can mainly follow along. So, just about network card. Okay, you heard right away the mystifying network card from the 34th Chaos Communication Congress in the translation of P-Priority and Loi. We have a bit of technical problems, that means we have a delay of a lot of seconds because our audio unit cabin is not light, that's why we apologize for that. We somehow try to stream the audio to stream, but let's try to get that back. I'm only interested in the lower layers, but that's not my plan. I just want to look at the individual packages, for example, an API address or a port to stream the audio. That's a cat video in the kernel and the kernel does everything from packaging to TCP, but we want to do that as an application, which in the worst case makes a lot of small packages the right thing to do. You have a lot of overhead and that comes with each package and not each byte. So, many small packages are a problem for this interface. So, if I say a problem, I mean performance, because I'm just talking about performance. So, here a few numbers, so that you can find out how many packages are over a 10 gigabit link, that's about 15 mega packages. But 10 gigabit, that's about 2016, today we already have 100 gigabit connections, that means about 150 million packages per second. And how long does that take us? If you have a CPU that has, for example, 3 gigahertz wire, then that means you have about 200 cycles per package if we want to handle a 10 gigabit link. So, of course, you still have some that are faster and then you have several. So, the performance goal would be about 5 to 10 million packages per CPU core for simple forwarding. So, just to keep it and to continue to handle it. And many more cycles, because you want to use it for the application. So, the CPU shouldn't waste a lot of time forwarding. So, to all of them, then about 300 to 600 cycles of 3 gigahertz CPU. So, how expensive is it to get from the user space to come? Very, very long for a package. So, in a few performance measurements, if you do single core package forwarding, then maybe you can reach 300,000 packages per second. Or maybe you can reach 1 million per second. You can get a factor of 2 with a little tuning but that doesn't seem to work. So, the limit is the problem. That means we just move the application into the kernel, write it in kernel module. And that's a pretty stupid idea. You think that it is, but it's not very unusual because there are certain web servers at Microsoft that are written as kernel modules. The latest Linux kernel also has it with TLS. And there is a quick in-curn on cache. That's not very unusual. But it comes with a lot of disadvantages. It's very hard to develop and most user tools don't work. Of course, you have to use C as a programming language. And, of course, the use of the kernel is pretty bad. So, you need this scale data structure. You can do it everywhere in the kernel and it's really big. So, you need 400 cycles to process it again. With one core forwarding, you get 2 million packages per second. So, about 200 cycles per package. Because the overhead and the access to the kernel is again 2,000. Then you get 2 million packages per second. So, faster than normal user space, but still too slow. We want to be faster than that. So, at the moment, the hot areas are XDP. That fixes some of these problems. But I can't show that here. So, let's XDP out first. So, the problem was that our application, you have to go through the kernel space and it doesn't work. So, instead, you could just do everything in the user space. So, yes, it works. There are libraries called user space packet processing. So, you have a library against which you link and one part of it is in the kernel and they do a shared map memory. And this shared memory uses the application to communicate directly with the driver. That's much faster. The operating system here is not connected to anything. So, the operating system normally doesn't know that the network card is there. That makes problems. But, yes, so, examples for that are netmap, pfring, setsy and a third. So, the problems here are there are no standards. You can't just import the application. You need a custom kernel module. Then you need patches for the network driver. It's difficult to maintain. And they need exclusive access to the network card. And yes, for one single application. And you learn the usual features that a kernel offers. That can be difficult. Then there are usually no hardware offload features of the network card, because the kernel has to help. And, yes, when the application speaks directly with the card, then the application usually needs specific adjustments. For a hardware application, there can only be one or two. So, you are very limited. So, can we make it even more radical, because you have all these problems. So, yes, that's possible. You can completely remove the kernel and put everything into a single application. That means the driver directly touches the network card. It also needs the memory completely in the user space. And you just have to initialize it correctly. And then it looks like this. You push the driver into the user space. Nothing is running in the kernel anymore. That's super fast and you can also do that to implement obscure hardware features. So, I'm not the first to do that. There are already two big frameworks that do that. So, DPDK, that's pretty big. It's a Linux Foundation project and has practically all NIC vendors support. So, they do SDKs for that. The second is SNAP. That's very exciting, because they don't write the driver the C, but the Lua, the scripting language. So, it's interesting to see a driver write a scripting language. So, which problems have we solved? What's new? So, the problem is, we don't have any standard API yet. You still need exclusive access to the NIC, because the driver runs in the user space. So, there are some tricks to solve that, but basically it doesn't work. So, you basically have support for all NIC models that exist. So, DPDK is not a problem, because all manufacturers support it. Interrupts have problems, but if you want a high performance, it's not like what you want to need. Because interrupts have a lot of performance on your head. So, if you want to save energy and have a low load, but that doesn't matter in our application case. And of course, the common kernel features are missing. So, let's start with what the kernel does for us, because the kernel has a lot of derived drivers. They have nice protocol implementations that work. For example, the real world is great. So, there's a lot of nice things you want to have. So, what else are there? There's a lot more. So, they throw away everything, because it doesn't play a major role for us. For port forwarding and firewall applications, these are high level features. But there are still a lot of features that fall away. So, if you want to build a TCP stack on these user space frameworks, then it's very difficult and there are problems. So, we lost a lot of features, but that doesn't matter. We want to have a performance. So, we want to have up to 300-600 cycles per package. So, how long does it take? For example, it takes about 100 cycles to get through. So, to receive it, to process it. So, no, not to receive it, but to continue to use it and then back. And the other frameworks are in the same size, but DPDK is a bit faster usually. So, it's super fast. We have a real world scenario, for example with OpenVswitch and DPDK. They compile it with a DPDK backend, then they link it to DPDK and use the network card directly in the user space and then it's 6 or 7 times faster. So, it means on a single CPU core you can create the 13 mega packages. So, in comparison to kernel or sockets is the zero copy, but that's a stupid term because kernel doesn't copy packages either. So, it's not copying and the packages are slower. But it's very efficient to translate a large number of packages. And yes. So, in contrast to other file systems or other memory management it takes 50 cycles. Okay, now we know that there are these frameworks and so on. The next official question is, can we build our own driver? But why? First of all, it's fun and then of course, why it works, how these drivers work and how these frameworks work. I saw in my academic work that many of these frameworks are used because they are simple and fast and there are the possibilities to use things that didn't go so well. So, I use them as a black box. It takes them and makes them faster. But now they are with a lot of codes. That's one thing with about 3,000 codes and they do a lot of magic to make it faster and that simply doesn't want anyone to die. And the question is, how difficult is it to do it yourself? What made you think it was very easy? It was a weekend project. I bought XE, XGB, it's very little code and for Xeon cards and it's a framework with a few examples, applications and it's a few days to debug it and tweak the performance. But I built this driver on the Intel XGB family, it's a family of network cards that you might know if you have a server that has that because all servers that have 10, 10 GB they have these cards. And there are also on-board chips and mainboards. And the nice thing about the cards is that they are public and the next thing is that there is no logic behind it. Many newer network cards that hide a lot of functionality behind the company and the driver just convinces you with the company and that's very boring. And with this family it's not like that, it's very nice. So how can you build a driver for it? There are many very simple steps. First of all, we take the driver that is just loaded, then we just remove it. Second, we do a memory map, the PCI address that gives us the possibility to use the PCI Express device to speak. And then you use it for the MDA and that's a bit more complicated than the first few steps. First of all, you have to find out whether the network card has been added. There is an address on the PCI bus, so you can find it with LSPCI. Then you look after it. Then you move the kernel driver, so you remove this specific ID and the kernel doesn't know that it's a network card anymore, doesn't know anything about it anymore. Then we take our application, so here in C for example, we open this magical file here in CISFS. It's a simple N-Map, finally. So it's a special storage area from the PCI address configuration space, that's where all the registers are. I'll show you what that means. So if you go to the start blood, you can see hundreds of pages with tables like this one here. These are the registers on the card that have an offset, a detailed description and then the code that looks like this, that's for example the LED controller. These registers are all 32 bits, then there is an offset here, there is for example LED zero blink, then here, so one of the LEDs is brought to blink, in this magical storage area, so all the reading and writing attacks go directly over the PCI Express bus and then the card does what it wants to do with it. So it's not actually a register, but it just sends an order to the card and that's just the standard interface, how you do something like this. So with microcontroller it works the same, that's very common. And these storage areas are also not cached, so every writing attack goes directly on the PCI Express bus and it can be that it needs many cycles until it is effectively executed, so hundreds of cycles. So how do you handle packets now? First you have access to this storage area, we can write the driver, but how do we get packages now? So you could write a network card that does that via the memory map IO region, but that's pretty tedious and another way you can communicate with PCI Express is via DMR. And for this input output it's initialized from the network card. So the network card can write on favorite storage addresses if it's correct. That's what you call rings, that's the queue interfaces. You usually have multiple, because if you want to work with multiple CPU cores then you need a queue per CPU. And the network card then writes on multiple queues. So in the rule you could, for example, make a hash via header from the package to add a queue. So it's not specifically for network cards, many PCI Express devices have queues, that's for example NWMI, PCI Express flashbacker. So let's take a look at these queues with the IXGBI family, but they all work very well. So they all work well. These rings are filled with circular buffers, with DMA descriptors. These are 16 bytes long, 8 bytes is a location in memory, and 8 bytes metadata. For example, I get something, or a VLAN tag from so and so, this kind of information. And we now need to translate these virtual addresses into physical addresses, because of course the PCI Express device needs physical addresses. You can take a look in the ProcFS pagemap file. Next we have these DMA descriptors queues. They are then read out via DMA effectively. And that works just like a ring puffer. So there is a head and tail pointer, and these are registers. So it looks like this, you have this description of the descriptor ring, and somewhere else you have these packages. And something that you have to pay attention to when you unlock it. There is a small trick you have to do, because the described descriptor has to be there, and if you just accept that everything is connected, that it is also in the hardware, then it is not really, but when you have a bug, then your data system dies, that is of course not so good. What I do is, we do two megabytes of pages, and that's enough, so that you don't have any strange gaps between them. So now we have to set up this ring. We set up certain things, and we fill the ring with pointers, so that it is still empty. And now we set the head and tail pointer so that the snake is full. So it's not full, but what the network interface does is... So it writes the register, the written address, and then in the DMA descriptor it also writes when it is ready. This step is important, because if you read the head pointer again, it would be way too slow. So instead of that, we just check the status flag. That will be optimized by the cache. So that it is not a cache, then you can read it out quickly. The first step is to read out the status flag periodically. That is where interrupts could possibly be useful. People sometimes have the feeling that the interrupt contains the package, that is not the case. The interrupt says that there is a package somewhere, and you still have to read out the status flag. So now we have the interrupt, we process the status flag, you can now recycle the old package, or allocate a new one. Then we set the tail pointer, so that the network interface knows that we are done with the package. So we can't use the ring buffer 100% because we can only update about 100 packages. So what now? Now we have a driver that receives the package. So what's next? I'm going to leave a lot of long-term details here. There is a lot of initialization code. Then you just have to follow the data board. Then it says, do this, do that and so on. Then you just have to do it like this, so that it works. So how do you write a driver now? A few ideas that I still want to do. So the one thing is to look at the performance. So how can we do that faster than the kernel? Then a few offloading features, the hardware, for example from IPsec. So there is hardware support from certain cards, but no one has implemented it from the older drivers. Then security features. So there are certain security implications to have from the whole user space. And I wonder how we can have the IOMMU. We can, for example, let the privileges fall, which we don't need. Then we could make a secure driver in the user space that can't do anything wrong. Because it goes through the IOMMU. And of course we want to support VirtualNix and other features from the driver. So I wrote it in C, because it's the usual language for drivers. So it's a BSD license. You can look at the code here on GitHub. So don't be afraid of drivers. Don't be afraid to write drivers yourself. And you can write them in any other language. Thank you for your attention. We have a few minutes left for questions. So please, go to the microphone, if you have a question. SignalAngel, do you have a question? I don't see anything. Nobody at the microphone, but here. Okay. If you don't need anything from the Linux drivers, do you have an open source operating system? I don't know much about others, but the only thing I use is the ability to simply map it. You just need a small voice driver who does it. And there was just a driver who did it, and it was pretty easy. A little bit of a loose talk, but I would just listen to your opinion on SmartNix in terms of CPU. I don't know if SmartNix works in a lab, but I think it's very interesting. But it's a very complicated problem, because it comes with new restrictions, and they're not very fast. It's interesting from a performance, whether it's worth it or not. But what I think is that it's probably better than just making more CPU power. It's a bit faster, but an interesting thing is about XPP. When you look at it, it's all very new, and there's a lot of important restrictions. You can use the space in any language. With XPP you just have to write an EPPF. That's a small timing of C, and that's a restriction. There are some strange restrictions you might not choose, and XPP needs patched drivers and memory-changing drivers. And features are still missing, like sending them back to the network. The interesting thing is to make Firewall on the same host, so you can send it to a TCP stack. But most of the things are very different, and XPP is...