 Welcome to our talk. I'm Paul with me as Simon and well, we have a lot of people on these slides because we just added everyone who contributed to this project by implementing a driver or some parts of a driver. And as you notice, we are going to speak quite fast because we have like 31 slides and only 20 minutes or so. So if you're watching the stream, you can reduce the playback speed somewhere. Well, I'm a PhD student at Technical University of Munich where I research performance of stuff like DPDK or other network drivers and so on. And with me is Simon who did a rust driver as a vegetable thesis and now is a research assistant working for us. And everyone else mentioned on the slide are also a student of me who implemented some kind of driver or anything. Now network drivers as the part where I work by a show of hands has who has used DPDK or Snap or something like this. Just yeah, not as not as many people as expected actually. How many of you have actually written a driver that's like the same. I want people to use one is unusual. I guess different code unusual for the talk because what I wanted to do like two years ago. I looked at a few drivers and I was there stupidly complicated or most of them are stupidly complicated and I tried to write a simple driver as simple as I could. And so I wrote the XE network driver, which is just a super fast user space network driver. That's also super simple because it's only a thousand lines of code targets the Intel 82509 network card and so on. And also that I own so on can check that out on GitHub. And I've of course voted in C back then because you obviously write drivers in C for historical reasons or something. Then question is, is C really the best language to write a driver in? Well, I didn't know if it was the best language or not. So I was like, well, let's write drivers in all the languages. And then it quickly turned out that I don't speak all the languages and it's also a lot of work to write them in all the languages. But good thing. I'm working at the university. So I can just offer all these things here. I'm looking for a student to write a driver in Rust, in Go, in Java, in C sharp, in Haskell and so on. And then I talked to like 30 students or so who were interested in doing this as a batch test thesis or master's thesis. And in the end I told them all is going to be really complicated. Maybe go look for something simpler and 10 then even stayed with me for some reason and did that. And yeah, this is like basically how I told them to write this. These are the absolute basics. Like a short session with all the students there and was like, you're going to do this, this and that. And it should work in your language. And if it doesn't, well, we'll figure something out. So basically just a few quick basics about writing a user space driver. You need to, there are basically three ways you can talk to the, or have to talk to the PCI Express device. One is via memory mapped IO. That's just to do a magic and map and it does the right thing. And it's mapped into your user space process. And you can call it from any language, obviously, because there's some way to access memory in a language. Obviously, sometimes you need a web server for some lines of code or so, but it usually works in any language. Then you can do direct memory access or the device can do it. And that's just some way that the device talks back to you by transferring memory. You just have to figure out where your stuff is located. And then the device can write into whatever memory is that should be accessible from your language, even if it doesn't have pointers or anything if you just know where the memory is located. And well, then there are interrupts which you are not using here, but skip over that. And well, then there are four simple steps to actually write the drivers. Once you know that you remove the current driver, you do the right M-Map call with the right parameters. You figure out the physical addresses, which Linux tells you. And then you just write the driver, which is simpler than you thought. Then I told him, well, what do I want from your driver? I want, well, the same features as my C reference driver, which was always like this. The C driver should look the same, has the same structure. It should have a similar architecture. It should have the same features. And at the same time, it should use all the language safety features that are possible because we wanted to have more safer drivers with fewer bugs and so on. And in the end, we wanted to compare all the languages against each other about, well, how fast they are, which safety features could be used, and how much performance does it cost me to enable a safety feature? What's the impact of having a garbage collector in the driver? And we have the implementation now fully up and working in C-Sharp, Swift, OCaml, Haskell, Go, Rust, and Python. We will show our performance graph. But before that, Simon will talk about the Rust driver and why Rust is, in our opinion, the best language of choice for a new driver nowadays. Thanks, Paul. So let's talk about Rust. What is Rust? Well, it claims to be a safe concurrent and practical systems language. Sounds great for writing network drivers, doesn't it? It has no garbage collector, so there's less overhead for memory handling. It has a unique ownership system and some rules for borrowing ownership and moving values that enable Rust to accomplish its goal of memory safety and an unsafe mode like Go and quite a few other languages. What is the ownership system? Well, it is the core feature of Rust and actually it's just a simple set of rules that restrict the way memory is handled. If you take these three rules you can see on the slides combined with the rules for borrowing ownership and moving values, you get what makes Rust memory safe. And the interesting thing is that these rules are enforced at compile time and so we have no performance penalty at runtime and the compile programs are not that different to C-programs, but they are memory safe. And what does it look like in our driver? Well, we have some packet structs for our network packets that own some DMA memory and these packets are passed between the users of our driver and the driver and ownership is passed along with the packets. So if the user packet is passed to the user, the user can modify the content of the packet and if the packet is passed to the driver, only the driver can modify the content of the packet. So we have safe packet handling unlike in other languages. And on the bottom of the slides you can see how you would use the driver based on the driver's interfaces. So you can see how to receive, modify and send packets and there's no way to screw up. So for example, you cannot forget to free packets because packets are freed automatically when they go out of scope and yeah, that's pretty cool. So it looks kind of like C but it's safer. Here we have some safe code on this slide. But what about unsafe? Well, unfortunately not everything can be done in safe rust. So for example, calling foreign functions and dereferencing row pointers is unsafe. But using unsafe code is nothing unusual. The idea is to restrict potentially bad code to a few places that can be reviewed and do some assertions to make unsafe code safe again. Where did we use unsafe code in our driver? For example, to set the registers of the device. We have a set register method that takes a register and a value and we use pointer write volatile to write to the memory address of the register. But before we do that, we verify that the register, that the memory address of the register is indeed inside of our memory region. Yeah, now we have some great code but is it fast? To figure that out, we set up a test bed to benchmark all our drivers. As you can see, we have two servers. We have a packet generator and the device under test. The two servers are connected with two 10 gigabit links bi-directionally. We use Moonshine written by Paul as a packet generator because it's obviously the best packet generator. And on the device under test, we run a simple bi-directional packet forwarder that we implemented on top of all our drivers in all the languages. So let's have a look at the results. What did we do? Well, we had a look at the throughput of our drivers with different batch sizes. That means how many packets are sent to the PCI device at once. On the x-axis, you can see the batch size. On the y-axis, the packet rate in million packets per second. And yeah, you can see the different plots in the graph are the different languages. And as you may notice, the C and rust are on top of all other drivers. The functional programming languages are quite slow and Python is by far the slowest with less than 200,000 packets per second. And yeah, all in all, batching has a huge influence on performance and it is one of the main reasons why users-based drivers are faster than kernel drivers because kernel drivers usually operate on a batch size of one on transmission and users-based drivers don't. And yeah, another thing to note is that batch sizes higher than 64 are worse than batch sizes between 32 and 64 because we have more cache misses. But we did not only look on the throughput of our drivers, but as well on the latency. So I'm going to hand back to Paul to tell you something about our results. Thanks. All there are obviously two things to performance. One is throughput and the other one is latency. And what you can see here is a graph of the latency. That's an HDR histogram, meaning the x-axis is the percentile of the latency, the y-axis is the latency. So for example, if there's a value of 100 microseconds at the percentile of 99, that 99% of packets are being handled faster than 100 microseconds and 1% is slower. Now, this can happen quite often that a packet is handled slower for whatever reason. One reason why it goes up even a little bit for C and Rust at the 99.999th percentile is that we have like periodic printing of statistics in the main thread, which is not necessarily a good design choice, but something can be implemented in any language. Do it like this. And you can clearly see the garbage-collected languages are significantly worse when it comes to tail latency. They are just, for example, for Haskell, there are some random packets that just take 200 microseconds or so. And that is after tuning it for low latency. Before tuning a few of these languages for low latency, we had milliseconds of garbage-collection pauses in there, but these are all tuned for low latency and then we can get to reasonable latency levels even in garbage-collected languages. You can see the go language is the fastest garbage-collected language in here because it has the fastest garbage-collector and the other ones are all significantly slower. And this is the forwarding latency at 1 million packets per second. If we switch that over to 10 million packets per second, note that a few of the languages are missing from the graph. That's just because they can't handle forwarding 10 million packets per second on this specific system. But adding languages that can't cope with the load doesn't make much sense in a latency graph because then the latency is just the size of the buffer and it's just a few milliseconds and that's too slow and boring. So here, again, C and Rust are still on top of each other. Then there's Go with some more latency due to the garbage-collector and then there's C-Sharp with a little bit more latency, but it's still okay. This is also with tuned garbage-collection settings. Otherwise, C-Sharp would be a little bit slower. An interesting trade of switching the garbage-collector and C-Sharp to low latency mode is that it's supported by only 1%, but reduces latency from like 300 microseconds to 60 or 70. And going even further back all the way to 20 million packets per second, where we are operating at basically for Go, it's basically on this particular CPU, it's at the capacity that it can forward and then there's some significant, the higher latency. And for Rust and C, we finally see a small difference between Rust and C. And this is in this case just because it's for Rust, it's basically completely loaded the system and for C, there's some spare sources, meaning it processes slightly smaller batches. And then another thing for safety, for improved safety is we have all these language features. We have now shown that they, well, can be used with barely any performance impact, but the other thing is user space drivers still usually run as root. And well, can we do it without root privileges? Well, why do you even need root for a user space driver? Usually it's for mapping the PCX processor source requires root, they're allocating the non-transparent huge pages that you need for technical reasons for the MA memory from user space requires root and locking the memory also requires root. Now the question is, can we write a small program that handles setup and then drops all the privileges and passes on all the resources to another unprivileged program? Well, yes, we can do that, but then it's not secure anymore. In order to understand this, we have to look at how our system looks like. This is the overview of the system with the CPU on top running the application. Then the application talks to the thing called the MMU, the memory management unit. And if you want to access memory, then the access goes through the MMU. The MMU checks whether we are allowed to do that. And everything's fine, great. If we access the device, it's also checked. Can we access the device? Yes, great. Now the question is what happens if the device wants to talk to memory directly via the DMA engine? Well, it, of course, completely bypasses the MMU, meaning even if we restrict our program to run without root privileges, it could just tell the device, okay, please read this memory for me. Great, thank you. Now what you obviously need is the thing called an IOMU, which is enabled or available on all virtualization enabled CPUs, meaning basically all server CPUs. And this thing just sits in between the paths between the PCI device and the memory, and it can be configured from Linux. And this is really messy to implement. I had a student do it, and it was really painful for him, apparently. But it's nowadays, it's working. Yeah, basically, to conclude, we have all the repositories. It's all free and open source. It's all available on GitHub. You can scan the QR code, but almost no one ever scans QR codes. GitHub, and then you will find a meta repository containing all the links to all the implementations. Also contains these benchmark graphs that I have just shown. There is somewhere on the internet there is a longer version of this talk available with a slightly different focus. And things to take away, that drivers are actually quite simple. You shouldn't be afraid of drivers. You should write them. You can do it in any language. I mean, we even have a PCI Express driver in Python. I mean, it's slow, but it works, and it might be worth it for a different device that doesn't require high performance. Okay, great. Thanks for your attention. Do you have time for questions? I think we do have time for questions. Yeah. Does it compare to a different version? It's completely unrelated. We really didn't look at it. It's the same comparison between today it's as fast as an older version of DPDK. It can't keep up with the new vectorized drivers and so on. But for the comparison to a corner driver, just compare DPDK to the corner driver. Yeah. It's the package generator, you mean? Yeah. Yeah, it's the Moongen package generator. Moongen, yes. Moongen, and there is a FOSTEM talk available from a few years ago. You can look that up. It's also available on GitHub, and it's the best package generator ever. Because you can script it with Lua, and then it's amazing. Do you need to fix the restration slower than the C version? There are a lot of bounce checks, because the goal was to write idiomatic code in all the languages. That means when I wrote the C code, I wrote idiomatic C code, which means not doing bounce check, because I know I'm right. We have some profiling results that show that the driver executes around 40 to 50% more instructions to do the same stuff, but only takes 10% more cycles to do so. It achieves an IPC of two, whereas the C version has an IPC of only 1.3. Enabling integer overflow checks in Rust adds an additional eight instructions, seven of which are branches. At the same time, it only reduces the throughput by 1%, again showing that a modern out-of-order CPU can just speculate away on almost all their safety checks. Thank you, Paul. Thank you, Simon. If you enjoyed Paul and Simon's talk, please leave the feedback on the past day and schedule. Thank you so much.