 Driver development has been dominated by C for many, many years and it sadly still is. C is inherently flawed and there are better alternatives but those are always fighting against the stigma of being slow and cumbersome to use. So, today here our speakers are going to tell you that this is in fact not true. You can write high level drivers with high level languages and they perform very well. Here to talk about this are Paul, Simon and Sebastian. One round of applause please. Thank you. Yeah, I'm Paul, with me are Simon and Sebastian. Just a few quick notes before we get started. I speak quite fast usually so sorry to the translators and sorry to everyone listening. If you are watching this on a stream recording on media.cccde there is a button to your right where you can reduce the playback speed. If you are watching on YouTube you can also reduce the playback speed somewhere also please like and subscribe on YouTube. You might have already seen that we have quite a lot of names up on these slides today and these are all the people who somehow contributed to this talk and these are all my students. Today, Simon and Sebastian who both did a Bachelor thesis with me. I'm a PhD student at Technical University of Munich and I'm researching performance of software packet processing system. And today we are going specifically to talk about network drivers. While I look at network drivers as a case study, well it's obviously our research area so it's the next best thing to do. And also user space network drivers are all the rage now and user space drivers is where you can use all the fancy languages. I have already talked about user space network drivers here last year. So a quick recap of what I presented here last year is that was the ICSI project. And the ICSI project is a thing I started where I wanted to show off how you can write a user space driver that is actually readable and understandable and fast at the same time. So the goal is to be used for educational purposes. And a thousand lines of C code that's full of references to data sheets, specs and so on. And well, if you want to know more about that, watch my talk last year. Just a quick diff since then. We've added support for weird ionics and we've now a vagrant set up so you no longer need real hardware to play around with that. You can check it out on GitHub. Yeah. Then I've wrote it in C back then. Well, why would you write a driver in C? Seems like a kind of obvious question. Why wouldn't you? Because most drivers are written in C and if you're going for educational use of the driver, then might as well use the language that all the other drivers are in. It's also the lowest common denominator of all the systems programming languages, meaning everyone should be able to read C. And I also think that C code can be quite beautiful in some cases. Can we get a quick show of hands who here thinks they can read C code? So this is way more than expected of like basically everyone raising their hands. So let's have a look at some C code. This is some actual code from our driver that was added by a student. And when I initially got the pull request for this from my student, I was like, I know we can't add this macro. It's like the goal was to have readable code. You can't add this. And then we discussed it a little bit in the end. We ended up adding it to the code base because it's really necessary. There's no better way to do this than this macro in C. Who here knows who can immediately recognize the macro and knows what it does? I see one hand. A few hands. Okay. Not bad. Not bad. Does it help if I show you the actual name? Who can recognize this macro? Some more hands. Yeah. Okay. But still this is kind of a little bit fake inheritance or a little bit to abstract drivers. It's commonly used in drivers to abstract the different drivers. I have macros copied from the Linux kernel and I've searched through the Linux kernel sources and found 15,000 users of this macro. So it's not at all unusual to have C code like this in your driver. So I then agreed to add it to the code, even though probably almost no one can read it. The actual code also has a comment to a blog post that explains what this macro does. But it also shows the problem that maybe a lot of people think they can read C, but then they encounter something like that and suddenly, yeah. Point is it can be ugly and it can not only be ugly in terms of how it looks or feels while programming. It can also be ugly when it comes to security properties or security issues. This is a screenshot from CVEDetails.com. No, I don't expect you to read all these figures, but basically it shows there are security bugs in the Linux kernel. This is specifically all the Linux kernel bugs found in the last 20 years or so. And now you could say, well, why is the language to blame? You can write bad code in any language. And what we now would have to do is basically we would have to go through all these bugs and check whether they could be prevented by using a memory save or better language. And it seems like a lot of work. But luckily, someone else already did that. Last year, there was a paper by Cutler et al. who developed an operating system in Go. And they looked at all code execution bugs in, I think, 2017 in the Linux kernel and looked to all of them manually and tried to figure out whether they could have been prevented by using a different programming language. For 17% of them, they weren't sure because it's a kind of weird bug and they couldn't tell whether that could have been prevented. For 22%, that were clear logic bugs. Clearly, you would have made that bug probably in any language. Other language wouldn't have helped. But for over 50% of the bugs were related to memory, both use of the free out-of-bound accesses and so on. And these can be prevented by using a better programming language. Now, these are 40 preventable bugs in this case study. And we looked at them to figure out how many of these bugs were in drivers versus bugs in other parts of the kernel because we are specifically looking at drivers. Well, 39 of them were in drivers. The other one was in the Bluetooth stack. And of the drivers, the Qualcomm Wi-Fi driver had 13 bugs. Yes, I know. I was shocked bugs in the Qualcomm Wi-Fi driver. Good have thought. Now, based on these results, should you be writing new code in C in 2019? Well, probably not if you have a choice, but you just often don't have a choice. If you're writing some kernel code for some reason, sure, you can write a kernel module in Rust, but good luck getting it upstreamed. And other languages like good luck writing a kernel module in JavaScript. That's probably not going to work. And if it's going to work, it's probably a bad idea. Now, this is why we are looking at user space drivers because they can be written in virtually any language. So we are not constrained by any environment here. And the question would now be, are all the languages equally good choice? Can I do it in any language? Should I do it in any language? Which language should I use? Is a JIT compiler or garbage collector a problem in the driver? Now, I initially wanted to write one driver in one high-level language with a case study and then extensively evaluate it. But then I thought maybe it would be a better idea to write drivers in all the languages. But it turns out I don't speak all the languages. But luckily, I could recruit the help of a few students. And this is a screenshot from my website at the university where I do like announcements for our thesis. And I just added announcements for our writing network drivers in Rust, in Go, in Java, in C Sharp, in Haskell, and so on. And then, at first, my colleagues at the university looked at me in weird ways and were like, are you serious? You realize it's the same announcement all over again. Some of them still can't tell if I'm serious about this. But yes, I am. And I got a lot of response from students. I think I talked to a total of 30 students or so who wanted to do one of these thesis. And then these two were one of the first ones to talk to me. And I tried to scare them all away initially. I told them all, it's going to be really hard. You can get an easier thesis. It's probably not a good idea. You need to know a lot of low-level stuff. And so I scared away 20 of the 30 I talked to. In the end, we did 10 thesis. And yes, quite nice result so far. A few are still ongoing. But I hope we will have 10 different languages for drivers. Soonish, I think, finished, depending on how you count, finished is six or seven. And it also turns out giving a talk here is a really nice way to recruit students, because a lot of my students mentioned that they saw my talk and contacted me afterwards for a thesis. Now, what did I tell my students? How to go ahead to write a driver? Well, I explained to them basically the very basics of how to write a user space driver, how to talk to a modern PCI Express device, and what you need to do. Basically, there are three different ways to talk to modern PCI Express devices. We are ignoring a few legacy things here if you like. We had old I.O. instructions not on here, sorry. First way to talk to a device directly, the simplest way is memory mapped I.O. Memory mapped I.O. is just a magic memory area that is mapped into your process and directly goes through to the device. And if you rewrite that memory, the device gets the request and can reply to it. That's usually used to expose device registers. In Linux, you can just map a magic file via the UIO framework and then you have access to that from your user space program. Second way is kind of how the device talks to you or how the device talks to the rest of the system with direct memory access. That is just a way how the device can read and write arbitrary memory locations. And for user space drivers, you just have to figure out where our memory address space is mapped physically. Then we can tell the device to write something there and then it will just show up in our process without us having to do anything in the kernel. And the third way are interrupts. We will not be using interrupts here because we don't need them for our high-speed network driver. But that is on here because sometimes people say, well, you can't use interrupts from a user space driver. That is incorrect. You can use the VFIO subsystem that has full support for interrupts but we won't be using them here. Now, what did I tell my students about how they should go to write their drivers? Well, basically, you should just remove the kernel driver, do the magic M-map call on the write magic file, then figure out the physical addresses, and then just write the driver. It's really easy. So then we have a lot of hardware at the university and I gave them all access to servers with 10 gigabit network cards of the Intel IX GBE family. This is a really common network card that you will find on a lot of servers, like the default go-to 10 gigabit network card on any servers. You will often find it on board or embedded in some of the CPUs, even the nice part about this is it has a very, very nice datasheet publicly available that documents basically everything. Fun fact, we found it easier to program against this hardware black box with a good public datasheet than we found it to implement the VIRT IOS spec, which is an open specification, has several open source implementations. But yeah. Then this network card is a little bit old, that's 10 years old or so, and it has a nice property that it's still very low level compared to newer network cards. If you implement this on newer network cards, you are usually just exchanging messages with some firmware and that's just boring because the firmware implements everything. Sure, the older network cards also have some firmware, but you get a lot of more low level access to the card and you don't feel like you are just talking to a firmware. You feel like you are implementing an actual driver yourself. So now these were the basic things I told my students. I will now give over to Sebastian, who will show a little bit of C code about, well, how to write a code, how to write a driver in C, and whether that could be done in a high level language. Thank you very much Paul. So he just showed you what we have to do in general. I'm going to take a bit more detailed look at this. So first of all, of course, we have to figure out our PCI addresses. You can do this via LSPCI. And then you get a list of all the PCI devices and you just look for something like Intel Corporation 82599, something that looks like this. And then there at the front of the line, there is an address. It looks kind of similar to a MAC address and well, that's the address we need. So now we got the address. With this address, we can just go there and unload our driver. That's basically it. You just write there. Next step, we have to M-app our PCI register from the address space. And for that, we open our magic file and execute an M-app. The challenge here is mainly that for every high level language which you want to do this in, you have to have some kind of way to actually use an M-app either via a library or in the worst case, you just have to use some C code for that. Next thing, this is an extract of the datasheet. In the datasheet, it's basically like this. You have the register names, register offsets and then you can just go through the whole document and you find all the registers you need to read and set the things for the driver. A quick example. Network cards have LEDs often and we can make them blink, turn on and off. How do we do this? Well, we have to get the base address of our registers at the offset that we found in the datasheet and then just switch the bit and this turns the light off and on. Right at the back, of course, the register. Also, this is one of the very few valid uses of volatile in C because we have to really prevent compiler optimization here as we have memory access from multiple sources. Next step. How do we handle packets via DMA? Packets are transferred via Q interfaces. They receive Qs, transmit Qs and these are often called rings because they are organized in a ring-like structure. And these rings are configured via MMIO and when this is done, we can access the device via DMA. These rings then usually contain pointers to packets and these packets are then also accessed via DMA. The details vary a bit between cards and devices but that is not unique to Nix. This process, this is pretty much the same thing for all PCIe devices. They all kind of work similar. So, what are the challenges for high-level languages? We have seen some, for example, the M-Map call. We somehow have to get access to this with the proper flags. Another thing, we have to handle externally allocated memory and the layout that this memory comes in. Also, we've seen we need a volatile to prevent compiler optimization so we have to have some kind of semantics so we can enforce this in other languages. And, of course, often, especially for low-level stuff, you often have to use some kind of unsafe code for high-level languages because, well, many operations are inherently unsafe when you operate here. But we try to contain this unsafe code to as few places as possible to contain this to a small area. Okay, now some basic challenges. I'm going to hand back over to Paul who says something about the goals for our implementations. Okay, basically, this is what I've told my students about what I'm expecting for these implementations. I wanted to have the same feature set as my C-Driver which was kind of used as the reference implementation. And it also was supposed to have a similar structure but at the same time we wanted to have code that looks like it was meant to be written in that language. It's always a difficult trade-off to, well, use safety features wherever possible but if they cost too much performance, do we really need them in all the places? Where can we use the safety features? Where do they make sense? And then we wanted to quantify that. The idea is to really have, in the end, like 10 driver implementations that we can quantitatively evaluate and look at the performance of all of them, look at the safety properties of all of them, look at properties like safety for memory accesses is guaranteed for packet buffers, yes-no, for other stuff, yes-no, and so on. And I'm now going to look at a few of these languages. Basically I'm only talking one or two minutes about each of the languages for students who are not here but have already finished their thesis and then we are going to have a deeper look at Go and Rust. But I'll now start with the other languages where it's only a very, very short high-level overview. Well, the first one is C-Sharp which seems a little bit unusual but we found a student for that, so why not? And no, we didn't develop a driver for Windows. Microsoft Core CLR is available on Linux and works really well. For those who don't know C-Sharp, C-Sharp is just-in-time compiled, garbage-collected, memory-safe language, and it has a relatively obscure or rarely used unsafe mode and unsafe mode features full support for pointers. So you can basically write code that looks like C, you just have to give the compiler a special flag to tell it, hey, I'm going to use unsafe stuff. Then, how can we access external memory or foreign memory? There's, for example, a few nice wrappers in C-Sharp with P-Invoke or the unmanaged memory stream and so on, but that turned out to be too slow for our implementation so we used that unsafe mode, which basically looks like this. You see the unsafe keyword here and the other stuff, it just looks like C and it also feels like C when you are writing it. So it's a really nice language to write the drivers and, again, the unsafe code is contained to a few well-known places that can then be audited compared to a C-driver where the unsafe code is all over your code base and you don't know where the bug is. Here we know if there's a bug, it's probably in there that we have while not checking the bug size properly or something. Okay, that's all we did for C-Sharp. Another unusual language for drivers is Swift. Also, from the student who mentioned why I kind of do it in Swift, I was like, oh, I didn't even think of Swift, sounds like a good idea, yeah? So, no, we didn't develop a macOS or iOS driver. Swift is also available on Linux. Swift is a compiled language. It's compiled via LVM. Memory management is done via reference counting. There is no garbage collector and it's mostly memory-safe. Now, you, again, have to use some kind of pointers. There's unsafe buffer pointer. There's also unsafe war pointer and more classes. And we are using these things to make packets that are stored in the DMA buffers available to the application using the driver and, for example, here's a property that wraps some memory in an unsafe buffer pointer wrapper and the unsafe buffer pointer wrapper forces you to specify how big the buffer is and then it does the bounce check for you when running in debug mode. Then pointers are kind of a little bit more verbose compared to C-Sharp, but basically you can use them like pointers and operator overloading here can help for some of the other managed buffer pointers or buffer pointers. There is already an overloaded operator so it looks like an array access, but, yeah, totally possible to do that in Swift, even if you are writing a driver instead of a UI application. Now, for the fans of functional programming, we have a fully working implementation in OCaml. OCaml is a compiled language that has garbage collection for memory management. It's also memory-safe and it's the first functional language that we are going to look at. And one nice quick feature of OCaml is the C-Struct library. The C-Struct library allows you to specify memory layout like this, kind of looks like a C-Struct and then it generates from that code for accessors and this is really nice compared to the Swift example where you would have to hard-code the offset somewhere and it's much nicer to have this kind of code generator here which does the right thing for you and also can do automatic engine swapping and so on. And then the code in OCaml now looks quite different compared to what you might expect from a driver code and this is, for example, just a function that counts how many packets have been received in the receive ring by checking the flex and we can see here that we are using the getRxWB status which is the getter function from the previous stock declaration. We are checking a flag in that and then we are just counting the packets and we know, okay, we have received 10 packets or whatever since the last call to that function and we can now pass them on to the user of our driver. More functional programming. We also have an implementation in Haskell. Haskell, again, compiled language, memory management via garbage collection, it's memory safe and also functional. A few nice features of Haskell that you might not know about is the system projects memory package has a lot of really helpful functions. This is compared to OCaml where we had to write some C code to get M-map and M-lock with the right flex working and here there was everything available and the foreign package has nice functions like peak byte, poke byte and so on where you can just do your memory access and another thing we are using in Haskell quite a lot are the sum types because a lot of things in the drivers are where you have and see a union is basically this part where you write some data in a DMA buffer in some format the device reads this data like for the transmit packet descriptor we have the transmit read format which looks one way once the device has transmitted the packet it goes back to the same memory location and overwrites it with a different thing so we then need to read the same data as something else and this is basically a fancy C union and this is a little bit nicer to work with than the C unions. We are the languages of students who are not here this year and you can check out all the implementations on GitHub there will be a QR code to scan on the last slide and I'm now going to hand back to Sebastian who is going to do a deep dive into Go. Thank you very much. So now you've seen a few languages next up is Go. What is Go? Go is a compiled programming language it's been developed by Google it's generally a general purpose language but as it's been developed by Google it's mainly designed for distributed systems because that's what Google does distributed systems the driver is not a distributed system so why should we even use Go? Well, Go does offer a few things that are quite nice it has a runtime that has a garbage collection and also enforces memory and type safety also it has a very large standard library so we don't need to use any other code except standard library code so how do we program drivers in Go? Actually in most cases it's just like C there are a few main differences though on one hand we don't have pointer arithmetic we have pointers but we don't have arithmetic and this is what we need for managing our DMA memory and the second point is we don't have a wallet tile for the memory barriers for our register accesses mainly so what do we do instead to compensate for that? First of all we can manage the DMA memory via slices that's pretty easy the second thing is we can use unsafe pointers for pretty much all the rest unsafe pointers are arbitrary pointers so that's good but they do circumvent the runtime so we have to be careful with them and this is what we use for mainly physical address calculation and register access and there is also a rule set for unsafe pointers so they're still valid and of course we follow this rule set so two quick examples first of all mempools here you can kind of see how we manage the DMA memory first of all we allocate the DMA memory initialize the mempool and then mempool.buff this is actually the whole memory mapped area we can just sub slices into packet buffers so you see that's pretty easy next thing is for physical address calculation this is where we need our unsafe pointers for the first time because we have to translate our virtual addresses into physical addresses that the network card can then use to actually send and receive packets and for this because the runtime checks many things and you have to explicitly convert types you first have to convert your pointer to an unsafe pointer and then you can convert it to an integer type the unpointer and yeah it's how you do this in go also no volatile no problem I said we need a volatile as we share registers with the network card and we need some kind of compiler memory barrier to prevent reordering we don't have volatile in go but sync atomic functions do prevent reordering amongst other things they provide stricter guarantees than volatile but yeah it doesn't really cost us any performance here so we're just going to use that and yeah so we can use atomic store and load for integer types to then get our memory barriers so as a conclusion I thought go was actually quite nice to work with the safety properties have improved Kotler et al as Paul said before wrote a kernel in go and yeah so it is safe it gets you some safety guarantees and the other thing it's kind of a personal opinion but I think it looks like C code in beautiful but it also has downsides in the best case it's approximately 10% slower than C and if you get to more optimal cases it's even worse compared to C but yeah we have to live with that also the script access can be a bit ugly but well as long as it works so next I'm going to hand over to Simon who did this in Rust thank you Sebastian so let's talk about Rust what is Rust? well the Rust website says it is a safe concurrent and practical systems language sounds great that's exactly what we need to write a user space network driver so is there anything else we should know well yes Rust has no garbage collector so we have less overhead for memory handling it has a unique ownership system and some rules for moving and borrowing values so with these rules we can accomplish Rust's goal of memory safety and we have unsafe like in other languages presented before what is the ownership system? well it is the core feature of Rust actually it's basically just a set of three simple rules rule number one each value has a variable that is its owner rule number two there can only be one owner at a time and rule number three when this owner goes out of scope the value is freed these three rules combined with the rules for borrowing values keep us safe from memory bugs like double freeze and as they are enforced at compile time we don't have any performance penalties at run time so our programs are similar to C programs but we have the great advantage of memory safety what does it look like in our implementation? well we have packet structs for our network packets that own DMA memory and these packets are passed between the users of our driver and our driver and with them a long ownership is passed as well and that's pretty cool because when the packet is passed to the user only the user can modify the packet and the packet content and when it's passed back to the driver only the driver can modify it so we have basically safe packet handling unlike in other languages and at the bottom of this slide you can see how you would use the driver based on our driver's interfaces you can see how to receive, modify and send packets and there's no way to screw up so for example you cannot forget to free packets because packets are freed automatically when they go out of scope and I return to the memory pool of our driver so this is safe code and there's nothing you can do wrong here but unfortunately we also have unsafe code in our driver what is unsafe code? well not everything can be done in safe rust for example calling for in functions and dereferencing row pointers is unsafe but this is nothing unusual the idea is to reduce unsafe code to a few places and do some checks to make unsafe code safe what does it look like in our driver? well for example we have to set register method to set the registers of our device and we use pointer write volatile to write to some register of our device and before we do that we have some assertion in our code to assert that the address we are going to write to is indeed inside of the map memory region so we have some great code but is it fast? well to find that out we set up a testbed to benchmark our drivers we have two servers a packet generator and a device under test they are connected bi-directionally with two 10 gigabit per second connections we use the Moongen packet generator written by Paul because obviously it's the best packet generator and on the device under test we have a simple bi-directional packet forwarder that we implemented on top of our drivers in all languages so let's look at the results of our measurements so this is a graph showing the throughput of our forwarder on the x-axis you can see the CPU speed on the y-axis the packets per second we look at packets per second because the main overhead is per packet not per byte and the top of the y-axis is 30 million packets per second because that's about 20 gigabit per second at minimum size packets you can see the different plots for the different languages it's linear scaling with the CPU speed rust is the fastest Swift is... yeah well it performs incredibly poor and so the thing is usually you don't manually change your CPU speed so we ask ourselves is there anything else we can modify yes there is you can change the batch size so how many packets you send at once to the PCI device because that avoids some overhead and kernel drivers usually use a batch size of one on transmission and higher batch sizes are one of the main reasons why users based drivers are faster than kernel drivers batch size of 32 to 64 packets is a very good batch size because higher batch sizes are bad because we have more L1 cache misses and well we ask ourselves why does Swift perform that bad and Paul is going to tell you why it is like that yeah so if you have some program that is performing not as expected well what do you do? you run some profiling on it and then you get a lot of data and then you need some way to visualize this a common way to visualize profiling is this thing called a flame graph basically the x-axis is time spent in a function and the y-axis is the depth of the call stack and now I don't expect you to be able to read all of this these are just the function names that are in here and if you look at the top most functions here these are the leaf functions where the time is then actually used and we can characterize all these functions what it's doing in there and well we found out it's due to Swift's memory management Swift adds calls to the magic internal release retain functions for each object used in each function just to keep track of the reference counting and that's well basically no problem if you are just writing a UI up with some buttons or something but if you are writing a driver that has to pass through millions of millions of packets through a lot of functions all the time then it turns out it spends well 76% of its time in these release retain calls so it could be four times faster if it had other way to manage memory and for comparison in Go we spend less than half a percent in the garbage collector because it's quite a simple application so now the big advantage of having these semantics with reference counting in Swift is of course that there are no unpredictable pass times and the garbage collector in Go might just stop your driver for all the time or for some time the question is now is it a good idea to have a garbage collector and something like a driver now the good thing is we can measure that because we have this forwarding application and we did that we measured the latency of all the packets that we forwarded this latency was measured at around 16 million packets per second and what you see here is the cumulative distribution function of the latency of rust and this is basically an almost perfect normal distribution centered at around 8 microseconds which is a very nice result for comparison a hardware switch takes around one microsecond to forward a packet about 8 microseconds is for a software forwarding thing is really nice and fast and now to the same graph let's add the other languages and now we see well Go is kind of similar but C sharp is a little bit slower at the top but I do realize that this graph might look a little bit confusing so let's really quickly explain it how to read it if you are not familiar with these CDFs now we look at a value at the y-axis for example 0.5 so 50% then we go over to the language go down to the x-axis and that just means that 50% of the packets take less than 8.9 microseconds to be processed with C sharp and the other 50% take more so when looking at any latency where a garbage collector or unpredictable spikes are involved then it's always a good idea to not look at the median but at something like the 99th percentile and for C sharp 1% of the packets take longer than 30 microseconds and 1% of the packets is a lot if you are processing a lot of packets it's like 1 in 100 packets and you are doing millions of them per second so you are going to get these worst case latencies quite often what we really want to know is not the 99th percentile but the 99.99999 percentile or something like that so we will need to zoom into that graph here so if you are zooming into a graph you usually change the axis to be logarithmic to zoom in well in this case it would zoom into the wrong part of the graph so we also have to subtract the axis from one yielding the complementary cumulative distribution function on a logarithmic axis now this is inverted and a little bit a confusing graph but this is what you would see in an academic publication talking something about latency or anything like that but I think it's a really confusing graph but you can quickly see the percentiles and the bottom line would be the latency of one packet in a million packets what we can do with this graph to make it a little bit more approachable is we can basically just rotate it and change the axis description of the x-axis which was the y-axis before and the x-axis is now the percentile and the y-axis is the latency and now it's easier to read we can for example look at 99.99 and check out which latency at this percentile is for the different languages and this is a kind of graph that you will see for a lot of latency evaluations of anything that has latency spikes if done properly you unfortunately often see people doing latency evaluations and then providing the average or median latency which is for many cases a completely useless value most of them probably just don't know better but if you want to evaluate latency please have a graph like this in the end if you want something to Google there is a library called HDR Histogram which can generate these graphs from latency measurement data and that's just a really nice way to characterize garbage collection or just-in-time compilation latency or anything like that now we got a driver that is nice and fast and has a relatively low latency for most languages but we have not yet really looked at safety and security beyond what was offered by the language because our driver still runs as root like all or virtually all user space drivers run as root by default and well why is this the case well I've shown this code before there are few operations in the initialization that just require you to be root like mapping the PCI Express resource requires you to be root for implementation details in the Linux con we need non-transparent huge spaces for the DMA buffer they require you to be root to allocate them locking memory requires root so these are clearly all functions related to initialization or setup so the obvious idea is well write a small program that does that for us keep that simple and out of that and check that it's good and then drop all the privileges yeah sure we can do that it's relatively easy just drop privileges after setting up memory but that's still not really secure in any way even though you're now running as an unprivileged user and to understand why this does not work as you might want it to work we have to take a high level look at how memory access works on a modern system and this is what a modern system looks like at a very high level view we have the CPU here at the top with our application running on it and the bottom left we have the PCI express device the bottom right we have some memory and if we now want to do some memory access from our application it goes like this it goes through a thing called the MMU the MMU is the memory management unit and the memory management unit translates your virtual address that you have in your program to a physical address that can be used by the memory controller now the security or the isolation between processes is controlled by this MMU and only the kernel can reprogram to the MMU to guarantee the isolation now if we want to access our device via PCI express for example memory mapped IO it also goes through the MMU the MMU then knows this is not going to memory but to PCI express it talks to the device and that's also all fine now you could argue clearly these are the two kinds of memory accesses we do both are checked via the MMU so what's the problem well the problem is that when we tell the device to do something with memory I've previously mentioned we tell the device physical addresses to use them and well of course the device does not go through the MMU because it doesn't know it doesn't have a concept of it's used by a process or it's being used by this kind of process it just has full access to all of your memory and so if you somehow own the program that is running as an unprivileged user it's quite a trivial exercise to get any data from the important area that you shouldn't be allowed to access or write any data there by just telling the device to do it for you meaning any application that has direct access to a PCI express device is effectively even if you drop privileges so the obvious solution is to somehow make this access pass go through the MMU and there is a fancy hardware component called the IOMMU which is exactly for this use case this can be found on any modern CPU that has hardware virtualization features because it's mainly being used to do pass through of PCI express devices to virtual machines in a safe and secure manner but you can also use it for user space drivers you just need a way to tell the kernel to configure the IOMMU with the proper restrictions for us and then the access looks like this and we just need to configure it in a way that the IOMMU has the same permissions for that device like the MMU has for our user space program and then it's perfectly secure if your program gets owned attacker doesn't have any privileges beyond what your process has and this is also useful for safety because when I initially wrote the driver I kind of killed a server when I apparently misconfigured the IOMMU and it overwrote something that was important for the file system apparently and then it was dead and had to reinstall it so that wouldn't have happened if I had started with the IOMMU in the first place how do you do that to use the IOMMU specifically on Linux which has a nice subsystem called VFIO subsystem and that's what we can use we just need to prepare the system as a root this is a one-time step as root and bind the device to the VFIO driver instead of to no driver at all we can change the owner of the resulting magic device file and pass it to an unprivileged user then we have to give the unprivileged user permissions to lock memory to allocate DMA memory but we can also restrict them you are allowed to allocate 512 megabytes of locked DMA memory and then all the remaining steps can be done as an unprivileged user the unprivileged user can now call an M-MAP on the new device and then the unprivileged user can communicate with the kernel via IO control commands on the magic device and it can also tell the kernel to please allocate DMA memory which is also actually a better way to allocate DMA memory for technical reasons and then you can just use the device as you had before so you just need to change the setup steps and then you have a user space driver that can run with special privilege at all and the IOMU will check all the accesses and the kernel will make sure that you can't configure the IOMU in a wrong way you can just tell it to please configure the IOMU in such a way that the device can access my current address space but not anything else and we have implemented this in our C driver my student who implemented this is actually here today but he was afraid to come up on stage but if you have any questions you can talk to him afterwards there now we have this awesome driver which is safe secure and everything and yet some people still argue our user space drivers are users I already have a driver in the kernel why do I need another driver why would you want to write a driver and obvious answer to the question is why wouldn't you want to write a driver it can be fun add a lot of fun when I wrote the first driver so yeah then maybe you just need a quick and dirty driver for a weird device that you found somewhere maybe you just need quick development you don't want to shoot down your kernel all the time or reboot all the time if you have a weird device that you are developing maybe you're developing a custom device maybe you have some FPGA board that you want to talk to really quick without being involved in some kernel stuff or maybe you just need a feature that's found on the hardware but not yet on the device driver there are some features there is like a lot of stuff in the past you have implement IPsec offloading which wasn't in the open source driver but something we have done recently is hardware time stamping the latency measurements I showed before required us to take time stamps of 15 million packets per second with nanosecond level position and it is quite hard and people usually use special hardware for this and we ourselves have used net FPGAs in the past for this which well can be a lot of fun to use these nice devices but at the same time it's prohibitive from a cost perspective or from a user experience perspective for some people who just needed to take some time stamps so we want to do it on some cheap off the shelf commodity network cards and turns out all of these or some of the newer cards have a hardware feature that can just add the time stamp to the incoming buffer of all packets that they receive but sadly none of the existing drivers supported this feature but I've shown you how you can access registers you can just skip this step and load the original driver you can just poke the just end up the thing by the original drivers running poke the right register and the right way tell it to do this please and now you got the time stamp at the end of every packet and sure it probably breaks your TCP stack or whatever if there's extra data but like pcap doesn't care or in this case you just did it with dpdk and got a wall packet buffers and it also doesn't care if the device thinks there's four extra bytes or eight extra bytes for this setup that we had here Nick on our cnd and we had a fiber optic splitter to sample time stamps of all the packets before and after the device on the test and that's yielded us with this measurement of all the things and yeah so this is just a simple use case why you would want a driver to conclude I can only really say that I think drivers should be written in better languages now I think you shouldn't be start writing a new user space driver in C nowadays sadly if you look at the world of user space drivers there is mainly dpdk which is a network user space driver there's all C there's all C because it's mainly copy pasted from current drivers then there's spdk which has NVMe drivers also all C and the big exception to this is snap which has drivers in Lua which is quite nice and well our implementations which have a lot of different things and we really want to these languages you can scan this QR code or just google for xt languages on github or whatever and then there's like this meta repository that has links to all the implementations has will have a link to this talk and yeah basically check it out write your own driver no kernel code needed thanks for your attention thank you very much Paul, Simon and Sebastian we do have time for questions please line up at the microphones and to get it started please a question from our signal angel from the internet so the IOC first of all was wondering why was the bash proposal only offered as a bachelor thesis yeah the bash question kind of expected like I was adding random language and thought well why can't you do it in bash the problem is I tried to implement it in bash it doesn't work the way I wanted to work I would need to write more C code the way I tried it does not work is I wrote a short C program that called the MAP thing and then just slept forever and then the idea was to get the to access the address space of that program that just called the MAP via the PROCFS system and then there's like this magic thing which has all the things and then with DD you can write and read from there but it breaks at some point because you can it goes through when reading but if you write something at some point via PROCFS that doesn't go through to PCI Express so it didn't work and I only wanted a bachelor thesis because kind of a master thesis should maybe be a little bit more serious than a joke driver another question from the internet yeah maybe a more serious one Gordon is asking how do you use the chest to handle interrupt requests or a code with strict timing requirements if I have strict timing requirements I'm not going to use interrupts interrupts are horrible if I have strict timing requirements I just pull the device all the time this is how basically all user space drivers work they just ask the device is there a new packet 1 million times a second interrupts are one of the slowest way to communicate between CPU and device just receiving the interrupt requires you to do a context switch on the CPU then you have to do a context switch back because you don't want to do too much in the interrupt handler and then you have to pull the device anyways because the interrupt just tells you that something has changed so if you really care about the latency then you just pull the device all the time and for for user space interrupts check out VFIO the VFIO you can do like an EPOL on something and then you get notified if there's an interrupt if you really need that okay let's go to the whole questions please keep your questions to one sentence only and only ask questions because there's many of them microphone number two please so when you compare different user space drivers in different languages why was Rust slower than C given that the memory safety is compile time yeah well we have a few more memory operations because of the safety so we have to move the packet structures from inside the driver to outside to the user and you don't have that in C so Rust was a bit slower because of that but I think you could optimize it maybe a bit better because it was just a bachelor's thesis so I didn't have that much time but yeah I guess it could be a bit faster but I think it would still be a bit slower than C the driver doesn't do any bounce checks at all just as you're doing the right thing microphone number three please you mentioned Haskell in the beginning but it wasn't in the comparison can you talk about that Haskell is not yet optimized for performance it would have been unfair to add an unfinished version for it it's currently still quite slow and I didn't want to add it there microphone number four please awesome talk thank you very much have you considered using programming languages like Idris or Coq where your compiler can check the logic of your driver I don't have a student for that to implement it but yeah microphone number one please okay I have seen several languages with garbage collector included including ghosts so my question is how often do the GC stop the war happens and what is the general hip sizes how often I don't know how often the latency short you how long it takes which was like up to 40 microseconds which depending on your application might or might not be a problem there's a few other data from we referenced this paper of the guys who implemented the whole operating system in Go they mentioned they see up to 200 or 300 microseconds of course times and hip size to be honest I didn't think I really measured hip size maybe something about the how much collector I'm not even sure if how much it actually works because the go profiling just this node as it was so little time compared to the others it's basically irrelevant except for probably some latency microphone number three please yeah I missed a couple of languages OpenCL and CUDA not because they're particularly interesting languages but it would be interesting to have the GPU talk directly to the network there's a paper out there that's called packet shader thing yes they do exactly what there's also another paper out there called raising the bar for GPU packet processing or something like this they basically argue against that the main problem is if you are transferring the packets between the network card and GPU this is kind of slow and you need to use gigantic batch sizes the packet shader guys used I think batch size of 4000 or 8000 which affect latency and they don't have a proper latency evaluation wonder why but if you're interested in GPU packet processing with that packet shader paper it's a few years old microphone number one please how do you deal with IO ordering the x86 will guarantee you that the order in which the CPU posts IO access is the same as the device receives them but on other platforms this is not the case and if you do it with the IO doesn't give you any such guarantee well for memory ordering it's highly specific to the device you are using which memory ordering semantics you are using the weird thing is in the device we are using here the Intel device there is one location where I'm so really sure that I do need a release memory order semantics because there's something which clearly sets some flag and then the device with some other memory based on this like 99% sure that this should be release memory order there but none of the driver implementations has any release memory order barrier there so we don't need it here and other than that for go and trust we of course have low level primitives to ensure or to enforce hardware memory barriers for other languages if you for example check out the SNAP driver which is written in Lua they have a little c-stop which calls right M-Fence instruction at the right place thanks number 4 please thank you the abstract of this talk mentioned that the user space C implementation was 6 to 10 times faster than the in kernel implementation was this just because of the batch sizes or were there other reasons for that so this is mainly due to the batch size the kernel implementation when using xdp is also faster so it was compared to kernel without xdp kernel with xdp I think we are 30% faster also if you have a kernel that can do xdp between different nicks because most nicks can just send back out different topic but yeah next question again from the internet so R is asking have you considered writing user space drivers for the more inherently insecure due to specification complexity things like the bluetooth stack or things like this um yeah would be would be interesting um yeah I guess that's a good topic for further research we just went for a network because it's really common to have user space network stuff all the way like look at your iPhone whatever there is a user space tctp stack running on it so yeah number three please hello thanks for the talk just a little question does the IOMMU affect performance um we are not yet sure you can ask him afterwards as far as we have evaluated it it does not yet affect performance there is also a paper called it's called PCIe Bench PCIe Bench something sitcom last year I think and you can read this paper they have a performance evaluation of the IOMU so yes there is some effects because the tb size is smaller on the IOMU but we couldn't measure it in our toy setup yet microphone number one please hi when you access ring buffers you usually have to have like an access once macro that you use to access the memory in C because C is actually allowed to change the semantics of the implementation in a way where instead of say reading memory is storing in a variable and then using that variable would translate into something that would read the memory twice and this was a problem with XSA 155 I believe where in netback, netfront the communication actually broke in that way because you had to talk to a bug and how would you enforce like a single memory access and all of those programming languages enforce is always a hard word the thing is most of these programming languages do need to copy the descriptor when using it so the ring stores the descriptors the descriptors are basically pointers to some other buffers and the critical part is when reading the descriptor and I think all of the language implementation copy the whole descriptor which is 16 bytes okay so the memory is always copied guaranteed I hope so please do check in the go code we had the the atomic read thing and I hope that only does one access and otherwise it considers a bug in go if my atomic reads it twice and C I also copy it I think we have time for one last question from microphone number two please how can we convince other people especially other developers and the business people of the necessity of moving away from C learning a new language and investing the time of developing stuff in it yeah I honestly don't know it's like a mystery to me why people keep writing stuff in C yeah no idea sorry thank you very much to Paul, Simon and Sebastian