 Well, hello. Down at the back, I'll try to talk nice and loud, because I know people haven't been having a hard time down the back. So, my name is Luke, and I'm here to talk about how to write a device driver for the Melanox ConnectX family of Ethernet adapters. And I want to say from the outset here that I don't know how to switch slides. Errokis? Thank you very much. I don't work for Melanox. I'm not affiliated with any vendor. I'm here in the collegial spirit as somebody who writes software like everybody else in the room. I'm not here to sell you anything. The project that I work on is SNAB. That is a production network data plane that's written in Lua from top to bottom. And the reason I'm here talking about the ConnectX is that I was very lucky a few years ago to get my hands on the very, very, very first non-NDA copy of the driver specification for the ConnectX. There's a document that had been kept secret for decades, presumably. And the context of that is that I was working on a project with Deutsche Telecom. We needed to run SNAB on 100 gig, and we had a requirement that the Ethernet adapter should not force us into any specific software ecosystem, and it should also not force us to sign NDAs and build our own software based on somebody else's trade secrets. And Melanox were just the vendor that happened to step up and give us what we needed to do that project. And I think that's a great credit to them. That's why I'm here. So I'm able to talk about this because I didn't sign any NDA. So my thing is software networking. And I mean this in the extreme sense of doing everything on the CPU, putting nothing on the hardware, not doing any offloads, any of that kind of stuff. I think hardware offloads is yesterday, CPU is today. And it's a really exciting time in software networking now because we're out of, we're out of the kernel. We're out of all of these constrained programming environments where we can only use very, very specific tools and where everything is different. So now networking software is just software in a very important sense. We have a lot of freedom. And we're taking this freedom in a lot of directions, as we just saw on the Ixytalk. You know, probably some great directions and probably some awful directions. We're doing a lot of exploration now and we're trying a lot of new ideas and we're seeing gradually what is it that works well and what is it that doesn't. And I think that to do this effectively, it's really, really critical that we have control of all of the lowest levels of the stack, all the way down to the device drivers. So if we don't have control of that, if we're building everything on large software frameworks provided by vendors, the risk is that we'll have pressure to do things in the way that they've always been done and we'll get stuck on a local maximum and not see some new ideas. And of course, in an engineering project, very often a local maximum is exactly what you want, but not always. And it would be real pity if everybody was just sitting on the same local maximum and we never got any new ideas, we would never get any progress. So it's very important that some people are trying new things and they're able to do that effectively. So something I think about these days is what will be the ideal network card for pure software networking. And I think it's actually really easy to describe. The perfect network card would just be a high speed serial port. We take a stream of packets from memory and we put them onto a network, we take a stream of packets from network and we put them into memory and that's all it would do. We'll not do a single other thing. Any feature that you would give it would be by definition a misfeature. Now nobody makes this network card as far as I know, but it would be really wonderful if somebody would. So if you're a hardware person, please do. But I think for now, the practical thing is to go out and look at what's commercially available that you can kind of use in this way. And for a really, really, really long time, the answer has been the Intel 82599 Niantic. Like raise your hand if you like the Niantic NIC. Everybody loves the Niantic NIC. Not every hand went up, but I know that in your hearts you all love it. Everybody does. So this has been the hackers favorite for the longest time. People like Luca D'Ary, like I don't know if it's for decades, people have been doing kind of amazing stuff, all kind of projects with this NIC, and it's just been a great experience. And we've all loved it. The problem is that the Niantic is getting a bit long in the tooth. It only supports 10 gig. It only supports PCI Express version two. So if you deploy this today, you can't actually take advantage of all the bandwidth you'll find in a modern network or in a modern server. And in a perfect world, Intel will just release a refresh of the Niantic that gives us faster port speeds and gives us parts of the PCI Express endpoint. But that has not been Intel strategy. So rather they're introducing a series of other NICs in parallel. And these NICs are not compatible with the Niantic. They need different drivers. And there's not really any one of them that you can point to and say this is the successor to the Niantic. The closest is probably the Fortville, but it's significantly more complex and it can't do 100 gig. So actually not. So if we want to find the new hackers favorite NIC here for all of us to do our next projects on in 2019, I think we have to look somewhere else until Intel give us the right thing. And that's what brings us to the Melanox ConnectX. And Melanox, as far as I'm concerned, is the only contender because if you go to the websites of every Ethernet card vendor, it's only Intel and Melanox where you'll find the device driver specification. So everybody else has disqualified themselves from consideration. And so then if you take a closer look at the ConnectX, it's actually, from a bird's eye view, it's got some really, really, really nice properties actually. So first of all, the ConnectX supports every port speed you care to name and they're very, very fast at introducing new port speeds that becomes standard. So 100 gig, 200 gig, they're on top of all of these things. They refresh the silicon very regularly. You get ConnectX 4, ConnectX 5, ConnectX 6. And they update this across the board. It's not that the 10 gig is getting older and older and then they get the 40 gig, they do across the board refreshers, which is very nice. And the best thing by far is that the same device driver works for every single one of these cards, every port speed, every generation. So if you would take a 10 gig ConnectX 4 and write a device driver for that, that driver could then also be deployed on a 200 gig ConnectX 6. And this is fantastic. This is fantastic if you're a driver developer who values your own time and considers it to be a scarce resource. And the magic that makes this work is that the driver is not actually talking to the silicon directly, there's a layer of firmware in the card that implements a standard protocol and hides the differences in the silicon. So if you're going to write the ConnectX device driver, you will have a control work component, I'm going to call it on the host, that is talking to the firmware of PCI Express and sending a bunch of requests to initialize and configure the network card. And this is basically a crud interface that you get from the firmware. You have a bunch of objects like a transmit queue and a receive queue and a flow table and so on. And you can create, you can read, you can update and you can delete these objects and you just make a series of these requests to initialize the card and give it the configuration that's going to suit your application. And then having done this, you can now have some logically separate processes that do transmit and receive and multiplex them onto the card. And there's actually quite a neat mechanism for delegating this just with plain old memory mappings. You can delegate access to queues to specific processes. Now, the last I checked, which admittedly is a couple of years ago, the software stack that Melanox would recommend for building on this is quite large. You have the controller is a Linux kernel module, the MLX5 kernel module. And then on top of that, you have the OFED library, Melanox OFED library, which is a substantial piece of software. And then on top of that, you have DPDK. And I don't think it's an exaggeration to say this is about a million lines of C code. So it's a considerable dependency to take on. So when we did the driver in SNAP, we wanted to see if we could optimize for making it simpler and easier to maintain and something we could understand ourselves all the way through. And so we replaced the whole stack. So we don't use DPDK and we don't use OFED and we don't use a kernel driver and we just talk directly on PCI Express to the firmware. And our drive is 1,500 lines of code, which is actually a considerable reduction from a million lines. And in that 1,500 lines of code is doing basically three things. One is it's implementing the client side of this CRUD protocol for initializing the firmware and giving it an appropriate configuration. It's implementing the transmit and receive functionality for writing descriptors and poking the card, waking it up to process them and it's implementing multi-process operation so that you have the controller running in one Unix process and if you want, you can have transmit and receive functions running in other processes. And I'm just going to give you a really, really quick flavor of how that driver works. So what you see here is on the top is an excerpt from the specification, the PRM, the Programming Reference Manual. So when you look at the specification for the interface to the firmware, you see a lot of tables like this that says if you want to perform the operation set flow table roots, you need to create a structure which is a fixed-size binary structure. And a specific offset, you need to put certain parameters that the firmware is expecting. It's a little bit like a C-struct, more or less. And down below, this is what the Lua code looks like to implement the driver. We have a function with the same name. It takes the arguments as normal variables. It puts them into an array of bits, suitable offsets. It sends this out to the card over a command queue of PCI Express and it gets a result back saying if the operation was successful or if the operation failed. And like half the driver code or something is just this, so it's 1,500 lines of code and half of it is basically just transliterated from the data sheet. So it's actually a pretty reasonable protocol. And then for multiprocessing, this is the way that we do it which might be of interest to other people who are trying to find a simple solution. We have the controller that comes up and it sets up the firmware and it knows about all of the set of workers that are supposed to exist for the application and for each worker, it just creates a configuration file. The worker then polls for the existence of and inside that configuration file is all the relevant information like the addresses in physical memory of the DMA descriptor queues and that kind of thing that they can attach to. So in our implementation, these files actually shared memory objects basically C structs. They can be used for synchronization and we have a trick and snub that for DMA memory, we always map it at a consistent address in every process. So any pointer into DMA memory in a collection of processes is globally accessible. And that's like it. That's how the ConnectX driver works and if you want to write one, I can recommend it. It's not as hard as you would think. So thank you very much for your attention. Five minutes for questions for Luke. Okay. So Mr. So you don't want to consider any offload in such a device but I want to say that this kind of device is able to do some really nice offloading like tunnel offloading or flow steering like offloading a full virtual switch and that the kind of thing you can get when you use DPDK and PMD. Yeah, absolutely. So if your goal is to actually take advantage of all the offload features on the cards, then the value of the off the shelf drivers increases because the amount of work that you need to do to implement that in a known driver increases. So I think that this approach is very well optimized for a very, very software oriented approach. If you are if you're thinking of your like a hybrid approach with something standard hardware and something standard software then there's a lot of value. So yes. And full disclosure, I'm working for Metanox and DPDK. So I'm available if you have more details or if you want to test this device, I'm available this evening that man can't be scared of. PRM. Sorry? The new PRM. So the one Luke talks about is the old one. You published a new one. Yeah, not really a new PRM but there is a work in progress to make things simpler because you are right, there is a big stack and it is going to be shrinking. Do you want to fight here? Most of it is missing at the time. Anyway, we can discuss detail if you want to this evening. The revolution has started. You can also lose some of your pieces here. Check sums. Don't you want anything from the hardware? No, absolutely not. So check sum in hardware is a really, really old fashioned idea. So in SNAB, we don't even use scatter gather from the hardware. We do everything in unified packet buffers and if you want to use vector instructions to do a check sum on a CPU these days, it doesn't really take hardly any time at all. And okay, so some of the offloads are quite interesting if you're developing a Unix kernel because a lot of them have been developed in that sense. So you're a kernel developer. I don't begrudge you using all of those offloads but I think for new applications, most of the time the offloads are not really going to be supporting what you want and or you'll find special cases that they don't support and that kind of thing. So specifically for check sums, I think that this is a yesterday feature. What about segmented transfers into memory? You don't get all the frame at the same time but get clips or... So our approach is very simple. A packet is an empty size array. You get the whole packet in. You assume you're going to bring the whole packet into cache. Once you have it in cache, you can do things like check sums on very, very efficiently. This is the basic programming model that we adopted and makes things very, very, very simple. I think it's a slippery slope that when you start trying to offload things in NIC, then you start saying things like, well hey, maybe we don't need to get the payload into L2 cache at all. And then you're spending the rest of your life in a balancing act trying to keep it out of L2 cache because that's going to change your performance characteristics. But if you would just say, hey, let's load the packet into L2 cache, then everything is easy. So I think this is the way to do a simple user space network in my humble opinion. Does writing in Lua improve the user experience and all like there's been... Typically it's something like writing vector instructions for DPDK has a very high barrier to entry, for instance. Is Lua helped with that at all? I think that it does. I think that by having SNAP written in Lua, it appeals to a different set of people. I think that DPDK is very well designed for making people coming from like a kernel, the X-Works, and so on background comfortable and keeping things like what they would expect. But there's another group of people who have not been doing networking development because they found that intimidating. They found that there's been a lot of barriers to entry. They don't want to write kernel modules. They don't want to program in C. And I see SNAP as catering to these people. So we're really giving an on-ramp to people who kind of were not a part of the networking world before, to give them an easy way to get involved. So in SNAP we do support Vert.io, but personally I don't have a lot of nice things to say about Vert.io. That's partly because we have also done the server side of Vert.io. Vert.io has a lot of options. And if you want to make all the virtual machines happy, if you want to provide Vert.io to DPDK and Linux kernel and so on, which we do in SNAP, it's a huge pain because they expect all of these magical offloads. And if you don't give it to... Well, you have to do a lot of work to give it to them. So Vert.io is not simple. So how do you manage to do a line rate while doing the checks and so on? So if you got a modern server with a 100 gig network card and a big CPU, you have something like a one-to-one ratio of instructions you can retire compared with bits. So if your average pack of size is about two kilobits, you can probably retire about 2,000 instructions, if you're assuming you're getting good IPC. Small pack of so... So basically, CPUs are just really, really, really, really fast. That's the answer. They're just really fast. You don't even need vector instructions. Thank you so much, Luke. Thank you. Thanks, we have...