 Okay. Right. Last session for this afternoon, first set of speakers is Tafriel Kohen and Talat Barthich and it's the state of RDMA in Debian. Thank you. Okay. So, well, hi everybody. My name is Talat. First of all, thank you for being here, even in DevCon or in this talk. And also, I would like to thank Manox that give me the opportunity to attend DevCon. We are going to talk about RDMA in Debian. We will give a basic introduction. What is RDMA in general? And we assume a basic knowledge, a basic previous knowledge about RDMA. However, we will not go deep into the protocol itself and if you have any question about the protocol, I will try to answer them afterwards. You can find me around and ask me questions. Then we will talk about RDMA in Debian in both sides in the user space and in the kernel space. Okay. Okay. Both of us are employees of Melanox, which is one of the major players in this space. Of course, this talk represents what we know and sadly it is a bit scoot for Melanox hardware. Of course, in this talk, we represent ourselves and not Melanox. And yeah, let's go into the talk. Okay. So, what's RDMA? The meaning of the name, Remote Direct Memory Access or RemoteDMA, that give us access from one memory of a computer to another one, to the remote one without involving the CPU or the OS. And for this support, we rely on support on the hardware. We call it channel adapter or host channel adapter. Okay. The RDMA using a different programming protocol than Socket. It's called Verbs. And as you know, the CPU is an expensive element in the data centers. And we should maximize the utilize of the CPU. And also, real-time application require low latency and consistent response. So, for this, we need RDMA. Also, RDMA is asynchronous. So, that means threads are not blocked. And while we're transferring data, we can do anything else. Okay. In this diagram, we can see that buffer one. We would like to transfer it from host A to host B or host one to host two. In green, we can see the flow of RDMA. And in the orange, we can see the flow of TCPIP. The HCA and the NIC could be the same device. Because there's devices that support both Ethernet and RDMA. For example, Rocky or I1. So, in TCP, we are copying the buffer from the application to the operation system. And from the operation system to the NIC and on the wire and the other side up to the application. And in RDMA, we give access from the memory of application one directly to the host two. Without copying, zero copying. RDMA layers diagram. I put all of them in one slide. The kernel side and the user space. The kernel space and user space. In the hardware level, we can see the HCA, the host channel adapter. It's a network device. And on top of it, there's the core driver. For example, MLX5 core we are using for the recently Melanox devices. 6GB for Chelsea. And the IB provider driver is also in the kernel module for that running all the ABI of RDMA that IB core define. And IB is a char device that communicates between the user space and the kernel space. On top of it, we can see the MLX5, for example. It's the IB provider, RDMA provider libraries in the user space that implement the IB verbs. And on top of it, there's the applications that running RDMA between them. Just to clarify, those may be drivers for any type of technology. This may support, for instance, Ethernet or theoretically any other technology. This layer adapts specifically for IB, InfiniBand for RDMA. This is specifically the RDMA stack. Okay. Rocky is RDMA over Converged Ethernet. So, usually the RDMA running on top of InfiniBand spec. But if we have a Ethernet devices that support Rocky, we can run RDMA on top of it. There's two types of Rocky. In the slide we see just Rocky V2. And there's a Rocky V1. The common one is Rocky V2. And this is the newest one. The difference between them, the main difference, the Rocky V1. The Rocky V2 is UDP-based. That means it's routable. And Rocky V1, it's not routable. Here's the Rocky architecture. You can see how it's fit the RDMA diagram that we see in the presentation in the past. The Ethernet nick, the same nick can use the TCP IP sockets and can use the RDMA stack. The same nick can run at the same time. RDMA and Ethernet traffic. Soft Rocky. Well, Soft Rocky is... Rocky, basically, is still something. It still requires support at the network adapter level. So, we still require that the network adapter... A specialized network adapter with support for acceleration. Soft Rocky replaces this with software emulation in both the kernel and in user space in lib-iby verbs. So, we still get performance that is better than... We still get some benefits of RDMA. So, we still may get performance that is better than plain sockets, plain TCP or any socket-based program. But it's still not as good as using hardware device. But, again, it's all you get. As we can show later, it's much simpler to set up. It's the only thing you can have if you don't have the hardware. This is how Soft Rocky fits in the scheme. So, you can see that this is basically any network adapter. It doesn't have to support... It doesn't have to have any specialized support. There is... This is the support in the user space in lib-iby verbs. It's the provider, the plug-in of lib-iby verbs. Again, in the kernel under IB Core, there is a driver for it. And this driver talks with the standard Ethernet driver. But it should support PFC, priority flow control, for... Okay, RDMA process flow. In RDMA, we call a client or server. There is no client and server. There is an initiator or target, a requester and some responder. So, there is the memory region. It's a binded memory that binded for some application. All RDMA works done with queues. The QP is a queue pair. Send queue and receive queue. That hosts all the work queues that we would like the hardware to perform. And the completion queue is hosted all the completion elements that are already done. The hardware finished with the work queue. He put for us a completion queue element. And we get it in the application as a completion... The most common RDMA operation is send, receive. RDMA read and RDMA write. And atomic is... Okay, that is also the infiniband spec. I put a link for the infiniband spec if someone would like to read spec. So, you can download and read. Now, we are going to see a demo between RDMA versus Ethernet. In the demo, there is two servers. It's my servers connected with 100 gigabit per second between them. The device is Ethernet, but it supports RDMA. The traffic is with the message size of 64 kilobyte. The traffic generators that I use are IPERF for TCP. And IP write bandwidth for RDMA. Trust me, I know that you cannot see this from your seat. Okay, so... Okay, this is a single... This is IPERF. Again, a simple TCP performance test. Traffic generator. Traffic generator, yeah. And we run here a single thread. What we can see is... We can see. And some of you can see. Trust me. For a single thread, we get 22 gigabytes. And the CPU utilization is... Again, there is some CPU utilization. Yeah, there is one CPU core that is fully utilized. Yeah, on one thread. So, here we do the same basically. Yeah, but we get on 100 gigabit per second Ethernet device, we get just 22 gigabit Ethernet. In order to get more, we have to run more threads. So, we run eight threads in order to get the full bandwidth. We get here 94 gigabit per second. But all the... Eight cores fully utilized because we are running eight threads. So, eight cores are really fully utilized. Okay, now we are running a single thread of IP write bandwidth. Can you make the letter bigger? No, it's a video, but we'll try to repeat this. Trust me, I will show you the video again. But if we have a time, we can do a live demo, but not on those machines that don't have access to them. Okay, this is a single thread of IP write bandwidth, running the same message size. I don't want to talk about the latency, because also here we get a better latency, but we can see the bandwidth that it's 92 gigabit per second without utilize even once the CPU. All the CPU cores is free. Okay, now we are going to RDMA in Debian. In the kernel side, so the RDMA subsystem development is usually upstream first. So, therefore, Debian kernel is up-to-date aligned to the kernel that we based on. So, all the drivers of RDMA are in the directory DriversInfiniteBand, and the provider's code is also in DriversInfiniteBand hardware, and there you can find all the provider's kernel modules. User space. The user space, nowadays we have the RDMA core user space package, libraries and daemons. A few years ago, this package reorganized instead of around 20 small repositories for the different providers. We gathered them in one Git tree that named RDMA core. This package introduced to ease the acceptance and the review for all the new features and code. There's a group of maintainers to speed up the acceptance and the review via the same mailing list for user space and kernel space. This is the mailing list. Someone would like to report a bug or post some new features. You are welcome. And the Debian packaging is done on the same Git repository upstream. And just to stress, this is new in Buster. So, before Buster, we had the... In Stretch, we still had the older version of those libraries. It was added in Stretch Backports. So, this is one new interesting feature of Buster. And also, I think it may be interesting to even report... I'm not sure about bug reports. It may be interesting to report bugs directly upstream, but there are few bug reports on the Debian packaging. Also, it's include a Python library for developing RDMA. If someone would like to develop with the Python, so there is a Python library to RDMA. Are you able to... Yeah, it's okay. So, inside Melanox OFET, the RGZ that you used before, there were missing some binary packages like Melanox Config in order to change the firmware to enable virtual functions and things like that. So, are you including all of them right now? No, because this is a window specific. For example, in order to change some configuration for Melanox devices, you need, for example, MS-Difluent package. So, it's not in the same Git repository, but it's also in Debian. Where is it? It's under... Okay, just to clarify, MS-Difluent, for instance, is included as a separate package. There is a large software distribution, which I happen to be... Both of us are part of its packages inside Melanox. It's called Melanox OFET. It includes a whole lot of other tools. Some of it are proprietary and a bunch of drivers, some of which can't be upstream for various reasons. And we try to reduce its size and some of the tools there are not included in Debian and other distributions. But MS-Difluent, it's included in Debian and also there's a new tools in MS-Difluent itself for configuration like MS-D-Config and Melanox firmware. Thank you. Other tools, here's MS-Difluent. Perf test is a couple of performance tests that are running RDMA. It also could be used as examples for developing RDMA. Infiniband DAGs is a set of tools for design and configure debug Infiniband libraries and fair fabric. OpenSM, if you are running Infiniband, you need a subnet manager. OpenSM is the package that's providing the subnet manager. And MS-Difluent, for example, for Melanox, it's for burning a new firmware or configuring the device to be a VPI, to be Infiniband or to be Internet, so you need a burning tool, configuration tools. Okay, RDMA is set up in a machine that running Debian, so you need just install RDMA code, the IV providers and the IV verbs utils. If you are running Rocky, there is no needed any configuration, just set an IP to the interface and you can get Rocky V2 and Rocky V1 working. If you are running Infiniband, if there is no OpenSM running on the subnet, even in the switch or in one of the hosts that connected to the fabric, so you need to install the OpenSM package and run OpenSM. So what's missing in Debian? As far as I know, there is no relationship between the providers, RMA providers and Debian, so there is no certifications, no tests. It's a bit complicated to do something like that because you need to learn the hardware and put it on physical machines and this is a bit complicated. In the kernel side, there is no back port or fixes to the stable, to the LTS releases. So also this is missing and we should do something there. Now we are going to give a live demo for SoftRocky. It's a emulation of RDMA device in software. So I think everyone can run Rocky on his laptop, so it's a good utility for students or for someone that would like to start developing RDMA, so it's a good start. So basically those two systems here are running QEMU, plain QEMU. Each has two network interfaces. I set up really, really simple. I can post the configuration scripts later, but basically there is one network interface that is configured with the NetDev user and one network interface, so I can SSH from the host, and one network interface which is configured with what the SSH calls Socket, so I can connect between the two machines. All plain user space, nothing... And this also means, as I use Socket and nothing more fancy, this means that I can't get great performance here, but it's simple and it gives a good demonstration. So right now, basic configuration is of Rocky. You still need to configure something. So it's already configured. Basically what I needed to do, there is a tool called RxC CFG. So basically RxC CFG... Nothing too complicated here. Start loads a bunch of modules. So if those modules are not already loaded, then start or load those modules yourself. I got the thing loaded with some systemD script but again, I could have just used manually RxC CFG start and then RxC CFG add and the network name of the device, CTH1, that's it. Okay, so this means that I have... that I have Rocky configured, a Rocky device attached to this Ethernet device. So let's look at the other system and again, those are cards that are connected to each other directly, with the cable directly. So, okay. IBV RC ping pong comes from the package. My battery is okay. I have a question concerning security. Are there any concerns about having direct access to the memory of another machine where there are problems in the past with security? Security, yeah, all the time there is security issues but I am not aware of security issues in Finneban because there is a page domain that control the pages that you are looking at and there is a key and you get a virtual address and with the key you can translate it to physical address. So you need the key in order to access this page or this address in the hardware. If you don't have the key, how do you get access to this page or to this address? And you develop your own application. You should care of the security. From the Infinibans back, it's well secured. But if I write an application that tries to access memory on the other machine that it should not, then the remote machine cannot say you are not allowed to access this. It depends on the RDMA connection that you are going to use, for example. If you are going to RDMA, read RDMA, right? You are directly right on the remote host. If you are using, for example, post-senovus, you should post, before you should post a receive request and you should also kind of three-way hand check that you would tell the other application where you would like this data to be written. So you have an address and key. Should we do more questions? That's a very, again, that's a very simple RDMA tool. Ping Pong. It works. Performance is not stellar, but it works. Yeah, and I didn't use any type of acceleration within KVM. Should we try the real servers? Should we see something or more questions? So, first question. Is this Rocky V2 based on UDP and that's why you did it on software and there wasn't before? You are talking about the software? Yeah, the software version. Is it UDP-based? You can check. It depends on the gig that we are running on. But it's Rocky V2, right? Yeah. Okay. If you ask about the application that we run. Yeah, it was Rocky V2. And the second question is, do you have IBM created a wrapper library to use RDMA transparently? So you basically preload a library and you use like if it was a TCP IP socket. Do you have anything similar to that? So you don't have to handle the RDMA protocol? We have something like that, but not directly, but not on top of RDMA. There is something that is on top of Melanox hardware, I believe. It's LibVMA, right? This is the application. It's acceleration tools. I don't run any performance optimization and any performance utility. So it's native. Thank you. It's installed from scratch. Thank you for the presentation. It was so interesting and inspiring. I want to know whether you compare the result of Rocky's soft rocky with the normal IPer in terms of packet per second or not? We are not from the marketing or from the sales team, but we can do it together. It's okay. We have machines with soft rocky and also we have machines with real RDMA and we can compare it together. Or the comparison in the terms of macabre per second, do you have a soft RCE soft rocky vs the normal kernel stack? Because you said that soft rocky is a bit better than the kernel stack. I want to know whether you have the numbers ready or not. Sorry. Yeah, we can compare it. We can have running, for example, traffic generator tool that running, for example, 64 kilopite data between two servers. We can see how many packets we transfer with the Ethernet and how many packets we transfer with soft rocky or with RDMA. For now, the numbers are not ready. We can do it afterwards. Another question, I want to know whether you have compared this with something like Ntops PF ring or not. Because PF ring also bypasses the kernel stack and you can directly access the packets from the networks and network adapters memory. I want to know whether there is such a comparison or not. I don't have other vendors or other, for example, sorry, I didn't hear what's the provider that you... Ntops. I don't have any other provider equipment, so I cannot say that I compared, so I didn't compare. But if you are running on top of a NIC that supports 100 gigabit per second and you get 94 gigabit per second without utilizing the CPU, I think that this is enough. Without any CPU, without any performance optimization. Thank you very much. I don't care what the other side has, because my equipment is very good. But whenever you provide a new approach and that we deal with the numbers, there is always interesting in the benchmarks to have the other vendors or other similar approach number two for the comparison. I ask only at some point of view. I would like to avoid talking instead of Melanox, but I will talk. We, when I say we, that's Melanox, we have all the results on the Melanox website. You are more than welcome to visit the website and download them, review them and give me your feedback. Thank you very much. Just in regards to your question, in regards to the optimization, yes, I do just one thing that it's no PTI, no Pages Relation, just to be honest. Let's try to repeat the demo servers. This is not the same servers. It's all servers that have RDMA NIC. Okay, so let's see. I have just one interface that is up. Crypt IBdev to NetDev is currently not included in the, in Debian. We borrowed it from the, from Melanox offered, but we find it a bit too useful. So we should at some point just push it upstream. It's, it's a, it's a big shell script. It's a simple patch script that shows the, shows the map of the devices, from RDMA devices to the Ethernet devices. So in first IPERF, that's a single, is this readable enough? So that's a single thread. You can see the CPU utilization and the results seem to vary from one run to another. Now I run it with, for instance, eight threads. He asked if we're running, if we're using the TSO, if we're offloading. Yeah, because the checks and the frames would be constructed by the NIC, right? So it's unfair because, I mean, if you are not using TSO, of course you would use lots of CPUs, you know? That's what I, and I have a last question, I swear. Yeah. So what about, I'm really interested in the packages you're not putting upstream. And I explain why. I'm from Canonico, and let's say if I don't want to use Melanox OFED because it's hard for me to support Canonico. So let's say I want to support, you know, my kernel, and your Melanox OFED has DKMS modules, for example. Sometimes you reimplement Netlink things that would break compatibility with my kernel, right? So I would like to use my stack, and I cannot right now because, you know, as I already said, there are some tools that are really important for Melanox cards that are not upstream. And it's hard for me to tell someone, you know, like a final user, go and install Melanox OFED, which I don't disagree because it's your development model, you know? Yeah. I would like them to be upstream. Okay, I will answer the question differently. So I will tell you what's the difference between Melanox OFED and kernel drivers that we have in the upstream. There's some verbs that it's in OFED and it's not in upstream, and we are going to get rid of them in the next releases. And also there is a back port. For example, in order to make this feature that already today in upstream 5.2 works on, for example, Ubuntu Xenial 16.4 with kernel 4.4. So these features require also parts from the kernel, for example, from the net schedule, okay? And we should back port all of these features. We cannot do it in upstream. We cannot back port all the features in upstream, so we can do it in our package. We're doing, for example, MLX Compact that handle all the needed functions or functionality from the kernel for each operation system. It's generic. This is why we are using or we are developing Melanox OFED just for customers that have an old operation system and they also would like to develop with the newest features of RDMA. Yeah, it's hard to keep up with several back ports you do in the Melanox OFED to fix bugs because there is no git versioning on them, for example. And also the tools that were missing source codes. But apart from that, I totally see what you're saying, and it's really hard to keep up with all the features. Melanox develops too quickly in the kernel, so sometimes recent versions already, like flow steering, for example, which we had to back port. Yeah, I do understand that. Thank you, and last question. Is it question time now? Seems so. I have a question from IRC. He asks, or the person asks, does this have similar your network FOMEC conspire on your implication as in smartphones? In other words, can the network hardware read or write memory the CPU did not intend it to? Could you please repeat that? Can the network hardware read or write memory the CPU does not intend it to? Yes. Again, there is initial setup of memory regions, right? Yeah, there's a memory region that it's binded for the application, and it's a registered memory. We are not doing anything that's illegal. It's binded memory for the application, and it's a memory region. I mean, think about support for DMA that also needs to work with virtual machines. So, through IOMMU and such, this problem is not unique to our DMA. All right, more questions? Do you have some comments from the presenters? Okay. Thank you very much. Thank you. And the final note, we hope to see you all next year in Haifa. No, next year we are in Haifa.