 Thanks for being there. We're almost at the end of the session. So it's probably has been in a long way So I'm going to talk about what we're doing in user space networking with RDMA. So it's not really about RDMA per se. You will see that in the course of the talk My name is Ben Wagan. I'm working at Cisco and I'm working on VPP So for those who missed the excellent talk about quick from my colleagues a few words about VPP. VPP is user space software networking platform and it strives to be Very very fast and have a lot of features like you know tunnelling, routing, switching all those kind of stuff and even Layer 4 like TCP, UDP and now quick But that's at the very bottom of our stack we need to receive and transmit packets and we need to do that efficiently and from time to time we We ask ourselves the question. Okay. Do we need to write our own network driver for us to do that? and Sometimes we we have to So first, why do we do user space networking? mostly because of performance but also because it allows us to Update more easily network functions in the field if you have some of your networking functions That you need to update you don't necessarily want your world host to be down because you need to reboot to start a new kernel So this is the features that your space networking brings us And and then the next question is okay, why do you need to write your own natural driver? Especially as as now, you know, you we have the gdpdk which which have a lot of drivers already available and We do use dpdk a lot in VPP. We rely on it for a lot of drivers But sometimes for specific drivers, we want to go a little bit further in terms of getting better performance and ease of use and in that case we need to go to develop our own own network driver and This is an example here where we did native development it was for the Intel XL 710 so I thought you know 40 gig Intel Were by developing our own native drivers. We gain about 25 percent Performance boost on some workloads like IP for forwarding with two million fruits So that's that's something which is important for us And one of the main reason of that is because when you're using an external driver You have like dpdk for example, you might have some Translation happening between metadata. This is actually something that even the dpdk has for example with xdp That they're presented in an earlier talk So when you get a packet from another driver, you usually need to convert the you know The bits describing your packets Such as the length of the packets is the check some goods this kind of stuff In dpp. We have our own representation in dpdk They have their own representation in xdp. You have another one. So you need to translate between that and that costs some performance That being said it's it's not always an easy decision because first Usually when you do user-space networking and with your own user-space network drivers You're losing integration with the kernel and that's a pity because you're losing, you know, your favorite tools like 8-H tool and blah blah blah and It makes things harder to deploy, harder to monitor, etc. So that's that's that's one of the problem Another thing is What we really care in dpdk for example in our case is we want to receive and transmit packet efficiently We don't care about initiating initializing the device Configuring it making it works, but we have to and also unfortunately it happens that most of the time It's the most complex most complex parts of the work when you write the drivers There is a lot of codes which goes and a lot of Datasheet rating which goes into okay, how should I initialize it to make it work and Finally Some things that we should not forget is well hardware is hard and I will show with you some some story later on about that, but That's that's basically the our problem statement. Let's say So our DMA Our DMS stands for remote direct memory access it was designed originally for message passing and data transfer especially for the HPC or storage applications. So it's a very efficient way to move data around in a network and It has evolved to support isn't it transports. So now we have our DMA protocols which runs on isn't it an IP such as I warped and rocky and Key properties of our DMA is so you have some hardware offloads It's kernel bypass meaning that you get the data directly from the nick without having to go through the kernel and sometimes Have data copy etc So it's also zero copy and it's usually very high performance. So It's it's it's looked like a good a good thing to use for networking especially for user space networking and on the Right side of the slides. I tried to picture a very simplified way of how the our DMS stocks Looks like so on the top you have the RDM and nick the hardware in blue In the middle you have the kernel which basically does all the configurations and control and at the bottom You have the user space. So you have a library in user space Which is called Libby be verbs and which is used to talk to the kernel but also to the hardware to move data around and at the bottom you have the application in our case It's VPP, but it's gonna be anything and The nice thing is with RDMA is the data move directly from the Hardware to the user space software. So there is no Data touching in the kernel That says we are not that interesting in our DMA itself in the VPP at least for now We're more interesting and interested in isn't it packets and it happens that Our DMA has been extended for isn't it supports somehow and They introduce a new kind of cues that you can use to send and receive it in the packets Which is called cube per type row packets And the nice thing about it is it relies on packet steering to Move back to let's say to steer packets between different cues. So that allows the kernel network device to to stay with its own set of cues for receiving and transmitting packets while your applications gets a different set of cues and your application can also receive and transmit packets and the hardware will actually Selects which cues it needs to send packets to based on flow you program So that allows you to have your user space applications having a direct access to the network hardware While retaining you your kernel net device So you your net device does not disappear suddenly because you will just start your application and as The packet steering is based on what you want to program there You can implement, you know, things like similar to McVilan or IP Milan So you can say oh my application is interested in all the packets with This specific destination Mac for example if you or my application is interested By all the packets with this specific IP and you can go even further up the stack If you if you are it depends of the capabilities of the hardware basically so nice model and This next model is actually quite easy to use at least if you If you use Libby beaver So here is a simplified example, but the full example you can have a look at a simple but Fully working example on on github here There's a full example is like 200 lines of code. So it's it's not very complex, right? And it's a little depending of your hardware Your my age my very but typically for me it allowed me to send like 20 million packets per second with one CPU So it's it's not bad for for a simple stupid test So the way it works is you just get the handle on the device you're interested in and this this is all user space Right with some help from from the camera here Then you initialize your cues. So our DMA has a concept of cue pairs so you basically get cues to push the packets and cues to Get the results of what you do and the same is true for receiving packets you get cues to push your descriptors to Tell the nick where it it must put the packets you're interested in and completion cues again to get notified when new packets are coming and Then you just push your work you elements which which are like IOVs IO vectors and you do that in a loop And that's about it. So that's pretty cool You can go a little bit deeper with direct verb. So again, it's an extension in ib verbs Which allows you to you do the same thing as previously But you just get the direct access to the hardware rings so instead of going through the mediation of the lib-ib verb to get your packets You will go directly to the DMA rings, which is really great because in that case You don't have any metadata translation anymore. You just get the row packets given to you by the nick. So that's pretty cool An example for us what we we first developed an ib verb version for for the driver and yeah, it gives us around 20 million packets per second for Layer 2 cross-connects or layer 2 cross-connects. It's just you move packet around between two ports But you check that the you know isn't it headers etc are well formed So you you're still touching the the header of the packets, but you don't do much So it gives us around 20 million packets per second. Next step is direct verbs. So going to for the hardware rings and Yeah, while doing that. I actually tried to while I was trying to debug my drivers I saw that the nick could give me some hardware traces. I tried to enable that and I almost break it so that's why you need to be careful about When you're doing out very low level development with hardware and yes next step We will have we will like to add our support for things like checksum of loads and TSO So in conclusion what I would like to say is As a user space networking developer, we really like this model We like the fact that we don't need to write any code to initialize a nick. That's great We like the fact that we are not stealing the nick from the kernel That's that's really cool, too And of course we really like the fact that it gives us great performance, but This has some limitation first it's RDMA so it's restricted to RDMA capable nicks and more precisely right now It's something that melanox cams cams up with and thanks to them for for that great technology It works really well, but well, there's only melanox for now So if those vendors could also support that we will be very happy about it and the the final thing is It's a little bit outside of my domain knowledge, but I'm wondering okay could Could we apply the same kind of? Features to other with over mechanism for example, maybe the FIO M dev could allow us to get you know We don't touch the nick for initialization because we don't care about that, but we get direct access to the rings Or maybe AFX DP also could help us in there So right now my my understanding of AFX DP is you still have some meta data Conversions between what the kernels gets and what it gives you so that means that you you you get some decreased Performance because of that, but maybe maybe we could use that also as a foundation to to have something similar So I don't know that's that's just you know food for thought and and that's about it So and I don't know if there are if there is any question Okay, so the question is do we use the This to send a packet over the RDMS stack So like I warp or rookie or do we use that to to send row isn't it a standard isn't it packet? So this is actually the second we we don't use at all RDMA protocols per C We're just using the RDMA infrastructure to send and receive row isn't it there is no I warped no rocky there They just using the infrastructure, which is already there, but instead just sending isn't it packets Why did you try to write your own implementation? Why did you try to use the direct by yourself and did not use the DPD came Okay, it's already exist and already Right, right. Yes. Well, the reason is mostly because of this so I Sure if if you can fit if you can fit in the DPD case Metadata format and etc. The thing is the issue we have with VPP is it predates DPD K by like ten years And so we have our own Buffer metadata formats and we can't change that easily I mean it would be basically a completely right of the world code base Which is not? doable and so it means that no matter what we do we need to pay the tax of converting RTA RTM buff format to our own metadata format And so you you have an example here for the Intel XL 7 10 the DPD K driver for the Intel XL 7 10 I'm sure as the greatest performance can get right but when we integrate it into VPP Because of these metadata Translation and blah blah blah We're losing performance on the table. So that's that's one of the point. The other point is the ease of use which We found that DPD K can be quite picky about about What it needs to run Like you know, oh, it really wants huge pages and this kind of stuff which is not always Easy to get in you know containerized environment, etc. And so that's that's another Thing we try to overcome with that Thank you