 So we'll kick off. Hi everyone. My name is Kevin Lats. This is Kira Loftus where network software engineers out of Intel and Janin Which is hidden away in the west of Ireland Today we're going to mix some kool-aids and Hope things go quicker in the internet then So most of you probably know what dbtk and fx to PR For those of you who don't we'll do a quick introduction So dbtk is a set of user space libraries and drivers They aim to accelerate pack processing workloads and they run on a variety of CPU architectures Some important things to remember for this talk is that dbtk supports many different PMDs and they're usually but device specific and dbtk has its own memory management system If xdp is a kernel-based address family optimized for high pack pro high pack performance packet processing If xdp has its own sockets in order to move packets from kernel space to user space and It uses the in kernel fast path. So it bypasses the network stack In order to move those packets quickly So if we take a closer look at a simplified diagram of the traditional dbtk model down the bottom in kernel space We have dbtk specific kernel modules They interact with the Knicks and expose them to user space in user space Then we have all of our dbtk PMDs and our applications and they work together in order to do whatever wonderful things You want to do with your packets? The aim of this work was to introduce and use the new dbtk of xdp PMD And then that will directly talk to your Nick driver. So you can still use all of your Usual kernel tools that you like using like I have config and so on So the goal of this work was to have all dbtk applications working out of the box With the new way of xdp PMD And of course it should do so with good performance the performance we were aiming for was close to or on par with the kernel sample app xdp sock The challenge with this was that frameworks like dbtk have their own memory management as said and these come with constraints and assumptions of their own dbtk specifically we have a discrepancy between the dbtk and afxtp buffer alignment So this prevents us from mapping dbtk memory buffers directly to afxtp humans And in order to do this mapping We needed to do some extra work and complexity which negatively impact performance Okay So I'm going to talk about how both afxtp and dbtk lay out their memory for packet handling I'll talk about the differences between the two and why those differences pose the integration challenges Which Kevin just touched on there So afxtp it's got this concept of a UMEM or user memory And it's essentially an area of memory allocated by the user for packet data The UMEM it's split up into equal sized chunks with each chunk being used to hold data from a particular packet and How it's used is for instance on the receive path The kernel will place packet data into a chunk for the user space process to retrieve and in our case or user space process is dbtk and On the transmit path the user space process places packet data into a chunk for the kernel nick driver to transmit and Prior to kernel 5.4 this UMEM this area of memory that afxtp uses to hold packet data It had a number of restrictions on it in terms of its size and its alignment The first being that the start address of the UMEM had to be page size aligned So that's going to be 4k in most cases The chunks within the UMEM they had to be power of two sized and Kind of as a side effect of that the chunks could not cross page boundaries and In a kind of networking use case that leaves you really with only two potential chunk size options either 2k or 4k Anything bigger than 4k and you're going to cross a page boundary and I think smaller than 2k Isn't big enough for a networking packet or networking use case So in this example here, we've got a chunk size of 2k We've to 2k chunks per 4k page and as you can see none of the chunks are crossing the page boundaries Everything is nice and neat and tidy The reason for these restrictions is essentially it just makes calculations in the kernel a little bit easier When everything is nicely aligned you can use things like masks, etc Okay, so let's See how dbdk lays out its memory for packet handling and see if it satisfies the requirements of the AF XTP UMEM So dbdk as many of you know we're in the SDN room It holds packet data inside structures known as memory buffers or mbuffs for short and a group of those together is known as an mbuff pool and dbdk mbuff pools they don't have as strict restrictions on them as the AF XTP UMEM so for instance mbuffs can be of any size within reason and They can have arbitrary alignment relevant to the page size so they can cross page boundaries So in this example here, we've got a mbuff size of maybe three and a half k and or mbuffs are crossing page boundaries all over the place And I suppose why do we care whether or not the dbdk mbuff pool? satisfies the requirements of the AF XTP UMEM and the reason is That in order to get the highest performing integration of AF XTP and dbdk We need to map the mbuff pool Directly into the UMEM to get a zero copy data path, which is obviously going to be the best the most performant But as you can see here, that's not possible at the moment. This is just one Example of a dbdk mbuff pool. There's plenty more examples with different sized mbuffs different alignments and most of them won't comply with the kind of restrictions of the UMEM But to get around this the clever folks in the dbdk community have come up with a number of solutions to to get them to integration work together Each of them have a varying degree of success in terms of performance so the first solution that was considered was copy mode and In this mode we allocate memory for our UMEM and we also allocate our dbdk mbuff pool as normal and we simply Mem copy between the two locations and memory This works really well, but in terms of performance, it's not the most performant just due to the Psycho cost of the mem copy being pretty high But nevertheless it made it into a dbdk release 1905 as part of the series that Initially introduced a fxtp support The second approach then that was looked into was this alignment API. So it was proposed to introduce a new API in dbdk, which allowed you to kind of specify the type of alignment you wanted for your mbuff pool and Then any application you wanted to work with a fxtp could use this new API and kind of mold its mbuff pool to fit the the UMEM requirements Then you could do the one-to-one mapping and you could get your zero copy performance but even though this did give really really good performance it Was deemed a bit too invasive So it didn't make it into a dbdk release It was invasive because you had to change your application to get it to work Which kind of went against what Kevin said at the start about apps needing to work out of the box so that didn't get into a dbdk release, but it Generated a good discussion on the mailing list which led to this third approach I think it was suggested by Oliver Matts and implemented by Xiaolong Ye and this approach uses dbdk's external mbuff feature which allows dbdk mbuff to instead of holding the packet data in the structure itself to point to a different location and memory In this case, we'll be pointing to our UMEM chunk And then you can achieve your zero copy However, there are still kind of additional cycles with this Solution so there's additional complexity involved in attaching and detaching that external piece of memory from your mbuff But then again, it does give a really really good Improvement over a copy mode. I think 29% for a certain use case So it made it into dbdk 1908 as kind of a first gen AFX-TB zero copy solution At this point, we kind of felt that we'd well the community had taken dbdk as far as it could in terms of performance with AFX-TB But we still felt that there was still some performance left on the table some cycles to save so at that point we decided it would be a good idea to start looking at the kernel side of things and maybe looking at Adapting the UMEM to make it a bit more flexible to work with the flexibility of dbdk as opposed to trying to Make dbdk fit the narrow restrictions of the UMEM So what do we do in the kernel when we finally took off our dbdk ads? We took a look at the original UMEM and its constraints Being trunk size aligned or page size aligned sorry Was one major restriction that we had to lift so we enabled arbitrary trunk alignment so you can now Align your chunks anywhere you want within the UMEM As a part of this we allowed arbitrary trunk sizing as well So now you can size and align wherever you want within the UMEM much more flexible than the original With this we also had to allow the crossing of page boundaries So we now need to keep track of whether pages are physically contiguous in memory or not If they aren't contiguous like chunk 3 in this case, let's assume page 3 is non contiguous to page 2 Then it will cross Into a non contiguous memory region. So we can't use that address We discard it get a new one and we use the start of the next page So we do have a gap in memory. This is just one the side effects to this kind of added flexibility But a lot of the time you're going to be a lot more better off with this With this we also needed to change the AFXDP RX and TX descriptor One of the fields within this descriptor is the address field This is simply an offset into the UMEM of where your trunk is placed As the packet travels through the data path and various offsets are added onto this in the original Design of this the offsets were added directly to the address field So the value would change as you it made its way through the data path at the end of it when we recycle the buffer We could simply mask back to 2k 4k, whatever your alignment was Because it was a power of 2 This isn't possible anymore without in complex calculations seeing as we have arbitrary sizing and alignment so we moved to a model where we took the upper 16 bits of the address field and Stored the offset there rather than adding it to the address field and we kept the lower 48 bits purely for the original base address Or the original offset as it was and This still gives us 256 terabytes of address space, so we've more than enough for now and What this enables us to do is basically just when we're doing the buffer recycling just mask off the upper 16 bits We have our original address and we're back to where we were All of this added flexibility Makes the UMEM a lot more flexible so we can map directly into it And it really gives us a lot more a much more seamless integration with existing frameworks such as dbtk So as Kevin said now that we've relaxed or UMEM alignment constraints we can now map or to be decay and buff pools no matter what size they are Directly into the UMEM so using our example from earlier with our three and a half k Mbuff we can size or UMEM chunk to to match that or whatever the Mbuff size is and we can get or a seamless zero copy and We don't need to modify or Existing to be decay applications. They're going to work out of the box So those were kind of two key goals that we outlined at the start of this and they were or key goals at the start of this work so we've achieved those and in achieving those we've got both a performant and portable solution In terms of performance This solution it gives us 60% improvement on the copy mode the first one that I showed earlier and a further 24% upon the first generation zero copy, which was in 1908 which used the external Mbuff feature, so it's a pretty significant performance improvement and The feature itself it's available in dbtk 1911, which is the most recent dbtk release So provided you have kernel 5.4. This feature will be available if you don't dbtk will simply fall back to copy mode I think we're out of time. So, yeah Okay. Yeah, so just a quick note Before we end a lot of people kind of ask what is the value of integrating AFX-DP into dbtk? so Dbtk as many of you know it provides an application developer with a wide variety of functionality for an application So things like memory and power management crypto virtual networking QoS the list goes on AFX-DP then provides unrivaled flexibility and Magnus touched on this in his presentation first thing this morning So in contrast with the typical dbtk usage model or Nick remains bound to the kernel driver so we can avail of the kernel control paths and have Use of our familiar tools like IF config, ETH, etc. So that has a huge Impact on the usability of an application and a solution as a whole so together essentially the best of both worlds can be enjoyed and We can get applications that are high-performing portable fully featured Accelerated insert blizzard here Yeah, so I think they're just a good combination together And then just to close a couple words of thanks to some people that helped myself and Kevin along the way with this work Magnus and Bjorn on the kernel side Bruce Chi and Zhao Long on the dbtk side and the dbtk and kernel communities as a whole Yeah, that's it for myself and Kevin Yeah So it really depends on the workload Off the top of my head, I don't have a number I it really does depend on the workload if it's a heavy workload They could be pretty close if it's something like test PMD some or I'll do forward. There's going to be a bigger data I'm trying to think is there there I think we might have some data published soon Which we might have some data published soon. We're running some benchmarks at the moment So that should be public soon enough. Yeah