 Hello, everyone. Welcome to DevConf and we will stop on accelerating TCP and user space version switching. I have shared the link and it's pinned. So you guys, if in case Hoppin breaks out, you can go and watch the YouTube video, but be sure to have this tab open so that you can participate in the live Q&A. So I'll be beginning with the video now. My name is Slav Leitner. I work for Red Hat. With that said, let's go over slide number three, the agenda slide. Okay, let me go over the agenda to give you an idea on how this talk will develop. So first I will talk about why using the user space data path and then present to you two typical use cases that were the motivation for this work. And then we'll discuss the packet sizes and the shifted to the TCP packets. After that, I'm going to present to you the proposed solution, which is the TCP segmentation of load followed by some key implementation details. I hope that these key, these implementation details helps you to understand the services if you want to check out them and you know, navigate to the sources. And then of course, I will provide some performance numbers comparing with and without this show and wrap up talking about some limitations in future work. With that said, let's go to the next slide, the user space data path. What is the user space data path? It allows networking packets processing without using the current stack. So basically, thanks to DPDK, we can receive packets from the network card directly in user space, then of yes, can process the packet completely in user space and then send out this packet to the machine or to another physical device. And everything happens completely in user space by passing the current stack. But why we want to do that? Well, because processing is multi-medium packet sizes and I'm talking about 64 bytes packets of 128 bytes or 256 bytes and processing them in user space is faster. But how can we do this? Well, using DPDK, send the data to open the switch also known as DPDK. DPDK provides you the drivers to have access to physical devices in completely in user space. And then you open the switch as I explained it, can do all the processing, completely in user space, and then send out to a virtual machine or to another device. All right. And so where can we do this? The most common use case is inside of hypervisors to provide connectivity to virtual machines. Okay. That said, let's go to slide number five. This is the first typical use case I want to present to you. This is called physical to virtual to physical or PPP. There are some documentation out there or some blocks out there referencing as PPP. So that means physical to virtual to physical. And it's basically a hypervisor running OVS DPDK connected to a virtual machine and also to a network card. So packets come in through the network card goes through the OVS DPDKs, OVS DPDK and loops back in the virtual machine to the network. So this network could be other hypervisors to build a cloud. For instance, you can have many hypervisors, or you can have, for instance, a traffic generator pushing back as to measure throughput or to measure latency or jitter. So this is an interesting scenario to measure the overall solution capacity. Okay. So moving to the next slide is the second typical use case is called virtual to virtual or V2V. It's basically one hypervisor running two or more virtual machines and OVS DPDK. And this OVS DPDK again is completely in user space, providing network connectivity to those virtual machines. So one virtual machine sent back as it goes to OVS DPDK and back and then to the other virtual machine. The next slide shifted to large payloads. So when the user space did a path processing, the OVS DPDK actually, in the beginning, we focused a lot in this multi-medium packets because for a few reasons, one of them is that the 64 byte packet is used for benchmark purposes. So if you want to compare switches or howlers, usually it's done using the performance when using the 64 byte packets. And that's because this is the worst situation in the network. You have very little data per packet and you have headers. So you have a lot of headers to process per second and not much data. And also because one of the big pushers for OVS DPDK or for path packet processing is the telcos. The telcos use they rely on these small to medium packet sizes. But as we saw over the years, environments are not as specific. So you might have a purple machine processing signaling for instance, which uses small to medium packet sizes. But there might be other VMs requiring regular TCP connection. So other sizes matter now. And we want to expand user space data path to other user cases as well. So I'll give you an example here, which is terrible as well. If you want to send packets from the virtual machine to a bridge to the bridge port, it will be 860 megabits per second. So for today's standards, this is pretty as well. So what's the solution? The proposal solution is already known, it's called TSO. And TSO basically means instead of the networking stack, spend CPU time, CPU resources, splitting this big chunk of data and saving an application route 32k of data to a buffer, instead of spending CPU resources, splitting this 32k into smaller pieces that fits into the network. It will just send one single packet with 32k of data and let the hackers split if this is going through the, for instance, the physical network card. So that saves a lot of CPU resources in the host and speeds up a lot because we don't have the CPU overhead to process each and every header. We just have one header followed by a big chunk of data. So that's where TSO improves the performance. It's a little technique available in most of the commercial network cards nowadays. Even the one gigabit cards, most of them support the TSO. And this feature is already available in the current data path with great results. So we expect to have the same great results in the user space data path. Okay, going to the next slide. Let's talk about the TSO support in the network card, in the physical card. So basically, obviously, PDK relies on the PDK to provide the drivers, and that means we use the PUMO drivers. Most of these PUMO drivers, they are called EMTs, they already support TSO. So not only the network card, the hardware itself supports TSO, but also the EMTs. But still, we need to change OVS to enable this feature and also to prepare it back. So again, here I give you two key points. One is in the PDK, the FF board config, where we set three points if the TSO is enabled in this card. Basically, we are setting the TCP TSO flag, also TCP checksum flag, and the IPv4 checksum flag. The checksums are required because when you segment the packets, you will create a brand new packet, and then the checksum needs to be updated. So they are, if you want to have the TSO enabled, you also need to have the checksum enabled. So we set these flags in the TX mode of load flags, and then pass that to the device, and then the PND will automatically will understand TSO packets from now on. But in order to pass these packets, we also need to prepare them. So there is a new feature called that FTPPK path, hardware of load packet, which will set some offsets and some header offsets in the packet in order to allow the hardware to do its job. So those were the new key changes in OVS to allow TSO support in the network card. Let's talk about now the support for the host user. So thanks virtual machine using the host user. This is the driver, it's the interface that connects the virtual machine with OVS TPDK. But different from the physical devices, we don't use the PND, OVS uses the V host library directly. And that's because in the past, the host user PND wasn't available. So we started using the V host library directly, and maybe this will change in the feature. But at this point, the support is headed on the using the V host library directly. But still it requires changes in this V host library, which is part of the DPDK project to work with external buffers. So what's the external buffers? To support this, we have pretty much two options. One of them is working with smaller buffers and then change them one after another or work with a bigger buffer. When we work with a bigger buffer, the advantage is that we don't need to worry about when one buffer is ending and when the buffer is starting. So let's say we want to shift, we want to repend if we want headers. So we want to shift data a little, or we want to parse headers, we don't need to worry about if the buffer is ending or not. So the solution adopted here is to use external buffers and then this buffer will be at the size required to hold. Reply changes in the S to negotiate the feature as well, because the V host user has two sizes, two sites. One of them might be another version, for instance, or a newer version. So it needs to negotiate the features to establish a common set of features and then enable the feature. The interesting thing though is that once the feature is enabled and the keyboard is configured to allow that, the feature is exposed inside a different machine and enabled by default if you are using Linux. So basically if you run a hypervisor and you want to configure a different machine and you allow the TSO to happen and you enable on the OBS side, then the Linux running inside of a different machine doesn't need to be changed because TSO gets automatically enabled. That's very good for if you need to, you don't need to worry what's going on inside of a different machine. So here I give you another two key points. One of them on the left side is the NATF communicating the host client pre-configured. That's where we set one of the flags required, which is the linear buffer support. And that pretty much says to the V host library that OBS does not support chain-in buffers. It's just one linear buffer. And then if the user space TSO is enabled, we will pass this flag called external buffer support. That tells the V host library if it attaches another buffer, a bigger buffer to the TACA, then OBS will know how to deal with it. And in the continuation of this function, if the user space TSO is enabled then it's interesting here because the V host driver requires us to disable features that we don't support. So instead of enable features that we want to support, we need to disable features that we don't support. So in this case if the user space TSO is enabled then we just disable ECN and also UFO. If TSO is not enabled then we disable the checksumming. If you recall one of the first slides I mentioned that the checksumming is required for TSO and also for user space the UDP fragmentation of load. So disabling checksumming pretty much disables all the other offloading features. Okay, so it's not much complicated there, just need to understand that there is a negotiation that's going to happen. You need to enable the Kibu and also set the flags. Both flags also, the flags to set to tell that we want external offers and also enable or not disable the TSO. Okay, let's talk about OVS DPDK support now. So basically there are some changes. One important change is that the packets are both DNDQ and I won't allow it but only if the EBS device has TSO enabled. So before we change if the packets is bigger than DNDU which is a configuration common to the whole to all the ports attached to a bridge. This packet will be just dropped, right? But now with the TSO if the EBS device has the TSO flag enabled this packet is allowed because then of course the network card will go to the port or the EBS device will do the same petition workbooks. If you want to check that there is a denied point that net dev is percent which calls net dev send prepared batch and this option since OVS DPDK works with a batch of packets it will iterate over the batch packet by packet if it fits the work or if TSO is able otherwise it will drop the packet. For non-DPDK devices it enables the VTIO underscore net underscore header data structure. So that allows us to exchange information, check some information in TSO type with the kernel. So that pretty much if you want to send packets to a retold and at best brings us we can use that interface to send the information to the kernel. But packets coming from the non-DPDK device need special handling to copy so since now we support external offers these packets coming from using non-DPDK memory we need to do some special handling now if you want details you can go over there dpdk underscore do underscore of dx copy okay and now there is no software fallback. What does that mean? It means if the if the device does not support TSO it could have an option in software to do that so let's say the VM sends a big packet a TSO packet this packet goes through the open disk which and it's going to be sent out on another device a physical device that work is this does not support TSO and then instead of just dropping the packet we couldn't do the segmentation in software and send out the supported packet to the egress device but that's not what happens today there is no software fallback. On the other hand the good thing about using the external buffers and not chained buffers is that there was no changes to quarter of the packet handling functions so functions like to shift bytes to include headers or to remove headers and things like that or to parse headers those functions were not changed there is no extra overhead processing packets because of this so basically you have either a buffer of a sturdier size which is pretty much empty plus overhead or you have a bigger buffer using the external buffer feature. That's how let's go to the next slide TSOs report overview so here gives you built with TSO right so this table shows you our performance sending TCP packets from the virtual machine to a local bridge for instance the second line is the VM sending to a local bridge default is without TSO is the default configuration and so you can see I got three gigabits per second and with TSO enable it jumps to 23 gigabits per second if you do the ratio there it is pretty much seven times in the next slide we can see the VM sending to a network NAMM space using the virtual with the net pairs so before I got three gigabits per second it jumped to 22 gigabits per second which is also seven times faster the interesting case is VM in the same hole so is VM to VM the V2V scenario jumped from 2.5 to 24 gigabits per second so that's a nine times faster and also the PVP right the physical to virtual to physical so communicating with external host in this case is the VM sending to the external host so the default performance is 10 gigabits with TSO enable it goes to 35 gigabits so it's three times faster with external host using encapsulation we cannot measure that because it's not supported okay I will talk more about this in the limitations slide all right here we have some performance numbers now using VLAN so it's a VM sending to the local bridge over a VLAN you can see that the numbers are good as as good as the privacy slide so the local bridge is five times seven times faster sending to another virtual machine and the same host is eight times faster and to external host is five times faster okay moving to the next performance slide now using IPv6 instead of IPv4 sending to the network NAMM space we jumped eight times sending to another virtual machine and the same host is eight times faster and sending to an external host is four times faster just a highlight here that the external host is using the physical network card and that was a 40 gigabit card and which means TSO pretty much reached pretty much reached line hate speed okay so I'm the bottleneck here is the physical device and not really the the softer switch most probably if I had a card a faster card I would get better performance I could not see that I could not prove that because my card was limited to 40 gigabits per second all right let's talk about some limitations in future work first and most important please check out the documentation there are details there on how to configure things how to set up the environment we try to keep it up to date and also it it contains the limitations so there is a link in this slide you can either check the OVS sources or go through that link to read the latest one interesting thing is that it needs to be enabled at initialization it's a global configuration so it affects the whole OVS in user space and if you want to change that it requires a restart so I put that the reference comment for you to enable the feature you can change true to false in order to disable that this comment is also in the documentation but yeah this here just in case you want to play with it it's yeah so it supports flat and villain networks as I showed to you the in the performance slides unfortunately there is no support for encapsulation there are patches posted to the mailing list as we speak to add support for the for that it's under review so if you are curious or if you want to test those provide feedback or anything you are welcome to do this yeah and one thing that I already mentioned that I want to emphasize here is that there is no software fallback at the moment so if the device does not support SEO and and it perceives or the switch is going to send a TSO packet to it then this packet will be dropped there will be a warning in the logs of course but still yeah it will be interesting to have this software fallback there is also a proposal in the mailing list feel free to check out and again test and provide feedback and as future improvements as I said support encapsulation VX Lunge and F and GRE those are also available in the most commercial cards out there so it should not be a problem to enable this in OVS also to have the support software fallback GSO the generic segmentation of load available in OVS and there is one interesting thing here is that the user the OVS user space is used with DPDK so we could try to leverage the DPDK GSO infrastructure but OVS is also used in BSD environments so and they don't need to use DPDK so it would be interesting if the GSO solution would be genetic enough to work with in the BSD with or without DPDK because then it allows user space to work independently of DPDK and also support TSO peer device not really a global key basically we would add a port for instance a physical device would come out with TSO enable because it's interesting to have enable all the time and but if the card doesn't support that we will do the fallback in this in software and it would be transparent to other ports attached to the switch so we would not need this global key and we could have this option to you know if for some reason you don't want TSO enable on a specific port you could just go and disable there this is something we can improve in the feature well there are other TCP improvements related TCP improvements like for instance the partial and the full hardware offloading the partial hardware offloading works like this instead of leaving to the software the switch software to process the headers and then finding the specific flow processing headers is an expensive and an expensive operation so we could what happens in the partial hardware offloading is that the hardware provides to us a hash that is based on the network headers and the only thing that the software needs to do is to take that hash and match to a flow so it cuts part of the expensive operation for processing packets that gives you like some significant performance improvement I'm talking about 30% it could be even more and the partial hardware offloading is being expanded to also execute some of the actions in the hardware so let's say that you want to do to do just some modifications in the packet and if the hardware supports that then it could do this partial modifications in hardware this is something that's being proposed in the main list and there is also the full hardware offloading where the switch the the software configures the hardware to match on specific headers and then do some actions like for instance packets coming from a specific IP address going to another specific IP address needs to be sent out to a particular VM the software can just configure that in the hardware and the hardware will receive the packet do everything needed and send out to the virtual machine without any intervention from the the hypervisor so it saves it's pretty much a hardware a direct hardware connection from the physical network card to the virtual machine and also there were some interesting improvements with matching specifically I'm talking about the the the interface to the non-DPDK devices like virtual internet pairs and TAP devices things like using syscalls that allows multiple packets to be sent out per syscall instead of doing one syscall per packet which is is low we bash the packets and send all and send like up to a 32 packets in one syscall that greatly improve it the performance over time okay that's what I had to present for this talk thanks a lot thank you for attending thank you so much for the talk Flavio if there are any questions for Flavio please put them in the chat and we will field it to him okay first question how can we debug network segmentation issues with when TSO is enabled or even see which devices are using TSO are there any insights into the segmentation engine from the host side so there is the segmentation is happening at the network card so when the packet comes in from the virtual machine you will see the whole packet so if you use OVS TCP dump you will be able to see the whole packet and but the segmentation itself only happens in the hardware if you are sending this packet back to the kernel for instance then there is no segmentation at all even if you are sending to another virtual machine there is no segmentation the the same packet that came in it will be sent out to the other device which devices are using TSO basically today is all or all or nothing so if you enable the TSO you we assume that every device supports TSO but if in the case of a hardware or yeah in the case of a hardware that does not support TSO then there will be a warning flag in the v-switch log and I guess the last question I already answered but yeah maybe I missed something any more questions all right so TSO is experimental yet as you can see there are many things to improve there it was added in OVS 2.13 the partial hardware offload is a little bit older it got added first in the OVS 2.10 it's still experimental but yeah those are very interesting technologies if you if you want to report feedbacks test or help us we appreciate that for sure question does it work transparently for patch yes you should be able to see the packet as is as it is flowing in the OVS let's give it a few seconds in case people are typing other questions okay doesn't look like there is any more right now if you would like to continue the conversation with Flavio feel free to directly message him on Hopin or we have breakout rooms under the export tab for the system engineering track so you can hang out there share audio video and have discussions thank you so much once again Flavio thank you