 I'm Paolo Abeni. I work for Red Hat in the networking services team. I use it to add back to the UDP protocol implementation inside the kernel. Then I move it to add bugs to the MPTCP protocol implementation inside the kernel. And I currently mostly get blame it by Linus. That is, he's one of the upstream maintainers of networking tree. And I'm Marcelo. I also work for Red Hat in the networking services team and one of our main responsibilities in the past time, recent. It's on integrating OVS hard of floating in our products. Today, we are going to talk to you about networking performances. OK. So we are going to give a very brief introduction about networking performance in general. And then we will focus on a couple of case of study that we use to demonstrate some interesting, strange results, how to investigate them using common tools. And perhaps how to improve things in some cases. And we will draw some conclusions hopefully. What's the big fuss about networking performances? In the end, they both down to measure the maximum amount of packet per second or message per second. Or the max throughput that a given host is able to process, usually on a single core. That is mostly by convention because we hope that things will scale on multiple cores. Even that is not always true. And also because the setups are much more simple when we use a single core. That is in turn actually unfair with respect to some architectures that have less power for full GPUs but possibly much more cores available. Why we do performance testing for many good reasons to detect bottlenecks, to avoid regressions, to tuning setup, et cetera. So when we speak about performances, we are not interested in functional behavior. We assume that everything just works just as expected. And we instead are interested in raw numbers. For that, we avoid using well-known and very useful tools like TCP-Down, Wireshark, packet drill, et cetera. And we instead focus on using packet generators, either in user space like IPerf or NetPerf, or kernel space packet generator like Packagen. That tool usually provides the statistics we are looking for. But more often than not, we need to have other aggregate counters that the kernel can provide. And we can expect with other tools like NetStat that gives us per protocol aggregate counters or socket stats that give us per socket information. And more often than not, we are interested in looking where our CPU cycle are actually spent. And today we will use a lot the Perf tool. So let's move to the case of study. The first one is very simple one. That is, receive throughput for an UDP application. Why this one? Because it's a very common thing that most even telco does, first thing, how fast we can go. Let's make sure many packets were able to receive. This step is very, very simple. We have a node that runs a packet generator. In our case, Packagen. We use Packagen as a packet generator because we don't want the transmitter being the bottleneck. We want to measure the performance of the receiver. The receiver runs on another host. It's a very simple UDP application. There is the URL where you can find it. Just read packets and drop them. We use that one because a lot of common light option that can be used to configure its behavior. The tools that are connected by a fast link 10 gigabit, in our case, the faster the better. And we are using two melanocs, Nick, on both the sender and the receiver. Whatever hardware you're going to use for this kind of experiment, you will have slightly different figures, but the whole trend should be the same on most recent several class hardware. So we want to see how fast we can go into this scenario, how many packets the receiver is able to process. And we want to get some more stable results. For that, we are going to pin the user space process application on a given core, and we are going to pin the kernel space processing on our core. Also, we are going to disable IQ balance that could move the kernel processing on the random cores unexpectedly, make the figures we measure random. And we won't avoid that. So that given, we are going to do the first test with somewhat default configuration. The only perhaps unusual thing is that firewallD is disabled. And yeah, we don't have a net filter table through it. And this is what we see. 1,660 packets per second, which is not bad. It's quite higher than we would have got a few years ago with the same hardware. But we want to understand if that is the maximum we can get. So we attach the Perf tool to the receiver. The Perf tool can measure how many cycles are spent in any function executed by a given core. And we report that. We are using a command line report. The main information there are the name of the function where CPU is spending cycles. And the percentage of CPU cycles spent in every function. You can see that most of the time we spend copying data from the kernel space to the user space. Not that many times, roughly 20% of the whole CPU time. And we can see that below that, the four top most offenders are function related to syscall overhead, let's say. There is the libc receive message syscall, which is the syscall used by the application to actually fetch the packet from the kernel. And the other ones instead are related to security content measures for recent hardware vulnerability and to the sending of security tool inside the kernel. So the bottom line is that we are spending a lot of time due to syscall overhead. And the application is receiving one packet for every syscall. So we could think that if we use a different syscall that allows us to process many packets with a single code, it will save a lot of time and possibly going faster. There's syscall actualizees. It's called receiveMMessage. Well, the first M stands for multiple. And we can change the behavior of our tool with a common line argument to tell it to use such syscall. So let's do it and measure whatever we see. Surprise, surprise, we are slower than before. We should have been slower than before. And that is quite unexpected. We hope it will go faster. Why? Simply running the top common line tool gives us some information, somewhat. As we can see, in the first experiment, the user space process took 88% roughly of one CPU. And now it's taking much less. So we are really going faster. But still, we are processing less packets. Why? Because the bottleneck is not, in this case, the user space process. The bottleneck here is the kernel space processing. The other process you can see in the top report, case of RQD, which is keeping a CPU fully busy. Still, we are faster. Why we see less packet? Because to actually walk up a process, the CPU needs to spend some cycle. The faster is the user space process, the more frequently such process go to sleep. More frequently, the kernel space process need to walk it up. And more CPU cycles as to spend to walk, wake that process up. Less CPU cycles that CPU has available to actually process packet. So the bottom line is that with our first report, with our first Perf tool investigation, we have looked to the wrong CPU. We should have looked to the other one, the one processing the kernel space stack. So let's look at that now. And this is what we get. And we can see that the top most offender, the function that is burning more CPU cycle, is that INETGRO receive. And we have another function, still quite high, in that ranking with that GRO word inside. Both of them are related to the GRO engine. The GRO engine is a very lower component of the networking stack, is in charge of aggregating as much packet as possible as seen on the wire, in a single giant packet that will later traverse the whole networking stack. So that kind of technique allows the networking stack to save a lot of CPU cycles when it's able to aggregate packets. Because as many as 40 wire packet will be processed by the working stack at once. The bad thing for this experiment is that the UDP protocol does not leverage GRO by default. So that cycle are actually completely wasted. What we can do, we could disable GRO. That is a thing that we can do in this experiment, because we are interested just in UDP. In general, it's a bad thing to do, because there is also the TCP protocol, which is what we use it somehow. And if you disable GRO TCP performance, we are seeing it dramatically. Anyhow, we can disable it via the ATH tool with the command line reported over there. And we can repeat the experiment. And finally, we get some progress. If you compare the number with that we had done some slide ago, you can see that we get some miserable improvement. We've not changed at all in our code, just with somewhat better configuration, at least for this test. Still, we have some surprise. There is what the Perf tool is reporting now, as the topmost offer for the kernel space processing. And as you can see, that is completely different from what we had before. Yes, the GRO thing is gone because we disabled it, but the other functions have just random numbers with respect with what we have before. Very different CPU usage. Why? Because when we process a packet coming from the net, no matter what we do, we will have cache misses for every given packet, because the packet contents are fresh, are just put into memory by the DMA engine. And that memory contents for the CPU point of view is completely new. That means cache miss. That cache miss before happened in the GRO engine, and that was one of the reasons why it was so costly. Now the GRO engine is not running anymore, but still we have that cache miss. And whatever function is actually experiencing that cache miss is getting its cost exploded, sort of. So we are still interested in seeing we can improve the throughput sometimes. Down the list, taking very little CPU cycle, we see that UDP before early demand function. Even is taking very few cycles. That is somewhat relevant because that early demox function is trying to look up for a connected socket for each incoming packet to avoid later root lookup. But in our experiment, the UDP socket is not connected. So that very little amount of CPU cycle are completely wasted. And we can avoid that with a simple CCCL, disabling that early demox functionality. We execute the CCCL, and we repeat the experiment. We hope to see some improvement. And we see that. And unexpectedly, that's a huge improvement, 10%. That's, whoa, we just removed a little over it, and we got a relevant improvement. Why? Because far I lie at the Black Huntley. The figures are not that stable overall. And they are not that stable because power management is enabled on the host we are using. And power management kick in at an unexpected moment. And when it kicks in, results are sort of boring. Anyhow, the trend, repeating many tests, et cetera, the trend is like that. If you disable GRO, you will see an improvement. If you disable early demox, you will see another improvement. So we are still interested in maximizing our throughput. And we can try something slightly different. We say that we can first look again at our per-prep. And notice that there are a few functions that are related to root lookup that they are consuming quite a bit of cycles. And we mentioned that before that early demox functionality could avoid the root lookup. So we could try to enable it and change our user space tool, change the configuration of our user space tool to actually connect the UDP socket upon reception of packets. That can be done only if the ingress UDP traffic is just for a single flow. In our experiment, it's just for a single flow. So we can single-flow give an L4 tuple, UDP IP source destination, sorry, source IP, destination IP, source port, destination port. So we enable early demox and we change the configuration UDP sync and rerun the test. And great, we see a relevant improvement. We have moved to more than 2 million packets per second from the beginning in 1.6, which is more than 20% improvement. And with no changes at all to the code of our application, just a slightly different setup. If we go back and look at the top tool output, we see that now both CPUs are fully busy. So we may conclude that we are at that of our children. No more performance improvement are possible. And that will be false because we mentioned at the beginning that UDP, sorry, GRO could give a grid booster to bulk transfer. And GRO is not enabled by default for UDP, but can be enabled on a per socket basis if the application creating the socket requested. UDP sync does not support that option. So no figures, no real figures for that. But if you fetch the source, modify that to enable UDP GRO, which is a simple set soak option, and then use receive and message. Because at that point, the bottleneck will be back on the user space part. Then you will see something with this hardware around 3 million and half packet per second, which is much more than what we see now. That will be the end. No, because if nobody from the security team is watching, you could try disabling Selenux and possibly disable also Red Poline and security mitigation. I'm not suggesting to do that. You can do that in control lab where everything is under your control and your responsibility. And if you do that, you can probably reach something around 4 million packet per seconds, which would be almost three times the initial figures. And that would probably be the greater number you could get for this kind of hardware. With that. All right, thanks Paolo. Moving on to our next case of study, I'll be covering the situation on which we partially use the hard of loading that we can do with OVS and TC. So yeah, completely different from the use case before. I'll briefly explain how OVS hard of load works, because that's not really common yet. But I'll assume that you know how OVS works. Here, we have a picture on how a standard OVS works. Packets are coming through the NIC. They go through OVS. And then they connect to a network in space using VE, right? So that's pretty standard. With hard of load, it leverages SROV. It creates virtual functions in the switch-dev mode on which the flows, they are processed in a programmable way, unlikely from the legacy SROV model, on which the card itself does many things on its own. So that's a benefit of using the switch-dev mode. And then the picture comes like this. You have the NIC, you have the network name space, and assuming that this flow is already loaded, packets come directly from the wire, into the network name space, and vice versa. This is the fully of loaded way. And then when we go to the half situation, which is, for example, when you are doing contract, you are doing decapsulation. But for some reason, you can't output to a virtual function inside the network name space. So you have to use a VE tunnel. The card can't output directly into that. So it will offload all the processing up to that moment. And then it has to resort to a software fallback to do that. In the picture, it's pretty much the same thing. But the last step, going into the network name space, it's not offloaded, it's processing harder. And then on the way back to the wire, it's entirely done in software. Because when it's coming from this VE tunnel, it's impossible to do offloading at that time. The network card user was a Konect X6 DX. The sender was a real late receiver, real line. So it's very, very fresh. The test was simple TCP stream with a perf. Test results that we got here. We are testing just with the seed at the path then. We are not using OVS kernel. But we are leveraging skip hardware switch so that we can say, OK, run the entire in software or leverage offload whenever possible. And the idea on trying to leverage this partial offloading there is that using dedicated hardware is usually better than using generic processor for doing the computational work. And at the same time, it's creating some parallelism quite out of the blue. Because when the card is processing a packet, it finishes, delivers to the West to do the last remaining part. And then it's already processing the next packet. So you're getting some parallelism for free in there, right? So we should get a performance bump, but not. Something happens. We get a worse performance than doing entirely in software. So what's happening here? Why did we go from 18 to 11 gigabits per second? That's quite a drop. And if we check, this is entirely software. We are using skip hardware on the center side. We can see the CPU usage is quite OK. No CPU is being maxed out. No bottlenecks here. On the receiver side, though, yes, we are maxing it out. So the receiver is the bottleneck. And then if we move to the offload situation on the center side, still not maxing out CPUs. And on the receiver side, still the same figures. So what we can conclude from here already is that when we go on this offload situation, apparently on the receiver side, we lose 50% of efficiency. Because we are using the same amount of processing to do half of the work. And then we don't know what's going on. So let's start at the top bottom. First thing, check TCP stats to check if things are going right or wrong in there. Do you see something off in here? I don't. Because there are no retransmissions. There are no drawups. It's just it's really clean. But the numbers are lower. So the problem doesn't seem to be on TCP. Let's move down. This is on software. We have our baseline. That is this one. We are not debugging and trying to make something just better. But we are comparing A and B. And this is one output of Perth. Unlike Longpolis, the left column in there is the accumulated time accumulated CPU that dysfunctions and its colleagues are using. And nothing stands out, although this is the baseline. And when we try to compare to the half-of-float situation, it's not too different. So how do we move on from here? One idea is pick one. Right? No, with previous knowledge, we know that this function is quite important. It's the one that gets called in the driver when it's processing a packet that the NIC just delivered to it. So we can dive into it. And if we didn't know that, we can go on over every function. It will take a little bit longer, but it will get us there. And then we expand that view. We have these differences. Here, it's the software one and the harder half-of-float one. OK, we can see it's doing more stuff and different stuff in here. But at the same time, we don't have a good idea on what these numbers are meaning. OK, it's spending less CPU time on API JRO receive. But what does that mean? We can't make some sense out of it to get. So finding bugs, it's harder than finding Waldewald. When we go on check and try to understand what those numbers mean, these screens, they are counting how many times these functions are getting called. And during this experiment, this one here, an API JRO receive was getting called 7 million times. And as Paul was explaining, this is the function that coalesces packets into a bigger one that the networking stack will process later. And when we go half-of-load, it gets called 4 million times. And if you do the math between these two, it's pretty much a drop that we had in the throughput. OK, we are starting to talk. And the other two functions that appeared, CT restore flow and RAPTC receive, they are also getting called that much. It was called a lot in here. But previously, it just returned it. So it was a blank call. But now, we have them quite present in there. And if we are half-of-loading, interesting to note that the contract in here didn't get entirely removed from it because we still have calls for it that are legitimate in the software processing, even though that is half-of-loaded because the transmit path is not offloaded, right? And also, we have calls for it inside the network name space. So it's really not easy still to make some sense out of these numbers. So let's move back to this window and then let's expand it because now we have some more knowledge on it. And it makes it easier to understand it. When I expand the function calls, we can see that because of the fallback to software, what it is doing is, in order to restore the contract rentry, it is doing packet header section at the driver level, which is before GRO. It is doing, it's consulting X array two times. It's doing memory allocations two times for XB extensions and for the tunel destination and memory allocations. As you may know, they are really not cheap. It's consulting another RH table. And this is all getting done before GRO. So the driver will do all this stuff for this packet and then we'll give this back to the GRO engine, which will realize that it belongs to the previous packet, to the same flow as the previous packet and we'll merge them. And all this effort, it was needed to do GRO, but it's not discarded because it's part of this bigger packet. That's why recovering this mad information from the packet that was partially processed by the hardware is actually more expensive than doing it entirely in software. In software, if we were doing it, we don't do any of these and we just aggregate the packets and do it only once. So you're talking about a trade of doing 40 times of these to then aggregate a packet and throw it to the network stack or just process them, aggregate 40 times, and then give to the network stack. So it's another attempt on getting some benefit of the hardware that backfires. Proper solution for this would be either avoid the situation or wait for the hardware vendor to support GRO, Harder Flood, so that we can have the network stack and the hardware more aligned on how they work. Some conclusions on it is, when dealing with performance, expect unexpected. Things may backfire and you may have more work to do. It hardly is something that, yeah, just flip that knob on and it will be fine because it really depends on the use case that you are working on. There is no size that fits all. So you can try to optimize the system as bad as possible for this use case, but for the other use case, it's different. And it may work, but it also may not work. It may work worse than if you didn't have done these optimizations. And there's this one tool that pretty much rules them all. That is Perf Tool. If you check the main page, it supports a lot of stuff that can help you. You will likely need some knowledge into the curators, into the drivers, and to the many subsystems that we have in there. But it's a very helpful tool that is very worth your time to understand it. And that's it. Any questions? I think we confused them all. No. Without Selene, sorry, let me repeat. Regarding the first use case, case of study, I say that we could have better performances without Selenex. Yes, that is actually well known, but don't tell Paul Moore, which is the Selenex maintainer. Well, he knows actually. If we go back a little bit, one of the first slide, here. You can see the fourth most offender is that Selenex socket receive message. That is the hook used by Selenex to perform and enforce its policy. If you disable Selenex at boot time, not making it permissive, like adding Selenex equals zero on the kernel command line, that function will not be called at all. And that overhead will go away. ER is 3 something percent. At greater packet rate that we cannot obtain with UDPGRO, that will be more visible. And removing that, you will get some relevant gain. But don't do that. Can you check all the cache lines? Why are you trying to repeat the last part? Excuse me, can you repeat the last part? OK. So the question regards cache utilization. I mentioned that processing new packets will experience cache miss. And it suggests to disable caching for the packets to avoid that cache miss, if I understood correctly. Yeah, but that will not solve the problem per se. Because we see the cache line, the CPU see the cache miss because the CPU has to access the data, for example, to fetch the MAC address for Ethernet processing, the IP address for IP processing, et cetera. So if you don't have the cache in between, we have to go all the way down to the main memory and spend a lot of CPU cycles. So in the end, it's exactly the same of having a cache miss there. Other questions? No, sorry, out of time, sorry.