 Hello, can you hear me okay in the back good? Okay, cool. All right. I think we'll get started here So my name is Eric Carver I'm going to talk to you a little bit about virtual networks and more specifically we're going to talk about CPU utilization After talk feel free to come up and chat if you want or you can send an email to me at any of these addresses here listed So this is our agenda what we're gonna cover We're gonna talk a little bit about why do we even care about CPU? CPU utilization processors are super powerful nowadays who cares We're gonna talk a little bit about the test how they're set up and what kind of what kind of types we're gonna test Then we're gonna go over the results and then we're gonna talk about what we learned So let's start with why we care Short answer is money CPU cycles aren't free. They cost money, especially if you're running inside of a virtual machine Some writers charged by unit of works some writers have a credit limit So you may you know, whatever you can eat out of your machine. That's what you want to do It can also affect the workload if you're spending cycles on networking you could You know, you might have a better use for it on your workload your actual application Energy use is another good reason You know data centers are big they use a lot of power so whatever we can save is useful In this part the last note here is The the most interesting piece I think for me and it shows areas we can improve You know some of these results might show us You know a bottleneck or something like that. So so here's a list of tunnels We're gonna or a piece of some of the tunnels L2 GRE also known in Linux land as GRE tap It's basically Ethernet inside of GRE It's of the IP tunnels. It's probably got the least amount of overhead. It's about 38 bytes for the encapsulation Another the second one will look at is VXLand. That's probably the most common Again, it's Ethernet inside of IP but actually inside a UDP It's a little bit bigger than GRE has a little bit more overhead We're gonna look at Geneve, which is a lot like VXLand A little bit different, but same it's same still inside of UDP So they're very very similar and the last one we'll talk about is 802 one 802 one 802 one AD Also known as Q&Q It's basically to VLand tags And that's the least amount overhead as far as packet data So we'll look at the native tunnels native Linux tunnels, but then we'll also look at some open v-switch too So the sorry I meant to start with the I apologize for my slides. I'm no PowerPoint master, so The the test setup is pretty basic it's two physical hosts and for each tunnel type We have a network namespace on each end The data path is a 4.95 Linux kernel The test tools we're going to use our IPerse or traffic generation And then we're going to use MP staff and sysstat tools that will give us our benchmark for CPU Each test runs for about 10 minutes That's basically we're just running traffic and then we're collecting CPU stats that whole time and that's just kind of Devoid any kind of noise. We stretch it over time get better a better set of data For my test I found it was useful to limit the number of CPUs Otherwise you end up with scheduling noise and things like that and the The test results weren't very as consistent as I hoped So I was it I basically took my machine disabled all but two cores and that Made for much better more reliable results I did not use any tunnel hardware offload because that you know like the VXLand Jeanine can use newer Knicks can do hardware offload I did not use any of that because that kind of defeats our purpose of for uploading So here's a little visual representation of what I just talked about this is intentionally very basic So you can see we have So this is VXLand on both sides here And I'm just you know OVS on one side and then native bridge on the other and it just goes from one machine down over To the next machine up to the other namespace, which is basically traffic between these two namespaces here Like I said pretty basic setup So now we can look at a little bit of numbers So before we do that we need to keep some things in mind These tests target a specific bitrate because we're not worried about how much data we push through we're worried about How much processor we're using as we're doing it So each test target is a certain bitrate so we're gonna look at a relative comparison between the between the different tunnel types in the Utilization, so here's a graph of so the set of scripts. I wrote for this Actually dump dump and dump a bunch of data to move to new plot and this is what we end up with so this is a the transmit side of all the tunnel types and For a single stream IPERF with negative algorithm on some of the bottom here. We have different packet sizes We have 64 by 12 1446 and 9000 the reason would be 1446 is to make sure we don't have any segmentation for the IP traffic Left side is just utilization so this graph is This one's a little less interesting the other ones. I'm gonna show you because it's single stream. There's not Not that much difference between a lot of the tunnels if one thing you will notice here is a Native tunnels do a little bit better than the OBS space tunnels and we'll talk about why a little bit later Q and Q does a little bit better in most places This next graph is we'll bump it up to 128 traffic streams and then we'll see a lot more drastic results If you didn't notice there's a legend up here that shows the different colors. Hopefully you can see all this, okay? So again the same kind of thing I mentioned in the first graph OBS tunnels do quite a bit better. Sorry. I mean native tunnels do quite a bit better in OBS. It's just more pronounced here You see on the left-hand side here, especially on the 64 by packets. It's quite a bit of jump I mean, I think we're looking at about 15% CPU utilization there. So it's quite a bit quite a big jump And again, we'll talk about that a little bit later about why Nothing else to you interesting here. This one's kind of an outlier that we'll talk about too. I think Received side shows something pretty interesting On the jumbo frames the X-lan looks a little weird the X-lan is this green native one right here That's the native tunnel and then OBS one here This one here. So it's a little weird that the X-lan is quite a bit higher than some of these other native tunnels Okay, again Q and Q looks pretty good and We'll talk about why and it has a lot to do with the encapsulation being only about four bytes So here's some interesting data points For 128 streams Open V-switch Had had I think I got to this out earlier had about a 15% more utilization for the processor then Native tunnels at small packet sizes Again Q and Q is about 10% down compared to the IP based tunnels and Then the received side VXLan was oddly an outlier for the IP based tunnels So what does all this show us? Well, most of the IP based tunnels meaning GRE VXLan Geneva are behave actually pretty similar Which I guess we can expect because they're all IP tunnels VXLan Geneva are very very close. They Have a common encapsulation. They're based on UDP And they also Share some code in the kernel too. They use the generic UDP tunnel code. I believe so that's why we see so much results there And again native tunnels perform quite a bit better than open V-switch Not drastically so I mean you're getting a lot with open V-switch You're getting a lot of flexibility so it makes sense. I mean, that's how code works You add features you lose performance a lot of times A lot of this has to do with open V-switch doing more packet inspection We have to do we may match on flow. So it's got to check to see if Any packet coming in is going to match a flow and if so we might do an action so it would be it one thing I didn't do that I'd like to try is Do the same set of tests but add a lot of a lot of flows to open V-switch and see what happens I would guess we see even more pronounced difference Maybe some of the open V-switch experts can help us with that Give us a guess Again Q and Q does well and the reasons being Encapsulation is significantly lighter. We just Just based on the amount of packet size it's quite a bit lighter and we're not doing any routing too That's the other big difference All these IP based tunnels once you actually do the encapsulation you're gonna hand it over to the IP code So Q and Q just Goes out the board So earlier we saw the oddity with the X-lan This is definitely something that needs to be investigated Hopefully I can find some time after DevCon to do that, but it seems really weird to have for the Received side to be so drastically different than the other tunnels. So perhaps there's And we did not see that on the transmit side So it's it's doubly weird, I guess So that's all well and good, but there's also a bunch of reasons why you might not care about these results The difference is is IP is convenient you may even knowing all this may still want to just use your IP based tunnels I mean we saw that There is some better performing alternatives But You might want to keep the convenience of IP Q and Q also has inherent problems Q and Q has been around for a long time One of the major issues is if you're using in your network all your course switches are going to be learning the MAC addresses that are inside of that tunnel So it may you may hit a scaling issue especially in the world of VMs One crazy idea you could do a combination of these you could Do Q and Q locally and then if you have a reason to get to a different data center or a different part of the network You can go through an IP tunnel to get there So it's not it's not end all be all pick one and that's all you have you can mix and match so that's the Pretty pretty much the results we have. I do have some extra data points that we can look at Toward the end, but I mean that we saw I think we saw enough to say these three things here in general that the IP based tunnels they're very comparable to each other With a couple outliers, and I think those outliers are actually pointing to improvements in code that we can make Q and Q had a significant benefit performance wise, but again performances and everything for your setup and VxLan receive side is a potential Bother improvement we can make by looking at the code To so there was some other other data points on here that I didn't Didn't show because it was in the raw data And that was So it's let's back up here. I'll show this You can see earlier. I'd point I pointed out numerous times that open B switch Utilization was a little bit higher And if you look at the raw data, you can actually see a good indicator of where and why So I said that the I'm using two CPU cores for most of the native base tunnels the one core will use roughly 80% utilization and another core will be mostly idle at 10 to 20 That's actually not true in the open V switch case in that case. It's more like a 80 and 50 So it's quite a bit quite a bit more significant than that that second course using a lot of more cycles than for a native bridge Let me go Show some more So some of the other I Ran a lot of different iPerf tests That I also ran 32 streams and this shows a you know, not quite as pronounced as when we jumped to 28 But you can still see that There's an odd outlier with the x-lane here, but it shows a lot more on this point on This set of data. This is again on the receive side if we back up to the transmit side. It's not there at all early Yeah, I'll end by looking at this slide again I P does good and I was actually really impressed with the performance of the IP based tunnels with them. There's been a lot of work in that area So it shows very clearly that they do very well. I think that's that's all I had unless anybody has questions What's a look at any more data? I have good So it depended on packet size for 64 it was relatively low, so It was relatively low at 64 bytes. It was Only like 128 megabits. I think is what I targeted for that. I mean the reason being is Let's back up You can see once I up the number of streams. It was significantly higher than one stream Yeah, so once we got up to like jumbo frames. It was more like Sorry, the question was the question Excuse me the Yeah, okay Yeah, the question as well as the target bandwidth for all of these tests and the answer is it varied up based on packet size On 64 bytes. It was really low 128 megabits But then once we got up to jumbo frames. It was four gigabyte get four gigabits per second. So Yes That's correct. So the question is why bother with 64 for bytes to show on these results And the answer is you always want to know your worst cases Yes, so you're going to hit I mean you're going to hit M2 size at some point So the question is this why do we not see a major drop between say 1446 and jumbo frames in 9,000? so let's back up to This picture so these namespaces inside part of the test was in some of these namespaces that there's an empty value Mtu value of 1446 So basically once we hit that we're just fragmenting anyways So you really there's not a huge difference going going to be from not not fragmented to Between the two there's not that much difference Any other questions? so the question is did I account for CPU cache lines or any I I think I'll translate to CPU scheduling noise so That's the reason I went down to two cores When I saw I was testing this on a box that's got I don't 32 cores I believe so it went out with that it was definitely Yeah, there was definitely some cases where Numbers weren't as good as I would as expecting and also it was inconsistent so The answer is the only thing I did was limit the number of course and I'm sorry. Can you say the first part again? So the question is is there any latency Correlation between tunnel types Okay, that'd be interesting to know I did not measure that I don't know it'd be interesting to know though. Yeah, I agree. I imagine you know as you As you spike and you see the utilization your cues are going to back up there for your latency is going to go up But the answer is I don't I don't have any measurements to prove that so in the back Not okay, so the question is Did I also try an inverse of this max out the CPU and then see how much throughput I could get So I didn't do that for this talk I Don't have any data to share for that But yeah, it would be interesting to know I assume you're talking by max out the CPU Do you mean max it out with data or do you max it out with some other process spinning or something like that? I guess this question Both cases would be interesting to Investigate yeah But I don't have any numbers for it now any other questions going once Twice all right. Thanks guys