 Derek software architect at Broadcom who will be talking to us about improving network latency and throughput with dynamic interrupt moderation Yeah, as expected this talk with a wide appeal to many conference attendees something is Sometimes it feels dry like the kernel or doesn't contain a container buzzword might be problematic, but I'm really proud of this work. We did and really glad that we could Then I could come here today and share a little bit with it with you on this and in particular I think this is a good fit for this track surprisingly, so rather than thinking about calling it What it is, which I can't read You know improving network latency with Jim we're gonna talk about auto tuning your network, so and I thought add a little Little picture of our famous our favorite auto tuning artists there So what is dynamic interrupt moderation? So for those that? We've got to go through a little bit of review what this might be people probably Maybe maybe aren't familiar with how the packet how packets in the Linux kernel actually make their way from physical hardware into the kernel stack itself so The main idea here is that we're gonna tune the time between when the first frame arrives on the wire or off the wire completely and And when it interrupt pops and so there's a variety of reasons to do this We'll go into those in a little bit, but this is kind of the flow, so we have an interrupted pops we schedule a polling event and That polling event that ultimately reads reads the receive ring of the nick so in this Beautiful picture we have frame zero to frame n and a head and a tail, so these are these are essentially considered to be the frames that have Not been read and pulled out of hardware yet and marked as complete So it's kind of a typical workflow if our arrow going from left to right indicates time moving on Each forward-facing or upward-facing arrow would signify an interrupt and the stack of five Rectangles indicates five frames that are read out of the ring buffer So in a fairly consistent flow of traffic that you have coming in this interrupt period As I'm waving my hands from this first one to this next arrow would represent The interrupt timing that we would have so we're gonna get a little have a few frames come in service them pop another interrupt more frames Etc. Etc. In a steady state this looks pretty good So if we have a short interrupt time, of course This means that we have a really small number of frames processed in each polling event now that can be good if your concern is Latency that can be bad if your concern is throughput because an interrupt is pretty expensive So if we think about doubling the interrupt period with the same traffic flow We have a situation like this where instead of just receiving five frames with each polling event We would now receive 10 so this is a great case for a Great description of a workload where you want high throughput and Also the downside is high latency so As you might be surprised as you might not be surprised to find out this is not a particularly new problem This is something that people have been dealing with for a long time so One of the first attempts to deal with this administrators. Yeah, it literally decades I think I first probably came across an issue like this easily in the knots if we're calling this lap the previous decade that and Regularly would have to talk to customers when I was a red hat to try to figure out whether or not How they should tune their devices so the first attempt at really dealing with this was in one of Intel's hundred one gig adapters they had a hardware feature called aim or adaptive interrupt moderation and This was actually the source of fairly What ultimately was a fairly long-running why someone was having a particular problem because they were primarily concerned with low latency Not with throughput and unfortunately at the time When we were only dealing with a single receive queue most people were concerned with throughput That was the big one of the big tests that was done So one of the things about aim is it was liked by some disabled by many Like many of the hardware features that have existed in the past. There's always a little bit of angst Hardware designer does it It rolls out some software to configure it Maybe it doesn't work exactly as everybody expects So then there's some significant frustration about you know, why does this feature break my network? so Kind of the same story over and over works for a lot of folks, but the lack of flexibility that existed in hardware Was not good enough for some They always people always seem to default to thinking that software is more flexible and better and for many cases it is so at the time one of the interesting things is that We sat around the office and postulated whether or not it'd be good to have a user space demon that controlled this interrupt timing so At the time it was at the time this happened It's when we're first starting to see some of the tune D profiling come out on various Linux distributions most of them red hat and fedora and You could you could tune your workstation for whether or your laptop for whether you were most concerned with high performance Or whether you are concerned with better battery life or I think they're at the time There were even some there even some networking Configurations that were available and many of these things twiddled bits in Intel's power management capabilities at the time And so we thought what if we did a parallel? What if we came up with something that could that could really have have a Administrator at the beginning of time say you know what this is a this is a workstation where latency is the most important thing So let's tune for that or what if we? You know this was a file server where we cared most about moving bulk traffic on a regular basis So sort of sat around the office and pondered whether or not that'd be a good idea and And I think ultimately when I look back on that now We've come to a realization that that was completely and totally the wrong strategy So we could say it was blind luck that we didn't implement that but realistically it was probably more about laziness than anything else and so Let's fast-forward a few years and think about where we are now So machine learning AI everywhere I'm amazed I'm like sadly amazed by how much it's on our phones You know things automatically presuming a time of day that you want to do something based on where you're physically located Like and I don't know it's it's funny to me How impressed I am by just little tiny simple simple things that you're probably never taught in any sort of CS or computer engineering program anywhere and So one of the first things it was that that came to mind is a talk that Tom Herbert gave and that's of in Montreal last year and He talked a little bit in his keynote about the impact of artificial intelligence and you can see I've got a screen grab of his video on YouTube about this with the link here all very clickable for everybody right now and You know it talks about the fact that that machine learning and he's he's got a new a new company that I think Or machine learning will play into this one of the things he talked about is like will the latest congestion control algorithm TC TCP BBR be the last human written congestion algorithm that exists and it kind of struck me when I was thinking about this like How interesting it would be to think about that being the last one that's written and how through machine learning We could come up with better ways automatically. It's a little bit Skynetty a little bit scary, but at the same time I think the power that we have massive compute power and their ability software's ability to Do the same thing over and over Effectively that mouse is moving around Is is not good or is good for us the mouse is not good so Coincidentally Mellon ox added support for what we're now calling dim in their main 2550 100 gigabit driver in 2016 and Bless you and the fact is we were I Was trolling around looking at their driver and wondered like now. What is this this operation here? It doesn't it doesn't really make sense They're doing something on receive. They're doing a little bit of data gathering It looks like and it's like they're kind of using it to make a decision later And that's exactly what they were doing so that they were Calculating how how many bytes were coming in they were counting the number of times an interrupt popped And they were using that data to come up with what they felt like was an optimal setting for their They receive interrupt timer, so if we go back here a second Remember our two pictures that we had so this one pretty steady state regular interrupts servicing a small chunk of packets at the time This one longer interrupt rates serving more bulk traffic so it looked like they were trying to figure out a way to Know which time was the best based on the traffic that came in Not pictured in either of these slides is the fact that there's a different each one of these packets could be a different size each One of these frames which also plays into it because again It's easy for us to think when we receive a packet and we just know that it's long it's easy to think that it It's all the same that a 64-byte packet and a jumbo You know 8k 9k frame is the same, but realistically they all take a different amount of time to be on the wire because there are Discrete bit times required to handle these things so so I thought I thought that was pretty interesting that that Mellanox had that and We started looking at it. This is basically how it works. So on this slide credit to Algarboa from Mellanox I gave a talk over this year on this Take a sample Compare that sample to previous runs previous iterations and then decide whether or not you want to make a change So when we dug into it it seemed pretty good. So the other cool thing and One of the things that we see as a kernel developer I'm okay with it One of the things we see a lot of talk at talks is you know escaping the constraints of the kernel You know people feel the kernel limits them and the dpdk is so much better or some other Thing is better for their application for their individual applications. I would 100% believe that one of the other things That's allowed us to do is by running this in a driver We escape sort of the lock-in that the global these tool API uses for configuring these interrupt timers So in the past when still today because there's interesting keep an e-tool pretty static if you configure interrupt timing It applies across all queues. We of course now live in a networking world where it isn't just a matter of a single queue receiving all these traffic all this traffic multiple Multiple cores are tasked with servicing this traffic Which is how we can get to 50 hundred and pretty soon 200 gig ethernet on a server So this allows us to escape some of those that kernel kernel lock-in So what we really found is that because it can operate independently we can also have different types of traffic being handled by different cores This is especially useful in a virtualization case where you might have an application That needs to be low latency that's running in a VM or you might have another application That's ultimately serving as a storage destination So having now all of a sudden we can have the best of both worlds. We could run a net perf test and receive full utilization of of all of that of that core at max at pretty much maximum throughput and we could run a TCP RR test with net perf at the same time and C low latency because they're end up being serviced by different CPUs. So that was super cool So we'll talk a little bit about the algorithm It's not super amazing and the great thing about it is I'm here talking about an algorithm today That's actually open source. It's in the kernel you can look at it So spending a lot of time explaining how it works. It's not probably super valuable Because I know everyone loves reading kernel code. I know it's what that helps them sleep at night As well as it's the first thing they read in the morning so in a typical case we have five profiles that exist right now, so You can see what's critical at the top is the different timer settings So obviously down here on the far left. This would be the low latency case so we want the timer to pop really quickly and On down on the far side the timer of 256 microseconds would be the high throughput case so the reference to left and right is something that's baked into this algorithm and Everything starts down here at the low latency case I think that makes a lot of sense to start there rather than starting in the middle Because typically low latency is going to be small traffic. It's typically going to be quick sessions Typically going to be a small number of bytes so it's default to that and the rate at which we sample and the rate at which We make changes quickly moves us down the line to the right so This decision tree is really pretty simple We have our previous decision Either right or left we compare the samples that we've collected on every single packet We receive and every single interrupt we process and then we make a decision. Well is is This better or worse or the same as before if it's the same we park it Analogy that probably applies to everybody that drives If it's worse we go left in the case where we were previously going right and If it's better we go the opposite direction we go more to the right so The compare samples piece can also Can be tuned a little bit depending on your Your workload or your speed, but really one of the coolest things about this is I've tested this across fast processors and slow processors and Super fast processors if we want to have three examples and what really works is this holds up across all of them so this is something that works in a small system, maybe even a 32-bit arm case and it's something that works on works well on the latest intel devices So all right, so I mentioned Intel and Melanox, but what about Broadcom? I mean, they're the ones paying paying for me to come here and talk about this So of course, this is the big reason. I'm here is that we found this to be interesting We ported it to our driver and we really liked what we saw in fact it was Other people confirmed they really liked what they saw. So here's some super fun graphs So in the case of Default and adaptive coalescing in this first picture on the left You can see that basically the throughput was unaffected by the number of streams. This was a 25 gig nick That's why we're up there at the top So we can almost fully utilize that just with one core and certainly once we hit two cores or two streams we're utilizing it a hundred percent and the graph the point of this here is to show that Even with the small hit that comes with cataloging this information We were pretty much right on my throughput. There's no hit there The graph on the right is a little more complicated to understand. So I'll explain it a little bit The x-axis represents the number of streams in use and the y-axis represents the total CPU utilization So unsurprising with one stream. We're utilizing one core completely It's as the graph is kind of funny that there's a two and a half core example there That wasn't really what we did. We there's a dot there at two. We would be nice if that's how Whatever spreadsheet technology we're using a graph this would have chosen to put the lines at two but anyway At two with the default coalescing settings that we have in our driver. We saw much higher CPU utilization because there wasn't the ability to adapt and have the interrupt timer move way out and On the case on the right lower is better. So adaptive is clearly winning as we scale up towards eight cores Or eight Eight cores being used for receive traffic. He's used a 7.0 there We're still not even barely over barely utilizing two and a half cores completely when you add all that up versus Close to probably four and a half four and three quarters With the default settings. So we feel like this is going to be a huge win in in the throughput case The other thing we did is we did some TCP our performance now this is a Hesitate to show raw numbers here because every time I get a new system in with a new processor these numbers all change But we went ahead and put it in anyway So with our original static coalescing at the best rate we could do we could do about 20,000 transactions per second with adaptive or a little bit less So I'll talk about why there's a 4% reduction But we were really happy with this to be honest the fact that we're Paying attention to every interrupt paying attention to everybody that came in and doing computation on those not on every packet, but Statistically within a certain number of packets we would analyze whether or not we need to make a change the fact that that only caused us In this in this single stream test of a total of 4% hit We knew was going to be a real positive And at least for the the people we were going after for this and and they were they were quite pleased so We also confirmed that one receive ring can be optimized for low latency and another for high throughput This was really the case that I think was most interesting to me I think This flexibility just doesn't exist today in the Linux kernel So by adding this feature we were able to provide something that really other than Melanox No one else could do so I was really happy and and What we decided to do was rather than just take Melanox's code And completely add it to our driver and that seems really weird in some ways I worked with to Al Gaboa at Melanox and we actually made a generic Layer in library now yesterday if you if you sat through One of the late afternoon talks there's a panel I said oh AI is not just about adding a library and thinking that like everything Just works magically I won't necessarily refute that but I will say that in this case That's one of the cool things about this is you can just add a library You add the right probe points within your driver You add a function call that can set this value in your hardware and you can just use it and in fact After posting my first patch of stream I got several off-list emails about this people who are interested and One of them doesn't even work in does does happen to work for Broadcom but not in my division So didn't know he was interested, but the BCM G net driver Used this right away, and in fact They also adapted it and wanted to use it for transmit as well this is a great example in my view of the power of this because this is a Driver for an arm so see it's typically embedded in set-top boxes. So if you Have used any pretty much Many of the triple play offerings from ISPs where you can plug a phone in and you can plug some ethernet in and It has Wi-Fi built-in That's the type of application that this has and this Type of application for this and in their case they've got a wide array of traffic patterns You might have home use traffic that is you know streaming video And so you can have large frames you're gonna want to want to make sure you're optimized for that but you're gonna have other flows that are very small a very short term and As soon as this came out Florian Finnelli is the one that did this work. He was Excited about it because he'd been they'd been pondering the fact that they saw such a huge difference in the way their systems performed When they would use different values. So the fact that this could do it tune it for them without doing anything They were they were pretty pretty stoked about so more drivers to follow. I don't know. I've talked to folks at Intel they have They have a little something in their driver that does something similar. They also have some hardware that has some fun fun features We additionally have actually amazingly lots of hardware IP blocks to try to handle this situation we have More than just a basic interrupt timer We've got several things that we don't completely expose because there's no API and part of the part of the challenge for this was actually Working Michael Chan and I working out which how we should how we should handle this and how we can how we can best You know give customers and more importantly administrators the opportunity to run this no longer are The theory should be when this is when this is working and this is in the distro and this is everywhere There should be zero support calls Again, that's the theory there should be zero support calls to anybody who who says oh my network's not performing in this low latency case Oh, it's not performing in this this bulk transfer case. This should be done. It should eliminate those calls So we have we have outsourced this this work to the machines So I want to share just a couple observations some some surprising some not For me, this is a fun thing to work on which at this stage in the game at working on the car as long as I have that's Sometimes a little bit rare So one of the first things we came across is that programming hard work can be expensive And when I say expensive we're still talking about milliseconds or microseconds, but it can be and This is a common case across multiple hardware vendors In fact, we spent a lot of time tuning and understanding when the ideal point When when's the ideal point to sample? When's the ideal point to decide whether or not we should make a new decision? because you can do it so frequently that you see a much greater than 4% reduction in your in your low latency tests and and this this expense when running on the same CPU as The traffic as the CPU receiving the traffic does is going to cause a small interruption in traffic so we talked about scheduling on other CPUs and we decided that was I Was a an experiment that that we could look at for another time But but another thing to think about you know that the cost of doing these operations to hardware is never free So a good thing to remember The other thing We found is that we had a few benefits that appeared sort of unexpectedly so when we were Doing some testing we are a typical test case where you have you know a whopping two devices involved and you're doing some transmit from one to another and You know in the case with almost anything you have an experimental group in a control group so what we started with was using our adaptive interrupt moderation on On our test server Running an upstream kernel and we had another system just running an upstream kernel With our normal driver and we slam traffic at it and watch what happened And one of the things we found is that we were not getting the throughput that we expected And it was you're sort of scratching our head a little bit saying like well, you know, I would expect that that We can see that it's moving up to this higher profile. We added some debug Fs support so we could see this in real time and It just wasn't happening as Efficiently as we thought it could And some of that was because I'd previously tested two systems back to back So we started doing this control group now what the performance wasn't worse It just wasn't as good as I thought it could be and what I realized is that if you're a sending system despite not having any transmit interrupt Moderation features enabled acts basically are classified as low latency traffic Acts are small they're coming all the time and the speed at which you receive an act Definitely determines how quickly you're going to send out traffic again So we actually did some tuning in and so we emulated what we thought the algorithms would have done and On the sender move the low latency back to the center of bulk traffic move just to a low latency profile And we actually saw improvements in CPU CPU utilization So that was kind of fun And I think to me this is one of the examples one of the things we can point out that Had I spent had myself or the other folks working on this spent a lot of time thinking about this ahead of time We probably would have come to this conclusion Maybe maybe not you never know might give ourselves too much credit but the difference was that Just trying this enabled something newer and maybe more fun than we thought and it was an improvement So I think this is a for me This is a thing I'm going to continue to think about as a big win for AI showing us something that we didn't think of We could do before So the other big takeaway for me is that the kernel has a ton of configuration knobs a ton and so many of the folks that have worked on the kernel are Some of them know no longer working on the kernel. They're doing the next the next most interesting thing that they think exists Or they're too busy, you know working on the next version of hardware or whatever Did I think there's a lot of low-hanging fruit out there for us to really examine? Different kernel config options. I mean take for example just the discussion I had about the the 2nd profiles that exist well, why Do I need to Why do any of those need to exist? What is it? What would what would it take for us to figure out with any of these things? What the ideal number of? What the ideal settings are for highest performance or what the ideal settings are for low battery life even take things like? You know data plane technologies that are of interest to me right now whether they'd be BPF and XDP or or even DPDK Things like why do we have to guess at what it takes to be the proper number of packets that we batch anytime anytime? We're doing reception we can improve packet performance by batching. Well, let's figure out how many that is Automatically, let's not figure out Let's not spend four days with the person Recompiling and testing over and over again to try to figure it out. So that's that's my encouragement for all 11 of you that are here To go forward and figure out and think about whether or not areas you work in can be can be done automatically so also want to Leave a little time for questions, but I want to make sure to give a shout-out to Gil Rocca Archie and Tao from Melanox who came up with the initial Implementation of this the initial design and pushes their driver and Rob Rice and Lee Reed and Michael Chan from Broadcom And then of course copyright holder saw images used in the presentation. So that's all I've got Questions, please say no No, that would have been I Given more time, I would have loved to have autotuned the entire presentation Cool, well, thank you Outside the lab Yes, absolutely. So the question was outside of a lab environment what sort of other applications have been tested and For me that's all I have done because we had a lot of this was motivated by a specific requirement from a potential customer and They had some workloads that they weren't able to emulate with net perf Pretty effectively and so they they came up with this recipe and said, okay You can do this and you can do this and you can do this and you can do this No touch, you know, you have a chance at winning and so that was a lot of what Lot of what I do in my job now is to figure out what it takes to do that And so we looked around and looked at different things And so they came to us with the TCP or our test with net perf and they came to us with some of the TCP stream tests and some other specific things So aside from just a system to system test the other thing that this has been tested on Pretty heavily from our own interest for another another reason Was actually syncing the traffic into a VM So a regular host that was pounding a VM with traffic But both the TCP RR and the TCP stream so the VM was the sink for the traffic And that's actually one of the interesting points to me is that's where this really shines Where they're using Broadcom hardware Melanox hardware because the VM has zero the VM might be running for IO and they have zero control Over what's happening. So now what you've done is you've got a way where it doesn't matter what workload is being run on those VMs You you've given them a chance to to to both be successful because typically both of those separate IP addresses separate streams They're gonna hash to separate CPUs. So they're gonna be on unless you have really bad luck And then they're gonna they're gonna be received at different rates and that's that I think is the is a big strength And so I look forward to when These upstream kernel changes roll down into the main distros and are used In virtual in virtualized environments like that whether it's open stack or just other other places. I think that's gonna be key Yeah so It was it landed in January in Dave Miller's tree So that probably means for 16 So Yeah, so it's freely available and in everything shipping pass that point did you have a oh Yeah, absolutely I So ESX I can definitely say okay. I shouldn't say definitely. I Haven't been asked to so broadcast maintains our inversion and Collaborates on an ESX driver. I don't think I haven't been asked by anybody that maintains the question was related to whether virtualization environments ESX or KVM Etc. I can't say for sure. No one has asked me on the ESX driver team anything about this which is typically a sign that it hasn't been implemented Not always but no one's asked on the In a KVM environment if you're running a new enough kernel this is available So if you're running if you're based kernel on your hypervisor is I'm just gonna go ahead and make a blanket statement and say 417 although I think really 416 or 415. It's probably right Like I said, it landed in Dave Miller's tree or this year and his tree is always it's a development tree So it's always, you know one version ahead. So if I do like I get described I always have to add one to whatever is there because he keeps Linus's tags so Probably should have done that homework But yeah any I mean if you go out and run fedora with this right now or even probably I guess 1804 Ubuntu, it's probably got a new enough kernel bit. It's gonna be there Okay It keeps one it knows the last state and that's it Yeah, very very low overhead and that's one of the things that we really liked about it And why like it was kind of crazy how simple the how simple it was and how low overhead it was and how Small of an impact it had on I mean the impactful part is actually the couple millisecond delay hit that you take less than that but the Writing to the hardware if you have to make a change that we never had to worry about tuning I mean you're talking about one or two instructions that with a good compiler are probably gonna slide right in With some other delay that you have in the network stack The cost is always How frequently we wrote to hardware like I can tune that and watch it change like if I write to hardware every hundred packets Throughput and latency suffer heavily because you're spending so much time writing out Obviously if you do it every million packets, it's less useful Especially since most flows aren't that long but but yeah, it's very lightweight. I mean I was shocked at how well it worked Like it doesn't and that's the thing too is it doesn't have to be complicated Like we don't a lot of the base This layer is created in such a way that if you wanted to do a very a much more complicated Stateful inspection and keep track of your preemptively decide based on Something that's coming in that you should go one way or the other you could do it But this is such a great easy intro to start that minimal hit Thank You Andy. We have a coffee break from now till 11 20 and we will be resuming session then