 Alright, can you guys hear me? Alright, so my name is Han, I'm a grad student at BU and this is kind of a collaborative work. Over the summer at Red Hat was my mentor Sanjay. He gave a talk earlier about deep learning stuff and here we're kind of trying to kind of try to apply that to something that's more in the systems inside. So in terms of operating systems and physical hardware itself. So actually before I kind of talk about why we want to do something called configuration of these hardware, kind of start off with some motivation about why we want to do this. So mainly just talking about a little bit of the evolution of CPU architecture history. So hopefully all of you are familiar with kind of Moore's Law. For the longest time we've had the benefit of, you know, doubling the amount of transistors on the actual processor and it's been on a very nice trajectory until about a decade ago where we cannot actually physically shrink those transistors anymore. And one of the things that impacted that was actually the dinner scaling. So the dinner scaling actually made Moore's Law useful because it basically said that as you double the amount of transistors, the amount of thermal dissipation to actually cool the increased transistors account actually stay constant. So that effective means that by doubling your transistors account you're using about the same power to run your applications. You're kind of doubling your performance in a way. And roughly about a decade ago is where this trajectory is no longer possible, right? And that's kind of where we are in the modern era, leading on where we cannot actually shrink the transistor die anymore without causing a tremendous heating cause and also impacting performance because you cannot actually keep all the logic operating all at the same time. And to kind of address this from both the software's perspective so like systems research and also hardware perspective there's been multiple approaches, right? Because for the longest time systems research we've always relied on the fact that every four years or two years we can double our performance, right? By just switching to a newer hardware. And that has impact in terms of the design of the software, etc. And to kind of tackle the fact that we cannot rely on having faster clock cycles there's been a lot of user-level software kind of like this to take advantage of the hardware better, right? So DPDK is kind of like a user-level library that lets you gain raw access to the hardware to process packets and there's also these other libraries to take advantage of these accelerators, vector accelerators, they're increasingly, they're very common almost all kinds of data centers. And there's also been kind of like a kind of explosion recently in terms of Unicernals, right? There's all these different Unicernal projects where the idea is that you can use a high-level language to build a kernel that's kind of specialized for your application and it's a single address but a single process, right? And the whole point is that a lot of the overhead of our traditional operating system where we kind of always written them because we wanted them to be adapted to different kinds of workloads, etc. If you really want a shooting for performance for like a single set of workloads, for example, Unicernal might be better for that. And at the same time, there's been kind of a gigantic heap of different kinds of hardware that's been added into a system too, right? For different kinds of accelerators, there's also programmable stuff they've added into these so this is like a programmable SSD you can actually write logic in there, write code and it will actually operate on data as it's being written to the SSD. There's also programmable memory also and then there's FPGA, etc. And there's also a lot more different kinds of devices to talk and there's also different layers of memory layers now, right? I think it's about seven or eight, if you consider like the, whatever the newest memory, the nonvolatile memory stuff. And with these different hardware, a question we can ask kind of is like, how could we, so if we are running kind of like a customized software for a single application where we even customize the system itself, could we actually customize the hardware for a single application and what would that mean, right? Customizing the hardware. The hardware is already customized enough. How much more could you customize it? And in this case, the hardware we're kind of focusing on is a network card, right? So a network card exists in all systems. You know, every time you send a packet, the packet goes through the network card and goes up through your operating system stack and actually fetches you the data. So at a high level, this is how the tool chain, I guess in a way of how packet receives work in like a modern system, right? You get your packet onto the NIC and on the NIC itself, there's a set of queues, like receive queues, where the packets are basically, every packet receives, it puts it in this queue and the packets here are basically DMA to some regional memory that your system allocates and with that packet data, eventually when it's ready to inform the software or the device driver that you're ready to handle it, interrupt gets fired and there's like a software interrupt handler that then processes the data and then passes it up through the network stack to your application, right? So transmitting kind of the same in reverse, you just kind of adding your payload with the traditional headers on top putting on the transmit queue and sending it out, right? So this pipeline is the same for almost all network cards and the kind of interesting thing about the network card, we are specifically looking at here is this 10 gigi network card from Intel. I think it was released about maybe eight years ago. So 10 gigi is still kind of being used. It's not that bad yet. But the interesting thing about it is that if you look at the actual data sheet for this network card, there's about 5,600 of these registers that exist on there that you can actually write into and these values are 32-bit registers and they have different impacts for the way the card behaves and a lot of them, you know, it's kind of like a pie graph of different parts of the card and so there are a lot of these values and part of my job was to kind of see which of these values actually make a difference in terms of the operation of the device, right? So here's just a table of some of them and the configuration I'm kind of talking about is basically this, I'm just kind of looking at this entry right now, which is the hardware interrupt delay value. So this hardware interrupt delay value is basically a value that you can set on the card where after it's received a packet, it delays for a certain set of microseconds before firing the interrupt and letting the software know that it's time to process the packet and when the processor gets the interrupt fired, it will basically pull as much as it can either through some budget or pull until some time and basically try to process as many packets as possible that exists on the card and the simple math of just multiplying these possible values together is a huge configuration state, right? Like 10 to 20 different possible states that you can set just a subset of these. And for the rest of the talk, we're kind of, like I mentioned, we're kind of just interested in investigating how could we configure this value for your application? Like delaying the hardware interrupt value. And then the kind of general question of kind of this project is can you actually do it in an automated way where you don't have to manually either manually run your benchmark to kind of figure out how to tame this complexity of setting all these values? And the error we're kind of, since it's like a machine learning section, we're kind of trying to use machine learning techniques to kind of learn about the best way to configure this device. And I'll get on to that kind of towards the end of this talk. The first part of this talk is kind of just investigating where they're actually setting this value makes a difference for a set of applications. And what would that mean when we actually apply it using machine learning techniques to learn about these? Okay. Okay. I just skipped that slide. Yes. So the kind of question is, the first question is, if we just change these hardware configuration parameters from whatever the default values are being set by Linux, could we get better performance for an application? In this case, we kind of use a very simple toy problem just to understand end-to-end because we don't want something complicated where you're running a high-end like benchmarking tool where packets are coming in some like randomly sample distribution, right? We want a simple thing where we can reason about like, okay, this is how many packets I should be getting, this is how many interrupts I think I should be getting, et cetera. So in this case, we kind of have like a application called NetPipe where you kind of just send between two machines that are on their VLAN and then sort of it's mostly quiet. They basically connect it through a switch and they just ping-pong back and forth to each other with a packet of size M for some N iterations. And it's a synchronous benchmark where a guy sends it and then the other guy receives it, it sends a response back which is the same packet and maybe some act packets, et cetera. And we're just measuring like what is the throughput to do this ping-pong back and forth, right? And the knob we're turning here is this interrupt delay value. So this interrupt delay value just affects the single machine where the single machine is the server machine where we're saying we're gonna keep everything else as in kind of like a black box. We don't really care how they're configured outside of the single machine. We just want the server to be configured where we retune the interrupt delay. So initially we kind of did like a very static search and that's what kind of all these graphs are showing where on the x-axis we're kind of just tuning or changing the interrupt delay values and on the y-axis is the measure throughput and then all these different bars are kind of for like different method sizes that we test it with and we kind of generated like a 3D surface of them that we've been just adding to it because it's actually it would take a very long time to run all possible method sizes on all possible delay values, right? So we kind of just took a kind of like a random sweep to some method sizes to see whether there was anything. And the kind of interesting we found was at a method size of 10 kilobytes or roughly in this range if you set the interrupt delay value to about 40 microsecond you get about like a 50% increase in your throughput versus compared to 20 microsecond obviously there was a bunch more but I'm just showing a couple of ones to show you and the default one is actually the policy that exists currently which is a kind of more dynamic policy inside Linux where every time you get an interrupt it uses some statistics of how many packets it's seen so far and it computes like a new interrupt value that is within some range for example so it's kind of more dynamic policy which will make sense because Linux is kind of like the entire yeah, which kind of makes sense because it's hopefully more adaptive to different kind of workloads but in this case since we're just measuring one we're hoping that maybe it could adapt to it too because it's just sending one packet back and forth but it's kind of interesting that at a very specific interrupt delay value a very static one so in this case we're doing this statically so we just set up so we just set the interrupt delay value once and we never touch it and we just leave it right and it's kind of interesting to figure out what is going on here how are you able to get like a 50% boost in your throughput by setting your interrupt delay value to 40 microseconds at a very specific method size right so one of the things that we tried looking at is basically adding instrumenting code to kind of log every single packet it's seen when an interrupt gets fired so in Linux packet processing there's something called new API which is basically every time an interrupt gets fired and your interrupt handler is called you're basically at that point you have a budget for how many packets you can process before you before you get contacts switched out or you start processing right so that budget is a limit the package you can process at once once an interrupt gets fired so if there aren't enough packets to be processed then then you would just process as much as possible and you stop otherwise you have a budget for how many yeah but there but when you're sending a packet it could get depending on how TCP does it right there could be multiple packets in which case yeah so that's kind of the effect we were trying to see which is where like how and this measurement here is basically other than the payload packet how many other packets are acknowledgement packets right so this means like on average for every single receive for every single interrupt that gets fired this is how many acknowledgement packets were appended to the actual payload packet because in TCP they do something like piggybacking where you piggyback the acknowledgement packets right so what this kind of shows in this example is that at a 40 micro second delay we are we are more I guess efficient in terms of processing acts as in we we hit some threshold and some delay value where we're we're being more efficient in terms of how we process acts because we're processing more acts than these other examples so I mean that doesn't necessarily explain everything it's just a metric that we're trying to measure to try to understand this phenomenon that we saw a couple of other metrics that we measured was a the amount of interrupts that got fired total for this workload also the how much instructions were run throughout this benchmark right so for example we see so one interesting thing is we actually ran way less instructions here compared to everything else interrupts everything else and the workload is the same across all of these right so we're just sending end packets end times and it's the same across all of these and the only hardware parameter we changed was this so you may think about so kind of our intuition for what is going on here is the fact that we are we are more efficient in terms of polling in a way because because because every time the interrupt gets fired your card is doing a sort of polling to pull as much but if there but if your interrupt rate is not correct in terms of optimizing the amount of packets coming in so you're polling in a very much more efficient rate then you're just wasting CPU cycles right so what this kind of shows is that at this 40 microsecond delay we hit some threshold with the other app with the message size and the other client that we are able to be much more efficient in terms of processing packets basically and a lot of this kind of trying to look at measurements that exist in the system in terms of measuring how it's running when we're kind of tuning these hardware parameters yeah no it's fine yeah this is the yeah yeah yeah which is not necessarily 5,000 because for this run we're sending 10,000 kilobytes 5,000 times so that kind of implies that the GCP stack probably split the packets into multiple short sizes and sent those instead and we're getting a bunch of interrupts there yeah yeah it's yeah MTU is 1,500 this is the default yeah but the other unintuitive sense might be is that the fact that for the default one which is the Linux default policy it has way less interrupt than anything we've seen before and it's it's instructions it's slightly lower than the other ones but not as much right it's also able to process the functionally and for some reason it's throughput is the same as the others right so that's kind of weird and to kind of try to investigate what's going on in the dynamic policy we basically log throughout the whole run all the every time Linux updated its interrupt value we just stored it and this is a dump of it right so this is what the actual dynamic policy is doing throughout the run it's I think what this basically shows is that for this benchmark it's probably not it's not doing a good job of adapting to it right it's doing like a large search throughout the space and the interesting thing about this is that by default it only sets interrupt values from zero to about 120 something so that's what this peak here is but you can actually set it way higher up to about over a thousand microsecond if you want so what this kind of explains that maybe at some point it's like delay in too long it's not responding back fast enough in which case you're kind of suffering the since you're not responding fast enough the sender's not responding fast enough either and you're kind of throttling yourself in a way so this could be yes I'm not saying the policy is bad in any way it's for this workload it's probably not doing as well as it could another workload that we ran that could be probably more realistic than just ping-ponging packets back and forth is actually something used when you're dedicated the whole system it's an in-memory key value store and most of the time you run it under some SLA you're saying I want 99% tile of my request under like a thousand microseconds for example and our goal here is can you maximize is there a way by tuning the interrupt delay to maximize your queries for a second while maintaining this SLA which is kind of unintuitive shouldn't you just be polling all the time you never want to delay your interrupt because you want to handle that interrupt immediately while while maintaining your tail latency and the interesting that we found is we ran this so basically the way we ran this is we ran the same server and client except this is multi-core so eight cores on both sides and we basically measure max qps it could hit or queries per second it could hit while maintaining a thousand microsecond tail latency under a 99% tail latency under a thousand microsecond and we just increase and we did like a linear stand through this delay value delay interrupts and we just kind of plotted this right so it's relatively interesting that basically as you increase your delay your queries per second doesn't really increase at all at all nor does it really decrease baldy until you hit these later kind of weird cases where you really shouldn't delay this too much right but the interesting thing if you think about what does delay interrupt really do if you delay interrupt that means that you're delaying the handler code for the interrupts right so if you're able to delay as long as possible and the moment that you want to handle it you're much more efficient at processing those packets but the interesting thing is actually power usage right so if you and this is the same plot except we measure the amount of power it uses over time so so basically what this means is that you can actually delay your you can actually delay the interrupts while maintaining the the throughput that you got the queries per second but doing it at a way doing it at lower power throughput right just over here and and we and we did kind of did a comparison between what Linux the default had along with Linux with a busy pole doing so busy pole is kind of like a setting you can enable on the device where you say here's busy I turn on busy pole and I give you a budget what busy pole does is after the interrupt gets fired and it starts processing packets it will actually pull a bit longer because it's trying because your latency bound right and if you and then we we calculate the per core interrupt because this is a a core run of and we compare them right so you can see that by doing a static delay of this value we actually have a slightly lower interrupt rate than even doing busy pole right so I guess the kind of hypothesis here is the fact that by doing this kind of sad delay instead of doing I guess a smarter version of busy polling where the moment your interrupt gets fired is when you're when you're processing packets at a much more optimal rate than you're just randomly either do some heuristics or or do some actual busy polling and that's kind of and if you actually want to really decrease latency like packets are copied to user space etc but in this case we're kind of showing that the technique of just delaying your interrupt or delaying when to do packet when to do packet processing you are able to actually get benefits that may not seem obvious and another interesting example that we did was instead of worrying about tail latency for example we can just what if what if we just care about pushing queries per second by increasing the pipeline so pipelining here just means that we're going to pipeline 16 requests at a time versus earlier where we weren't doing any pipelining so pipelining basically destroy your tail latency but you can increase your queries per second for the same main cache workload and here this red bar is kind of the throughput where Linux is and here with this we actually found that if you increase your delay to a completely unintuitive level because you actually do the best in terms of maximizing your ops per second right so we still need to spend some time to kind of understand this value because almost no network cards set your delay value this high yeah when the first packet comes it will wait the amount of time before firing so after the first the controller will like I'm talking about the actual hardware that fires the interrupt to the device driver that then handles it to the network stack yes because the hardware will fire a IRQ that you have to register with the function and yeah it's that one yeah that's the hardware limit for that one yeah it's yeah until basically a millisecond on that specific device that's how long you can wait max basically so most of this so far is kind of just understanding what does what is changing one of these hardware parameters mean the other thing we're kind of interested is to kind of instead of doing this manual search could we do it more smarter by kind of feeding it to some sort of machine learning algorithm in this case reinforcement learning where you know you can think of the agent as maybe the software or device driver and the environment the agent is maybe probably the network card because the agent is doing some sort of action and the action itself is tuning the device is setting the device at certain value and the environment is basically your system software or your application and it's gauges and given given that action are you able to get some reward in this case like doing better in terms of throughput or lower latency and feeling doing this and doing this feedback loop of a self improvement system in this example in this case is something that we're kind of still working on because it's kind of a there's a lot of challenges with the learning aspect of this that we're still figuring out so I kind of talk about some of these challenges because we're still trying to understand both from a like systematic sense of what does changing the hardware mean to the system and also how can we present this as a problem to even a learning algorithm to learn right should we be using kind of like supervised for something like this unsupervised where we kind of give it like a reinforcement learning approach or kind of like a more supervised approach where we basically run a bunch of experiments real experiments and gather raw numbers and just feed it to like a SVM or some sort of classifier with actual labeled data or should we stick to some sort of like statistical method or some heuristics that is more that is different than what it has right now there's also I guess some challenges of measuring like of goodness because because we consider traditional AI like reinforcement learning stuff right it's being applied in stuff like robotics and image image processing right where you have a clear goal of whether either it works or it doesn't in this case it's more it's not that clear cut whether it could be better because for that specific message size or for that specific workload the goodness doesn't translate as well across different kinds of workloads and and I guess it's important to kind of that and something that we're still trying to understand which is how do we measure this to kind of understand whether it is doing good or not I guess and also applying this to actual hardware is kind of an unknown error right yep yeah that's a like another box of questions to be asked yeah I think yeah yeah I'm not I honestly I don't know what the answer to that solution is whether whether it's possible but if we just focus on making like one application work in a way I think that's for me that's like a progress but the issue here also is the fact that we are actually learning from the physical hardware itself and how it operates right like how can you actually even simulate this because like to actually run each of these experiments is very expensive because you have to run through the entire software stack and the hardware itself and there's all these interactions that you you do not have control over versus maybe a more simulated environment right from this and yeah so yeah that's kind of a abrupt end because the learning aspect we're kind of working we're still working on hopefully maybe I'll have much better stuff to show yeah well it would so the policy of setting the interrupt delay isn't the device driver itself it's in the yeah it's in the device driver code itself it is yeah yeah yeah it's yeah it's base yeah I have it's like maybe like a 40 line switch statement based on certain assumptions about like the peak optimal way this should be running and it's like switches between certain values and that's it yeah so that could be adapted that could be much better I guess but yeah yeah and you know it doesn't have much application state because it's literally looking at how many packets is received and how many packets is sent and using that as a metric and perhaps like some of these other measurements like some of these other metrics that were measured might be more interesting to use it for also right and also I guess here it's a very simple painful application too right so maybe it doesn't do well in that case application of different demands and maybe it can adapt to that better also yeah yeah so this is the thank you for attending my talk if you guys have any questions would be happy to answer them yeah I think so too yeah because the pipeline here is 16 so we can for every core we can send 16 requests one after the other so that means when you're getting a packet when you're getting an error I think delaying longer is much better in this case right like the default policy can ever go beyond this much delay basically because it's hard coded to be about a hundred something whereas the physical heart itself you can go up to a limit of a thousand microsecond right so it's basically a space that cannot actually explore yeah so yeah yeah so I think the interesting here is that maybe there's some very nice much more efficiently than some other for this one yeah this one is actually interesting so the tail latency in this case was about six milliseconds for the request and actually at this point at this eight hundred something microsecond delay it had the lowest tail latency about five hundred five milliseconds yeah no idea why that's the case versus the versus the Linux one I mean the I don't know if you want to care about tail latency like five five milliseconds maybe you do yeah yeah that would be hmm well yeah yeah because the device the device itself has about five hundred kilobytes of storage of packets but once it gets it will like DMA choose a memory region in your system because when you start the device you tell it here's my memory which is DMA yeah I don't think I've seen the case where you actually overflow the device itself yeah any more questions? thank you