 Okay, we just have one more minute here, but I guess I can introduce you and you get started at 205 so. Hello everybody. Here we have Han he is going to be talking about saving money by reducing power consumption. Thank you for tuning on energy efficient, you know, raising systems. So, I will hand it over to you. You can get started. Thank you for someone to stop. Hi, I'm Han and I'm a graduate student at FPU. This talk is part of my dissertation research that I've been doing with Red Hat for about a year so far. And kind of this talk is mainly about the latest results we've had over the summer. Right, so I'll get started. So, so the kind of the motivation for this research, such as the fact that for like a modern machine you can find in the data center, there's a bunch of diverse hardware in them. And even within the hardware itself, the manufacturers have added like layers of programmability to them. So, as a as an application developer or anyone who wants to optimize your application on these hardware, there are different ways of doing it. And for this, we've kind of focused on a set of particular hardware settings as we call it that and the focus is kind of on kind of minimizing the time it takes to get your work done and also minimizing the amount of energy to do it also. And the kind of question that we, the main question I want to ask is, you know, assuming your machine is only running a single kind of workload. So you can think of it as like, you know, clusters of like memcache the servers running. Can you actually cater the actual hardware itself towards that single application. And to actually do it, we, in this case, we've kind of focused on on three, three hardware settings from, you know, from from like looking at the manuals. And one of these hardware settings is what we call the interrupt delay. So this is a hardware setting that exists on on the network card. And what it does is by setting a value there. It lets you set a set a interrupt delay mechanism that lets you actually manually set how long it'll wait before a new, a new, a new, a new interrupt fires, or when you when you get a packet. And then the other two are, are something called this, this frequency scaling so so this is ability that exists on like an inter processor that lets you set the actual processor frequency on that particular core. And then, you know, by setting a lower you're effectively lowering the clock rate or the frequency on that on that core. And then the final one is something that they've also added which lets you set a power limiting on the actual processor package. So this is something that that you can set on the entire CPU die effectively to kind of limit how much power it can draw over some time. And we're kind of curious for our work here in terms of investigating, you know, for, for, for, for two of these, the interrupt delay and this dvfs. And these, these currently they have policies that exist inside Linux to kind of dynamically modify them to adjust to the workload. Whereas we're, we're interested in seeing, what if you take out all of this dynamic policies and just statically set them and, and aside excited on you to take a simple approach I were doing is basically doing a sweep through to each of these parameters and in different combinations of them, and try to see which ones give us for a particular workload, the ones that can minimize the amount of energy and time it takes to accomplish that work. And, and, and here we're we're we're kind of introducing the idea of using something called energy delay product. So this is a concept that architecture folks have used when they're trying to measure something with with the processor. In this case, the energy delay product is a simple multiplication of the amount of energy it takes to do some work, along with the time it takes. And the kind of data we're just showing here is kind of like a it's an artificial data. So on the x axis you have the time taking in some units, and then on the y axis you have your jewels. And, and, and the area under the curve or the area under the slope here is effectively your your EDP. And the interesting about the method of using EDP is the fact that there is a, there is a slope attached with it which kind of which indicates the rate of power energy consumption over time. So you can both compare in absolute terms, the amount of you know, EDP as a measure of energy efficiency, along with the fact that you can, you can compare and contrast how how different methodologies, or different hardware settings in this case, change the rate of the power energy consumption. And, and, and we're hoping this will, you know, present the data that we collected in a way that's more interesting to look at. So, so here I kind of talked about our experimental setup for how we actually, how we actually conducted this study. So effective what we've done is we, we run a set of network applications, and, and while they're running. You know, we have a application that's running in Linux. And then, and then we have another one that we have, which is a library operating system or like a unit kernel so we can compare between systems. And, and, and in the hardware these are the three hardware settings that we're tuning the interrupt delay, the power dimming and the, and, and the frequency. And effective what what we've done is we've instrumented a tracing infrastructure inside the network device driver. There was a interrupt handling code that fires every time the card registers or the card wants to tell the software that hey I have new packets and I want you to process them inside there. We've instrumented various counters to, such as a measurement of how much jewel, like how much energy has been used since since the last time it was interrupted, along with like a time send counter. And then, and then we have a set of other statistics we gathered also. So, so we gather some software statistics such as you know, number of bytes, they receive and transmitted along with how many instructions has been used how many cycles, and etc. And so effectively you can think of think of it as we run the experiment and we just collect this trace and at the end of it we have a giant text file where every single line that text file is is a time stamp for when when the interrupt happened along with certain hardware statistics we've we've gathered. And, and for this talk, I'm kind of going to talk about one of the workloads we ran so it's a very simple workflow. It's called net pipe and, and we're starting with something very simple here because because there's so many things in your plane, such as like the hardware along with the operating system itself. So, so with the net pipe application, all it does is you have two machines. And, and for the machines we we send a message of some fixed size between them. So it's like a ping pong back and forth. And in this case we we set a static number of iterations to do this ping pong, which is and, and for one of the machines that we call the server or the victim machine. This is the machine where we where we set these hardware settings of the interrupt delay, the dvfs and the rappel and you know for for each workload we basically sweep through these three settings of different values. And while that's running we effectively collect a trace for for every single run run run of this. So, given this, the first. So, in this case, this graph that we have here is is is net pipe running with a message size of 8192 bytes. And then for this workload, we we ran this 5000 iterations. So this is just like a very simple edp plot. And the blue line is is is effectively kind of just default Linux where you know it has the dynamic policies that govern the interrupt delay and the frequency scaling. So this is kind of like, you know, Linux version five point something that comes as this, and then, and then, and then we run it, and then this is the edp we got. So, so effectively, you know, the this is kind of its default behavior. It took about, you know, 1.3 seconds and the amount of jewels are used was about 20 something. And we the green line is what we call a Linux tuned. And in this, in this case is where we were after doing the sleep of multiple, multiple of these hardware settings. We, we've effectively graphed the data that has that has the minimum of edp right so the minimum time and minimum energy. And that's the green line here. Right, so we can see that it took drastically way less time. And the rate of its energy consumption is higher than than this default one. But but the cumulative energy consumed is also is still a couple is still way less. And then we have our library operating system. So, so this is so all these experiments are run bare metal and this library operating system is this is a C++ unit currently effectively. And it's like a event driven, you know, no no scheduling, nothing crazy going on inside the kernel. So it's so it's very, very lightweight and very efficient. So you can see that it because it uses even less time. And the rate of its power consumption energy consumption is slightly higher than the default. So to kind of like, so given, so given this initial edp data. And the fact that we have this trace of every single interrupt that happens, we have a trace and basically the behavior on the server side. How can we start to kind of, you know, analyze anything like explain this. So, so here I'm trying to just kind of walk me through an explanation for that kind of at this size. So, so when you think of that part, the workload yourself is actually not used to be more of a, you know, the message and then you see the same thing. And you can kind of get your background back into capital cash. So, so, so we have this here. The first, the first number. So this is the value that we have to have. It sounds simple math here, where if you take this size, which is 19 bytes. If you take the speed of the network that would be right here, it's a 10 gigabits per second. And if you just divide it, you get around roughly 6.5 microseconds. So this number represents kind of the, assuming there is no atmospheric effects of any switching delays or software delays. So, how fast network will send 8 kilobytes of data, which is around 6.5 microseconds. So this, so this number kind of lines up very nicely within this range between the two Linux tuned and library OS tuned. So, effectively, you can think of it as we've literally set the intro of delay such that it just matches the time it takes for the wire to have the message be sent across, assuming some microsecond of delay in terms of, you know, switching calls because we have to go through a switch. Right. So in this case, we literally customize the interrupt delay such that it matches perfectly with the with the theoretical peak speed that you can send that message. And therefore we've like basically minimize the amount of the amount of basically maximize the amount of processing efficiency here. And that kind of explains the kind of the time it takes the time differences. Because because inside Linux default, it has a kind of like a dynamic algorithm where every time it receives a new data, it will do a computation to kind of predict what is the value issue set the next interrupt delay value. And we found that that algorithm itself is just not very efficient because it doesn't really adapt to your application specific use cases. So, so, so that explains so this kind of explains kind of the, the throughput. The next thing that we're kind of curious is, is how can we explain kind of the energy benefits of this method, right, like, why is it that we're able to tune these three parameters and we're able to, you know, decrease around energy So I'm going to skip this. So, so one thing that we found with with one of the parameters is is the is the cycles it takes to run this. So, so the cycle is a measure of from is a measure of how many cycles it spent non idle or busy when it's busy processing. And you can do a bit of math to convert that into a time granularity of like similar to the timestamp. And therefore you can kind of estimate. Basically, in this case we're showing the same the same timeline with the fraction that is spent non idle or busy time. Right. So what we see here is the fact that it spent, it spent majority of its time. When you're tuning it a busy processing, right, I like almost 60% for Linux tuned and 40% for for the library OS. And whereas for the default case, most of the time, you know, it was only about able to do like 20% of the time, you know, a busy processing and what is and what is kind of the effect here of these three values. So, so the next thing we plotted was basically the amount of time it takes. And so the graph here is effectively every time they interrupt happen we kind of measure how much jewels was was consumed. Right. So the same exact this on the timeline and these are just each individual point is how much energy was was consumed. Right. So in terms of Linux default, you see that most of the time it was in this in this range where it did some work and then sometimes it peaked. So so the zero data here you can ignore because because because the because the counter that we use to to read from the energy counter, you can only read it at a granularity of one millisecond. But then we have interest coming in faster than that. So so some of these we just we just put them to zero because we cannot actually sample the energy at that point. So, so literally, so just focus on kind of this middle part. And what we can see here is that for Linux default, most of the time is spent here and and and if you sum up every single dot here and you divide it by time, you get energy over time, which is a measurement of Watts. So here is spent majority of time, you know, at 17 Watts effectively and 17 Watts from some other other measurements that we've done is is is a time that it takes is actually the amount of power that the processor takes just just idling. So so what this is saying is effectively that, you know, the default policy most of the time is doing work is up here, and then rest of the time is just idling and it's not doing anything. And that's why, and that's why effectively, the time, the time here is just wasted spent idling and and why you're idling you're still consuming power. So so that's kind of like a waste a waste of the the amount of you're running because you're just not doing any work. Whereas here, when we actually tune the these hardware settings. We don't spend any time idling for Linux tune at least because almost. So that's the green data points here. Most of the time it just it just it's just always doing work. Whereas with the library us. It's a bigger spread. I think, I think because the library us is way more efficient, just because of his code base that when it finished to work, it is still able to take advantage of some idling costs. It's still, it's still go so idling, but most of the time is is, you know, spent also doing work. And, oops. Yeah, so, so, so, so, so given so given this kind of tracing data. We've, you know, we've, we've done like analysis across all these other applications, you know, other than that pipe at the eight kilobyte that I've talked to you about we have some other message sizes and we also have results from workloads where you know we run like a Node. We have this web server. And then, and overall we collected a bunch of trace data so we have about two terabytes of this trace data set from this work. And, and kind of, and kind of our future work, or our current work right now is we're in the process of open sourcing this, this set of trace data that we gathered your website. And kind of we're interested in seeing how we can extend the tracing from the network card into the application or to to basically have like a kind of end to end tracing methodology that will explain both, you know, how the applications behaving, how the TCP IP stack is behaving up until the device driver itself. And there are kind of other opportunities here where we started to take in looking to, you know, can we replace the dynamic policy with a something that's kind of powered by machine learning. And how we can maybe integrate these kind of idling these like periods of processor into this policy as well. There's plenty of other hardware that you can play around with and like, and by reading the manual there might be other registers to look at also. All right, I so good. So, so that's you know my talk. Thank you. Thank you so much. I've just been monitoring the chat. And I do not see any questions. But if anybody's watching, please, and you have a question. Feel free to put it in chat. We'll give it like a minute or so. And if we don't have any forthcoming, I guess we can move over to the breakout session. And at that point I'll just put a link to the breakout session in a little while. Okay, I don't see anything coming so thank you so much for the time.