 And maybe at that time, it's most things about energy efficiency. Okay, let's start it. So welcome to this energy efficiency session about OpenSec and thank you for coming. So we are here from a team at Huawei in Munich. So no Chinese even while being a Chinese company. This presentation was done by Radu and Kurt mainly. And we all work together in a team in Munich for our Huawei cloud solution, which is based on OpenStack. So that's why this all is relevant for this. And in our team, we are combining different things about OpenStack. So our cloud solution is for the IT sites and also for the carrier sites. So therefore we can combine requirements which are coming from that side and from that side. And my role is mainly on OPNFV and the carrier requirements of OpenStack. While Radu's topics, for instance, are more on the big data side. So it's a very distributed topics which are covering and Kurt will say something about himself later. And we have done some studies here on this energy efficiency thing and want to report to you about this. Why do we need to do that? So energy consumption of IT is very immense. And we found these interesting figures about a single YouTube video, what it costs to display that so many times. And this single video is displayed more than 2 billion times. And that leads to this high energy which is needed to provide all that. And that's why we as OpenStack, we could improve something there and save a lot of energy, which is something I think all of us want to do. So the motivation behind this, all that is how much energy is wasted because we have idle resources and how can we improve that? How can we maybe do some rescheduling of VMs? And by that, improve the way we are using the energy that we have. What is the relation between the energy we use and the load or the workload to be executed? And then, what can we do now when we understand this better? How can we maybe do other decisions in the scheduling of VMs? And by that, save energy. So to do that, we executed a lot of studies and we did measurements, how much energy is used, performing several things. And Kurt will report about the details of these studies. How did we do that? We looked at the overall energy consumption of a cluster. And then we also zoomed in to the single nodes in that cluster. And when you look at the performance energy ratio, so the amount of things you want to compute and how much energy you need to do that. Then we can work on different strategies from that. How we better can do the VM scheduling and by that, do energy saving. So let's look into the details. Good. Okay, I'll get started with introducing the hardware setup before I do that. I quickly say one word about myself. I have some history in the open source world being a kernel engineer with SUSE and then later a number of engineering leadership roles there as well. And I have some history with OpenStack as well. I was a VP at Deutsche Telekom introducing the OpenStack based hosting platform there. And actually I had the honor and pleasure to talk about that at the 2012 San Francisco summit and give a keynote about it. So it's good to be back here. San Francisco, I think it was 900 people. So it was a different size event, but this is pretty impressive. So it's good to have the opportunity again. Let me quickly just give you a short view of the hardware. We had two plate centers, our E9000, we also call it Fusion Cube. And running on that was Fusion Sphere. That's the name of the OpenStack based cloud solution from Huawei. In these plate centers, we have a number of plates. And I've just written down the data. So if you look at the numbers, you can make some sense out of it. This is a somewhat oldish Sandy Bridge based system. And we had the large plates with a number of hard disks. They also provide the software defined block storage. There's a Huawei solution for that as well. It's called Fusion Storage DSware. And when we look at the results, we'll see the effect of using software storage. And there was also an object storage system at the left side, the UDS. And actually we didn't use that for the test. It was just because we had this set up for some other customer where we did some tests as well. In general, what you want to do if you do such measurements, you need to make sure you have reasonably good sense of support. So you can actually measure the real-time power consumption. And we were lucky that this plate center really delivers that. We can measure the consumed power in real-time, really, for the chassis level. We can do it at the node level. And actually we can also get the reportings from the power supplies. So we can actually compare them and see whether they are consistent. And actually that was one of the first things we did to understand how reliable those measurements are. They are exposed via a web interface. But fortunately, the web interface is actually powered by an embedded Linux system. And you can use command line tools as well. So that helps with automated data acquisition. And then basically when you measure a workload, you get a graph like you see on the right side. You see the idle power. This is the power at just of one node that fluctuates a bit around. In this case, 116 bots. And then once you start to fully make your system busy, then obviously the power consumption goes up. And in the end, when you want to measure the energy that's been consumed, you take obviously the area of this graph, the integral of the power over time. So that's the approach we used. We scheduled a number of virtual machines with forward CPUs, 8 gigabytes of memory. We used the two clusters. And then if you look at the complete numbers later on, you should be aware that we didn't use all the nodes. We put some of them, actually switched some of them off and some of them were idling for reasons that we had to reserve them for another project. In the first set of measurements, we just used the Linux stress tool. I don't know if you know that tool. That basically really is a synthetic workload that just keeps your CPU or your memory or your system busy. And that's really what we did to see how much does the power consumption of our cluster change with the load. And then we did the usual thing. We did a number of measurements and averaged and made sure that we understand the error bars and see that they are very consistent. So this is what we did. We just started with 10 VMs and then put 20, 30, 40 to see how with the increasing loads the power consumption changes. In that test, we really used only the CPU and memory modes of this stress tool. And one of the additional questions, research questions we had in mind was, is there a difference between a VM that's not running, that's hibernated, and an idle VM? Is that something we need to care about? If we want to be energy efficient, do we need to hibernate VMs? And you can see the result of that first question right here. The power consumption really does not change. So an idle VM does not cost any energy, which is good news. And if I look at the efforts that are happening in the Linux kernel, they once did that work to make sure if a system is idle, you don't have that constant timer tick, awake in your CPUs. And that's probably the reason why you can actually have a lot of VMs idling without causing any negative effects with respect to power consumption or performance. And then what you see is not very surprising. We basically maxed out the system with 40 VMs. After that, the power consumption would not increase anymore. And that was really if you do the mass because of the amount of nodes we had available for doing that measurement. And at that point, the management interface reported just a 65% load, but that was because we had reserved some nodes. Second set of measurements we did. Now we wanted to understand how the different kinds of workloads influence the amount of power we consume. So we used the various modes of the stress tool. And what we wanted to do is really fully occupy the nodes with those VMs that are putting stress on it. So we put actually, we actually used six threads in a VM that only had four CPUs, so we made sure we use all the available CPU bandwidth. And then we put eight VMs with those four CPUs on a node which had 32 threads, hyper threads. So we've also made sure we consume all the CPU that's available on the node. And then we tested combinations of the various modes that the stress tool offers, it's a CPU load is pretty simple. It just does square root all the time. Then memory load really does memory allocations and write to that memory. DIO load does sync calls, which by itself probably doesn't do much. But if you put a hard disk load in addition, then it causes a lot of disk writes and that's what we see if you look at the results in a second. And then a second thing that we also did in that test is we looked at actually, is it worth switching off, powering off nodes and comparing the power that's consumed there. And you can see, I hope this is readable actually. If you go from left to right, we have a measurement of power off. The blue bars is the complete cluster power consumptions. And the red bar is the one node power consumption. And I've offset the one node power consumption by two kilowatts. So it fits on the same scale and is more easily comparable. So don't get confused about that. So if you look at the first red bar on the left side, that's actually nine watts. And I've added this two kilowatt offset. That's why it looks, but the reason I did that, I wanted to see. In these tests, we only used one node and I wanted to see does the overall cluster power just move the same way that this one node changes its power consumption. And that's actually largely the case. If you go from left to right, the sixth set of bars when you introduce a hard disk load, that's where it changes. Because now, we're actually putting load and other systems in the cluster. Because we have this software storage and there's a lot of calculations, replications going on in those storage nodes. And obviously, all the disk accesses happen. So that's, if you look at that bar, it's the first time when you actually see that the single node is not a good indicator for your complete cluster power consumption anymore. And that's true for the other measurements as well. But mainly, it's still CPU that is the most variable piece of the power consumption. So, well, actually I stated this already CPU and memory loads are what is consuming the power on the nodes. If you have an external storage system or in this case, it was part of the cluster, you can obviously, putting load in that, you can obviously see the performance effect of that as well. So one of the obvious thoughts, one of the obvious ideas you would have is, well, what if I can switch off idle nodes? Because that's a 100 watt difference. That would be a good way to save energy. So we just try to schedule which machine such that a few nodes stay empty. And then we can switch them off. And then obviously, we can bring it back on once we need them again. One caveat on that is most of the people that operate data centers don't do that. They don't trust systems that they switch off to come back on in a controlled way. So it might be a nice idea, but currently I think there's, the industry would need to put some effort into making this a very reliable thing. So that operators would actually trust switching off hardware. And I would think that a lot of the failures that people are used to are from spinning hard disks because they are known not to come back on when you power them off and they change temperature. Today with solid state storage that might not be such so much of an issue anymore. But I think there's not a lot of testing that has gone into that approach. So once we are done with that, with those measurements, we thought, okay, we'll just group VMs on a node. And that's the most power efficient thing. But there was actually one, one thing that we did not consider and that's one of those traps that you can easily fall into if you do such measurements. We actually just put as much load in a system as we could. So the amount of work that, the amount of computational work that's done was not constant. I mean, the faster the system would run, the more of that infinite square root loop it would do without, and of course, consuming more power for doing more work. So it was actually not a very good benchmark to measure energy efficiency. In reality, you have a workload where you want to achieve a defined, where you want to do a defined set of work. So actually we started again and did define a very simple actually workload in the next step. So the goal of that final step was really to evaluate three possible scheduling strategies and let me start from the bottom. That is the scenario where you would actually try to group virtual machines on one host, on one hypervisor. So we can switch off another hypervisor and get rid of that idle power consumption. Unfortunately, again, that's a scenario that operators don't like. They don't trust it. So we consider it, but we know more work needs to be done before that's going to be something that's accepted in the industry. Scenario two would be the same thing. We still try to group virtual machines on one hypervisor and keep another hypervisor idle and see what is the power consumption of that scenario. And then scenario one would be, okay, we just spread it out as much as we can. Which is by the way what the Nova Scheduler typically does and see how the power consumption of that is. So this is the workload we chose. It's fairly simple, not sophisticated, not something like Jack, spec JVB or something. So I acknowledged that some more work, some more nice work could be done by using more realistic real world. Worklet examples, but I think this is enough to basically prove the point that it makes sense to look at this because we could observe some differences using this benchmark. So what we just did was calculating a huge amount of digits of the number P. And we just used standard Linux tools that are available, the BC calculator. And the command that you can see that just gives you the 15,000 first digits after calculating for a few minutes. So that's what we did and then we compared the placement strategies that I just described. One thing we did is we were trying to, comparing the one note situation where we put a load that was really using all CPUs, all hyperthreads I should say. Of that note, compare that to another situation where you just have 50% load on that note and use a second note to do the other 16 calculations of pi. And this is what we found. If I start with the top most left graph, this is the power consumption of the two nodes. Blue is node one, brown is node two. In the one node scenario, the node two obviously is just idling. So you see it's sitting at 116 watts. Whereas the first node obviously consumes a lot of power to do the calculations. And then the two node situation, you have both nodes consuming equally power. If you do actually the sum, the two nodes consume more power than the one. But, and I now go to the right top graph. If you compare the computation time, the two node calculation actually is a lot faster. It finishes in like two thirds of the time that the one node calculation finishes. And if you think about it, it's probably not that surprising considering that we have hyperthreading and share resources on the CPU. But if you multiply it out and compare what it means, how much energy in terms of what hours do we need to do that calculation? We get a number of possible answers. First bar in blue is you just use one node. And you assume you can switch off the second node. That gives you the best energy efficiency. But if you need to keep the second node running, so it's idling, then you get to the red bar. And you see actually it's not as nice anymore. If you compare that to the two node situation, actually two node situation is better. Unless, again, you additionally consider the idle time and the idle power consumption after you finalized your computation. So the question is, once you completed that workload, is your cluster going down or is your cluster busy with other useful tasks? In which case, I would argue you should not consider the fact that it's idling afterwards because there's useful work to be done. But if it's not, if it's just idling, you should actually consider the idle power as well, because you cannot switch off, which gives you the worst result. So the message of that measurement is not a simple, clear message like distribute your load because you save power. It depends on what's the other usage you have for your cluster. So it's that complexity that makes the result a bit tricky to understand. And I think the one learning we really got from that is that measurements are hard to get right. We iterated a couple of times over this and you need to really make sure you do things like considering idle power and think, okay, is that something that I can get rid of by switching off? Or is that something that I can use for other useful things? In which case, I think it's legitimate not to consider it. Otherwise, you need to consider it. And the same thing obviously for sensors and the environment. So one of the things we really did was to control the environment well, made sure there's no other workload, no disturbing effects. Otherwise, your results just will be off. And then you should do the measurements multiple times and make sure you measure the variance and understand, okay, if there's something going on, if something looks strange, you understand that before you actually trust your results. But here would be some preliminary advice. If for you, switching off and host is an option, it's a good idea to save power. One of the things I should mention though is the last time I tried, actually Nova has support for switching off hypervisors. The last time I tried it didn't work. So, at least the LibWord driver does apparently not expose that capability to Nova. Maybe it does in the latest version, the one I tested did not work. And at the same time, Nova currently does not even have the idea of switching off hosts, so you would need to use another mechanism and also another policy engine that does that for you. So currently, I think that's not even feasible. And I think what we would need to do if you wanted to say, okay, this is a path you want to go down is to talk to operators and understand, okay, can we do something to make it acceptable to you? To switch off hosts? And if that's the case, then we build the mechanism to actually do it automatically. Then the second thing to take home is, distributing virtual machines actually can reduce energy consumption. That was a bit of a surprise when we saw that for the first time. And we had a number of discussions to really understand why that might be the case, and we think the reason is that the CPU power curve, how much power it consumes depending on the workload is quadratic in the voltage. And CPUs increase the voltage when they increase the clock speed. So if you fully use a CPU, actually it runs at a higher voltage, and that goes in quadratically. So if the relationship would be just linear, you would expect the placement strategy doesn't play a role unless you overcome the resources. But that's not the case. I think that's the reason why we've seen that. Actually, distributing your workload on many hosts is also a good strategy for performance because you don't share resources and you don't risk to prevent your CPU from using the turbo boost feature. It gives you the best performance. So for a number of scenarios, actually the current default behavior of Nuva is not a bad one. It helps energy and performance, and that's good. We have, we can actually, we don't have to make difficult trade-offs there. But on the other hand, if you don't have any useful work for your, for your cloud to be done afterwards, you're idling again. You need to consider that idle power consumption then clustering your workload would be better again. But it depends a bit on the CPU. So one of the things we want to do is do some retesting with later CPUs that have better idle power characteristics than sandwich. And one thing that I completely kept out of this discussion so far was, obviously if you schedule a lot of VMs that belong to the same tenant that have some relationship, there's a lot of other things you want to consider. If those VMs talk to each other because they belong to a, to a, to an application, you may want to schedule them together on one host because the communication overhead is a lot lower. You have lower latencies, better communication bandwidths. That might save power as well. This is typically expressed by affinity and anti-affinity rules in Nuva. So we already have a good mechanism actually to do that. Anti-affinity would use, for example, if you want to make sure for resilience you have, you have machines running on different hosts. So if one goes down, you survive. So this is, this is the kind of goal that this, this leads to. You wanna make sure the Nuva schedule learns how to have some awareness of energy. And for that, you would need the Nuva schedule to have a model. How does the power depend on the workload? For CPU, probably a simple quadratic equation would do. So you could use three parameters to describe it, which I think is pretty much handable. And you could look those up in data sheets, or if you have senders, ideally you would use them. And also, if you wanna go to the next level, you would wanna understand some hardware details. There's some, some nice things out there like the Haswell EPEX CPUs. They have a lower maximum clock if you use AVX instructions. So you would wanna maybe cluster all AVX VMs, all AVX using VMs on that host. So you don't get a hit from the lower CPU frequency. Those are things that the scheduler currently does not understand. So awareness of those hardware details would be one of the pieces that you would understand. And then coming from the other side, awareness of the characteristics of a workload would be what you would want to match against that, to come up with good policies. And then if you have that, that would enable you to have a variety of policies. One would be you could somehow choose a point between minimal energy consumption and maximum performance. And sometimes there's not even a trade-off if you're lucky. But sometimes that trade-off needs to be made. And you would as an operator like to be able to set at which point you wanna be. Also things you could do if you have that hardware awareness and can read sensors would be thermal management. So something that you sometimes see in data centers, you have hotspots in the data center. And then your AC, your cooling actually needs to make sure that hotspot doesn't get too hot and waste a lot of energy that you could better save by better spreading out the heat that you generate. So that would be another application. Once you create that awareness in the scheduler to avoid that. Just yesterday also I had some nice discussion with Adam Spiers, which I do not see in the room, unfortunately. There is, there's some more things you could do and thanks Adam for writing this great blog post. Of course you could not just consider that when you schedule a new VM. From time to time it might be helpful actually to say, okay, let's analyze the state of my cluster, of my cloud. Do I wanna do live migrations to move from a bad state into a better state? Both from an energy point of view, but you could also do it from a performance point of view, of course. And there were also some ideas in there which I was just make you aware of. You could also do things like if you have virtual machines that run the same operating system image. Chances are a lot of the memory pages are identical because you have the same kernel, you have the same set of libraries with text sections that never change. You could share those pages and save some memory, save some cache, save some TLB space in your CPU and achieve slightly better performance. That might also give you another few percent power efficiency. And finally, one of the concerns that people will have when they would look into, okay, can we put this into the novice scheduler's scalability of the scheduler? Because the scheduler now needs to have more information available. It needs to consider more information and needs to optimize for more difficult problems. So if you have a large number of hosts and you try to optimize for this NP complete problem, it can become very computational intensive. And also the amount of data you actually need to collect and keep current can become large. So I think you had this divide and conquer idea. I would probably think a step further and say probably would at some point in time if a data center gets large, think about hierarchical scheduling. And then the top level scheduler would just say, okay, let's put that VM on this rack and then on the rack level, you would actually take the decision where exactly it goes to keep that manageable. That would be, I think, the approach that would come to my mind to solve that. But that's currently just an early idea as opposed to something that I know could be done. Well, that's, I think, what I had to present. Maybe you do the conclusion. Okay, I can do. So you see, there's a significant room for further work on that, to find out what is really the improvements we want to go for. And how can we have a more sophisticated energy management in a cluster. And OpenStack has a good possibility to contribute to that and work to that. And we are ready to go that way. An essential part of that is that we need to understand the computation patterns. How does the workload look like? How many disk accesses? How much memory? How can we bundle the workload time wise and then switch off something? So we need to understand also the workload the computers have to do. And maybe we can measure these things. So there are different ideas that we are thinking about how to gather the necessary information, how to take the necessary conclusions from that. And then to influence and implement a scheduler that can take the advantage of this new information. And by that, save the necessary energy. So what we need to do for that and we need help, we need more discussion, of course, with the scheduler people. And we are happy to do that and go forward. We also would like to have more people in Munich working on that, of course. So you can contact us on that. And also other companies who help with a project going for these energy efficiency solutions. So I think we should spend a little bit more time also on discussion, on questions to that. So, questions? Yeah, maybe two housekeeping items. First of all, the slide deck is up on slide share. I was careful enough to upload this morning. Somebody reminded me in another session to do that. Second, if you have questions, using the mic would be nice because it's recorded so everybody can hear it and it's also understandable in the recording. And if the mic is too far away, I will just try to repeat your question. Yeah, go ahead. These tests are only with KVM, correct? The question was whether those tests are only with KVM and the answer to that is yes. Well, to be very exact, Huawei has a slightly modified version of KVM which Huawei calls UVP. But I think for all considerations, it's KVM. Fascinating presentation, I think it's a really good area to work with. I think it's a really good effort to start looking at the energy consumption. I was wondering if this is within a data center, within a rack, within VMs. Have you thought between data centers like shifting from high power zones, like where there's a large load to something lower load or for following price or power, things like that? We have not done that yet. My current understanding of the situation is that currently we do not have in the scheduler the information about there's hosts that may be more power efficient than other hosts. We currently mostly assume it's a homogeneous infrastructure. And the cost of doing computation on one host is the same on another host. And you have to do some manual work as host aggregates to actually work around that in case your data center is not so homogeneous. So the first thought would be let's add that resource awareness to the scheduler. So it knows there's different hosts with different properties. And then if you really run a large infrastructure, you might actually come to the point where you say, okay, this local optimization doesn't work anymore. I need to look at it at a more global level and then we'll probably think about this hierarchical scheduling thing again. But I mean, I think this is the long term way to go, so I very much agree with you. And it goes a second step, I mean, maybe you have noticed we didn't consider also the air conditioning of the data centers. When you go to that, also the figures will slightly change. But the air conditioning power will go up with the energy you use for computing. So some things go in parallel. But when you consider multiple data centers, this is much more complex to consider. I think you probably get some traction about linking this to corporate green efficiencies and CO2 output and ideas like that. So the interest level may come from more corporate governance than the operations IT folks, it's just kind of a little hassle. Yeah. Thanks, I agree. I think there's some work we should be doing and starting to do. Before we go out and go out to large companies and want to talk to their social responsible people and get some support from them. But I agree, I mean, it's important topic and it's both on the cost side. I mean, if you look at data centers, you often have like a third of the running cost being caused by energy consumption. So it's both from the economical and ecological point of view. It's an interesting thing. And I mean, if we do that work, I think getting 10% efficiency out is really not as really the numbers really show that that's possible. No, actually, I was actually hoping before I get out to the open stack summit, I get some nice white paper written and then also some blueprint maybe started, I didn't get that far, unfortunately. But I think that's one of the next steps. And fortunately, Adam has some nice content on this already now. So we'll probably combine forces. Yes. So the question was, we looked at different policies, minimal power consumption versus maximum performance. And also whether we looked at this only at direct level or data center level. Second question, second piece of the question is simple. We just looked at really this setup that I described, which was really one, well actually two. Two nodes. Two racks, no, no, no. Two to see, many, many nodes. Yeah, well really two to racks of computers. If you do it at a data center level, the fundamental questions don't change. You still have the same optimization, it just becomes larger. And you have to consider a bit more hosts. Which is optimal? I mean, this is really depends on what your policy is. Do you want to achieve maximum performance? If that's the SLA you have with your customers, that's what you would want to go for. If you have more than enough performance for what your customers need, you would probably go for the power savings. That's why you want to be able to choose that point in the spectrum between minimal power and maximum performance. Yes? Yeah, so the question was avoiding hotspots, whether we do that to avoid outage or to basically optimize. We were really thinking about optimization, but this is something, I mean, we haven't implemented this hotspot avoidance thing yet. So this is really just an idea where this resource awareness could take you. So, okay, I'm getting a signal we are out of time. Maybe we can take a final one. Yes, so the question was, we do in general favor spreading out workloads, but the network overhead could actually overcompensate for that and what would be the way to decide this. Currently what I would think is if a set of instances expresses an affinity, I would say that probably overruled the power efficiency policy because somebody has put enough thinking into it to know it optimizes my workload enough that I spent the time to put this affinity rule in there. So I would probably say it's probably more worthy. But in the end, if you really want to know, you need to do the measurements and then decide, okay, well, thanks. This was a nice discussion. I hope to have some more discussions at the summit in some of the design sessions and obviously talk to us. Thanks.