 Thanks people for joining us today about discussing how we try to help about the CPU power management so Let me present myself Silvan Bosa, I'm the senior software engine. I'm a senior senior software engineer from Red Hat working on working on Nova since the last decade. I'm also the open-stack Nova PTL and previously the last decade I was also an operator Hello, so my name is Roné Ribot. I'm a software engineer I'm a quite new contributor and I discovered Linux and the Floss community in 1995 and previously I worked as a solution architect mostly on cloud and DevOps activities So let me explain how this story started. This story started because previously on the last Open-infra summit that was in Berlin someone from the BBC presented something that they were using called SkaFound for getting power measurements from their clouds and when I saw it, I thought it would be nice if we could help At least the operators about that. So That's why we that's why we do it Also by the way if you remember about last year The electricity bills Became more expensive. So I think it was nice to help out to have less power about that And also we are all facing an enormous challenge about Global warming. So that was one also of the motivation to work on that topic. So after the summit When we discussed it was for the Antelope PTG So the Nova community started to discuss about it What we did is that for example we had two topics for discussing about that We are having for example one topic for discussing how to reduce the power consumption and also another one about power management measurement and what we could reduce So first about auto reducing the power management Maybe you know about that but for example for oops So basically Sorry about that So what you what you can do with a Linux for reducing the power? So basically you have multiple ways to do it So for example for active or idle processes what you can do with rel is that you can change What we name the CPU profiles by using something named tuned So you have multiple CPU profiles. So for example you can use a specific a specific profile like Let's say power save or balanced or a new tune profile That's named CPU person in perverse safe. That was created last I think on the last month What you could do then it's nice because then you can modify the c-stats It basically modify the c-stats sorry like using C3 which is good for active processes like DPDK You can also ask the kernel to stop the CPU cores. That's something that the Linux kernel supports Or you can also ask to reduce the CPU frequency by using CPU Freq if the OS supports it The only problem about that is that maybe you know about that already But it's not automatically. It's not automatically So what we did is that when we saw about that Then we wonder how to automatically doing that with Nova That was what we did last cycle on enthalpy So what we did is that we created new config options that were saying for example if you want Nova to support the power management then you can say true and then you have two strategies first one is saying that you can offline a CPU core or Instead of offline in the CPU core the other strategy is saying I would like to modify the CPU governor Just to be clear a CPU governor is about modifying the frequency for the CPU. So as you see that's an example You can say in that example that you would like to stop CPUs in between 2 and 17 If none of the instances are using them that being said that means that you need to make sure That those CPUs are dedicated for instances and are not shared in between instances. So that only works for pin CPUs So basically, oh, does it work? So this is demo time Up it will be working So in that example, I have a house with eight CPUs All of them are active are currently active if I'm modifying the Nova Conf in that example I want to only use two CPUs to be dedicated then I'm basically modifying the config options When I will be restarting the Nova CPU That's what I'm currently doing I'm using DevStack here Then if I'm looking back at the CPUs as you can see both of the physical CPUs Are no stopped no If I'm creating a new flavor that's asking for a dedicated CPU in that case I'm here in that example. I'm asking for one dedicated Dedicated PCPU then when I'm creating the instance No, this is active just before it becomes active Then basically Nova automatically Online the CPU in that example given the instance was using CPU 6 it's basically It's basically online and when I'm deleting the VM or when I'm stopping the VM Then basically the CPU is going back offline So that's it for the demo Okay, but basically how does this work? This is pretty simple We use ccfs So basically with Linux you have a specific ccfs file file system and Here for example each of the CPUs is a directory From ccfs. So what the Libre driver is doing is That when restarting the Nova compute We look at all the dedicated CPUs and then we ask ccfs We directly call ccfs to either modify The file for online or to modify the file for cpu frag scaling governor if by the way The US supports it again. No, no not all the US Support scaling governor, but for for those who do then you have a problem you have that and basically When then you want to start an instance or you want to create an instance Basically, we're basically getting the CPU from that instance and again We're calling ccfs to either online the cpu or modify the governor Know that you have that I Personally think that you have some quick wins You have some cool queens because given the current scheduler I think it's nice to basically pack the instances to specific hosts So basically the power of that host would be better You have multiple ways to pack your instances Either you use a cpu white The surface affinity white multiplayer or you can leave your user to ask to use the same host filter by the host parameter What you can also do if you want to pack your instances after they're created is that of course We can migrate your instances to move them to exact to the exact same host Or you can use what sure if you want to do it quite automatically that said Nothing is done automatically by the scheduler About some scheduling decisions related to the power measurement Nothing is done automatically. So basically you need to do this by yourself As you understood There are some limitations though with the new feature the first one. I Think you already understood it This is only about dedicated CPUs This is not about shared CPUs because basically we don't exactly know whether the shared CPU is used or not and by how many instances So this is hardly the whole for public clouds But for the other clouds that basically use dedicated dedicated CPUs They can use that of course If you have restricted CPUs that doesn't that are reserved For I would say other processes, but Nova Nova won't be able to do this right because those are reserved processes I was told I was talking for example about DPDK That's one of the examples. Nova won't be able to modify the CPU of those other reserved CPUs As a reminder as well That's not only the CPUs That have power consumption and like GPUs or nicks Can have also large power measurement. So I think it's probably Something we could be trying to see later Which will be about How to help that but for the moment this is only about CPUs also If you really want to pack your instances on some specific hosts Then you may you may have some power cons more power consumptions of the top of the rack switches because of course you're basically sending All your instances to the same hosts and basically the nicks Would be having more usage So it would be a problem with the switches maybe more power consumption basically need to modify we need to monitor that and eventually I Guess most of you are wondering What actually is the difference between an idle CPU or an active CPU because actually that depends on the instances on the workloads if you prefer and That's why I think it's also important instead of just trying to like Stop the CPU or start the CPU to measure it in terms of what and that's why we're moving to the next section Okay, so what we could improve on measurements. In fact, we looked at the BBC use cases and The use cases is great, but they are measuring the consumption on the overall VMs. So we thought that we could improve it by giving the power consumption of processes inside the guest VMs So the idea and the goal could be to enhance the dashboards with more granular view and also maybe to have dashboards more oriented application oriented and So users could act on So what is scoff on so scoff under is the underlining tool used by the BBC folks for various cases It's a monitoring agent Dedicated to energy consumption metrics The author of the tool is someone called Ben Whappety and it's developed by the community It's under Apache license Apache v2 license and I would like to make a strong disclaimer here because this is a community tool without product utilization or support by red Hat So how scoff on there is working so scoff on when you start an instance of scoff on it's defines server CPU topology It gathers rappel data. So rappel data It's data providing by the Nintel technology to estimate CPU and run power consumptions And in other in another hand, it gathers system processes based on this collected data Scaffold is doing some calculation to provide the consumption from the host down to the Processes so you will have the power consumption of host CPU scores and down to each individual processes and When you have that you can export this data using Various common format so it could be primitive use which is the most famous one and Jason Reman a lot of formats and combining with a tool like grafana You could do nice dashboard nice dashboards like you can see on the on the right So I would we expose and configure scoff on to have the guest VM data So on the left you have the standard configuration So you are just running an instance of scoff on the compute host and scoff on is providing so the Processes consumptions and you will have to look at the QMU processes to have the consumption for your VMs and Moving to the right. This is the configuration with QMU exporter. So you are first you need to run scoff on with QMU exporter on the compute host and what it does it will scoff on will extract the data on for each each QMU processes into one one directory and Scaffold will try to extrapolate the the power consumption for this for these VMs Into the form of a rappel device. So basically, this is a kind of a trick to emulate or fake Rappel device for guest VMs guest VMs Once once you have done that you need to share the the metric the This data with either of a virtio fs or virtino virtio 9p from your host to the guest VMs and Finally on the on each guest VMs you will have to run scoff on with the dash dash VM option to collect this data and then At that moment you can use whatever exporter you want and it will give the Power consumption data of the processes for the guest VMs and that's it So a quick glance of what is virtio fs and what we have implemented in Nova to support that So virtio fs is a shared file system It allows to share three directories and directory tree in between your host and your your guest VMs and the benefits of using virtio fs. It's It's really It has the same semantics and performance as a local file system. So that's good and to use that via libvirt you need to Configure that into the XML files Libvirt XML files and that's the reason why we had to implement something because the XML file is managed by Nova You cannot modify it by hand and that's what's the purpose of a feature we introduce Initially to support many shares So This features requires a QMU file the 0 and libvirt 6.2 and really important the instances Should use share memory. So it means that on these instances you will have to either Configure file back memory or huge pages to use the features The status of this feature is still in in dev because we have to fix some Some issues with Manila, but that should learn hopefully in this cycle So What about the scaffold feature? So first I would like to say we name it like this for convenience because it was easier for us to share about it But the idea was to share Power data. So this is kind of tool agnostic and basically we leverage what we have done with virtio fs And What I just mentioned and the idea was to share also a local file a local share on your computer host To the guest VMs To enable the features you have two steps. The first steps is to enable on the compute node So this is done by using the share local FS parameter and you have to supply The directory you want to share and a tag that will be used by the guest VMs and On the instance you you add to enable an over to enable that we're using hardware power matrix Parameter and it can be done either via an extra spec or an image property Time for the demo so Quick demo just to kind of proof of concept So basically it runs on the already configured compute host and so I have created a new flavor Passing the hardware share local FS parameter. So now I will resize VM stop and resize a VM to use the the best new flavors with the parameter to enable Let's say the scaffold feature So when it's done we can I will start the demo instance So if we just have a quick look at the Nova compute logs, we can see we have a section in that file system section defining the shares and If we connect to the guest VMs You can see that You can look at the metadata of the guest VMs using a curl command so that will list The fact that we have a share and the tag calls staff on so we could use this tag to mount the share so that's done here and If we look at into the share we can see that we have intel rappel data and Then if because we have the good data we can start scaffold and the scaffold is currently trying to process the data and we can you can see that we associate power consumptions with With the processes so for the moment. This is just a POC This is a as well not to merge so it might change in the future because I have some comments to change some stuff and also don't Don't really look at the data that was more or less fake data Just for purpose of showing it first Okay, what's about the limitation? So right now we have some limitation due to Virtaio FS and the churn memory and in fact the limitations also are from QMU because QMU as to Disables from features like migration or snapshots Because when you have a file system attached There are some Let's say more development to do around that. So as a result for us We had to disable some operations like resize migrates shelve live migrate and evacuate So this is quite annoying for the moment Because if you want to use that kind of operations You need to stop you need to unshare by stopping your VM An overall limitation as Siva mentioned We are just we just have the CPU and memory consumptions. So we are missing the GPU one even if might be introduced into Ska font soon and also more complex topic we don't have the devices like Knicks or FPGA and Maybe let another limitation. It's that rappel. It's like I said, it's an Intel technology So it's available on Intel and AMD so it means that that's not supported on IRM or RIS 5 CPUs and Finally, what are the next steps improvements and ideas we are on this topics so first QMU upstream as already tickets and they already works on Removing the limitations Operational operational limitation so we could do as well on Nova when it will be available So this is one of the access another possibility we have a colleague which is called Anthony Arrival and basically what on or simply what Anthony has done he he put into QMU the mechanism used by Ska font for calculating the power consumptions and So that would be available in into QMU if the patch the patch is accepted And at that moment it means that sharing will not be required anymore. So that could be a nice improvement Obsoleting what I what we have done so far and An over-alternative instead of using virtio FS. I mentioned that really quickly We could use an over protocol, which is virtio 9p and virtio 9p does not require short memory. So in the meantime, we could allow to We will not have to reduce the operational stuff. So If you are interested by that feature, maybe we will discuss that in the in tomorrow PTG sessions and We also have a tool called Kepler So Kepler is a tool that using machine learning to predict consumption. So maybe that's another way to To Let's say have better results and maybe to deal with all the devices and things like that. And by the way Maybe Kepler will be fully supported by Red Hat So could be an over-alternative Well, that's it. That's it for our presentation. Maybe you have questions. We are here for that. So no worries Any questions so yes It's a it's a compute specific config option So basically either you modify of course all your config of I mean all your modifier your computes But yeah, basically, it's per compute Second question is Correct. So basically what we do if you prefer is that For example, say you reserve some CPUs outside of Nova when you modify the config options by saying, okay Those are the reserve CPUs and I only want to take those Dedicated CPUs that we are basically using the CPU dedicated set that people are using That's the new config option name that what previously that were named. I think pin CPUs I can't remember exactly the right name, but that that was the previous name of the config option So basically what we do is that we just take those specific CPUs and then we basically stop them But for the other reserve CPUs by the way, I forgot about that But we also don't support CPU zero because in general CPU zero is used by the kernel and for other processes So you can't just try to modify CPU zero if you do that then you will have some problems But yeah, basically just about the physical CPUs that you can use for Nova You have to be careful to manage on the CPUs that are not used by other systems So this feature does not make checks any that these CPUs are in the machine slice or that are allocated for Nova Yeah, basically if you already support basic if I use Basically pin CPUs that's something that we already support like I can't remember for like a lot of years before Basically, what we do is that we we basically Offline them if you don't use them. That's basically what we do any other questions You talking about the offline online stuff, so what do you mean by bar metal Indeed, I mean say for huge pages or for stuff again, it's just about CPU pinning It's just about CPU meaning. So again, if you see you pinning basically will be offline What's the impact of the consumption to disable one core or not Very very good. I was pretty sure that someone was thinking for it So basically the measurement of that basically, it's a very good question So what I mean, I'm not seeing someone that that was trying to look at that I would say For the instances It would be good, but sometimes About the performance month what I would say is that from what I heard Because I'm just a developer, right? I'm not really testing it, but I mean surely I'm testing it But not not about measuring this Musing the power what I'm saying is that Yeah, basically, I would say For instances that that's nice in terms of the power measurement that you have for example if if you use that cool in general if you use for example NFV and then for example you have srUvfs With dpdk and something else Maybe you could try to use tune D What that's what I said in a specific slide for those other Processes like dpdk and that's why I told about the tune D profile because that's what we saw at Red Hat It also helps those those processes I Excellent question. I haven't thought about it yet Yeah, I mean in general, you know Nova doesn't I mean that's that's that's done by I mean that Will be done by Nova automatically that will be done by a deployment tool So so I would say that will depend on the employment tool about measuring the power good question The problem is that if if you basically stop the host and then you restart it Sometimes about the measurement of the power and the time it takes maybe it's maybe not that good. So maybe you should be Continue to just run it and try to have like like I said see states for the other processes I I think it's would be better. I guess And I think we are on time. So thanks. Thank you