 Hi If you're not asleep already You can join us in the next talk. I am parol and with me you have Krishna and We are going to talk about how you can empower the efficiency of your clusters by using Power aware Kubernetes scheduling. So we've had too many AI and LLM sessions This is a good refresher for you and stay with us So first who we are we are open source contributors that are working on cloud native sustainability stack and the major contribution comes from IBM Intel and Red Hat any Intel folks here No, we have few IBMers and Any Red Hatters Okay So at the cornerstone We have the project Kepler which gives you energy observability metrics of your cluster and Around Kepler. We have other projects that We are using to consume those observability metrics to do cluster optimization Kepler is a CNCF sandbox project already and feel free to drop at a kiosk on Friday 10 30 to 12 30 if you want to know more about Kepler But today we are talking about peaks which uses Kepler observability data to Do energy and power aware Kubernetes scheduling? So what problem are we trying to solve? Well, we all know that the I'm not sure if all like anybody familiar with Kubernetes scheduling framework Okay So like Kubernetes scheduling framework gives you the provision to have your in-built or like custom scheduler plugging that you just hook up with the Scheduling framework and you can optimize or you can try to Solve whatever problem you're solving in the scheduling or whatever objectives you are trying to Maximize or minimize but within the ecosystem. We found that there's a lack of scheduler plugins that focuses on power optimization or energy efficiency While also taking care of the other objectives that they are trying to solve so we thought of developing peaks that can address this gap and What peaks does it it essentially tries to maximize the energy efficiency of your cluster while also Focusing or letting you do other stuffs like how to optimize on topology and how to optimize on Network or CPU utilization So our goal was to come up with a Configural scheduling plug-in that minimize the aggregate power consumption of the entire cluster and We wanted to implement it as a score plug-in While making sure that we are not altering the default scoring plugging of the Kubernetes framework So the solution obviously peaks being just talked about peaks for so long. So Peaks is a Kubernetes scheduler plugging That aims to optimize the aggregate power consumption of your entire cluster and the important Thing to note over here is during scheduling. So it at the moment peaks only do this optimization at plot placement and It uses pre-trained machine learning model that correlates node utilization with power consumption to predict the most suitable node for your workloads and these predictions are based on the resource need of the incoming workloads and The real-time utilization of the nodes so let's go where peaks workflow and This is quite going to be very descriptive. So stay with me We'll start with what happens or the pre-processing. We need per scheduling cycle so we start with taking a node in your cluster and You have all the nodes that you have in your cluster You extract some metrics right now We are using energy consumption metrics that comes from Kepler and we use usage node usage metrics that comes from Node exporter or load watcher But the thing to note over here is you can bring your own metric provider It doesn't really matter what metric provider you're using as long as that metric provider push the metrics into Prometheus and you can read from Prometheus The energy consumption and the node utilization metrics so for each of the node you will have these metrics and then you will create a Node power model which is a function of node utilization Node energy consumption and the workload request that your pod has So you train a machine learning model for each of the node within the cluster the next step is you have an incoming workload and The pod has its resource requirement that can be read from the manifest so Not this processing is done per node So what you do is you consider a cluster node and its power model So you have a node you have the power model you try to predict what is going to be the change in the nodes instantaneous power if you schedule this part on this node and you whatever is the change in the nodes instantaneous power is What is returned as the plug-in score? So you do this for all the nodes in your cluster if you have more nodes You just repeat this process if no then you just go towards normalizing the plug-in score So over here we have given a basic normalization function but essentially what it does that it gives higher score for nodes with has less change in power and once you have The normalized plug-in scoring done then you can put a suitable weight for the peaks plug-in and weight is essentially Kubernetes led to you assign a weight to the plug-in and depending on the weight of the The priority or the importance is given to that particular plug-in so Once you have the weights you have the score you just This whole cycle will give you a node that is best fitted for the pod If it and you decide that by the highest normalized score So if you have more parts to place you just repeat this process if no then your cluster That that's really happens. You will always have more workload so Let's talk about the power model training and inferencing so each node will have run some benchmarks and This is like a screenshot of all we were using stress ng for this particular experiment So we ran a lot of benchmark we Overutilized hundred percent utilization of all the node we collected the metrics the particular metrics we were focused on worst CPU utilization and power consumption and using those metrics you generate your node power profile and Inside the plea plug-in you you import the model power parameters and Once you have the pod you get the resource requirement of the part you get the node model parameters from the model and then you predict the best node for that particular part and The prerequisite for this was Metric provider as I said you can bring your own metric provider as long as they push the metrics into Prometheus We used Kepler and node exporter You can also bring your own model like it doesn't really matter How you have trained your model as long as peak can use that model so The training is outside the scope of peaks but when we were creating this we have a training pipeline But it doesn't matter how you train your model peaks just uses those model to the inferencing so the important thing to note is You you can train your model depending on your cluster behavior. So if you have like AI specific Workloads, then you can create the model according to the AI if you have edge specific So you will consider those parameters in your model training and you'll have edge specific node model So next we have experiments and Krishna will give you an overview Yeah, thanks, but So we have gone through the what peaks is till now I would like to take you through the value of peaks how you can realize actually the value of peaks so that We all can appreciate the work right so before I jump into the different use cases Let me also first explain you briefly about the experimental setup that we have so it's a two-note cluster that we have considered for the evaluation of a peaks performance and We considered it in particular a bare metal cluster Kubernetes cluster with two nodes These nodes being heterogeneous in nature. What heterogeneous here refers to is the resources allocated to these Nodes are Not exactly the same right so one is a high power machine one is a low power machine you can see the CPU cores varies from 8 to 40 and also the memory allocated is very high in this node and And then we run the benchmark workloads and fit the try to capture the nodes power behavior and these curves represent the nodes power behavior the One in blue is for the node one and the one in orange here is for node two you can see that the the power curves actually depict two things one is they consume some power even when the node is Almost not utilized right even in the ideal state They consume some power and the power consumed the ideal power is very high for the high power node and Low for the low power node and also during the active phase as well The amount of power consumed by the high power node is very high. So this we can look at this from this graph and with this you can actually Relatively rank these nodes one being the more efficient node the low power No, is more efficient the high power node is the less efficient right another important aspect is that as I said some of the use cases we Go through to realize the value of peaks plug-in This is the scheme of evaluation right we first Run some workloads using the Kubernetes default scheduler then we repeat the same experiment Replacing the default to Kubernetes scheduler with the peaks scheduler During these two scenarios we calculate the we collect the matrix like node utilization energy efficiency matrix and then compare how much is the energy consumption during this experiment and See if there is any Saving in the energy consumption using peaks if that is the case then that is attributed to the value of Peaks plug-in. So this this this scheme of evaluation is Common across all the use cases that I will take you through and this is the particular cluster set up on which we Demonstrated these use cases, but these use cases are generic enough to be able to carry out on any cluster node with any number of Nodes and with any types of homogeneous or heterogeneous only that the amount of savings will differ Okay, so let me jump into the list of use cases that I would like to take you through So this one is the most regular configuration when a party is being deployed With the help of a Kubernetes scheduler second one is the scaling of your part using horizontal part order scalar and using a cube cut all scale that is another very popular Cli interface To operate on the different API objects and fourth one is migration of a part via part explicit eviction And the fifth one is a much more generic case That is a cluster auto scalar. We'll see more details and of these use cases But this is the gist of the use cases that I would like to spend the next 10 to 15 minutes The first use case is a deployment of your part using Kubernetes scheduler Under this we considered the cluster with the two nodes and we assume there is no other application or part running on this cluster and Only one part is to be scheduled that belongs to our application and That is a part when it is a schedule using the default scheduler because the default scheduler is not cognizant of the nodes Efficiencies it can place this object this part on any of these nodes Assume that it places this part on the node enough on the node, which is energy inefficient So and if it runs for a while about 10 minutes, you can see that the node utilization of that energy inefficient node is At some level it is around 12 percentage You could actually turn a bigger workload and the utilization can be much more higher Then repeat the same experiment with replacing the default scheduler with Peaks scheduler, but this time because Peaks scheduler is aware of the energy efficiencies of these nodes So it will choose the node n1 which is most energy efficient in this case. So you can see the part Executing on this node again. He transfer about the same period 10 minutes, right and the right side a graph actually depicts the energy consumption of The cluster across these two nodes over the time progresses now you can see that using the default scheduler the Energy is consumed some some amount and with using Peaks the amount of energy consumption is lesser So there is a saving the gap between these two graphs represents the saving in the Amount of energy consumed by the cluster across its all nodes So in this experiment what we observed is the saving amounts to roughly about 12 percentage And points to note before I jump into the other use case is that the energy savings can be more if you run this same experiment for longer duration because the accumulation accumulated energy over time Over time we have some Portions of energy which gets accumulated over time and hence over time if you run the workload for longer duration You will see more savings. So that's one thing and second thing is because the active energy for the The high power machine here is very high. So the as the utilization of that node goes up So here we are running these nodes at lower utilization if you increase the utilization of these nodes You would also see a higher savings. So there are two ways for realizing higher savings and I would also like to extrapolate the fact that if These parts are running on these clusters along with other applications as well We can realize such a saving just for the sake of understanding the concept more easily We had we assumed the fact that there is no other application running at this time But this can be co-located or co-executing with other applications parts The next use case is a scaling of a part using a horizontal part autoscaler. So in this use case actually we have a particular Deployment and the deployment is configured with the horizontal part autoscaler the particular CPU percentage here represents the fact that when the Resource utilization allocated for this part goes beyond the 50 percentage limit Then a new part is created by the HPA And we want to have minimum of one part at any time and a maximum of 30 parts to be configured by the HPA controller and When we do this experiment using the default peaks scheduler. So you can see that as the load increases then The HPA controller starts creating more parts and When the HPA controller wants to create a new part The placement of that part the decision is made by the scheduler default scheduler and Because it's not Default scheduler does something called spreading of the parts on the nodes, right? So it equally tries to Place these parts over the time. So during this experiment We created about 15 parts and those parts were alternatively being placed Between the nodes of this cluster. So you can see that the at any point in time almost the utilization across these These two nodes is at the same level, right? So and now Come come back to the the peak scenario when peaks is used for scheduling When the load increases the HPA controller horizontal part autoscaler controller Does the same kind of scaling that that it decides a new part to be created as the load increases but this time the part placement is the decision is made by the peaks scheduler plugin and Because peaks is aware of the no energy efficiency of nodes It wants to first adjust all the resources available on the most energy efficient node So it tries to keep the parts as much as possible on node one first Okay, so you can see that the utilization of for node one goes all the way up to hundred percentage before a pod is Placed on the node two, right? So initially the pod The node one is packed with the parts and then once the node ones all the resources are exhausted Then no two starts getting packed, right? So this is the place where no two starts getting packed and you can see that at this level Node one is at hundred percent utilization, okay? now you compare the energy consumption of the cluster across these two nodes under these two scenarios the RNG line represents the energy consumption using the default Scheduler plugin across these two nodes Whereas the blue one represents the energy consumption across the cluster nodes both the nodes using the peaks Scheduler plugin so you can see there is a gap between the energy consumption Over the period right so as the time progresses the gap increases so this is because While time progresses actually in this deep default scenario. We are using the more resources of Node 2 which is energy inefficient. Hence the gap is increasing now There is another observation that you can make is that after a point here the Gap starts decreasing why the blue car who starts growing up here is because This represents the time scale timeline where? Using peaks we started using the resources of node 2 which is energy inefficient, right? So as long as this was not used the gap was increasing because we are using under the default scenario resources of node 2 in this so hence that there is an increase in the gap and here We started Using the resources of node 2 hence the gap starts decreasing So this demonstrates the fact that using HPA horizontal part auto scalar with a peaks plugin you can still realize energy savings and horizontal part auto scalar using a hard horizontal part auto scalar along with the The deployment is a very common use case because HPA enables us do scale up and scale down automatically Right, so let me go to the third use case in the third use case we see the advantage of peaks with a very famous CLI that is cube cuttle scale the cube cuttle scale actually interface Is helpful to scale up and scale down the parts of an application where the API resources are not just being the deployment resources, but they can be The stateful sets or replication controller or replica set and so on so there are a variety of application API objects that Kubernetes actually allows users to work with and under This variety of API objects To configure right to scale up scale down. This is the API now. What happens? So let's see with the default controller plugin assume that We have a deployment Which has two parts initially assume that these two parts are Deployed on node one and because this is the most energy efficient assume that due to some scenario The nodes the parts are initially placed on these nodes now Think that the application one or wants to scale up the number of parts from two to five now the new three parts They are scheduled by the default scheduler default Kubernetes scheduler on the node two because default scheduler behaves the thumb rule is to spread the nodes across the nodes, right, so it Places these newly created parts on node two even though there is a space for Deploying these parts on node one which is more energy efficient because it cannot simply do that I just not aware of the energy efficiency of these nodes now on the other hand if you Consider the scenario of a peaks plugin consider that we have initially two nodes and Sorry two parts which are deployed on the most energy efficient node And now we are doing the same experiment We are scaling up nodes number of parts from two to five, but this time the new parts are deployed on the node one because this deployment is Handled by peaks scheduler plug-in and it is aware of the energy efficiency, right? So Please note the fact that Okay, so maybe I will Go to the another use case migration of part. So This is the fourth use case where consider the scenario where at any random time instance certain parts may be running on the cluster nodes and Assume that at a given point in time We we have an application running and its two parts are running on two different nodes, but you know that The node one time by one is more energy efficient. So ideally you have two options here. You can let the Application run as is and complete or you can actually try to move this part from node to to node one because not running on node One is more energy efficient. So if you try to delete that part manually and this time Because the API object the replica set knows that a part of this number of replicas is Deleted it needs to create a new part. So it will try to create that part to meet the desired state of two Replicas so but this time The part is again created on the same node because this The underlying scheduler plug-in is default on the other hand You repeat the same experiment with the peaks plug-in assume that there is another application running which is already consuming most of the resources of energy efficient node and our particular application PHP Apache is Running with the two parts one on node one and another one on node two So assume that after a while the other application CPU stress test exists so when it terminates it Releases the resources allocated to this application. So now we have the same scenario where The either we can let this application run as is or try to see if we can move this part to the energy efficient node So this time try again the same command try to delete the part that is running on the energy inefficient node and this time again replicas it knows that one one part is Exiting so it needs to create a new part it tries to create that new part And now this time the scheduler plug-in will try to allocate that on the energy efficient node So this mimics the auto migration of parts from energy inefficient nodes to energy efficient nodes At random time periods, right? So this we have done it using a script, but it is always possible to Exercise the same behavior using some automated scripts like Jobs and cron jobs and so on right which can actually be monitoring the energy the resource usage of energy efficient nodes and if they find that Energy efficient nodes are anytime underutilized then they try to move workloads from energy inefficient nodes to energy efficient nodes automatically right so all this is possible to automate so the Last use case that I want to touch upon is Using cluster autoscaler in the previous use case. We have seen the scenario where the parts corresponding to a given Deployment we are trying to migrate. What if we want to do it across all the deployments of a Cluster, right? Why can't we do it? So if we try to do that across the cluster By trying to migrate all the parts running on the most energy inefficient node of the cluster to Any other node of that cluster then ensure that the most energy inefficient node of that cluster is Dried up of the parts then the cluster autoscaler will identify that this is underutilized and automatically will Delete that part then by doing that you are not only saving the active power so by the way when we do the Migration of a pod we are actually saving on the active power, but by shutting down the nodes We are also saving on the idle power and if you see these two graphs The idle when the node is any node you consider any cluster node when it is in idle state It accounts for some energy and which is also known as ideal energy and this is increasing in nature even if you just to keep the Node idle right you you think that okay my node is not doing any work and Still it is consuming a lot of power and it is a significant portion of nodes total power So shutting down the node is one of the best ways to handle energy savings So this is the use case that that demonstrates shutting down node so before I move on the other aspects let me because of the Time constraints. We are not actually doing a live demo, but this is a slip of the logs from the live cluster where this demonstrates the fact that an incoming pod Is placed on a node which has less jump in energy consumption over a node which has More jump in the energy consumption So that means that we are making the decision where the incoming pod is placed on a node such that the Change in the clusters energy consumption is lesser by making this decision, which is the wager decision in our context Okay, there are some FAQs I will come to this slide later before that let me spend couple of minutes on future work For example the pod migration scenario that I explained we have demonstrated energy savings possibility with Pod migration, but one can actually automate the same workflow that I explained using automated scripts very easily So that is one feature work item for us Immediate on milestone for us and shutting down the nodes to minimize the power consumption again This one the use case file that I have referred to is done using manually Running scripts, but one can write automated the scripts to exercise the pod migration across the cluster and across the nodes in an automated fashion The next enhancement possible is vertical pod autoscaler We have seen that with the use of HPA horizontal pod autoscaler. There is a benefit in terms of energy savings When it comes to VPA when VPA tries to scale up the resources allocated onion node for a particular pod you could actually Make VPA more Aware by enhancing the By performing the scale up on those nodes which are energy efficient when you want to Scale up the resources allocated for a pod using VPA Today it does on any of these nodes without being aware of the energy efficiency right you could actually make the VPA aware of the energy efficiency and ensure that it scales up those parts running on Most energy efficient nodes. So the another one is a cube cutter scale Okay, I will not spend that much time and they here we have listed some of the related work and Yeah, here are some of the reports the first report is where the plug-in score is implemented So this repo peaks repo is more for the project management and the last repo is the Kepler repository, which is one of the Dependencies of our project. I'll hand it over to Parul. Yeah before we get into the questions Just wanted to know that we are working on a cap that will be creating with the Kubernetes scheduling SIG and So keep on monitoring our peaks repository and we also have a community meeting that happens once in a month So we are trying to build a community and we want more contributors So if you have any suggestions, you want to participate in any discussions or you want to have Present to us like a use case, please open a discussion or issue on our repository and Now we are open to yeah few acknowledgments like Felix was not able to make it To our session, but he has actively contributed and we also have a sale So thank you to both of them and now we are open for questions Any questions? Yeah Yeah Hello, thank you for the presentation I saw the future slide at the end and I'm just asking could Where the power Is created, you know because it's it's important to know where the electricity come from Especially for the carbon Will pick picks Take in contemporary Consideration sorry for my French English. Oh, so if I can replace you are yeah asking the fact that during the energy efficiency power Conjunction minimization process. Do we also consider the carbon efficiency? Do you consider where the node are running? Children can be speaking So if I got your question, right, you are talking about do we also Assign a value or like a priority if depending on the source of energy, right? Yeah, exactly So we did dive into like we two years in 2022 We were working on the use case where we are trying to consider carbon as well as the source of electricity But we like a process to be very transparent and open source based and it was very hard to get those data In Europe, they're still possible if you use electricity maps, but in US It's just very hard to get by so just because we want our policies to be super transparent That's why we right now. We are not working on that we explode, but we are not And just to add to what Parul said optimizing for energy efficiency and Carbon minimization are not exactly the same, right? And there is a different talk called a Caspian that is going to come tomorrow from the same group that we belong to Please step to that Session if you want to really know about carbon optimization. Yeah, so we have one more question. Yeah, sure Hello, thank you for the talk I noticed all the pods got scheduled to a single node What happens if I have anti affinity policy does the scheduler respect that sure so because the scheduler plug-in Follows the scheduler framework. So it does respect the part affinity anti affinity node labels all these aspects And if there are any such nodes filtered after these Phases only those nodes are considered. So this plug-in respects the framework of scheduling At the beginning of your talk you referred to like creating a profile for servers Would that be would that profile be the same for the same configuration of server? Like if you create a rack where all the servers are the same not necessarily not necessarily because why would that be? Yeah, because even though the hardware configuration may be same But the applications running may be different and different workloads may use different amounts of resources of the Servers hence your power modeling needs to be cognizant to the workload type that is going to run And there's also two components to it So either you can have like an offline model that you train once and then you don't revisit again But you can also do like keep an online model where depending on the instantaneous or what is happening in the cluster at that Moment you keep on retraining your model. So it's very unlikely that two nodes will have identical node model profile Next person, yeah So this is the last question, sorry Okay, it's over two question. You told about creating model per node or node When does this occur? What's the overhead correctly? So, yeah, that's what we said that that's a pre-processing Step, let me just quickly go back to the slide. Yeah, so at the start when you are just starting and you When you have you haven't enabled the scheduling So what you do is you just pre-process the node model for each of the nodes in the cluster So that is pre-scheduling. But again, as I mentioned, they are two approach You can have online on an offline. So for an offline, you just do it once but for online this process keeps on going on and that's when you get the model parameters using for example of webhook Okay, thanks. And second question The processor have poor efficiency modes and they can switch basically between poor modes and why Do not did you try to Leverage this features during your Sorry say that again. I don't the poor modes CPU can downclock and Yeah, that is called the DVS functionality Dynamic DVS functionality. Yeah, this can work with the DVFS though in the use cases. We didn't Include any particular use case with the DVFS, but definitely it should work with the DVFS to It's cognizant to the frequency of the node So, yeah, that's about it. And I'll just bring to the QR code So yeah, drop by in a community meeting and we have a kiosk on Friday as well 10 30 to 12 30 So feel free to talk if you're interested in exploring the use case or Start a discussion or whatever. Thank you so much and also try to provide your feedback by scanning this QR code. Thank you