 Hello everyone. I am Parul Singh and I work as a senior software engineer at Red Hat. Today I'm going to talk about how you can power the efficiency of your cluster and make sure that you're consuming less energy and if possible control your carbon footprints running your workloads on the cloud. So who we are, we are a group of people who are taking a community-based approach on environment sustainability. We are also part of the environment sustainability and if you want to check out a proposal the QR code will take you over there but in general our mission is to advocate for develop support and help evaluate environment sustainability initiatives in cloud native technologies and we also aim to identify values and possibly provide incentives to service providers so that they could reduce the energy consumption and control their carbon footprint through cloud native tooling. The great thing is as of June Kepler is part of the CNC of Sandbox project so that's a great step for us and what brought us so what was the thing that initiated to bring the sustainability in computing in 2021 an ACM technology briefs estimated that the ICT sector or the information or technology communication technology sector attributed to 1.8 percent to 3.9 percent of total global carbon emission and to give you a context that is more than the carbon emission of Italy and Germany as a country combined so we do emit a lot of carbon and so this brought us to ask questions how you can measure energy consumption indirectly when you don't have access to the racks in the data center you cannot install specialized hardware and how do you measure energy consumption of workloads if you're using any cloud you it is possible that you do get the total energy consumption as a tenant so does the other tenant but how can you pinpoint what was the energy consumption of particular workloads and again when you're running in a private cloud or when you're running in a hybrid cloud or in a public cloud where you are not the only one but there are many other people who are resources how do you attribute power to specific process containers and pods so with all these thoughts in mind we designed our cloud native sustainability stack and it has many projects but today I'm just going to talk about Kepler and the Kepler model server but before that I want to give you an idea what are the principles that we follow to attribute energy consumption so we we did a bunch of experiments read a lot of paper and come to the conclusion that power consumption is attributed to resource usage by process containers pods etc that's running and this is one example let's say that you are only in this picture we only talk about the CPU usage or the CPU power consumption so let's say that you have a pot that has a one container and it is consuming 10 percent of the CPU then you can say that it contributed to 10 percent of total CPU consumption so if it is consuming 50 you can say it attributed to 50 percent of CPU power consumption so the first project obviously Kepler which is called Kubernetes based efficient power level exporter and it uses software counters to measure power consumption by hardware resources and exports them as Prometheus metrics down the line we are thinking of moving from Prometheus and adopting OLTP so that anyone any monitoring stack that is compatible with OTLP protocol they can they can consume our metrics so Kepler employs three prompt approach which is per pod level it first reports per pod level energy consumption including the CPU GPU RAM and it works both for bare metal as well as virtual machines and it exports the energy metrics as Prometheus metrics obviously the goal of Kepler is to measure power consumption you don't want Kepler to measure these power consumption and itself contributing to 20 percent of the power usage so to minimize that we made Kepler very lightweight so that we so that's why we're using EVPF to attribute power consumption to a particular process and this I'll talk later but not all the time you have access to hardware power consumption using Rappel or ACPI for example in case of VM you cannot have access to the power meter so in that case we use machine learning models to estimate energy so this is the bottom up approach and for data collection and data aggregation Kepler uses as I said software counters and power meters to to calculate the power consumption by the hardware and then for data modeling and presentation what Kepler does is converts these power consumption into energy estimation using machine learning so how is this machine learning model formed they are formed by Kepler model server when the when the node energy is not provided to you by when or in the absence of power meter what Kepler relies on is pre-trained models that can estimate energy consumption and right now the the stack that we're using is tensor flow keras flask and Prometheus so we use two kind of models the cpu core energy consumption model which is trained on features like cpu architecture current cpu cycle current cpu instruction and cpu time and you have the DRAM energy consumption model that uses cpu architecture current cache misses and memory working set sorry so there are two phases the training phase and the uh exploding phase so for training the Kepler model server have no has agents sitting on each of the nodes and these agents scrapes the node metrics and exports them to Prometheus and then the Kepler model server scrapes the Prometheus and forms the data set both for training as well as testing and it trains the model evaluates the model and if it's of acceptable accuracy you can use the model for using the model or exporting the model uh uh use flask endpoints down the line again as I said we are going to go with open telemetry but you can also load the model in memory to do the estimates so how do you decide what models you're going to use and uh and when you have to use so it depends on generally the available measurements if you have access to the total power then you use a power ratio modeling where the usage is a ratio of the power consumed by the processes by the total summation of power but as I mentioned when power metrics cannot be measured for example in case of a virtual machine you can estimate power by usage metrics as input features of the train model and you can do three level of estimation you can measure node total power which includes fan power supply and the internal components such as CPU and memory and then you can do the pod power so this is uh the various scenarios on when you use what model supposedly the first row when you have a bare metal with x86 with power meter you measure total power using ACPI you measure node component power using raffle and pot power is just a ratio and the last row if you see which is a pure VM scenario when you don't have access to node power or node component power then the pot power is just power estimation uh the link will you will have these slides and the link will give you more information on the various metrics that we use and as I said Kepler works not only on bare metal as well as VM but now Kepler goes beyond Kubernetes because we have created an RPM that you can run on the Linux server that can estimate power consumption of indual process outside the Kubernetes ecosystem so I have few screenshots that I want to share that shows you very interesting Grafana dashboard at least it's interesting to me so for this cluster we have uh six nodes and Kepler runs as a demon set so it is the instances of it on each of the nodes and we have a Grafana route and this is the first dashboard so we use a third party api to compute the carbon intensity by region and this is for united states and uh you can see over here the various colored graph is the carbon intensity by the region and you can see that uh the first is uh highest and the down is the lowest so BPA which is like a stepcraft it's the lowest while the highest is uh the also the so-called purple MISO MISO is the highest the second dashboard now that you have the carbon intensity by region you can use Kepler to translate that carbon intensity to particular process you know what is the energy consumption or energy estimates and you know what was the kilo ton per hour carbon emission and using these two values you can correlate what was the carbon footprint of each of the space or each of the parts so for example over here we are using the BPA region and the carbon intensity on the left hand side you can see is ranging between point anything between point six to one but power consumption of all namespace is a straight line now we have the MISO region so the carbon footprint is ranging between four to five but the power consumption is constant so you can see that you can control your carbon emission if you can schedule your workloads in a region that have less carbon intensity so we have applied to Kepler in the open source we have worked with IBM to design something which is called clever which is a container level energy efficient BPA recommender we also worked with Microsoft and the Keita team and we have made carbon aware scaling with Keita and something that I worked on is carbon aware scheduling so if you can have the access of the details of what is the source of your power can you control the current intensity of your workload relying more on renewable energy and less on fossil fuel so that is all this QR code will take you to a sustainability stack this is the link to Kepler and the model server but you can check out other projects as well and now I guess we have time for a few questions if you don't mention you can also have suggestions Your role is very interesting it basically would capture also a case when for example one workload triggers machining to be scheduled in a region so that basically that changes your total consumption to be one more machining because one one workload particularly needs to scale up so that one more machining schedule and if I am setting the right it would capture that because you are in the consumption on the whole cluster and then if you basically could relate it to so region how do so we get the first thing oh sorry okay so you are saying that if I got it right what you were asking is if I control the machines that are in a region I can somehow control the carbon intensity of the workloads is that your question one more machine you mean to say machine or you mean to say workloads virtual machines virtual machines okay got it yeah yeah got it so I said Kepler will not that you have to get a third party and if you have access to that then definitely you can if you're adding a machine you would have the you can do the estimate of the total like all the nodes that are present in the cluster and definitely you get the workload estimate from Kepler now tying these together definitely you can do it but that we are finding is getting a reliable carbon intensity data because everybody have carbon intensity but they don't talk about how they are exporting it there's no transparency and we just have to accept what they are giving us so we are working to a more specialized open source and transparent way to get this carbon intensity data but if we have access to reliable carbon intensity data definitely that can be done anybody has used Kepler or heard about it before like I know we have one person who use a project but anybody else heard about Kepler before okay I'm sure you will now okay nice okay how much time we have so you asked how how do we get the carbon in how do we use carbon intensity information to do the scheduling I have some slides in case somebody was interested so okay so as I said we used rely on third party API to get the carbon intensity so what we did was we developed a carbon intensity forecaster that scrapes these third party API to give you what is going to be the carbon intensity in some step in the future and for this experiment we were calculating carbon intensity for a day ahead so the first thing what you get these carbon intensity data from the forecaster and the forecaster is kind of there's a cron job that would be querying the forecaster to get the carbon intensity data for each of the node and depending if the carbon intensity is super high we label them as red if the carbon intensity is somewhere in the middle we label them as yellow and if it's like the lowest we label them as green now there's something which is called labels and taint in Kubernetes ecosystem and using those features we control that so when the node has a really high carbon intensity build it as red and we also apply a taint to it if the node is some has a lowest it gets as you can see is yellow and not so tainting a node means if a pod and this is the pod spec so over here in the pod spec you can this pod is explicitly declaring its prefers a node that has a label the intensity is green but it does not has tolerations for any node that has been tainted as red and a toleration second for the sake of this experiment we made it as five that means if a pod is scheduled on a node that is tainted red it would just stay there for five seconds and then will be evicted just to be clear that the workloads are moving so as the node turning from red to green the scheduler which is the Kubernetes inherent scheduler will see the pod spec will see what is the taint and labels on the node it would evict the pod from node one that is going from green to red and schedule it to node two which is going from red to green so tainting nodes will ensure that pods are evicted by the nodes if pods do not have tolerations for that somebody asked me this question yesterday in the workshop what if my idea about carbon intensity and I don't care that my pod should be evicted so in that case you don't need to provide these additional information in your pod spec just don't provide a node selector and don't provide a tolerations and then it would be considered general yeah thank you so much