 Happy Friday everyone. Hope you are having a great time at the conference. My name is Parul Singh and I work as a senior software engineer at Red Hat. Today I am introducing our sustainability in computing stack and we will see how you can monitor energy consumption and build energy efficient system the cloud native way. The agenda is simple I will start by briefing the existing state of affairs in terms of energy and sustainability in computing then I'm going to introduce a sustainability stack which comprises of three projects as of now Kepler the Kepler model server and peaks which is a scheduler and at last I'll have a demo yeah obviously we all love demo so I'll have a demo at the end. According to Gartner in 2021 an ACM technology brief estimated that the information and communication technologies sector or the ICT sector contributed between 1.8 percent and 3.9 percent of global carbon emission and to give you a context that is more than the global carbon emission of Italy and Germany combined. So this haunts us to ask questions like how can we minimize the energy consumption of computing be it on cloud or on premise where you may or may not have access to the devices in case you want to install some specific hardware components to measure energy consumption. So we need to come out with a way to measure energy consumption directly and to measure energy consumption of individual and specific workloads and to attribute power on shared resources to processes containers or pods. So let's talk about the energy measurement mythology and what seems to be idle and the reality of how things work at the kernel level. So the first target that we have is the frequency and you would think I can just measure frequency by monitoring each circuit but that is not possible because in reality Linux kernel CPU frequency governor dynamically changes frequency for saving energy and or for performance. So the solution is to use average frequency to approximate the second target capacitance and you could you would be thinking that oh to measure capacitance I can just monitor the number of circuits powered on but again in reality there is no such way to do it on a CPU and to fix it we use the number of CPU instructions to approximate and at last the execution time which is just the duration of circuits state powered on but you cannot monitor the duration because again in reality there is no such way on the CPU. So the solution is to use CPU cycles to approximate. So our mythology is based on the principle that CPU usage is directly proportional to CPU power consumption and we did not make this up. There is a big research community that have published many white papers which I put in the reference as a last on her attributing CPU usage to power consumption. Feel free to dig into those white paper if you want to. So what we're saying is you can use software counters to measure power consumption of hardware resources and for example in this diagram let's say you have a Kubernetes cluster that has two workloads running. Each workload has one pod. The first pod has one container and it is consuming 10% of CPU that means it attributed to 10% of CPU power consumption and the second workload which also has one pod but it has five containers and it used 50% of CPU that means it attributed to 50% of CPU power consumption. So essentially power consumption is attributed to the resource usage by processes, containers and pods. So let's see the projects that we have in a sustainability stack. The first is Kepler. The first project is Kepler that reports energy consumption of various hardware resources. The second project is Kepler model server that trains regression model for prediction of energy consumption of the workload. And at last we have peaks which is kind of a scheduler that takes information from both Kepler and Kepler model server to schedule workloads on the nodes that best fit the criteria. Okay, now we will be seeing the project Kepler which is a short form for Kubernetes based efficient power level exporter. Quite a complicated name. Kepler uses software counters to measure power consumption by hardware resources and exports them as Prometheus metrics. Basically one of the main goal of Kepler is to measure per pod energy consumption. So Kepler in short provides the capacity to report energy consumption of different hardware resources like CPU GPU RAM and it supports both bare metal as well as clouds and it uses cloud native stack like Prometheus metrics exporter Grafana to render these information into a very user friendly dashboard. Kepler is also very lightweight because it uses EBPF to minimize its own computational energy. You do not want a program that is calculating your power consumption of your cluster to be very power intrinsic. You want it to have a bare minimum footprint. So that's why we have used EBPF to get to gather or collect these information. And at last it uses machine learning models to estimate power consumption. The model server trains model tuned to the software stats with respect to power consumption. And again, we have used we followed or studied various white papers which which have attached in the references again that supports this this approach. So let's look into the architecture of Kepler and here we have a layered architecture. So we'll start we'll do a bottom to top approach. We'll start with the bottom most layer or the data collection there and Kepler collects data based on EBPF programs that attach to Linux kernel trace points and performance counters to collect information such as process ID, C group ID, CPU cycles, CPU time, CPU instruction, cache and misses. This information is pushed to the user space as the data aggregation layer in conjunction with other stats from C group, GPU as well as the hardware monitor. Kepler then exports these stats as Prometheus metrics and the model server uses these information to train models to establish relationship between energy consumption by the pod and the software stats. Now let's see let's talk about the Kepler model server. One of the core goals of Kepler is single pod energy consumption prediction in a Kubernetes cluster. This prediction can be done by attributing a relationship between performance counters and energy consumption as as I've already explained before. And to be able to do that, we need to provide reliable and accurate machine learning models which can be used to accurately predict energy consumption of Kubernetes workloads given the performance counters. The Kepler model server is implemented using TensorFlow Keras and Flask. And the main reason of using Keras over other library like Skykit is because Keras operates more efficiently when it comes to deep neural network. Next, let's talk about the models within the model server. Currently Kepler model server implements two linear regression model. The first one is core energy consumption CPU core energy consumption model. And it is based on features like CPU architecture, currency current number of CPU cycles, current number of CPU instructions, current CPU time. The second model aims to provide DRAM energy consumption based on features like CPU architecture, current cache misses, current resident memory. Now that we have, and both these models are based on supervised machine learning. Now that we have talked about the models, let's see how we train the models and how we use them for prediction. In the diagram, you can see that there are two nodes and each of the node has power consumption estimate agent. So Kepler exposes performance counters that comes from these power consumption estimator agent. And these agents are sitting on all the nodes within the Kubernetes cluster. The agent exposes the node energy metrics and the performance counters to Kepler metrics collector. As you can see on the on the top left, the Kepler metrics exporter then aggregate the Kepler metrics collector, then aggregates these information and exports them as Prometheus metrics. The Kepler model server scrapes the metrics and converts the Prometheus metrics into sufficiently large tensor flow data sets. And this data set is used to train and retrain DRAM and core CPU energy consumption model. Certain metrics like root mean square error and r squared value is used to check if the models are acceptable for Kepler to use. And once the models are deemed fit, they are exposed to Kepler for prediction using HTTP endpoints that you can see we have the we've used temporarily flask for that. And once the and once the models are available for at the HTTP endpoints, the Kepler uses Kubernetes pod performance counters to then make prediction for energy consumption of individual workloads. Okay, now that we have talked about Kepler, the Kepler model server, let's see the last project in our sustainability stack, which is peaks or power efficiency aware Kubernetes scheduler. It has a similar workflow like the Kepler model server, where it obtains thermal temperature, cooling and power consumption metrics from Kepler, in the same fashion that these metrics are exported by the power consumption estimate agent and collected and aggregated by Kepler metrics collector and then exposed to Prometheus. So peaks download these metrics from Prometheus. It as of now, as I talked that we only have two models, which is a DRAM and the core CPU models. But in the future, we as we are working on peaks, we are also going to implement four models that that peaks can use to obtain appropriate power, energy, carbon from the model server. And it uses these metrics in the scheduler plugging. So as of now, we are in a very early stage of development of peaks, and we are still conducting a lot of experiments on which metrics is best suited to find the node. We are thinking of various tradeoffs where we're thinking of energy efficiency versus energy, total energy consumption of the of the cluster. Another tradeoff that we're thinking of is the load packing versus spreading. So this project is right under still under work, and we are we're just figuring things out. But the basic concept is similar to how I've explained before, we will have models that will be used for machine learning prediction. And the same pipeline that we have set up for Kepler and the metrics exporter will be reused to capture node stats, which later be applied by the scheduler plugin to figure out which is the best node to schedule those tasks. So let's get to the demo, I will be showing you how to install Kepler on Kubernetes cluster. Before you do that, you have to make sure that the operating operating system provides support for C Group V2. And it also provides the kernel header that are required by eBPF, and obviously kernel that supports eBPF. For the demo setup, I would be using a micro shift cluster. Micro shift is a lightweight open shift specifically optimized for edge. And the reason why I'm using micro shift for this demo is because it's very simple to start up and you can just use system D to start and manage micro shift on an RPM based host. I will also be using the monitoring stack that is provided by the cube Prometheus operator. And for dashboard, I'm using Grafana, which is installed by the cube Prometheus operator itself. As you can see, I have a micro shift cluster running. There's nothing else running on this cluster as of now. Next, I'm going to install the cube Prometheus operator. So the cube Prometheus operator also deploys Grafana, so it's pretty straightforward. So I'm applying the cube Prometheus operator manifest, and it will take some time. Okay, as you can see that the the monitoring manifests have been applied. And now it's time to install Kepler. So I would be just again installing the Kepler manifest. Okay, so the deployment has been applied. And it has created a bunch of resources. It is deploying Kepler as a demon set. And it also exposes a service where Kepler can expose the metrics to the Prometheus endpoints. And other than that, it also has a couple of our supporting RBACs deployed in the namespace. Okay, now, you can see that the Kepler pod is running. And I will go back to the monitoring namespace to see the Grafana service. And you can see that the Grafana service is running on port 3000. So to see the Grafana dashboard, all I have to do is port forward 3000 to my host machine. And once that is done, you can see the Grafana dashboard. And in the first quadrant, you can see that the namespace that I'm observing is monitoring. And the part is Grafana. And the first quadrant gives you the pod current energy consumption, which you can see as the line graph over there. In the fourth quadrant, you have the various parts that are running inside the monitoring namespace and the individual total pod energy consumption. And it is very easy to identify which pods are consuming the most energy. So in the third quadrant, you have the various namespaces running inside the cluster and the total pod energy consumption of each of the namespace was under under the cluster. So now I'm going to run a simple Python app that will use Monte Carlo algorithm to evaluate the value of pi. And let's see how what kind of visualization we get from the Python app. So here is my Python app that is running inside the namespace Python app over here. And the pod is the Monte Carlo Python app pod. And you can see that the current energy consumption of this pod is shown in the first quadrant. And if you see closely, the total in current energy consumption almost overlaps with the core CPU energy consumption. And which is quite obvious because this application is very simple in 2026. So it uses a lot of CPU frequency versus the DRAM energy consumption, which is almost negligible. And one of the cool things about Grafana is that you can see these visualization in various way. For example, you can also see this as a line graph over a time series line graph. That would be all. So thank you so much for watching my presentation. And I hope you enjoy the rest of the conference and happy weekend. Thank you. Bye bye.