 Hi, so we are going to talk about sustainability in the container native way. My name is Huang Mingxian, I'm working as a Red Hat's emerging technology team. My name is Chen Wang, I'm from IBM Research, I'm a research staff member. So the agenda for the day is that we're going to talk about the power management, power measurements in theory, and how we're going to do it in practice, and we'll introduce our projects for power measurements called Projects Capital, and we're talking about what's going to do use these projects for what's coming up in energy conservation, and we are going to be excited to share with you what has been happening in the community, and Chen will give you a demo for these projects. So as a refresher, energy measurements is not something new, it has been going on for many many years. We are talking about, we have both theory as well as the methodology. In reality, we care more about the question how we are going to manage the energy consumption indirectly or directly, especially in our operating system that is shared by a lot of running containers and processes. And now we come to the second question is how once we get the energy measurements in place, how can we attribute these energy consumptions to different processes or containers fairly and accurately? So this is about the two questions. So if you go to the wiki page, you find us there's a very exotic explanations about energy measurements in a digital world. The the gist of the idea of the measurements is that there are three components in the CPU energy consumption, the dynamic, the short circuits, and leakage. The detail of which you can go in deeper into the wiki page, but what you see matters here is that the dynamic energy which is consumed by running the CPU instructions is the major components of the CPU power consumption. And there's also something called leakage currents, which is the current the current needs to make the circuits running, even there's nothing going on. The dynamic power is determined by three factors, the capacitance, the number of circuits that are running. The voltage in real time in the data center, specifically, is not something you can tune. So we just leave it as static and the frequencies as which the circuits is operating. So just keep in mind that we have two things, the capacitance and the frequencies. That's the two things we want to get a hold of when we come to the power measurement. So now we come to the methodology. If you are electricians, you can attach the power meters to the circuits and measure the powers of each circuit and the durations and the frequencies. But in the software world, that is not something we can do. So we have to think about something creative. The way we are doing the measurements in software world is that we use certain things from the CPU, from the software counters and from the system measurements. For frequency, we use in the average frequencies that is running inside of the software, we can read from the CPU counter. Specifically, if you are running the x86 CPUs, there's like an 8-perth counter, you can read over the time what's the frequencies, average frequencies during the time intervals. That is used for our estimation. For capacitance, because we do not have great ways, or that's an impossible way to monitor the number of circuits being running. So we have some creative ways to use in the CPU instructions. Because the CPU instructions have very deterministic ways to use in the circuits on board. So by measuring the number of CPU instructions, you have good correlations with the number of, with the capacitance, with the CPU instructions. Finally, we also want to have the execution time. That's for the circuits to stay on. Since we do not know exactly how many times the circuits is being stay on, we use something indirect, we use the CPU cycles. As long as the CPU is still running, the number of cycles we count will be a good measurements of the execution time of the circuits. Now with the measurements in mind, the second question we're going to answer is how are you going to attribute your measurements to the consumption of each of the processes or just the containers. If you are familiar with this space that you have seen a lot of software projects or software packages, that's trying to answer this question. They come from different perspectives. The way is that every one of us agree on that we have to use utilization as a proxy and you attribute to your energy consumption based on the utilization. But that's the end of the agreements. The disagreements is how we're going to measure the utilization. Are you measuring the utilization based on time, based on the actual counting at the kernel level or as your measurements as the whole stack? Our ways of attribution methodology is that we measure the execution time at the kernel level. Whenever the process gets running, we start our counting. Once the process is scheduled out, we take a snapshot and we're using that to accumulate the stats as the benchmark for our estimates. In this way, we can fairly, accurately attribute the energy to the processes without the loss of transparency. Just putting things together, the way we're trying to measure the physical CPU or memory or any of the digital device is using the methodologies as approximations to the real-world energy measurements. We use models that we can derive from machine learnings to improve our accuracy. When we come to the attribution, we are trying to use the utilizations that's actually consumed by the process or containers to estimate the energies consumed by these entities without losing accuracy. This probably sounds simple, but that's the fundamental ways, the fundamental differences. All projects is different from the others. Now we have introduced the projects we have been working on. Collaboratively, it's called a Kepler. It's a Kubernetes-based efficient power level exporter. If you are a science lobby hobbyist, you can use it. It's constantly related to the astronomer, Johannes Kepler, who has these laws of planetary movements, and also correlates with the NASA, Kepler, the telescope. That basically means the same thing. You have to have certain ways to measure precisely and proportionately in real-world. Kepler is focused on three things. I want to do three things in the best way. It's about reporting. We are trying to collect information from the system about the GPU, CPU RAM, as well as the other hardware resources we are trying to measure. That's where we consume energy that we are trying to measure. What we are trying to do is both on bare metal as well as virtual machines. If you are running the machines on, for example, the hybrid cloud, you could have some of the energy stats coming up as well. We are trying to, because we are living in a Kubernetes world, so we are using a CNCF ecosystem. Prometheus is the way to go, so we export the stats from Prometheus. The way we collect this information about the performance counters, the CPU cycles, and the CPU time is through EBPF. This gives us the ability to attach to the kernel trace points and collect the snapshots of the performance counters without losing lots of overhead. Lastly, we are trying to use the science-based regression models to improve our accuracy of energy estimates. The first paper, because of this now we are in Texas, the first paper I read is about using performance counters to estimate power consumptions on a different level of the systems was of the paper done by professors from the University of Texas in about 10 years ago. So it's an interesting coincidence. We come to here to have our presentation. The architecture, it looks like the bubble upgrade. So if you are looking at the bottom, we are connecting data as the kernel level using EBPF. So the EBPF attached to the process switch function and collects certain information. And as those take a snapshot of the performance counters and take the deltas and push off that information to the data aggregation. On the left side of the data aggregation, you have the energy accounting stats. On the S86 platforms, you have Rappel. On other CPU platforms, you could have some other CPU energy counters. And we want to collaborate with vendors to see how we are going to get the information from there. On the right side of the equation, you will see different stats we collect as the aggregation layer, stats from the kernel EBPF, and stats from the CGRU file system, and stats from GPU libraries. Specifically, in this case, we use NVIDIA's GPU library to get the GPU power and process data. And we also collect data from the hardware monitors, information like frequencies, and we are going to have the information like temperatures as well in the future. We put all everything into a regression model that we are going to explain later. That's used to derive by the model server. And we're using simple regression models like linear regression to stats. And in fact, many of the research suggests linear regression is very powerful. It gives us lots of accuracy in energy estimates. As the data presentation layer, we present a package of data in the formats that we are going to be useful. Prometheus is the current what we support. But actually, that can also work with open dynamics as well. We are going to investigate that later. So the capital itself is not able to figure out what the models that's going to be used for power estimates. So we're using a thing called a model server to make these things happen. So capital as a process will export all these metrics as the power stats, the rapport data, the performance counters, C groups, file systems stats, information like that through Prometheus. On the other side of the Prometheus is the model server. As we are taking these stats, running these continuous online learning, and then build off the model. Once the model is satisfied to our consumption, the capital will just download these models through the Flux endpoints. And they're using this model for power level consumption estimates. And those estimates will be exported through Prometheus again. So you see there are two export points. One is for the model server. The other one is for the reporting purpose. At the power level, the energy consumption. And then with that, I will turn over to Chen for the demo of the running capital and running the dashboard. So thanks for having me. So basically today, we will show a very, very simple demo, running a Kepler exporter in microchip cluster installed on a biomedical machine, where we have the NVIDIA GPU car. So the necessary prerequisites for this is you need to install NVIDIA GPU operator, of course. And then we install Kepler and keep Prometheus to visualize the Prometheus data to load our Grafana dashboard. So let's start with a brand new microchip cluster on the biomedical. And sorry, I pre-recorded it because we needed to wait a certain time for the workload to show the energy data. So now we are in a brand new microchip cluster. The first thing we are going to do is to deploy Kepler exporter. And so we just go to the GitHub report, clone the Kepler report, and then there would be a deployment YAML file under Kubernetes deployment. What it has is a bunch of RBAC rules, service account configurations, and also the deployment YAML, the demo site YAML for the exporter. And here in argument, we enable GPU as true. So we will later show not only the energy consumed by CPU, but also the energy consumed by GPU. So then we go ahead. Fund the Kubernetes deployment YAML on the manifest. And then all the cluster or service account will be created. So then we see the Kepler exporter part is running on the monitoring namespace. And we can first check the available metrics from the log messages. And here we can see for different pod names, we got those performance counters like CPU time instructions, disk write and read, et cetera. So based on those hardware counters, Kepler has a model to estimate how much energy one pod is consumed per component, such as CPU, GPU, DRAM. So another way to check the availability of the Kepler metrics is through its endpoint, which is on 9102 pod. And here we can export all the CPU scaling frequencies, pod CPU energy current, meaning this is the energy consumed per second for this pod. And then it is also we are showing the DRAM, CPU, and GPU energies in different components. Of course, we are showing the total energy consumed per pod. So of course, we need then to install the Kube permissions project. We already cloned that project here. So we just get into the folder. And if you get to the Kube permissions GitHub report, there is three lines of installation command you can use. And then you just copy them. And then it will set up all the necessary permissions stack. So we do the same here. And then we need to get into the Kube permissions folder. It takes about two seconds to install everything. Now the permissions is also installing our cluster. What we are going to do is to go back to our Kepler project and then configure the service monitor customer resources to actually configure the scraping interval of the Kepler demon site to permissions. So it's a simple CR, and it allows you to choose the application selector labels. And then it allows you to choose the intervals for scraping. We go ahead and create this CR in the cluster because the permissions is already in store. So it's a permissions operator. And it automatically recognizes this service monitor CR. So we double verify if service monitor is already available on the monitoring namespace. Then we won't validate while the permissions UI. What we can do is we can do a simple SSH tunneling to the bimetal machine and then also do part forward for the permissions endpoint to our local host directly. So this SSH tunnel, tunnels the remote 399 to our local 99. And then we have another part forward command to part forward the permissions endpoint to 399 on the remote host. So then let's go to check the permissions UI here. We have all the energy consumption metrics available. This part manager stack, we especially make it an object. So it includes all the necessary performance counters and also the energy consumption metrics for different components for a particular part. So in this way, you will get all the data in one query. And then it can be later used by all types of resource management optimizers like auto scaling and scaling. Also, if you have some particular controllers, those data is also available. So let's take another example. Here is the current energy per part. And if we choose a particular part here, which is the alert manager on the monitor namespace, part of the monitor stack. So we can see some data because the coupler is just running. So it only has around less than one minute data. What we are going to do next is in Kepler project, we also have several dashboards you can try out. And then we will similarly, we will do SSH tunneling to the remote host and port forward the Grafana dashboard to have a local host. And then we will show an example how to import our Kepler dashboard to Grafana, to the default Grafana. So now we are already tunneling through the remote 3,000 port. And then we can check the default Grafana dashboard first. Default password is at main, so we need to change it when we first log in. And then here is some default Grafana dashboard available in Kube permissions to show the research utilization of your cluster. And then next, we are going to import our Kepler dashboard, one of our Kepler dashboard. So I already cloned the report in my local laptop, so I just choose it from my local laptop. And then what you need to configure is only the permissions data source. And that it works. So this is a simple dashboard to visualize the Kepler metrics. And on the top, we kind of convert to the energy consumption to the equivalent pounds of coal, petroleum, and natural gas we will burn for a particular application. And you can choose the application while the namespace. So next, what we are going to do is we are going to deploy two types of workload. One is a CPU intensive workload, which is basically generating some random numbers. It's a simple command line CAD-DEV random. And then we will deploy two replicas. And then the image is just Ubuntu image. So similarly, we will test another GPU intensive workload, which is the vector addition workload. We got from a video report. It will only use one GPU. So now all the pods are running. We should get back to the Grafana dashboard. First, let's change to the smaller window. And then this is still the monitor app. And then here, I kind of cut one minute video so to wait for the workload to run. And then the default pods are on. And we see there's still just a small amount of data, similarly for the GPU. But you can already see less than one minute data. And here, we can see for the GPU workload, and then the major energy consumption is from GPU. And if we switch to the CPU intensive workload, apparently the GPU DRAM usage is zero, and CPU is 100% of the total. So that's just a simple demo to show how Kepler is visualizing the energy consumption per application for different components. And the below two charts is showing in this namespace what is the PQ usage application. And then the bottom right one is showing over time how the energy consumption is building up for different namespaces. And what's important is we also use the default Grafana dashboard to show the resource usage of Kepler demo site. And then you can see the received packets and transmitted packets is only 70 to 100 packets per second. And the CPU usage is only 0.02 cores. And memory is about, let's go back to the memory a little bit. The memory is about 150 megabytes. Yeah, sorry. Here, it's always below 150 megabytes. OK, that's all of the demo. And let's go back to the presentation. So in summary, the data we are able to export it while Kepler right now includes all those CPU statistics, including energy consumptions, the CPU timeshare, the frequency of the CPU, and all related hardware performance counters like the number of instructions per container, and like the CPU share. And for memory, it's the same. It shows the cache misses per part and the resident memory size per part as well. In GPU, we can show the energy consumption and resident memory size, and also as well as the IO stats. And then in the future, we want to integrate this energy consumption with the carbon intensity APIs. So we not only show the energy consumption per part, but also show the carbon footprint per application. And then to further take advantage of this, Kepler, we also have some ideas like using those data to directly do the sustainable resource management, including auto scaling and scheduling, to further optimize the cluster management, resource management. So we welcome everyone to join us in this sustainable computing community. And we welcome all the comments, issues, and PRs in both the report and on the Slack channel. And then we want to send specific thanks to Intel and Waveworks for their good comments, suggestions, and ideas. That's all of the talk. Thank you. There were labels that were being emitted that kind of averages. Is that the case? Maybe I misread it. I think it is. Wouldn't the data card analogy be very high? Oh, yeah. That's a very good point. So when we build the card. Could you repeat the question? Oh, yeah. So for questions, when we have some of the labeled counters, some of the real correlations between the counters have a certain level of correlations. Well, so in some of the counters, I was seeing averages. Now let's go back to the screenshot. Oh, no, not the screenshot. I think it was in the demo, actually. OK. I think some of the labels were averages if I was reading it correctly. OK. Here? Yeah. So the energy pool, I think it was. A lot of energy pool or something like this one. OK. Average CPU problem. Sorry. Almost there. Yeah, so this one. That's that? Well, maybe I'll do this one. Yeah. OK. Average CPU reference. Wouldn't the data card melody on that be very high? Because you're committing the average as a label. So basically, any time it doesn't match, you admit a new metric. So for the new metrics? Oh, OK. Yeah. So we take this as an average over the sampling period. So we actually have, because the average is about, there was seven core A's cores. So average is about the A's cores. Over the time average. So if you are looking at the scale, I think the scale itself looks a little bit, probably the absolute number is not something that you are interested in, but the range. So each of the CPUs have its own range. As long as we see a certain range, I think the energy is about the same. So it's not the A's? The absolute number is not the absolute number. We are talking about the frequency range. My understanding is that when you tune the x86, you have certain scales, like 10 or 20 scales you can tune. As long as you are fitting to some of the scales, it's going to be similar frequencies. Because of these frequencies, the absolute numbers will be something very hard to get. You have so many cores and so much period. It's probably going to have a lot of variations. So based on the scale, then you kind of... Yeah, but still, the next question is, do you want to have core levels granularity or just like average core level granularity? That's going to be a scientific research in terms of follow-up. Yes, Scott, this gentleman first, would you do not mind? Yeah, let's go ahead. Yeah. Yeah, I was wondering about the linear model. Can you explain more of that? Yeah, so the question is whether the linear models are valid or how we're going to test this. The answer is we learn from the scientific research papers. The research I just mentioned from the University of Texas, the way they do it, they attach the parameters to each of the components, the pure memory and the disk. And they use the linear models to estimate the power consumption and compare with the power meter readings and they come up with accuracy. In the ideal world, we do the measurements where it should be doing the same thing, because we do not have this expertise in lab. So we just assume that the readings we have come up from the ripple counter is good enough and that's where our ground truth when we come up with the estimates. All right. What values to emit? Model server needs input from the processor in order to know what to base that on and I'm curious what the bootstrapping process is. Yeah. Yeah, so that's a very, so the question was how the model server and the capital were working together was the proper bootstrapping process. So that is a very good point. The ideal way, the way we are still under construction in the first place. The ideal way in my opinion is that we export all the node levels that's first and that is picked up by the model server. So model server will come up with the Russian model and then we'll tell the capital we just go and check the endpoints of the model server and download the model server and it gets the coefficients and estimates the power consumption. So that is the bootstrapping process. But once that bootstrapping process is finished we have a model in place. That's model can hopefully be ported to other environments as well. So when other capital process starts they already have the basic model to use start with and then during the learning process they're still emitting the node level stats and the model server will continue to improve on this initial model and come up with the better model. So I understand this is going to be a loss of learning and training and a validating process. So that's going to be many of the machine learning expertise built into the way. We are still developing this process and if there's a community has better ideas, better expertise can help us thus it's very appreciated. So regarding to the question of the model change and what is the other change? How the difference between the... So I still got your question first. If you don't mind just repeating it. And actually what we thought was seven instructions equals 10 watts is actually seven watts. Versus the process actually just uses three less watts than it did previously. Oh yeah, so for the question of how we're going to estimate if I just rephrase the question. For the question of how we're going to tell the difference between the process of using less watts and the different conditions. My gut feeling is that based on this paper again, based on this research again, if the counters and the linear regression models are sufficiently accurate, they should be resilient to the conditions of whether the power is off, certain conditions is fractured over time. But I do have your point that certain things are not being built into the regression model yet, even in these scientific research papers. So that is going to be investigations we have to carry on in the next phase. So there are uncertainties. I trust your intuition. And I do believe there's a certain knowledge you can still leverage, but whether that's where bring us to the end game is to be known and to be investigated. And there hopefully going to be some resolutions or some methodologies we can establish in the future. Yes. I have a distinguish between, like if the model gets updated, is there a way to distinguish between decrease in power because it's actually the model was reflecting the wrong data before? I see. Versus an actual power. I see. Yeah, so regarding the model accuracy versus the ground truth. So that kind of thing. So I think that this will come to the validation parts. So when we train a model, we also validate, right? So we get, you know, as I said, 100 samples for training. Some of them will be for the validation. So if the validation does not work, then we are not using this model at all. We only use the model once the validation phase we are passed, like the 95% accuracy for validation, then we're using this model. If that model is broken, only gives like 80% accuracy, then we are not seeing what's the end result, the predictions. Yeah, I can add something here. So in the online training, right? So what additional ground truth data you can get, right? So if you already have a very accurate baseline model in there, you already know, like, what the core containers are consuming, like the core containers used by Kubernetes, right? The default containers in Kubernetes. And then any additional new applications we add to that gives you new data. And that data serves as the additional data you can improve your model. Does that answer your question? We still have half an hour to dinner, so feel free. Do you have a question from the virtual? All right. So we solved the sustainability issue for the world. We're good. Thank you. Thank you.