 Okay, this is a talk about observability using Kepler. Kepler is a tool to measure the energy consumption of applications, and it's a partnership between IBM and Red Hat that this work started, and I'm going to introduce the main architecture of Kepler, and Sally has been working on enabling open telemetry metrics to export Kepler metrics, especially on the edge use case. Okay, just for a quick introduction of what you're going to expect from this talk, have some motivations and then talk about what Kepler is, and then we are going to introduce what we have been doing for open telemetry with Kepler for the edge computing and the short demo. Just some motivation about why to measure the energy consumption. So first of all, the results, these topics about climate change is important and reduce the CO2 emissions. Also, if people are not really concerned about the climate environment now, reduce the energy consumption is something important for data centers. It's also cost, and the data center has a big chunk of energy consumption and also the CO2 emissions in the world. Plus, currently there is some governments, especially in the European Union, pushing the companies to report the energy consumption for the AI workloads, so measuring the energy consumption it's becoming something more important than before. And then the problem is how we can measure the energy consumption on the cloud or on public and private cloud, how we can do that. And what's the energy consumption of workloads that I was mentioning? It's something that it's becoming important now to understand that and also once the user can understand what's the energy consumption, there are some techniques to try to minimize the energy consumption and also the CO2 emissions, because CO2 emissions is a little bit not linear with the energy consumption because it depends on the factor what's the energy source. And then how to attribute the energy consumption to containers on the cloud or to pods, how to aggregate that, and Kepler is the project that is doing that. So what's the Kepler project right now? So it's become a CNCF sandbox project and recently and it's a project again to measure energy consumption to try to minimize the environment impact of workloads on the cloud and also to identify the possible scenarios, what's the opportunity to reduce the CO2 emissions. So Kepler is this main project, but there are some side projects that it's related to Kepler. The main one is the Power Exporter that is the main repository. However, we have another one that is the model server to create power models for public cloud where we don't have real-time power metrics. We have the model database and the operator and also the clever which is a project that dynamically changes the CPU frequency to save energy for the workloads and picks with its Kubernetes scheduler also to be aware about the energy consumption of the workloads and also CO2 emissions. For example, different times. It has like using different CO2 emissions, also different locations. So the scheduler that can be aware about that in a cluster, a big cluster can optimize things. Okay, so the Kepler community, it's growing now. There are a lot of IBM and Red Hat. We start with that. We have a lot of folks from Intel that are discussing that. There are some that are not... These are the only people that are contributing to PRs in the main repository, but there are more people here that it's contributing to the discussion to the project and it's not here. So sorry if I didn't include the names. And again, so there are some companies that are helping to the open source. Okay, the Kepler. The name comes from Kubernetes-based efficient power level exporter. We have another... It can run standalone outside Kubernetes, but in the beginning it was created only for Kubernetes, so that's what it has Kubernetes in the name. So it's actually collecting software and hardware counters to measure the resource utilizations for the process level. And we also collect the power consumption of the hardware in the bare metal node. And if it's in an environment that we cannot collect the real-time power consumption from the hardware, we use power models. And export these metrics to Prometheus. And this talk is actually a use case to export also metrics to OpenTelemetry. So basically Kepler, the first premise is to reporting the power consumption. It's actually reports the energy consumption, but if we use... You can convert the energy per second in it, so we have power. And then it includes some smaller granularities. Not only the node power, we have the CPU, DRAN, GPU power and also the total node power that we are reporting. It can work on bare metal and also on VMs and supports Prometheus and OpenTelemetry. So the idea is it has low overhead. We have done a lot of... I have another talk on Wednesday that I'm going to... With much more details about the performance, but just for general introduction here, it has low overhead. It's using eBPF to collect metrics from the process. And we have measured the Kepler itself... Resource utilization is low at scale. And also the introduction of overhead in the other applications that Kepler is running because it's monitoring. Monitoring can introduce some overhead. It's also low. As I mentioned for the use case, here we have the power models for scenarios that we don't have access to. Power metrics for sensors. And we use machine learning to create the power models. So just very quickly here, the base assumption, how we distribute the power. Consider that it's as basic as that. So if a process is using 10% of the CPU, 10% of the energy consumption of the CPU, it's related to this process. So of course, the CPU utilization can be determined by different ways. It can be CPU time, it can be... If we have hardware counters, it's instructions, cache, and there are a lot of components that determine the CPU utilization. But in general, just to say that the CPU utilization is linearly related to the power consumption. Okay. Yeah, so this is the Kepler architecture. So Kepler is natively integrated with Kubernetes. So we watch the Kubernetes API server to get information from the pods so that we can use that to aggregate metrics. So we have the pod names containing our IDs from the API server. From inside the node, Kepler is a demo set. It runs in each node, and it's monitoring all the process. And it gets the process ID, container ID, pod name, all the information. We export these metrics, and then the user can aggregate the power or energy consumption using Prometheus metrics. As I mentioned, we collect the EBPF metrics. We collect the energy consumption from the node. If it's a bare metal node, we can collect the power from the sensors. If it's a virtual machine, we need to use a trained power model. And to create the power model, we have the model server, as I mentioned before. Using all this information, we associate the energy consumption to the process and export. So there are different ways to deploy Kepler. As I mentioned, it's a bare metal, so bare metal we can have access to the sensors. And it's the scenario that we can export the real metric. If it's a virtual machine, public cloud currently doesn't expose power metrics within the virtual machine, maybe in the future. So we have the third-wiz case that maybe cloud provides in the future can actually use that. But then we use the trained power models. Of course, trained power models have some limitations so that if it's a private cloud or if maybe public cloud in the future wants to expose the power metrics, it of course depends on the requirements from users. If it's becoming even more something that users are requesting, the cloud providers will expose that. And it's possible to run Kepler on bare metal right now it's already doing that. So we can measure the energy consumption of the VMs and expose that so we can have this pass-through. Expose the power metrics from the VM within the VM. And then when a Kubernetes is running on top of VMs, it also can get that. The advantage that it's because it doesn't need to use prediction. It's using the real energy consumption that it's from the bare metal. And then also it can expose the idle power. I'm going to have more details on that on the talk on Wednesday if someone is more interested in that. Okay, so again, so I'm not going to repeat too much that, but we have the basic power model is what we call the ratio. The ratio is because I was telling about 10%. So it's the process energy resource utilization divided by the node resource utilization. We have this ratio multiplied by the energy consumption. So we do like this 10% of CPU utilization is 10% of the energy consumption when we call the ratio power model. And the estimation, we can also do using a trainer power model, using machine learning. So we run a set of experiments in a bare metal node collects all the resource utilization, energy consumption from different resource and do some regression analysis. As simple as that. So using different algorithms, they have different occurrence and the one that has better occurrence, we store that online and then people can use this. Also, capillary is flexible. Power model is specific to the hardware. So each CPU model has different power consumption curve. So we support officially some CPU models, but we also ask for the community to help with that because we cannot provide for all kinds of CPU models and ask for the community to also help to create power models for different CPU and architectures. CPU model, CPU architecture, sorry. Okay. The Kepler model server, as again, so it collects the matrix and it's using for the case of VMs that doesn't have real-time power model matrix, so we need to predict that somehow. And it also provides, you know, the power for workloads. So basically, for the model server, we actually have the CPU and the run energy consumption that it receives the parameters as CPU architecture. As I mentioned, it's something important. If CPU architecture is not available, it will use some generic power model, but needs to be aware that a generic power model not necessarily fits all types of CPU. So it's something that needs to be must have some stud. So it's the model server actually gets matrix from parameters. It's run with the main Kepler, the demon set that it's running. It's something used for training. It's another demon that it's running there, collects matrix from parameters, do the training and export the power model, and then Kepler can download this power model later on and use to estimate that. So as I mentioned, Kepler is not only for Kubernetes, especially for the edge case, it can run standalone, and then it can be, for example, installed with RPM on the Linux server on redhead machines, and it can be deployed on on VM to also to the on bare metal or VMs in the standalone mode to collect the power matrix and export that. Okay, so I'm going to, you know, give it a talk. That's the talking to Selina right now. Yep. So yeah, I started working with Kepler well at the edge. I was doing work on micro shift, which is a low footprint Kubernetes just platform meant for the edge running on redhead device edge. So at the edge, Prometheus and Grafana, those are often too expensive, especially at the far edge. So I knew that open telemetry would be a good fit. Open telemetry can act as a stand-in really for Prometheus and then also as a connection point where you can connect to the central data hub that has your full observability stack with a Prometheus or Prometheus data source and Grafana. Kepler comes with a really wonderful Grafana dashboard. So yeah, open telemetry is a really great project that has brought the whole observability community together in the CNCF and a lot of integrations are possible. So yeah, like I said, it provides a vendor agnostic way to collect metrics and well, all telemetry but you may funnel off to an observability vendor but you don't have to and at the edge you might want to collect all of your Kepler data and then aggregate onto, like I said, a remote central location where you can process and manage it. So a Kepler can run, I mean the collector can run as a sidecar to any the Kepler demon set which is how I run it in MicroShift but also the collector can run as a container with a system D. It can run as an RPM. It's really flexible. You all know this, you're at observability day. So yeah, currently a Kepler is instrumented for Prometheus and in the collector config that I'll show I use the Prometheus receiver but there is some work recently where you can use the Prometheus receiver was embedded in the SDK and so you can directly use the OTLP metrics now but for now it's the Prometheus receiver. So the idea is allowing that single connection point from each of your edge devices to funnel up your metrics to the central observability stack where it's more full featured and yeah. So the centralized dashboarding is an initiative going on with the Kepler project now and so we do have a demo do you think it'll just play? Let's try it. It doesn't have sound so I gotta watch it. I'm gonna play full screen and then we just hit escape when it's done. Alright, gotta be fast. So this is MicroShift and this is what you get out of the box of MicroShift is some open-shifty nice stuff like storage DNS, the service CA operator and it's OpenShift 413 and then I just have the one workload Kepler because that's what I was showing. Here I'm taking all of the manifests from this write-up tutorial that is in Red Hat ET that I put together. I'm running it as a sidecar in the Kepler Demon Set. Works really well. What's next? Oh I set up Mutual TLS which was a bonus if you follow that tutorial. Okay so then I also set it up as an RPM running just on a local an edge machine rail base and you can see I'm using the Prometheus receiver and the exporter is going off to a Kubernetes cluster that has a full featured observability stack and yep there it is in all of its glory. It's a beautiful, I love pretty colors that's why I love Kepler because I love the dashboard. You gotta have that at the end and you can see that all of the metrics were also explorable with the Prometheus data source. The last thing you want is for your tool to monitor your carbon footprint to be creating too much of a carbon look. It's good to keep it small. Yeah so as I mentioned we have another talk about the Kepler. It will be the project update and deep dive. It will actually discuss what we have done this year, what we are going to do in the future. So if you are available please join the next talk. It's in the main Kubicon talk. Questions? Sorry she gave me the nod so I just kind of went with it. So something I thought might be interesting to go into the use cases. What's that? I think it's off. I think it's on. Do I just need to be really loud or maybe taller? So I was going to say I think what would have been really interesting is going into the use cases that you would actually see from IoT. I saw like the dashboard for all of half a second. I saw there was like CO2 emissions that you're looking at and that sort of thing. But I would have been really interested to see what are the use cases for using hotel with IoT devices on the edge. I can answer. I have an idea. Besides like the goodwill of saving the earth a lot of countries and governments are mandating that you report your carbon footprint or your carbon usage or that's what we call it. It's not exactly carbon. It's an estimation of carbon. So that's one reason. I think that's one of the reasons why companies like enterprise, AI companies they're going to mandate it and so that's probably and it can save you money by optimizing your power usage saving the life of extending the life of your hardware devices stuff like that. But mostly it's to save the earth. Did that answer your question? You can make a relation between the power usage and the green energy from the cloud providers. I mean, is it greener to run on this provider or on that provider? Can you make that relation or? Yeah, we can do that. Of course just depends on the information that the cloud providers are providing. So there are some possible estimation. We know the data center locations so you can know where the application is running and there are some assumptions for that but the real energy source that comes from the data center it's not open right now. So the cloud providers are not exposing that but we can have some assumptions again so we're not 100% but we can have assumptions. For example, if some we know that some locations it's more energy efficient isn't it? It's colder environment, things like that. So we can have this kind of assumptions and there are some websites in the US that we can see more or less the energy source for regions, for example states, locations and then we also can have assumptions for that. But if the cloud providers in the future expose more information then it will be more accurate. I do have another question if I may and that is I love this topic so and I think it is very important to start it's too late for us to start discussing that right now but how confident are we on the numbers that we have there? I know it's all estimations but how close do we think we are from the estimations through the actual number? Have we ran some like kilowatt tests and solved the workloads and see what are they actually relate to the estimations? Yeah right, so for the energy consumption we verify that so we have for example the bare metal we have the real energy consumption that's come from sensors and if we are in the bare metal we just get this data and split that to the process so it's the real energy consumption that we are exporting. For the VMs we have also verified that for again for a given CPU model which we are using the right model for it's accurate. For the CO2 as I mentioned for you we need metrics, we need to know what the energy source for that so right now it's some general estimation but it needs to be improved. Thank you. So we are looking at the model itself and then the serving of the model for one are you like using platforms to serve the model and are you creating inference model sets and then two I don't know if you can hear me two is like are you using GPU driven modeling right now? Right so we can you repeat the question for the models? Yes. And then are we using GPU computing? Because inference based models we don't necessarily always have to. So the power models are for CPU, DRAN and node power related to that. For the GPUs right now the use case that we have is we have total access to the GPU. So for example I can see in the IBM cloud point of view we fully expose its GPU pass through so we have access to the power consumption of the GPU and we use the not power models for GPU. We get that energy consumption and split based on the GPU utilization of the process we split the energy consumption between the process. Okay sorry I guess I didn't phrase my question great. I'm talking about the actual model itself that you're running. There's a cost to that model that can technically be more than the observability that you're getting right? Especially if it's GPU driven models and you have large data sets. That's expensive model. So inference models they're CPU driven, tiny compartmentalized right typically. So are you creating the modeling that you're using for the platform as inference sets in the future is my question. So the model that we are creating, if I understood so we are saying that the model that has been trained on GPUs is the AI models in it. So our model is very simple and lightweight to train for the Kepler itself. But I understand that the energy consumption for AI workloads we are measuring that. So what's the CPU part energy consumption from the CPU part? The energy consumption for the GPU part as I was mentioning we have power models for that. Then we measured the overhead of Kepler but I didn't plot the energy consumption of Kepler but I mentioned the overhead of Kepler is very low I didn't expect that you're using low CPU utilization so it's also using low energy. Got you.