 I think we are going to start. Thank you everybody for making the session. Really appreciate you sticking around this long and choosing this session and not the one next door. My name is Rima Iantel. I am a chief architect responsible for Telco vertical at Red Hat. And... Yeah, thank you Rima. My name is Athanas Athanasov. I'm senior clouds native software engineer at Intel. Thank you. All right, and today we are going to talk about power optimization. You probably have noticed that sustainability is becoming a hot topic. And power optimization is one of the aspects of sustainability. And it has to advantage over some other topics that fall under the umbrella of ESG, which is environmental, social and governance. It's quantifiable and it can potentially save you money. As my power company tells me every month when it sends me an email saying I haven't saved any money compared to my neighbors. So we are going to address specifically node tuning optimizations in this talk. We are gonna talk about how profiles can be applied to the node to tune power specific settings. And then we're gonna talk a little bit about how what Intel had contributed with their Kubernetes Power Manager improves what you can do. And we're also gonna have a short demo all the way at the end. It's gonna be prerecorded because we're not brave enough to have it live. And for the power optimization for the node, we are only going to talk about what you can do with capabilities that are already present in the CPU. So there is some work happening to do more power savings optimizations in a wider context. So for outside of the node, for the whole cluster, for multi clusters, for the whole domain. And when I say domain, what I mean is in the context of a telco, for instance, a whole RAM network, radio access network or all of 5G core network, right? But we are gonna start small and build. So on the node, we're gonna talk about the capabilities that the CPU exposes to the Linux kernel. And what you can do from the Kubernetes layer, which types of constructs you can use to configure the capabilities of the CPU to provide you power optimization. So I want to start with the tune D. I mentioned profiles that you can apply and tune D is a utility that provides you a framework to configure certain power optimization settings on a different hardware, right? So in your node. And it uses profiles that come, some of the profiles come out of the box. So included with the system that you can apply based on the type of workloads you're gonna be running on your system. And those workloads can range from, say, databases or virtualization, or you can have low latency, jitter sensitive telco network functions as a workload. And each one of them requires very specific tunings on the system to operate well, even in the context of power optimization. So on the screen here, I have some of the common workloads or the types of common workloads you can see and the types of profiles that you have available out of the box. You can also create custom profiles as well if you have special workloads. From the CPU capabilities that I mentioned, I want to concentrate on two right now. And this is P states and C states. And when I say P states, it's voltage frequency control of the CPU. So you probably know that each CPU has a certain rating for the frequencies and it's a range of frequencies that it can support. And the higher the frequency, the faster you execute instructions on your CPU on each core and the faster you do some work, some useful work. But at the same time, when you're doing that, the higher the frequency, the more power you're consuming. So on the system level, what's responsible for controlling the P state of the CPU? First of all, you can just leave it to the CPU to decide itself, it's smart enough to decide, okay, I'm seeing this pattern of usage and I'm gonna adjust the frequency. You can also lock the frequency. If you know what type of workload you have, you can say, okay, I want my frequency locked at this state and stay there. And what you do basically, you can have highest frequency, you can lock it there because you have low latency workloads or high performing workloads and you want everything done as quickly as possible. At that point, you're gonna be consuming maximum possible power on each core, right? And you can do this tuning per core as well. So what's doing it in the Linux kernel is CPU frequency and it's using governors which are predefined and you can run a command and you can see which governors you have available and I'm showing some examples here. So you can choose a governor that saves your power, you can choose a government that gives you performance, right or you can find something in the middle as well. And then you have C states and C states are responsible for how much your CPU is going to rest, sleep, hibernate. It's the C states are defined in something called ACPI, ACPI, advanced configuration and performance, something. Sorry, I can't remember exactly too many acronyms, but you can look it up. It's basically open standard and it tells the operating system what it can do in terms of controls of certain hardware, including how you put certain hardware to sleep, how you configure it, et cetera. So these C states are defined as part of that open standard and C states start from C zero, which is an active state and that means your CPU is always ready to do work. No matter whether it has some work to do or it doesn't, it's always ready, right? It's not gonna sleep, it's just gonna be alert all the time and consuming power. As you move up in C states, C1, C2, et cetera, and how many Cs you support, depends on an actual model of the CPU. So as you go up, certain part of the microprocessor can go to sleep. So if there's no activity happening, you can turn off certain functionality on your course. And actually it's a bit more complex than that, even, but we're not gonna go into that level of detail because we have multiple core CPUs, of course, and we have hyperthreading. And some of it also can play part, but not in our particular case, all right? But if you're interested, look it up, it's pretty fascinating how it works. Okay, so we have C states and you can also leave it up to the CPU to decide which C state to be in or you can actively, proactively control it and you can dynamically reconfigure it for a particular C state if you know what your workload is and what it needs are. So if you know you're running workloads that cannot withstand any sort of latency, you are never gonna put your CPU to sleep, you're gonna stay at C zero. But if you know that you have workload that can handle latency because the deeper the sleep, same as us, right? The deeper your sleep, the longer it takes you to wake up. Same for CPU. So yeah, you control it from that point of view knowing what your workload is. There's also something that's called uncore frequency and that covers everything on the CPU that's not a core like QPI bus and controllers, et cetera. And Atenas can tell us a little bit more about it because it's something that's coming in the future. It's not available for control right now. Right, the uncore on processor, it has also its own frequency. Basically it's responsible also if you imagine about processor today, it also consists of a mesh which connects the cores between each other and there are cache agents, there are multiple components which are basically can be controlled through this frequency. The frequency is important basically for data movement. If you have a workload which needs to move a lot of data from one core to another, it might be a good idea to tweak this frequency. That's why we are integrating that, this capability into Power Manager. We will have it in our next release. Currently it's still under development but with our next release you can use that. All right. So let's look at what Kubernetes contribution is to this whole power savings picture. So we have, and I'm actually gonna start at the bottom because I can. So performance profile controller is something that's responsible for taking your definition of what you want to look like from the power perspective and from some other perspectives you can tune other things outside of the power optimization but it takes that and it distributes the relevant bits to the other things that I list here. So Kubelet which is, I hope you know is an agent that runs on every node and it's responsible for the life cycle of the containers running on the node and it has some components that can control what the CPU assignments and layout look like to the container. It can, using CPU manager and topology manager, it can assign specific cores and specific placement for your workload. So that's where Kubelet plays its role. Then you have cluster node tuning operator and that is what maintains certain tuning rules and those tuning rules apply a lot of different things again including some of the power tunings and the last one we have machine config operator which controls certain configurations on the operating system. And here I have a diagram, it's a pretty high level so if we don't have enough time to dig deeper into that but what you do is you start with your performance profile and you define what you want your specific node performance profile to look like. The tuning operator extracts relevant information from this configuration and based on that from each relevant other configurations that you have defined. Basically you define which nodes you want to apply this to what it is, what are the settings that you want to apply, et cetera. So node tuning operator takes all that information and basically applies it to the node. On the node you have a Kubelet running already and the Kubelet communicates some of that information to the relevant bits underneath including the operating system directly on the node and then you have for the performance optimization specific parameters you have a tune D running on the node because the tune D service is running as part of your I think node tuning operator space. So basically it passes the parameters to the daemon and daemon is then applies those parameters directly through your drivers to your hardware, right? And then it automatically happens that your CPU then can adjust the P states and C states and as we said, anchor frequencies is coming up soon. So I mentioned node tuning operator already several times and these are some of the things it can control. So you can turn on and off CPUs using node tuning operator. You can select specific governors per core or per group of cores and you can granularly control the configurations per pod that are running on the node and I'm gonna go into a little bit more detail for that. So if you want to permanently turn off CPUs, you can do that. Right now it's a rebuttable event so you can specify which CPUs you want to turn off. Obviously it's gonna save your power because the turn offline CPUs are not consuming any power but it means that if you want to turn them back on, you have to reboot the node, right? So not every type of deployment can handle it. Well, where we see people are using it right now is some of our customers are over provisioning their hardware in expectations of future growth. And basically they end up with too many CPUs too many CPU cores on their machines that are sitting there idle consuming power. So they can plan in advance and turn off those cores when they're doing deployments and then they have their own ways of managing when they have to reboot. Anyway, so this is how you do it through the performance profile. You can offline the CPU and it's just a configuration change. It's supposed to become hot pluggable at some point. So watch this space, it's coming up. And this is something that's very valuable, especially for our telco customers because they're running some workloads that are latency sensitive alongside the workloads that are not. So if you're looking at the workload, we're talking about like data plane workload versus a control plane workload. And up till recently us when we provided best practices to our customers, our default performance profile was geared toward the low latency workloads until we realized a lot of the workloads are not really that sensitive. And we're consuming too much power because we're setting C zero and highest like turbo frequency on every core. So what we can do is be using the workload hints and pod annotations, we can say, okay, this workload is low latency. It needs to be performing. So it needs to be, you know, everything needs to be set for performance. And the rest of my workloads are not that sensitive. I want to set them for power savings. So now you can do that through the same model that I described earlier. And now I'm gonna hand off to Atanz who's gonna talk about the Kubernetes Power Manager. Thank you. Right. So I will start with why we looked at implementing a Power Manager for Kubernetes. So if you think about Kubernetes as a software layer and the whole idea of Kubernetes is to abstract away the machine for applications. And there is a natural gap in that aspect. So basically we as Intel and other vendors basically providing machinery with the processors and different kind of capabilities which can allow basically to configure the power utilization or to have certain tweaks on the power utilization for a workload. And basically this is one of the benefits when using Power Manager while we implemented the Power Manager. We try to bridge the gap between Kubernetes applications and the platform. And this was the main idea. So there are a bunch of capabilities which are coming inside the Power Manager and some of them are targeted exactly to this example what you heard before. We have high priority applications let's imagine and low priority applications. And we want somehow to assign more power or higher frequencies basically to the higher priority applications which are mission critical. And for the other applications which we can run without issues if they are running with lower frequency we want to assign them to a power saving profile if we have such one. So our architecture you see is basically structured as an operator and consists of several controllers. From left to right we start first the user can configure the system through a configuration file or a configuration YAML which is a custom resource and there is a controller which processes this configuration and in the configuration we have all the profiles what have to be made available on the nodes on the compute nodes. So this is being processed and basically the operator the controller for the configuration brings up so-called note agent and it deploys the profiles what we need what we configured in the configuration. After that basically the user can use those profiles in the workloads and they can be several types. We heard similar to Q and D. We provide performance profile. We have power saving profile and something in between. Balance performance, balance power. That's the whole idea. Next slide. This is a summary of the capabilities of the whole software. So basically on the left we see the note agent is a utility, a demon set which runs on the nodes which were configured to use the power profiles and it uses the C state capabilities on the processors to basically put cores in and out of sleep. We have additional capabilities, for example, to point out time of day controller. You could imagine, for example, in technical world it can happen that the load during a certain period of time during the day is not that high and you might want to configure your profile based on the time of day period. So we can do that with the time of day configuration. So we have capabilities to control C states, P states. We have also capabilities to select different pools, of course, where to assign those C and P states according to the nature of the applications. If they are high priority applications or applications which are lower priority and we can put in a shared pool. How does it look, how does the whole configuration look like? We see some examples of the different profiles. So specifying a profile is quite straightforward. This is basically a custom resource definition. On the top left, you see an example of a performance profile. So you can configure that by selecting the performance profile and you can also tweak a range of frequency for those profiles, a min, max range of frequency. So this kind of CRD is understood by the controller and deployed automatically on the corresponding note agents. Time of day capability, we mentioned that already. So you might have this case where you want to control the active time of day. And basically this is another CRD and you see how this can be configured in that example. Let's look in the whole kind of situation in a more practical example. We did deployment of an actual workload together with the power manager in OpenShift environment. We picked a microservice workloads called that's our bench hotel reservation. If you're not familiar with that workload, basically it's some sort of hotel reservation system consisting of it's a classical three tier architecture. So you have some front end and then you have some microservices doing the business logic and you have the last tier which are databases or caches. So we have picked this kind of deployment and we tried to play around with the power manager. So what are the steps actually to use the power manager together with tune D in an OpenShift environment? It can be summarized in quite simple four steps. So first we needed to configure the tune D software package on OpenShift. After that we followed the deployment recipe provided by the power manager. We instantiated the power profiles which we need for certain set of applications and at the end we deployed the workload. Here you see our tune D configuration to use the power manager. There are some important things to point out. We mentioned that we need P-states. You see in the tune D profile configuration we need to make sure that the P-states are actually enabled on the system. You can specify that in the boot loader option. Further also we configured further options which are required to run the power manager in the current implementation. Then we instantiated two profiles. We had a shared pool profile which is for our lower priority workloads. And you see those lower priority workloads will run in the range of 1000 to 1500 megahertz. So we put them on less frequency. This is our power saving profile and we have a performance profile for more important workloads which is in the range of 3333,337,000. In terms of after instantiating those ZRDs with the profiles we can use them in the actual pods for the workloads. You could do that in the resource section where we introduced the power.intel.intel.com balance power, for example, or the performance profiles. What you see there it's also those basically the core count what you introduce, what you pick for your workload has to be matched. So we had, for example, in the performance profile case we had the high priority workload which we wanted to put in a guaranteed quality of service. And to make power manager work and pick basically a pool of cores which and assigned to it a performance profile we have to match this kind of core count for it. And now I will switch to a short demo from my colleague Lukas who is unfortunately not here. He recorded something for us to see that in action. Hi, my name is Lukas Danijuk. I'm a cloud native software engineer at Intel and I'd like to demo the Kubernetes power manager on an OpenShift platform. We will start by verifying that the power manager and the required tune the profile are applied. We also have a shared profile and performance profile already created. Note that the performance profile is currently using the performance governor and the shared profile is currently using the power safe governor. We also have the hotel reservation from Dead Star Bench deployed. Currently, no pod is requesting any power profile so the role are in the shared pool. To verify this, we can also log into the node and check that all the CPUs are in the power safe mode as per shared profile. We have a version of hotel reservation ready that already requests the performance profiles. Finally, we can apply the changes. Now we can verify that CPUs requested by those pods are in the performance pool. And this is it for this quick demo. Thank you. This was our demo for the power manager integration in OpenShift environment. Yep, that was it for us. If anybody has any questions, feel free to ask. I think there's some microphones at the ends or you can just come up. Hi, thank you very much for the talk. My question is, did you observe any changes in the lifespan of the device by changing frequencies and playing around with that? Sorry, in what device? The CPUs, did they live longer or shorter or did that have any effect? This hasn't been around long enough for us to really notice. Okay, all right. But yeah, once it goes into production, so this is all still used in like lab prototypes. I haven't seen any like large deployments that are actually utilizing this. Ask us again, maybe next year in Paris. Yeah, that would be great. Can I add a follow up question? Which is, have you tried making your clusters or your nodes carbon aware that way so that you trigger that not only by the time of day, which I heard you could do, but also by like the carbon intensity, if that goes up to your frequencies goes down. Okay, that's coming up. So that's a future work. I don't know if you heard about Kepler. Yes. So the plan is to basically make this functionality potentially connected to like what Kepler metrics are collected and make some decisions based on that. And when I was talking about more cluster wide or multi cluster wide decisions, that would all be working together. So using Kepler and maybe connecting it to like auto scalers. And one of the types of auto scalers could be based on frequency auto scaler. So basically you would figure out what is the sweet spot for each core to work at, right? At which like frequency C state balance for types of workloads and place the workloads based on those decisions. So each CPU in each node is basically at that sweet spot if possible. Yeah, just to add it into we work also on another framework called intent driven orchestration which could fit also nicely in that kind of question. So where you could have maybe an intent to control the cluster based on carbon footprint. As an example that you desire a certain level of carbon footprint and then the whole intent driven framework will tweak the components under the hood so that you keep basically that level what you specified. Thank you very much. All right, we have time for one very quick question. I'm not sure if that's good. Thanks for the talk. My question is, do we have some numbers on how much energy you can actually save using those measurements because they're probably quite minor? How much do you have some numbers and how much energy you can actually save using those measures? So yeah, we have some measurements with B controls in Delco space which were done recently. They are actually quite good numbers which we published recently as Intel for Delco workloads in the World Mobile Congress where they were saving up to 30% in power. So which is quite impressive in Europe space where power is expensive. Thanks. Thank you. Thanks. Thank you everybody. Thanks for coming and sticking your own.