 Hello everyone. My name is Daniel Wilson and I'm presenting along with the TANAS. We're going to talk about environmentally sustainable AI via power aware batch scheduling. So I've been doing this work as an intern with Intel under the leadership of the TANAS and Chris Kentalupo and done this with the support of all of these other people around the the GOPM team over at Intel. So let's begin. So for some of the background, Bob, while we're looking into this is basically just that AI workloads demand a lot of power. One of the things that you might notice when you look at the different types of things that are happening in these systems is that the overall impact in total of say energy consumption will often be pointing at things like inference. However, when you're looking at the the single-run impact of an individual sort of run of doing something with your AI models, it's the the training side that really has the really heavy impact all at one point in time. So this gives us a big focal point on which we can put our power management decisions to try and have a large impact anyways. So one of the things that we found is this project that I've been working on in my internship GOPM where it focuses on power management decisions in HPC systems. A lot of the things that we've been looking at in HPC are also relevant in cloud computing systems, in particular with batch-oriented training in AI models. So one of the key bits that I really want to emphasize to start here is that some workloads use power more effectively than others. The way this is kind of visualized down here on the power cap diagram in the bottom right is that if you have basically a power cap that goes from some high number to some low number, you're going to see that the runtime increases for your applications. Now the amount that it increases for an individual application is actually application-specific. It depends on the different components you're using with that computing system and how the application uses those different components. So another point here is that there are some estimates that place training a single LLM model ranging from on the order of like 300 to 600,000 tons of CO2. So this energy impact ultimately contributes towards a high carbon impact as well. So these are a lot of motivations we have for investigating how we can use power control to help modulate these systems. So in order to work toward that, we're working for a software stack for Kubernetes power management. So we do this by leveraging the resource management features in queue within Kubernetes and utilizing Kubeflow's MPI operator to be able to work with MPI oriented workloads. And using GOPM's software power management framework, where GOPM is this project I've been working on in my internship. So one of the questions, or several of the questions that come up through this, the prototype work that we've been doing here is can we port our solutions for power management to less HPC-specific environments being able to expose ourselves in more cloud-oriented environments as well. Some of the relevant bits here are things about like sharing our computation resources and our cloud environments, being able to use edge resources that have cyclic demand, and dealing with things like AI inference computations as well as the training computations. Now the software stacks that we use to achieve this for batch scheduling with AI workloads in Kubernetes include things like being able to distribute training across our AI training computations across a cluster, being able to do inter-node communication with support for our HPC fabric, and being able to have these abstract computer engines for highly optimized solutions. We use several different open source components and plugins for Kubernetes to be able to achieve this. And I'll talk a little bit more in detail about some of the ones we ended up using in some of our test applications here. So one of the other bits that I want to go into a little bit more detail about why it's useful to have power capping in say the batch-oriented job scale is because once you do have that sort of control knob where you can modulate power consumption in your job queues, is that it opens you to be able to do additional types of optimizations for your system. For your system's energy usage, where you're writing your power, for example. The diagram here is illustrating two different types of power distribution policies you might use in your software. So again, with similar to that diagram from a few slides ago on this bottom axis here, we have a power cap and on the vertical axis, we have the time to completion. So again, we see as you decrease your power cap, the time increases. Now, the different power management policies we have here are one where you can either balance the amount of time that's spent in your applications, or you can balance the amount of power that you allocate to those applications. If you do maybe a naive solution where you say, given some amount of cluster-wide power that I have available, I'll just uniformly distribute it across all of my computing infrastructure. Then you can see you'll actually run across this point where we have the vertical dashed line, where your different applications are actually encountering different amounts of slowdown. Now, if you somehow have application performance awareness in your power management mechanisms, then you can do something that's actually balancing the slowdown or the time in those applications, where you give different power caps to the different applications, in order to achieve uniformed slowdown across them. This isn't something we've quite been able to complete implementing in our Kubernetes prototype. It's something that we've more recently been working on in the EHPC space, but the work that we're presenting here today is a step in the direction of being able to enable this. So, the way that we actually end up working toward this for a workflow is that there are basically three steps. The first is that you need to be able to configure your power limits within Kubernetes, and then you need to be able to model how those different power limits have performance impacts on your applications. And then lastly, once you do know what you want to set for your power caps, you need to be able to run those applications under some specified cap for power. So in order to achieve those three different steps, we use Q to be able to configure power limits, where we're introducing a power constraint into the Q system. And for modeling, we provide some scripts that allow you to run sweeps of your application, executing either the entire application or a representative proxy of the application under multiple power caps, and then feeding the resulting metrics back into a script that we provide to generate the power performance model. And then lastly, we provide a mechanism to actually run your applications under those different power caps, informing the Q scheduler of those power caps, and then enforcing those power caps that you've informed Q of. So in order to get to the point where we're able to actually implement all of that, we're utilizing three new features of some open source software in Kubernetes. One is that we use resource flavor extensions for Q, which through a patch that we provide allows the MPI operator to be able to use power as a first class resource. So we create some devices, some resource devices, and then control those through the Q system. We also utilize sidecar containers, which are a new feature in Kubernetes. The sidecar containers are useful for us because this gives us a place where our power capping mechanism is able to execute alongside your applications. This power capping mechanism can start when your applications start, and then when your applications end, it can sever a connection with a demon that we have running, and the demon will detect the severed connection and restore your old power settings. And then lastly, we're integrating GOPM, which is a software power management framework for computing systems. And in particular, we just came out with the 3.0 release for GOPM, and on top of that 3.0 release, we've been working on an alpha branch that includes GRPC-based interface, which we utilize in this work. So altogether, when we combine all three of these layers, it gives us value in being able to use energy optimization techniques in our power management decisions, and it allows us to save energy and make more efficient use of the computing resources that are available to us. I will speak now a little bit about the realization in Kubernetes, how we implemented that kind of approach. Just looking into that diagrams from right to left, we have external components inside Kubernetes, the Q engine basically responsible for batch scattering, and we defined a Q for a cluster Q, which has a specific resource flavor, so-called resource flavor in Q kind of language. You see it on the picture, we call it itto.com power. It allows us to have basically a quota for the power across the cluster, across that Q, which jobs can request from. And then on the left picture, you see basically what happens in terms of pod specs and components, which we are creating to start an AI training job. So the component, the first box on the right, basically the job, consists of two containers. We mentioned the side car container, which is actually an init container, having side car properties. This is a feature which was enabled in Kubernetes 128, where you can define that init container should not block your following until it's fully executed, but it should be executed in background and then when the main kind of container finishes, the init container is destroyed. And this feature was really good for us, because we could set the power cap initially. And when the application finishes, the training process, the container gets destroyed and this resets the power cap. So basically the node gets back to the original power level. This is our job, in the job kind of specification, which is following the Kubeflow MPI job kind of format. We have to request this kind of special resource, the resource limit power, with some amount of power usage, which the user can pick. This is expressed in watts and then passed to a demon set, which is the demon set on the bottom, running a device plugin. The device plugin is responsible really for accounting of these power limits on the nodes and exporting them in an environment variable later to the containers so that it can be consumed by the GOPM service. And yeah, the last component is the GOPM service, which is really responsible for the execution of this power capping command. We provide several interfaces to different vendor hardware. There is one API support for Intel hardware. There is NVIDIA kind of driver support for basically GPU power control. To illustrate the flow, we have also this kind of cartoon. Let's imagine we have the situation where a user has a set of jobs, in that case, four jobs, and he or she started submitting the jobs. You see Job 1, 2, and 3 requesting 1500, 1000, and 2000 watts. And we defined a queue through the mechanisms provided by the batch scattering component to support up to 5 kilowatts. And then the user comes with a fourth job requesting 1500 watts. And basically, this cannot be satisfied at that time, as we have those three jobs already scheduled on the cluster and being in execution. So the four job will be put to idle and will wait until some of the jobs freeze resources, power, and will get executed. So in terms of what we have as components, we have the TensorFlow jobs executing the AI training, which requests a specific cluster, request a power limit provided by user. In queue, we configured the cluster power limit to 5 megawatts. And we have the GOPM demon set. The GOPM demon set basically receives a command that it has to set the node power limit. Let's, as an example, the 1500 Job 1, when we issue that job, this will trigger a command to the GOPM demon set to a client to request that power limit to be set on the node, and this will happen basically in that system. How do we configure the different power limits? The queue power limit or cluster power limit is a standard kind of resource flavor in queue. So you can specify different resources in queue, like CPU power defining certain quota for them. And you can do that also for any kind of device, which is basically exported through a device plugin in Kubernetes. So we use that kind of capability to use device plugins. And we defined the power device plugin, which is really for accounting of this power cap. And you see it's quite simple. This did not require any change in queue way basically how to specify it. The job basically, we took a MPI job coming from Kubeflow. And yeah, in that one, we specified the two containers, which I showed in the previous picture. We have an init container, which will be requesting the actual power cap. And as it is defined through 1.28 sidecar capabilities, this init container will not block. And the actual training container will start. And when it finishes, the init container will be destroyed automatically. Yeah, and the actual power cap, you can request as part of your resource limits. Yeah, this is recognized by the device plugin and accounted accordingly. Now I will give back to Daniel to discuss a little bit about the results, what we have. Thanks. Yeah, so this side is going to go over a specific example of one AI training workload that we tested this out on. So the point here is that we're able to extract power performance models by using power sweeps with the infrastructure that we've explained over the previous few slides. The plot on the left is showing the results of running a power sweep with an application called Cosmic Tagger. Cosmic Tagger is an open source application that's able to classify basically background versus foreground events in cosmic data or cosmic simulations as well. We were able to run this inside a containerized environment within Kubernetes alongside this power capping infrastructure that we've explained. And took a look at running it under several different GPU power caps by utilizing that infrastructure. One of the key bits you can see here is that there's basically a polynomial-shaped trend line that's going from the maximum power cap that's available on this 4GPU system that we're evaluating here all the way up to the minimum power cap that's available on this system. In that case, when you're at the minimum power cap, you're seeing around 10% slowdown in average. So what we're able to do then is execute the application under multiple such power caps which generates a bunch of different log files from our GOPM monitoring container that runs alongside the application. We can feed those trace files into a modeling script that we also have available here which generates basically this polynomial model that has our slowdown as a function of a power cap. Basically the idea is the lower your power cap is, the more your slowdown will be, so let's try and fit that with an ordinary least squares fit. So as a result, once you have this kind of model, the goal is that instead of just blindly saying every time I submit a training run of this application to the system, let's take this model and say how much tolerance do I have right now based on my work schedule for slowdown of my training once it starts running. In the example that's illustrated here, we've said, well if the user is willing to tolerate 5% slowdown while their application is running, then we recommend somewhere around an 850 watt power cap on the GPUs. So this is something that the script we provide along here is able to output just by, you know, you put in your .05 to say I'm willing to take this 5% slowdown and then it tells you which power cap you want to use. So while we were working on implementing this, we had quite a few things that we learned about how to get it all together and some ideas about ways that it can be extended in future work. One of the things that we thought was really cool is that there's this new sidecar's feature in Kubernetes. So I'll talk about this in a bit more detail on the next slide, but the key point here is that by having the sidecar feature, it really simplifies what we have to do inside job wrappers, like the power management layer that we're using. What we can see is that Q is able to limit continuous resources like power, but it takes some adaptation. Not a lot, but at least some amount, which we were able to implement. So the way to get this to work is that we need to be able to represent power as some discrete collection of devices. So this is what Atanas was describing a few slides ago. Basically, you can generate a device as Kubernetes sees devices that you can schedule against for each unit of power that you want to be able to schedule. So if you want to be able to schedule a watt level granularity and your system is capable of taking power caps up to say a kilowatt per node, then you would generate 1000 devices for that node. So then what you can do is after you've allocated all of those devices, you can request it when you're submitting your limits or your requests for your jobs that are going into your queue. So one of the things that we found is it would be nice if we could enable continuous resources in some sort of future change. I don't think we have something in mind yet for how that would work, but it would be excellent to hear if there are any ideas about being able to do this without generating a large set of devices. There are also several opportunities to build on this in future work. So once you have a mechanism in place for job level power capping, you can start to explore a lot of things in terms of what are power caps doing to my individual jobs and how can I use those power caps at say a cluster level or some broader level to achieve my other system level infrastructure objectives. Some of the things that we believe you could work on here are doing things like evaluating power over subscription opportunities in your infrastructure. So if you have some level of power over subscription in your system, your goal will basically be to say with that over subscription in place, I really want to maximize the performance of my system under some power constraints. So with a power level, they're with a job level power capping infrastructure that's performance aware. You can make decisions about where you're routing your power within the applications and within the components that they're executing on. And with that information, you can aim to achieve higher system level efficiency because you're able to send less power to the applications that don't need it in order to be able to send more power to the applications that do need it. Another thing we'd love to do in future work is to integrate with container scoped metrics. So as I mentioned at the start of this project, at the start of this presentation is that this project has started as trying to take a power management infrastructure that we've largely developed for HPC systems and start figuring out what sort of new challenges and new solutions are possible in the cloud space. So one of the things that comes from the HPC space is at least for this current iteration of this, we're assuming basically a single tenant per node. So basically there's always one application executing on a given node because we're controlling power without awareness of what's happening in individual containers that may be executing on the same node. There are other projects out there that have already started investigating this problem, in particular the Kepler project is a big one where they're working on being able to get container level metrics on power, even though you're typically only able to monitor the hardware at a much broader scale than what's available within containers. So beyond that, it would be great if we could integrate with a solution such as Kepler and further start working toward solutions in per container power control. We anticipate running into similar challenges for the control side as the types of challenges that are being encountered in the monitoring side, mainly because, for example, if we're only able to set power caps at the individual GPU level or GPU tile or CPU package, you're going to have applications that are executing at a much smaller granularity. However, there are many other types of power related controls available in these systems. So similar to what Kepler is doing in terms of modeling those lower level components to the more broadly package scoped or GPU scoped components, it'd be great if there's a way we can model the controls as well. Another thing that we'd like to do is to be able to investigate elastic resource allocations. What I mean by that is basically being able to specify for a job here's some sort of minimum amount of guarantee of power that we want to give the job based on the user's request, but also that we as the system operators or, you know, an automated operator can choose to boost beyond that as needed. There are a lot of different ways you can apply this. The two examples that I'm most excited about here are things like being able to allow more power during periods of low carbon intensity or low energy cost, because basically what you can run into in cases where, say, you have mixes of power availability in the grid that you're working under is that carbon intensity will change over time, energy costs will change over time. So wouldn't it be great if we could take maybe our more efficient applications and give them a little extra power to utilize more of that power at those times of lower cost and lower intensity. Another example is considering that many applications have properties within the applications themselves that change over time. So as applications enter different phases of execution, they may have points where they're more efficient or less efficient with the power that they're using. So one of the things we'd like to be able to do here is introduce the combined solutions that do live monitoring of the performance of the application and live monitoring of the potentially changing power objectives at the system level and combining those to be able to have a system that changes over time and comes up with some system level policy that's aware of the application changes as they're occurring. This is something that we're currently already working on in the HPC space, but again, this is something that I'd love to be able to figure out what the new challenges are in the cloud space so we can bring that here as well. So I mentioned that was something I wanted to say about the sidecar containers. And this slide goes into a bit more detail about that. So the key bit that was useful for us with this new sidecar feature in Kubernetes is that it simplifies what we can do with job wrappers. And what I mean by a job wrapper in this case is having our monitoring and control that starts with the application and ends with the application. The way that sidecar containers work is that there's some sort of init container that is a prerequisite for the app container. But when the application ends, the sidecar container also ends. So in general we think this is useful for any type of prolog or epilogue work, which we see quite often in the HPC space when you have battery-oriented workloads. The way that this ends up actually making things more simple is that is illustrated in the example on the bottom here. So when you don't have sidecars, basically what we have to do with our monitoring infrastructure is our container that has the monitoring application inside it needs to be aware of what the other container is doing. In other words, we have to have a shared process namespace and we have to monitor whether the process is still alive for the application that we're checking. So we have to know what that application is. On the other hand, if we have sidecars available, what we can do is basically just tell our monitoring application to run infinitely, let it run forever, and then the Kubernetes infrastructure will terminate it once the application is terminated. What this allows for us is we can have a more abstract implementation of this because it doesn't need to have any awareness of what the actual application is that's being monitored. With the prototype that we put together, we did have to apply some patches in some of the existing software for the MPI operator or for Q to be able to work alongside sidecars. Otherwise, things just wouldn't start to be, they'd appear to hang before they were able to actually schedule the jobs. One of the things we're hoping we can do is work with those teams to see if there's a way to get official support inside those extensions. Everything that I've been talking about in today's presentation is provided online on our GitHub. There are quite a few places within our GitHub page that it ends up getting used. We hope that this site map might be a helpful way to help you navigate where all of this lives. Some of the places I would recommend you definitely check out first are, take a look at the GOPM homepage that talks about the power management infrastructure as a whole. Take a look at the cloud branch, which contains all of the changes that we made for enabling GRPC with the GOPM infrastructure and which also in some subdirectories contains a specific folder for the KubeCon changes. I'd recommend you check out that directory. There's a readme in there that outlines all of the different components that we were talking about today. Certainly, check out the rest as you find they're interested. Those are just some suggested starting points. In closing, I just want to remind a few points that I've covered throughout this talk. The first one is basically that AI workloads demand a lot of power. Then that power cap sensitivity for performance changes by workload. It's important for us to be able to understand for a given workload what's actually going to happen if we apply different power caps. When you have queue and sidecar containers, it's easy to use GOPM or any other side power manager to be able to use job-level power caps. We have several ideas about future work that should continue making power management broadly available across containerized jobs. Lastly, I want to mention that I definitely recommend checking out this experimental cloud branch of GOPM where it provides a GRPC interface to be able to interact with GOPM for power management and accessing its other features. Thank you for listening and please raise any questions. Hi, Olivier Tardier, IBM Research. Very nice work, very nice talk. We've been exploring very similar things, very similar systems. Many questions. Maybe just one of the first ones is you talked about training jobs, you talked about user tolerance to slowdowns. In our experience, for these kind of jobs, the tolerance to slowdown is zero, not one, not two, just zero. Do you have any insights on maybe what this is more applicable, GPUs versus CPUs versus insights about applicability and what kind of workloads you're planning to use this for? I can try to answer with some ideas where this can fit. For example, I was thinking for ADAS use cases where you have fleet of cars going overnight basically to inject data and usually this data is injected in a data center used in a training scenario where the new kind of collected data has to be used to refine the models and so on. Usually the job has to finish in exactly eight hours but let's say you have different kind of models and they finish in different time frame, some finish in five hours, some finish in three hours. So maybe you can put a slowdown in some of them. You have a total time of eight hours to finish all these jobs and do some sort of unpacking, basically allow some sort of slowdown for some of the jobs, still try to fit in the eight hour time frame and hopefully this will give you a better performance per watt kind of ratio for those workloads. This was just one idea but maybe Daniel has also some other examples. I think some other things might be that a lot of times users say that they don't tolerate one thing until say there's an emergency. One thing that we've seen in the HPC space is that they've also said similar things like if you're running this super expensive HPC system you probably aren't really interested at all in power capping because it could potentially slow something down and your amortized cost of that system is just being flushed away when you do that. Reality is emergencies happen. Sometimes they're budgeting emergencies like in the past year when there have been big spikes in natural gas pricing. There was in particular like one data center that needed to shut down about a third of their systems for about a quarter of the year in order to meet their energy budget. So when you have systems that cannot tolerate any slowdown but they actually hit some other constraint that reveals actually we can tolerate running at a lower level you're going to have users that start being a little bit more realistic saying you know what maybe I am willing to accept two percent slowdown instead of saying zero all the time until there's an emergency where I don't get to run at all. One of the things that we expect if you can tolerate very small amounts like between zero and five percent is that there still are several opportunities you can get there for energy savings. These plots here show someone else's work on the left where they were evaluating LLMs with GPU power caps basically demonstrating similar to what we saw in our curves where at subtle power caps you have basically a slight amount of increased time but you do get savings for your energy. So what we expect you might be able to do is if you see systems where you start charging users for energy related concepts there might be also more incentivized to take very slight slowdowns in order to save energy. Hi thank you for your presentation it's really good. I have a question about the internals about the GOPM so how does really control the power consumption like what device can control like CPU or memory or NIC device or PCIe device how does that really work? So for the data that we show it on the curve in these slides for those we were just doing GPU power caps we were interacting with the NVML software library to achieve that. The GOPM software is capable of also utilizing say MSRs to be able to interact with CPUs for their power caps. We can do DRAM power caps through RAPL in that same interface as well. So for that we either directly interact with like the MSR driver in Linux or the MSR safe driver which provides a patch interface. So in summary it's GPU, CPU and DRAM. Yeah and the DRAM comes through the RAPL software power management interface for us. Okay thank you. Hi you mentioned one of the future directions is about finer granularity and like power management for containers at the container level so I was wondering how would how would that control at the container level theoretically work would that have to be through through like the kernel itself and C groups? Yeah so one of the ways that I imagine that could potentially be possible is basically integrating with something at the kernel level. I don't think there's currently support to say like doing specific power capping policies on a per-process level but if you can integrate with scheduler events then there may be opportunities to basically model not necessarily directly a power cap but other power controls to power consumption for example on systems where you can only do like package level power capping you may be able to do core level frequency limiting which you can use with power performance models to basically create your own software power capping infrastructure. A challenge there will be that there are latencies associated with changing the frequency limits so it would probably require something a bit more involved to integrate with the Linux scheduler so that can try and you know batch things together and avoid frequency state changes if that's not actually what you want at that point in time. My opinion is that it's possible to get something that's more control than we have right now. It may be difficult to you know get ideal control. Thank you. Thank you. Hey everybody. Thank you all.