 All right, thank you for joining our talk. My name is Hua Mingqian, and I'm joined by Qian Wang from IBM. It's our pleasure to meet you here, and we are going to share with you what we have done for the sustainable cloud AI. First of all, the agenda of the talk is that we are going to show how we deploy cloud AI, especially using foundation models, and what we can learn by measuring the power performance and the power metrics. How to discover what is the best way to optimize energy consumption for foundation models. And there are certain future work directions we are going to explore, and this will be including a demo from Qian. So first of all, we just want to have some overview of what's the energy impacts and the environmental impacts of foundation models. As years go by, and this is all, even though this is only five years since the first large language models from other institutes, the other models was introduced, we still explore exponential growth in the large language models. So the four metrics evaluate large language models, the size of the parameters, basically the connections between neurons. We see that's grown from 94 million all the way to legends that GPT-4 has 100 trillion parameters in only five years. That kind of growth is surprising. On the other side, if you're looking at the models, training cycles, what's the cost? The cost can be a monetary cost or environmental cost. Looking at the numbers, we got it from the Manners paper. The training cycles, the amounts of energy and carbon footprint included in these training cycles, this is quite impressive. In the early days of OPT, that's for Metta, we are using 137 tons of carbon. If you consider that standard is that's a car using five tons of carbon every year. That's the equivalent to 40 cars driving in a year. And if you are keeping getting bigger and bigger models in the future, the amount of carbon that's going to incur is not something small. So we want to do something in that direction, but we are not going to talk about the training. We are talking about the inference. Also, this is the paper from Hockingface. So we see here, that's two dimensions. The X dimension is the number of requests in a 10 minutes interval. So if you have very highly positive traffic, the last intervals that go into the left side, the closer to zero. The Y axis is the energy consumption by test server at the moment. So you see here, there's a big cluster as the corner. So we see two things. One is that even there's no request, you still spend, the GPU still spend a lot of energy on idle, on idling. So that is not something trivial. If you have this wonderful server serving a wonderful model and you are not receiving a lot of requests, you are still spending a lot of energy on idling. The other, that's most of the energy being spent actually being capped. So you can potentially put a power cap somewhere in the areas you believe the best. But in our later presentation, we will see that the power cap is not always working. So a quick discussion of what's the model we are using in our evaluation. So we're using Bloom. Bloom is a model from big science. You can get it from Hockingface. And this has a multiple variations. You can have a smaller ones with only a few hundreds millions of parameters to the big ones. We saw 176 billion parameters. And you can scan the QR code to get the Docker file. And that's the step number one. As we have the talk this morning, the keynote talk in the morning, many of the data scientists, the machine learning developers, they are very good at building machines, building the models. But oftentimes the end users want to have the workloads to be able to deploy it on their own environments. So Docker file is the first step to get this thing done. So when we are able to containerize the model and the serving, then it becomes more affordable and more accessible for the end users to use them. So we're also using the stable diffusion model as well. This is our QR code for the web UI and the stable diffusion Docker file. So again, this is very helpful for the end users. If you are, for example, if I want to run a stable diffusion on my laptop, I do not want to replicate everything in the hugging phase. I can just run it as a container and that solves a lot of usability issue on my own local environments. And if you are running the workloads in your production system, containerization is also the first step to make it happen, to make it so cloud native accessible. So we just used the way we containerized, once we containerize the foundation models, the next step is to naturally build a container phase and make the containers available in your Kubernetes deployments. And with that, in our evaluation, the two containers are building our privacy repos and we just push them to the query image hub so we can get the image from there. And these are two links to the YAML file. So when you deploy the containers on Kubernetes clusters for the best usability, you probably need to have the GPUs. So in the environments we used, we evaluate the GPU power consumption on each of the containers. So that's why we need to have the resource requests to claim a GPU for each of the models. So that's two models running in our system. The first one is a bloom, the second one is stable diffusion and each one of them using one of the CPU. So that is, by the way, all the set up environments set up and YAMLs are available in the GitHub repo that we provided for this demo. So once we have these deployments available, we create a service. The idea of the service that's to make the foundation model access available for both the end user, as you see in the previous UI. So you can have this UI to manually type your prompts or just go into the stable diffusion, you create your image. Or in our later evaluation, we use these service to inject tokens so we can evaluate the response and also generate the workload so we can measure the power consumption. So that's how it works. So the whole process making this evaluation, the measurements happen is that we first of all create the necessary resources for each of the deployments. We have to install NVIDIA GPU operator because that's the platform we evaluate. We create the GPU request. So each of the models will run its own GPU bloom and the stable diffusion. And then we use NVIDIA data center graphics monitoring stack the DCGM exporter in conjunction with Prometheus to obtain the utilization as well as the frequency and the power information on top of which we do the queries to obtain performance per watch information. And we discuss how that's information is used to derive the relationship of how to optimize your workloads energy consumption without losing performance. The analytics, which is very interesting, is going to be further deep dived by Chiang Wang. A little bit about the energy conservation and power optimization. So the theory is DVFS, dynamic voltage frequency scaling. So as you all know, the energy used by the GPUs introduced by the circuits, right? The electronics circuits as running as a higher frequency or running as a higher voltage will consume more energy than lower frequency or low voltage. Oftentimes the best tuning is to tune the frequency under which the GPU or CPU is running. So you can reduce the energy consumption but that's where it comes as a cost of delaying the extrusion time, extending the extrusion time. So the lower the frequency, the more time you're going to spend on the same device running the same program. So how can we get the balance of achieving the most optimal frequency and execution which is the performance metrics such that our performance will not be degraded while the overall energy consumption will reduce. So that's the goal we want to achieve using the DVFS based tuning. So NVIDIA GPU, you have three ways to do it. One is that you just put a power cap in and rely on NVIDIA's own internal mechanisms to reduce the overall, to cap the power consumption for your workloads, which we are going to discuss as its own limitation. The second one is that you can just adjust the frequency and the reach the GPU operates. So there's two frequencies you can tune. One is the 3M multiprocessor, the SM units. This is basically the computation frequency. And the other one is the memory clock. That means how fast the data being transferred between the memory and the processing units. In the environments we are using, the V100, we can only tune SM frequency. The memory frequency cannot be tuned. The very last one is the voltage. If you operate at higher voltage, same frequency, then you consume more higher power. But on NVIDIA device itself, the voltage is not tunable. But as you know, if you are running for your own laptop, you probably can have some other ways to tune the voltage. But this is not as possible on the data center GPUs. So we are only focused on the first power capping and the second, the SM units frequency tuning and see how much performance and energy changes we can get. And with that, I will hand it off to Chen and she will explain and show you the demo. Thank you, Huamin. A little bit about myself first, Chen Wang. I am a staff research scientist in IBM Research and I was mostly focusing on Kubernetes and cloud native stuff in the past five years. And recently I'm moving to more AI system support, especially cloud native AI system or platform. So just come to me, drop by for chat if you have any questions related to those. So about this talk, I'm going to more technical details here. And then we have the GitHub repo. So you can scan the barcode, get to the full tutorial. So the whole setup to set up Bloom server, stable diffusion server with all the GPU tuning only takes around 30 minutes in your Kubernetes cluster. And so let's first take a look at the two control knobs Huamin already mentioned. The first one that you can do the energy conservation on GPUs is NVIDIA GPU power capping. It is a feature that allows users to set limits on the power consumption of your GPU. So if the GPU workload demands more power than the cap allows and then the GPU may reduce or dynamically adjust the clock speed or other performance parameters to stay within that limit. And conversely if the workload doesn't consume more than the cap and then the GPU will just operate normally without any throttling. So you can simply use the NVIDIA as my command line tool to take a look at your GPU, what are the minimum and maximum power limit you can set for your GPU. And then to set your GPU power capping is also simple. You just need to sudo and use the NVIDIA SMI to specify your GPU ID with the new power limit you want to set. So this is an example I tested on my own GPU server. And so on the right, if you show the range, it will show you like minimum power limit and maximum power limit. And then another pretty useful knob is called the GPU clock frequency setting. So it's a tool that can help you manage the performance and power setting of your GPU, including modifying your GPU clock frequencies. So exactly you can run like NVIDIA SMI dash Q dash D with supported clocks to list all the supported frequencies you can set to. So it includes both the memory frequency setting and the graphics unit's frequency setting. And then it ranges from like 100, around 100 megahertz to up to like 1300 megahertz. So it's just for V100 GPU. And then for other devices like H100, A100, it may vary. And so similarly, you can specify the GPU ID, specify the memory clock frequency and the graphics clock frequency to change your GPU frequencies. So in this demo, we will show you how to container, simply containerize those foundational models to serve as a service in your Kubernetes cluster using open source solutions. So both the Bloom server and stable diffusion servers are open sourced. And then we create the deployment file, YAML file for you to simply deploy it in your Kubernetes cluster with one click. And then we develop some load generators using the real data size, real prompt data size from the open source community. And then those data size are available in the GitHub tutorial as well. And then while we run the load generator to generate a lot of load from both the Bloom server and the stable diffusion server, we can see how energy and GPU temperature is crawling up and then how they are using the GPU. So then we will show how the GPU tuning and power capping can impact the performance, especially the end to end performance of each query we are running. So in this way, we can find a nice trade-off between the inference performance you want to have from the generative models and also the power saving, all power efficiency you want to achieve. So it's a totally 30 minutes demo, but for the time purpose, I just speed it up a little bit into 15 minutes. And now we are in a Kubernetes cluster. So what you need is several tools like NVIDIA, the CUDA GPU driver, the NVIDIA GPU operator, the permissions, all the permissions stack, Grafana dashboard, DCGM exporter, et cetera. So then we go ahead to clone the hugging face provided Bloom server. And then they already provided the Docker file for it. And then we don't need to change it, to already have the APIs, the GUI, the web UI, all the things you need. And this process, the Docker building process might take around like 10 minutes to 15 minutes, but I quickly speed up this process. So again, there's a lot of nice open source tools around the stable diffusion models. They develop the web UI for a, they provide the Docker file for a, so it's all available online. And then for this one, you can have the barcode and scan the link on it, and the tutorial have all the links. And then the only thing you need is to enable the API for the load testing purpose. So we change the Docker file a little bit to include the, to enable the API options. Otherwise it will be just a web UI. So we have to try it manually. Okay, so the Docker is not available on my server, so I use PODMAT. Okay, when we come back, then we want to clone this report. It's on the My GitHub, it's OSSNA 23 demo. And so I already previewed those images on my query report as well. You can easily get it, get the latest version to test on your own Kubernetes cluster. So then in this demo report, we provide the two manifest files for deploying both the Bloom server and stable distribution. The only thing you need to add for those models is you need to specify like how many GPUs you are using. So here is specified limits and requests to one GPU. Okay, and another thing is you may want to create either a cluster IP or load balancer surveys to expose your surveys to external environment. Here, I just use the cluster IP. So now to expose the machine to the external internet. Then, so we use the same server as a load testing server. So we need all the detached screens to forward the part of Prometheus Grafana and Bloom server and stable distribution servers to local host. So that our load testing can send queries to those endpoint directly to fetch both the metrics and also the queries. Yeah, if you are familiar with Kubernetes, those are just regular load testing procedures we will do. So now the server, all the parts are forward to my local host, so in my local browser, we can see the YBUI of stable diffusion and then also the YBUI for the Bloom. And in the Grafana, you may want to use the DCGM exporter provided by NVIDIA and then it also provide a nice dashboard to show you the GPU temperatures for all the GPUs, the power consumptions for each GPU and then what is the utilization for each GPU? That's pretty useful. So here is the dashboard and then we can see like for the GPU is how the temperature is changing, how the power usage is changing and they also show the graphics units, frequency clouds and also the memory utilization, et cetera. So let's use this prompt, simple prompt. Welcome to Open Source Summit 2023 to see how the Bloom server works. So this is the 516 million Bloom model. So if you deploy like seven billion model or one billion model, so the quality of the output might be better. And similarly, we show the stable diffusion server can give you to generate a realistic photo based on the prompt. And those prompts are from the hugging phase databases and they provide really nice prompt data side for those big models. So to run our low generator, the only thing you need is on your low testing server, you install the Python environment, Python version is 3.9 and then we have all the requirements, required packages available in the repo. So you just need to enable the virtual environment and install all those package. Then the Python script low generator ready to go. So now we kind of start a new detached screen to run this stable diffusion low generator, which will send 20 realistic prompt queries to stable diffusion server and to generate a load on the GPU. So you can see like if you refresh the Grafana dashboard, the GPU temperature is going up when you have the load, right? And the GPU usage is like a power usage is also going up around like 150 watts. Similarly, on another GPU, which deployed the Bloom server, we generate some Bloom model load. And in the Bloom server by default, you can choose like batch size from one to 32, the maximum batch size is 32. And then it means you send 32 queries at the same time and they will process in parallel in the GPU. And then so you maximize your utilization of GPU. The batching is pretty common for those generative models when you deploy those. And then those will improve your GPU utilization but not hurt your performance much. We will see that in more results later. And those are just IDs and the response time per batch query. And on average, what was shown is the response time per query for Bloom is around like 120 milliseconds. So if you ever use a chat GPT and then you put some prompt and then waiting for the response, you know, like it's really more than one second. So, and this open source model for 516 million Bloom model, you just need to weave 100 GPU and then the response time can be easily below one second. So next we want to try the power capping. So to limit the maximum power you can use for the GPU that runs the stable diffusion because we just saw like the power usage of the stable diffusion server is around 150 watts, right? So if we lower the power cap to 100 watts, what would happen to the response time of the query? So let's set it to 100 watts and choose the GPU one. So the GPU one first to crawl up after we deploy the load generator for stable diffusion. It's all done, it's changed from 150 watts to 100 watts. And next we want to check the screen for the low generator. By the way, the Bloom generator, if you just choose 20 batches, it runs out quickly. So this time we change it to 100 batches, so it runs longer. And then just remember, this is the number of the response time of a Bloom load before we do any GPU frequency tuning or power capping. And then we go back to the screen of the stable diffusion load generator. So what you can see here is after we limit the power capping, the response time for each query goes from 36 seconds to 52 or to 62. So it's almost a double. So as long as like, when you limit the power, apparently your response time will go up. But as long as it's acceptable, you can easily use the power capping or frequency tuning to reduce the overall energy consumption of your cluster. So and if your cluster is not heavily running a lot of load and then those latencies are acceptable, so why not just tuning down to save more power? And similarly for the Bloom model, we will change the frequency of the GPU, GPU zero to 300. So now it's still running at the full frequency and the response time is around 120 milliseconds. And we use this command to tune the graphic clock frequency. So GPU ID would be zero. The memory clock supported is 877, which is now tunable. So we still set it to 877. And then graphics clocks vary from 100 to one gigahertz. So let's choose 300 megahertz. So if you see the dashboard, you can already see the stable diffusion models, power consumption already drops from 150 watts to 100 watts. And then we go ahead, change the other GPU's frequencies to 300 megahertz. It takes a while for it to take effect. And then the next batch of queries have the average latency around 650 milliseconds. And the latest one goes up to 770 milliseconds. And if you check the, you also can use the dash Q dash D clock to check the current GPU frequency. Yeah, graphics clock is 300. So when we refresh the Grafana dashboard on the DCGM exporter dashboard, we can see the clock. So all models are still using 100% GPU. The utilization is GPU, but the power is significantly dropping to below like 500 watts. And we will show that later in the results that it's even lower than the idling power you have. So even when you just run below 500 watts GPU power, you can still achieve less than one second per query response time for your blue model. So assuming you don't have any load to your GPU cluster and you don't tune the frequency of GPU's, what it means is you consume more power than you enable the GPU frequency tuning but running the model with not a lot of load. And then you still get less than one second response time for all your customer queries. And later we did some large scale experiments to understand the performance and energy tradeoff over different batch sizes under both power capping and the GPU frequency tuning. So the first experiment we do is we fix the output generated token lines for the blue model. And then it means whatever you input as a prompt to the server so you limit the number of word generated output as 180. And you vary the batch size, which is the number of queries you send together to the GPU from two to 32. And then we enable the power capping model to different like varying from 100 watt to 250 watt. So what we observe is like when we reduce the power limit from 250 to 100 watt and then it doesn't have any impact of the latency. And then the latency just goes down as the number for batch, the size of a batch goes up. That means even you set it to the lowest power capping it doesn't impact your blue query performance at all. And the energy consumption doesn't change either. So you can see if it goes up to more than batch size of four that your energy consumption is similar regardless of your power capping. So what it says is blue model is a 560 million model is not that large to boost your capacity or to use the full capacity of your GPU. And then you have plenty of space to reduce the energy but not impacting your query performance at all. And then the larger batch sizes you have you use the GPU better and then you get even better performance for all your queries. And then so energy consumption doesn't even increase significantly with increasing batch size. So, and another experiment we do is we change the maximum output token lines for the Bloom server from 20 tokens to 180 tokens. And then we fix the batch size of 16. Similarly, we draw all the curves over different power capping limit. And the left chart shows like after the maximum token size is over 100 tokens then your energy consumption doesn't increase anymore. So it's okay to have larger output as long as the server allows you. So your server has the maximum token that's like 200 or 400 per time. Just similar as charge you are using. And so for the batch, if you fix the batch size of 16 and you see the latencies, the latency will of course increase over the number of tokens because for each token you need to go through the whole transformer network, right? And then, but it's almost fixed over the 100 servers. So the average query latency and energy consumption for 100 batches increase as the token lines increase. However, they do not increase further when the token does is longer than 100. So in this experiment just saying like you have full flexibility to reduce your power limit and the energy consumption without impacting your query performance much. And then, so for V100 GPU is pretty enough for 560 million bloom model. And it runs at full frequencies even at 100 watt power capping. And another thing we found out from here is the GPU utilization metric is not a good metric. So the GPU utilization right now only shows that one of your accelerators unit is being utilized and then it shows it has some utilization but actually you didn't fully discover your GPU's power or capacity. So next, what we want to do is for this bloom model we want to further reduce your energy consumption. And then what we can do is we fix the maximum token lines as 200, we maximize the batch size we have as 32. And then we tune the GPU through increases from 150 megahertz to 1380 megahertz. And then we want to show over the different stage of the tuning frequencies what is the maximum latency we'll get for the per query latency for bloom queries. And this, the chart on the bottom left shows that the maximum latency you will get is a little bit over 1.4 second and for like interactive experience it's pretty acceptable because I experienced the chart GPU a lot it's surely more than a second, right? And then if you run at the full frequency full power limit around like one gigahertz the latency can goes down to a hundred millisecond. And if we see the energy consumptions compared with the frequency tuning we have show here show below in this chart the energy consumption or power consumption sorry for like frequencies below 500 megahertz is even lower than your idling power. So it's just like most of the GPU servers we have that are idling nowadays is completely waste of energy and they can be used to run those large models without even paying more energy cost. And the temperature, the good thing is the temperature is also below the idling temperature. So for example in this case the green one the first GPU is idling but the second one is the temperature of the bloom GPU is even lower than idling GPU temperature. So this set of experiments tell us like for those large generative models we can still use the GPU when the load is low and then we can save a huge amount of power from it without sacrificing the performance much. And our future work is to first understand what is the most efficient point of performance per watt frontier we want to explore and how to leverage that optimal point to do some auto tuning for the frequencies so we lower the GPU clusters energy consumption. So thank you, that's all of our talk and then now we have four questions. In an organization who's running this infrastructure like who would be responsible for this level of tuning? Like a custom Kubernetes scheduler that interacts with the NVIDIA GPU tuning based on load. Interesting, interesting. Does this have any relevancy for cloud? Like if I'm running an AWS could I do this to this? So yes for any unprimed Kubernetes clusters as well so you have the root access to the machines to the two or Kubernetes nodes you can easily tune the power capping to the GPU frequency without problems. And how many people are actually running on premise? Yeah, for all kind of GPU workload you need the... It may not be unprimed but you may have a GPU servers from AWS and set up your own Kubernetes clusters so... AWS doesn't give me a reward for using less of their electricity. And so for example, if you have the root access to the node and then the DCGM exporters will just show you the power consumption. So the two questions are if you want to save it can I do it? The other one is can I do it on the cloud? So the incentive from the cloud and the incentive from the end user sometimes in harmony sometimes are conflicting. The cloud operators do not necessarily give you the access and provide you the benefits as you already saved but the end user want to make it happen with this capability from the cloud. So the solution in my opinion is to work with AWS find these tuning knobs and help you AWS to reduce the energy costs. So you as an end user can enjoy those benefits. For like a sustainability goal and target. We see that's a positive transformation already happening. So we just need to AWS to do more to make that available to the end users. Yeah, this is a pretty advanced use case. Yeah, yeah. So you either either power can be used to do real work. That's just amazing benefits for everybody. Yeah, the custom auto scale is going to be the trick. Yeah, but whoever is going to provide the service will be the customer. Yeah, cool. Similar to yours, does this work or will this work with the NVIDIA DRA Kubernetes stuff that they just talked about? Do you know the DRA driver Kubernetes, that stuff? No, I have no idea. Do you mind just explaining what that is? It's something that I only overheard at Kubernetes but it's essentially driver for Kubernetes where it can Kubernetes is now more aware of what is going on with your things like GPUs and seems like it intersects with what you could do here. So at a point to our best knowledge we have not find any of the auto scaling or frequency auto scaling utilities from NVIDIA. So the power capping provided by NVIDIA apparently does not always work. So we anticipate if we are able to get more APIs for NVIDIA and more tuning labs for NVIDIA then we can potentially make that happen. So all these test results just show that if we are able to tune the frequency based on utilization, we are able to performance and performance per watts objective, we're able to do more, do better than power capping So if the DIA from NVIDIA have the same capabilities then I think that's going to be a very good news for end users. Thank you. So based on what you've demonstrated, I'm guessing that there can be cost savings also from this on-prem and on the cloud. Is that correct? Can you validate this assumption? Yeah, I think that is a correct assumption. We have not quantified the exact benefits yet. We have not done the scale test that you let a lot of people run the test. And we only do this for inference. We have not done anything for training yet. But I believe that is going to be a very positive trend for the future as you see it. ChargeGPT has been everywhere. And Google's recent announcements make a bar and the other models available only makes the problem more complicated. So the earlier we make this happen, the better we can do for the future. And there's potentially also savings from the cooling since there was some reduction in temperature as well. Exactly. And then it shows, even shows that you can consolidate more queries into smaller number of GPUs as well as you increase the batch size. And then this is all the machine learning community are trying to do. And then they want to merge out the larger, provide larger number of batch sizes in one shot. So all this experiment on Bloom is showing that even for V100, so we now have like A100, H100, but even for V100, the GPU capacity is not fully utilized. Thank you. Have you seen this model? Thank you. Have you seen this model deployed on any enterprise customer or at scale? I believe customers using models, certain kind of models, but the experiments is generic. We pick language model, pick a stable diffusion just for investigation customers who are running the language models or text-to-image models should have the same situation as we are facing. So as long as we are using the same, as long as we provide the same methodology that they can utilize on their own end, they should be able to see the benefits over there. Well, currently, have you had any experience with it being used? You know, this model is still very new. So I believe the enterprise customers will find out the limitations. You know, the first of all, they have to make it happen and then they find a problem. And in the end, they will find a solution. So we are ahead of the curve at a moment. Good things.