 Hi everybody, and my name is Raaz, this is Natasha. We are both from there, and here it's in the front of your eye. And we love GPS. Now, before talking about GPS, please raise your hand if you use GPS. Okay, now keep your hands raised, please. If you have been done. Great, so I see that you guys came down, and you are still up. So, trust me, once I always had a feeling that we had enough to do the sooner or later, you might have a feeling that you are not doing it. Okay, so just so we're on the same page here. GPUs are highly optimized for deep learning processes. They can dramatically speed up the computer processes. So, that's why our people choose to do their deep learning and AI model on GPUs. So, we want more and more GPUs because we want to do more AI. But GPUs are so expensive, we can't just keep buying more and more. So, that's why we need to try to bring the most out of our existing GPUs. And that's a big challenge. So, in this talk, we're going to talk about GPU utilization and how it's often known, how we can detect it. And we're going to show you some tools that are built that can help you in this situation. So, now, Raz is going to talk a little bit more about it. So, it's true that when you use GPS, it's not that it's only doing it as a way to do it. But it's really treating it as only one aspect, a bigger problem. And the problem is how to efficiently manage and provision GPU licenses. Now, in the road, GPU provisioning is not a term that we deserve to be able to exist. But the truth is that if you use programs, it does that for you. And we will see what programs does and also what it does. But first, let's try to understand why GPU provisioning is not an easy task. And we'll do this by example. So, let's say that you are a manager of a team that uses GPUs. Now, this is true both for small teams of older team members and for large organizations in hundreds or thousands of teams. In our example, we have 14. And we have 4GPUs. So, each one of the members has its own GPU. Alright, sorry. Can I hand you a mic? Your attached one doesn't seem to be working. So, people can't hear you online. Thank you. So, should I go back? Can you hear me now? Great. Alright. So, let's say we have 14 members and we have 4GPUs. Now, each of the members has his own GPU. And then comes a new member. Now, you as the manager probably think that you need more GPUs. And let's go through all the things that are happening while you are out there on the journey to getting these more GPUs that you need. So, first of all, you are sure that you are letting the new member down because you remembered only last minute that you need these GPUs. You start fighting the IT department to get these more GPUs, even though you already have a few. You have to explain your boss why the new project is being delayed. And all this time, the new member is waiting for her GPU resources. Now, trust me, this is how I look when I wait for my resources. And the truth is that you don't always need more GPUs. When the new team member comes, instead of getting more GPUs, you just... I'm sorry. So, the truth is that you don't always need more GPUs. And we will see why. So, first of all, not all the tasks require all the resources that a user gets. Some tasks are just smaller. Now, secondly, some of the work is being down out of the GPU, leaving the GPU unused. For example, writing code or building the model. Also, some GPU workloads have heavy CPU and IO tasks. In addition to that, people take breaks. Now, this could be a short 10-minute coffee break, lunch, or even entire days off. During this time, the resources could be unused almost entirely. And generally speaking, not all the resources are being properly used. So, instead... When the new team member comes, instead of getting more GPUs, what you want is just to use these idle GPU resources. Now, until now, I was talking about a team of people sharing GPUs. But this is exactly the same case for a Kubernetes cluster running different AI workloads. Because very similarly to people, different AI workloads are not identical. They have different loads and different resource consumption. Now, it took us quite some time to fully understand why it's so hard to fully utilize a GPU. But I think that now we can say that there are two fundamental reasons for it. The first one is that many GPU usage patterns are underutilized by definition. For example, when you do remote debugging with VS Code or PyCharm, when you write code interactively with a Jupyter notebook or just a Python shell, when you have idle AI workloads just waiting for new requests, and when you have just not-so-heavy tasks like small models or small batch sizes. The second fundamental reason is that GPUs are being provisioned statically and coarse-grained. This means that most of the time you will see a GPU being provisioned for either a user or an AI workload. In order to overcome these challenges, GPUs must be provisioned in a smarter way. They have to be provisioned dynamically. And this means not a GPU per user or per single AI workload. Additionally, they should be provisioned in a finer grain, for example, fractions of GPUs. And they also be overprovisioned. This means that you should run more workloads than the available GPUs, very similarly to other resources like CPU memory and storage devices. Now, unfortunately for us, Kubernetes does not support overprovisioning of GPUs. It also does not support fractional provisioning of GPUs, but luckily for us, it does provision our GPUs in a dynamic way. Now, if you're not using Kubernetes just yet, for example, if you're running on the host machine directly or over SSH, we created the right tool for you. We call it GNV, and it stands for GPU Environment Management. It's 100% free and open source. It is highly inspired by existing environment management software, like PyAnv, Konda, and others. And it works in a very similar way. And it helps people to provision their GPUs in a smarter way. Now, GNV is a terminal-based tool just like PyAnv and Konda, but we did create integration to common IDEs. We have an extension for VS Code. We have an extension for JupyterLab, and we are working on creating a plugin for Python as well so that every GPU user can use their GPUs in a smarter way, natively, in his or her working environment. GNV and all these extensions are open source and are available on our GitHub page, and I recommend you to go check them out and see if it can be useful for your team. Now, going back to Kubernetes, we said that Kubernetes does provision GPUs in a dynamic way, but the problem is that it does it in a pod granularity. This means that most of the time you will see one GPU provisioned for one pod. And when this pod gets scheduled, a GPU is assigned to it. And as long as the pod is alive, this GPU cannot be reassigned to other pods, even if the GPU is idle and not used because any of the reasons that we saw earlier. Now this must be changed if we want to fully utilize GPUs. Furthermore, even with the smartest provisioning mechanism, the provisioning, smart provisioning is not enough because GPU resources should not be an afterthought. The point in time in which we deploy an AI workload to production cannot be the first time that we ask ourselves how many GPU resources do we need? The GPU resource requirements should be properly defined in the development phase of the project, just like any other project dependency, for example, Python packages or a Docker base image. And I truly believe that they should be specified as infrastructure, as code. Now, if you use GN for your project, it allows you to save the resource requirements as code as part of your Git repository. Okay, so the truth is that there is no magic button for these problems. We have to change our mindset, and now after we have a better understanding of it, I'm going to pass it to Natasha and she will continue with the second part. Thank you, Rod. Okay, so now that we understand the problem, our responsibilities and the power we have in our hands. But where does it lead us? What can we actually do about it? Well, you cannot improve what you don't know. So the first thing we need to do is to get a better view and to better understand the GPU utilization in our cluster. So we want to know if we have some idle GPUs, see if they have pending workloads. So let's drill down and see. Are you really out of GPUs? Now, going back to the case of AI workloads in our Kubernetes cluster, we know that different workloads have different loads and different resource consumption. But how can we actually measure it in reality? The guess is right, it's GPU utilization. So this is the official definition of GPU utilization, but the factor just means how much the GPU will be used. So if utilization is low or even zero, it means the GPU is not being fully used. So in order to detect low utilization in a cluster, we want to detect all theizations that... Let's see how we can do it. So as I said, if you have a cluster with server GPUs but you have not installed Kubernetes yet and you're considering it, you can use our job, which is my AI open source top-like tool for monitoring GPUs in cluster over SSH. We don't have time to look at it today, but it's really cool so you can take a look at it. But for our test case now, we do have a Kubernetes cluster. So let's say we have two nodes for GPUs each, so let's get a better look of what's going on on those GPUs. So this is the dashboard we built. We're going to go over everything here in just a few minutes, but what is the first thing that I want you to do? So we have eight allocated GPUs out of eight total right now. And what do we see on the right? The problem. We said before we have a problem, now we can actually see it. We have two pending pods that request the GPU and they're not allocated due to lack of resources. But at the same time, we have two GPUs that are allocated by pods that are not using them. And they are AI. So this is a big problem. Now I want us to go a little bit back and I want to explain how I built this dashboard. And then we'll go back. So to identify our GPUs, we need to know the GPU utilization of each GPU at a certain moment. So if the GPU utilization is zero and there is a running pod on the GPU, it will be idle. It depends on the GPU that you have in the cluster, you're paying for it, and you can't even use it. So we need some sort of component to export this data to us. Those that can help us with this. So we have a video CCGM exporter. It's a demon cell that exports metrics about GPUs and the loop. If you're working with GPUs and credits, you probably already have it installed as part of Convigate's GPU operations. We also have Prometheus, which I'm sure you've all heard about. It's a monitoring system that collects and stores metrics. We also have Varfana, which comes both in Prometheus. And it allows us to build the dashboards like the one we saw earlier. So they're all easy to install and easy to use. They have a pretty UI and everything. I installed them on my command cluster, and I opened the Prometheus UI. Now take a deep breath, we're going to deep dive into some topology for a few minutes. So this is a screenshot from Prometheus UI, just draw metrics without the font. This is the metric we are interested in. DCGM FIDET GPU until. It comes from the DCGM exporter. Each record here stands for a different GPU, and the value is the utilization. It's a percentage, so the value will be between 0 and 100. And the labels of the metrics give us more information about each GPU, so we're able to identify it. So for example, we have Kubernetes node of the GPU. We also have its UUID and GPU index on the node. And there's a pod running on the GPU that will have the pod name and name space. And if there's no pod running on the GPU, those labels will just be empty. Now, out of all the records here, we are interested in the ones that actually have a pod running on them. So we can use from QL, which is Prometheus Query Language, to perform some queries on the metrics. So this is how we filter on the desired results. Since we use a non-empty pod label, and those records here stand for the GPUs with a pod running on them. And also, we would like to filter on the idle allocated GPUs. So we want the utilization to be 0, which means the metrics value equals 0. So this is also from QL. And there you have it. Those are the idle allocated GPUs. They have a pod running on them, and the utilization is 0. So I used some more from QL to aggregate all this data on this metric and other metrics as well. And then I put it in this Grafana dashboard. It's also open source, of course. You can take a look at it later to see if Prometheus Query is there and how it would be in Grafana. So now that we understand where this data is coming from, we can analyze what we see. Here we have the amounts of idle allocated GPUs and below the pods that are running on those GPUs. And here we have requested GPUs. So we can also know how many GPUs each pod is requesting or has idle. Now, if each user or team will use a different game space, which is a good nice practice, then we will know the owner of those idle GPU pods. And we can kindly contact them and ask them to delete their pods or maybe delete them by ourselves if we have the permissions for it. And we can also know the owners, the users who submitted the pending pods so we may know who's going to be angry with us about their not running the pod pods that we want to execute it. We have 60 GPUs allocated out of each total. Now, some organizations might not even know this information that they have available GPUs. They're not idle allocated because there's no pod running on them. They're just available. So it's an easier problem. And I know it sounds funny, but you should just execute more workflows. Use your available GPUs. Pay good money for that. A good visualization of the GPU civilization. So here we have the GPU civilization over time. Each line stands for different GPUs. And you can adjust that time span on top of the good time page. So it can be like the last five minutes or the last 24 hours, whatever you want. GPU utilization out of all GPUs in the cluster. I'll look at the GPU with this civilization. So now that we have all this data, we understand the GPU utilization and can see it and visualize it. What can we actually do about it? Well, the first thing and the easy thing and we said that already, we can delete the idle GPU pods. So either the user or the applicant will do that if they both have access to the dashboard. Another thing we can try to do is to avoid the idle GPU pods in the first place. So for example, when you run a process inside a Jupyter notebook and you press the little right button and then the process finishes and the notebook stays open and allocates in another GPU. So instead, you can finish working on your model inside a Jupyter notebook. When you want to execute it long-term, you can run it as a Kubernetes job. That way, when the process inside finishes, the pod and the job will terminate. And the GPU will be available for other pods. And that way, depending on the problems so earlier for example, can start running and this is our goal. So let's analyze. We understood the problem and presented some easy steps we can do to improve the situation. But it takes more. You need to look behind the scenes. Maybe think of a better work process. One that suits you and your organization. We have all this data presented in the dashboard that can help you get a better understanding of the GPU utilization. So you can use the dashboard, deploy it, change it, add more information to it, adapt it to your needs, maybe use profana alerts. The power is in your hands. After you do all that, you can continue to the next steps, which is for example, smart provisioning, smart scheduling, increasing visualization of existing models. But we don't have time to talk about this. Thank you everybody for joining us today. As we said during the presentation, everything we mentioned, every tool is open source and available on our Github page. We'd love to hear more from you about the way that you use GPUs and the problems that you encounter. So here are our email addresses. Feel free to contact us. We will also be at the RANA AI booth, booth number S63. So just walk by in the next couple of days and we'd like to hear from you. Thank you. Thank you, Raz and Natasha. Excellent talk. Yeah, I'll open up to questions. Oh, there are loads. I was actually going to ask one, but I better let the audience do them first. You go first. I just had a question about it strikes me that if you're going to do it dynamically, do you need to know the length of time somebody's going to use the pod in order to so that you're not switching them, right? Or if it's idle and then that person who owns it, the way you've described it is very like you take it back. So now that first person no longer has access to a GPU. I just wondered if you could elaborate on that. Yeah, so that's a great question. Yeah, I can try to answer. The pod is idle right now. Don't hurry up and delete it right now. Models have some idle time. Then you go back to being not idle. But if you see, like, a pod being idle for a day, for a long time, and maybe you don't have to know that this person went on vacation, you just see it's idle on the dashboard, then you can delete it. Yeah, the person will lose access to the GPU, but, well, they shouldn't, in the first place, execute workloads that have idle GPUs for 24 hours. So it's like, our approach is like there are problems. If I may add, so what you suggested is a great idea. It's a bit too simple and naive. I mean, it would be beneficial. You can specify the amount of time that you think that a pod is going to run so that the orchestration system, as a hint to the orchestration system, but this is too simple when you're talking about big clusters with many different AI workloads. One to have is a system that would automatically detect idle resources and idle pods and will take these resources and bring them automatically to other pods. So this would be a great idea, but it's too naive. I'm sorry, it might be too naive and we're looking for more advanced ways to even improve it without specifying hints primarily. What we've had with using the GCGM metrics is that utilization is actually defined as if you've done one CUDA operation in the last one second. So you could do a CUDA mem copy and you get utilization. I'm kind of curious if you've been able to develop anything else deeper to actually understand how many CUDA cores are really left and how much you can push the GPU more. Yeah, so that's a great question because the GPU utilization might be misleading, because as you said it might be also too naive because if you do something very small on the GPU every sampling window, then it might seem that you have high GPU utilization. Now this is an inherited problem with looking at GPU utilization. There might be other metrics that could be created, for example to understand the size of the CUDA cores that are running. There are technologies of NVIDIA for example MPS that allows you to have a bit more understanding about them and you have also profiling tools from NVIDIA that will help you to further understand how much you actually do use your GPU. Now GPU utilization is just a very basic metric but it gives you at least 80% into the direction that you're heading. So I'm curious what you see as the future of the roadmap of time slicing or context switching of GPUs. That does seem like the holy grail I'm a system operator not a data engineer so what I see as a system operator is why aren't we treating our GPUs like CPUs with multi-threading and what do you guys see as the future of this? Is this like around the corner? Is it decades away? Where is the technology going? This is another great question. I think the way we see it is that we believe that we can have a system that automatically detects the idleness and the resource consumption in real time and then just like any other more mature resources that have been here for decades just like that we can develop technology that would treat GPU resources in a very similar way in which you just submit your workloads and the system underneath realize, understand in real time how much resources every workload needs and if a workload can give it up temporarily if it gives it up intentionally so for this temporary period of time we can give these resources to another workload this is the Holy Grail and the Holy Grail is to make GPUs very similar to CPUs because nowadays when you submit CPU workloads you don't really care about what CPU core is going to run on the operating system does it for you so we believe that you can achieve the exact same thing that GPUs are going to use and no, it's not decades away it's here and it's going to happen real soon I believe so earlier you had described a little bit about looking at batch workloads and jobs in order to be able to relieve the pressure on the GPUs it's the primary spot that you're looking for to try and relieve this GPU pressure like looking at being able how do we move some of what we like what does that continuum look like I'm not sure I understand the question I think I understand if you're working interactively in the GPU then you do need the notebook and you're doing interactive work on it but I meant like if you're debugging or building a model and then you finished it and you want to train it for a long time that's where it's better that you switch to a job and not a notebook what is this from does that answer your question? I think that's part of it even during that debugging cycle I have a day scientist working for 8 hours a day they take that break but they're still interactively looking at that to me even as you described in the presentation that's a period of time they went for lunch for an hour and a half I could use that GPU somewhere else but because it's nominally interactive job that's still holding that GPU can I relate to that? this is a really great question here we presented some basic stuff that people can do nowadays but if we're talking more advanced stuff then your point is really correct because we believe that you can have a system that detects this GPU idleness and when you decouple CPU resources from GPU resources then you can leave the Jupyter notebook running the user can still use the GPU notebook for visualizing the data for writing code interactively but as long as nothing is running on the GPU our system underneath can take the GPU resources to another pod that is sharing in a controlled way by this system and it's sharing the GPU resources so you do the holy grail is that you will not have to stop your job entirely you will just have a system that manages it underneath does that answer? alright did you get a chance to compare schedulers like MISOS versus Kubernetes MISOS has a good scheduler called dominant resource fairness and it has a very decent scheduling algorithm and you get a chance to compare and so that we can you can see which is a better scheduler for resources like GPU alright so yeah we believe that we understand AI workloads in a very intimate way because we go through all the stack we go through the Kubernetes layer and the Linux layer and the GPU CUDA layers themselves as well as the applications and frameworks like TensorFlow and PyTorch so we believe that we can create a Kubernetes scheduler that is fully dedicated to AI workloads and we believe that we can build the software that is needed in all these layers instead of relying on others to implement them for us so I'm not familiar specifically with the scheduler that you mentioned but we have a lot of thoughts around existing schedulers and we do have our distinguishments between them and I'm sure we can talk about it later cool time for one more question if anyone has one have you found that the provisioning and metrics differ by workload type a lot of what you've explained is really good for batch processing and there are a lot of workloads on the same cluster that want to hold on to a GPU longer and that might be appropriate for that type of workload yeah exactly I mean there are many types of workloads for example interactively building a model training the model and inference servers as well and inference servers can be idle as well because they are waiting for new requests to arrive until these requests arrive you can take these GPU resources to another workload so this is true also for inference clusters as well great thank you again let's have another round of applause for Raz and Natasha