 Hi, hello, and welcome, everyone. This will be a talk about increasing GPU utilization in Kubernetes clusters. And I wanted to start from thanking you for still being here after this exhausting week. I hope you are not as tired as I am, and will stay with me through the talk. So my name is Maciek Mazur. I'm the principal AI engineer in Canonical. That's the company behind Ubuntu. And I'm typically working on the other side of the fence. I know that you guys are mostly business admins, DevOps engineers, and people working on Kubernetes clusters. I'm the user of the great stuff that you guys build and do. And I wanted to tell you a little bit about how we are actually using it, what are our goals, and how we built a couple of projects last year with infrastructures, which varied from 7,000 to 12,000 GPUs in one project setup or a cluster. So why we are even interested in Kubernetes for AIML from the ML engineer perspective. Like PowerLife is typically superstructured. There is a process that we are following called MLOps. And basically, there are a lot of small steps that needs to happen using various types of compute. And from our experience, we saw that Kubernetes gives us a lot of repeatability, gives us all the pipelines, portability, scaling. So all the standard stuff that you guys realized already 10 years ago, it took us a little bit more time, but we now started using it properly as well, I think. But why this talk is about GPUs, and why we do even need them in the cluster. So the actual things that we are doing is mostly matrix multiplication. A lot of things will show ML as something very complicated, but mathematically, these are sort of a standard computations. And why GPUs? Because it's very similar to this. This would be your screen with RGB, and it's also a matrix. So all the silicon being built by NVIDIA, by AMD, by Intel is basically the tool that was used to do a lot of similar computations. And that's why in machine learning, GPUs become super popular. And that's why we are using it very heavily during our processes. So typically, what you would do in any Kubernetes cluster, you need your GPU operator, something that will enable the device to be visible. If the cluster is bigger, you would also go with a network operator to connect the GPUs together. So that's like a fairly standard example of a stack from NVIDIA. But OK, so we want to use GPUs, which ones? And is there really any difference? And with GPUs, you can go with the setup from NVIDIA. That's the most popular one. That's something that most of the people are using, is how the clusters are being built these days. But you also have contenders. You have AMD with the ROKM kits and basically also working a lot on the optimization of the frameworks. The same goes for Intel with OpenVINO and all the extensions to TensorFlow and PyTorch like IPEX and IPEX. So these are multiple options. But what we are, as users, mostly looking into is actually the pure statistics and power. So if you look at the set of GPUs from NVIDIA, there will be obviously some performance metrics and benchmarks being made by people. But a very important metric from my perspective as a user and a solution architect is this, basically performance per the price of the GPU. And if you look at this list, this does not look exactly the same. So there is a lot of caveats in terms of which GPUs you are actually choosing to the computation that you want to do and how the mix should look like. So the price performance is a very important aspect, especially that the most powerful GPUs, obviously it's a new technology. It needs to be priced a little higher because of the development cost. So that's something that we are looking into very much. And how we are actually using GPUs, there are a couple of main categories. So we have the training environment where we are basically taking the data, training the models, perform the fine tuning processes. And we have inference environments. And on the training side, what we are doing, we are starting our work with a notebook, typically, or like a VS Code or IntelliJ hooked up to a notebook server. We have our ML of spy plans on Kubernetes with Kubeflow. We are doing some processing tasks which are also GPU accelerated if you go with Spark and Rapids. And we are testing a lot on this type of environments. But we also need to do inference where we are actually serving the models to end users that's completely different requirements in terms of computational speed. We are doing also a lot of serving in the real time, especially if you do like IoT devices or analysis of signals from a telecom network, a lot of pre-processing and also GPUs as additional hardware type. But what is super important is sharing that infrastructure among multiple teams and multiple users because GPUs are that expensive. Price performance is important. We need to figure out how we are actually fitting the workload into the GPU in a proper way. And there are multiple ways how you can do it. You can time slice the GPU. So basically give everyone a time slot when they are using it. But the most popular way, especially in big clusters would be MIG or any other kind of GPU slicing. So basically, if you have 40 gig or 80 gig GPU, you slice it into smaller pieces to be able to efficiently place the workloads which does not require the full GPU. Like if you have a 40 gigs on the GPU and your training process requires one gig because it's something super simple. Reserving the whole GPU is very counterproductive. And if we have more GPUs, we need to start looking into networking. And this is typically a big bottleneck because GPUs are very fast by themselves. If they are in one server, they also have ways to link them in a fast manner. But if you have multiple data centers, especially in different locations or even different availability zones or fire compartments, you need fast networking. There are a lot of technologies between RDMA overconverged internet, InfiniBunt or presently SpectrumX that will allow you to basically have the networking that is fast enough to pull those GPUs together in bigger training processes. But what it leads to, it increases the deployment complexity because we need to have a network for all of our data and everything that we do normally. But we also need a dedicated network for any GPU related tasks and processing. So there are a couple of options. InfiniBunt and SpectrumX are coming from NVIDIA. ROC can be used in basically with any type of hardware. And these are the choices that you would look into in the networking space. So when you have those GPUs connected already, you have a big pool of them, then the next question is how to schedule the resources. And from our perspective as users, the ML job would look like in a way that you do some CPU-based computations, then some GPU-based stuff, then a bit of waiting time, idle process fetch the data from the disk and then a rinse and repeat that type of process. So basically what you would have is a queue of things that need to run on top of your hardware. So the problem is that if the queue, there's something that you want to put in a queue exceeds the available GPU space, that's when you would start having problems and needing to basically do more resource-aware scheduling. And if you have a bigger setup, the problem is even bigger because on one single GPU, it's easy to manage. You can use stains, tolerations, and basically describe your pods in a way that it makes sense. But if you have a full data center or multiple racks like this, this problem gets more and more complex. So we are starting typically from the place where we actually interact with the system. That would be Kubeflow in that case because it gives you a nice way to generate pipelines, notebooks, the QR code will bring you to the way, how to install it in an easy way. And that's like the front-end part. But what's actually important and what the stock is about is what happens behind Kubeflow and what are the things that are enabling, using it in very big infrastructures. So the first project that we are utilizing very heavily is Paddle-Paddle. That's basically something that allows you to split the bigger task or a training job into smaller pieces because obviously in any scheduling, if you want to fit stuff in the queue, if they are smaller, that's easier, right? Like queuing 70 gigs of a workload on 80 gigs GPU, you don't have a lot of flexibility. You have a lot of small jobs which require five gigs of GPU memory. It's much easier to distribute them. And then the scheduler, the standard Kubernetes scheduler is not really great with various types of hardware. So what we are using typically on those bigger infrastructures would be volcano. That's a really great project that allows you to perform gang scheduling and more complicated assignments of things. So if you have a job that's actually comprised of multiple tasks because you manage to split it with Paddle-Paddle, then the next thing that you will be doing is basically calculating how much resources does the job need so that you can do a proper workload placement. So if you see that, okay, I need 20 gigs of GPU memory. I need five CPU cores and that amount of RAM. These are my options and that's how I can place the workload. So volcano works by itself on a single Kubernetes cluster as a scheduler really nicely, but in reality, whenever you have a bigger training environment and you need to run on top of multiple case clusters because obviously a single Kubernetes will not be really able to serve like 12,000 GPUs because you calculate it to nodes, four GPUs per node. It's still too big infrastructure to run it in one cluster. So you will have different ways of how to actually split it in terms of the resources and pulling it. And the actual scheduling is pretty much something that looks like this on the picture here and it allows you for an every job to either schedule it but you can also preempt some jobs or remove them from the queue or actually even evict the pods if the higher priority thing comes in. So if you have multiple teams, you can allocate priorities between them and define which job would take a precedent, for example, if you have an overflow of a queue or other problems like this. If you have more Kubernetes clusters, the Armada project is really great if you want to have one single point of entry and then distributed across multiple case clusters with its own schedulers. So with this long intro of the technologies that we use, let me bring you to the actual part of the real-world examples and the projects that I was working on before coming here and over the last six to nine months. So basically there was one common denominator in all of them. So public sector-related big clouds that give compute capacity to multiple tenants. So the size of the infrastructure was between 10 to 15 K GPUs which means around 3,000 nodes and typically something around a few thousand of users split it into various teams. So that's basically the main problem sharing the resources in an efficient way. They do everything on those clusters. It's not only running training and inference but also stores the data. And typically if it's like a sovereign cloud environment, people are looking into open source projects and want to build it fully full stack end-to-end open source. And there are a lot of strict requirements about security and SLAs. And the setup that's one of the projects was using was a couple different open source projects as databases, the new additions to the standard thing that you would see here would be the vector DB and what we saw that works really well especially with like internal LLMs with racks and trying to get like internal data and knowledge about it into the LLM would be open search or postgres with PG vector extension. These are the two most dominant things that we saw working well in those environments and the rest is like a fairly standard MLOps tool chain. We also work with them on like a full guided journey started with architecting that environment put into production and now expanding it to multiple nodes. But what are the learnings? So first thing that we learned is to listen to various people. The biggest pitfalls that we got were actually not coming from the data scientists or from the end users of the models but from places like compliance department, legal people and a lot of other problems in terms of how stuff is being shared and it's important in terms of the GPU sharing because some of the methods of how you can allocate the GPU to a workload will allow memory leakage. And obviously if you are running workloads in the golf space there are strict requirements about who can see what and what type of data so you need to really watch out especially if you slice GPU into smaller pieces so that you don't have any leakage between the workloads. The hardware choice that was really well working for us is a mixture of different things. So first of all, the bigger portion of things around 70% of the cluster would be something big, H100, H200, powerful GPUs that has a lot of memory. We can slice it into mixed slices and basically run the big jobs, the small jobs like very flexible here. The 20% would be mid-range GPUs that will allow us to basically make scheduling more efficient. That's the only reason why they exist here so that you have some additional capacity that you can place the workloads that are spilling over the memory of your big GPU. And also around 10% you will have the low-end GPU, something like L40, which is best in terms of price performance like dollars per flops. And it's for inference, it's for testing and also for putting even the smallest pieces of the job into the GPU that you don't require that kind of level of speed and that fast connectivity. Another thing is that actually people still use VMs, which is really bad, but in that particular space if you want to put them in the mix, but nevertheless, they're still important. They are utilizing them first for the performance on the databases and also a lot of legacy workloads that they use to pull the data from. Like especially in the federal sector, we have seen tons of people still using things like SAP from which they need to pull their data legacy deployments and we are solving this problem using the open source projects FlexD and basically on the same cluster with the same resources, but defining resource pools with metal as a service and flexibly scaling the amounts of things that are still running on VMs and shrinking it down every month whenever they start migrating to something newer. One additional thing is that whenever you design this type of architecture, you need to look to have multiple storage types. Obviously, standard object store S3, that's cool, but you also need some additional local storage and one thing that typically people forget about is a big file storage. There are a lot of places where either the LLM model itself is a file that will exceed your file system capacity or you have big files like from oil and gas industry, field maps might be something that will not fit on a standard storage. So that's another thing that can bite you. The scheduling complexity, so we tried various different schedulers, ways of sharing the GPUs and the combination of MIG, volcano and Armada was the best thing that allowed us to actually increase the GPU utilization and we are typically using Kubeflow as the UI part to control it because those projects by themselves and MIG as a technology is super complicated and it's not that easy to use by the end user. So that's the setup that works best for us over course of those projects. And networking as the main bottleneck, if you have an infra with any new GPUs, you have to go with Spectrum X, that's the only thing that will give you enough performance and enough scalability in those clusters in order to go with H100, H200 level. And especially with the recently few days ago announced Blackwell chip, that will be the networking that you will use. So if you want your cluster ready to be expanded by that, that's super important. Make sure that you have three dedicated networks, one for the GPU traffic, which will be one of those technologies that I mentioned. A dedicated storage network because that also is sometimes a huge bottleneck whenever your fractional piece of the job needs to fetch additional data from your S3 or any other place. And also whatever automation tooling you are using, it needs to be aware of the top of the rack switch and be able to reconfigure it. Actually, during the scheduling, when we have jobs that are doing rebalancing and changing the amount of nodes in each cluster or rules for scheduling, you will in a lot of cases end up in a moment where you have to reconfigure your networking. And if the automation tooling that you are using is not able to see it, that will be super difficult and will require a lot of manual stuff and hacks. So typical deployment in terms of hardware that is under the bare metal Kubernetes that you would want to run would split more or less like this. So on the left-hand side, you see the management rack. This will be no GPU compute where you have your control plane, you will have your CPU-based compute for stuff like UIs, your Olama Web UI or whatever project you are using to basically visualize the LLM to people, your nodes that are controlling the vector database, doing some data ingestion stuff and so on. On the right-hand side, you have your GPU nodes and storage nodes. From the networking perspective, it's really good to place them in the same rack. And then you can see here two different mentions of observability stack. That's also something that we learned the hardware not to mix those two. So the observability for the ML pipelines and the tools running there versus the observability for the underlying hardware. It was much better for us to separate those. And then if you look at the tooling perspective, so having those nodes and those racks placed like on the previous screen, you put Metal SS Service to control them and add them to the plane. And then you can run a Kubernetes cluster with Volcano GPU Network Operator MIG configured and your tools on top of those case clusters. But what's nice, if you have any cloud management platform on top and multiple of those clusters shared across multiple teams, you have your job queues and then the resource pools that are in mass allow you the flexibility. So whenever your observability system sees that team A is typically utilizing additional resources, they schedule a lot of job on an idle resources that are on the cluster of team B, you can rebalance it. You can add nodes from the resource pool B to resource pool A and put them in one Kubernetes cluster. So basically that's an exercise that we are doing weekly over the weekends to rebalance all the resources and keep them confined into each and one cluster because whatever you do on the initial stage, on the workshop asking the customer for requirements, yes, this team A needs 200 GPUs, team B needs 400. That's an assumption and typically in reality it changes a lot, even like completely opposite. And the whole observability and evolution while super important, we are monitoring the utilization of the GPUs and also of the network and all the other components of the system and then do a full reconfiguration. So besides the resource pools, we are also changing the priority access between the teams in terms of the priority on the scheduling queue, changing the slicing of the GPUs because if we see that all the jobs that we run are utilizing at least two or at least four slices, it means that it doesn't make sense to slice this in that granular way and lose some performance on that. And also there is a lot of optimizations that you can do on the framework level itself try it on KSER for however you handle the request. That's also additional place where you can look into. And if you want to ask some questions about those projects or you have something similar, this is the way how you can reach me and contact me and I'm happy to talk more with you after this. Thank you for listening to that.