 Good to be here. I'm excited to be here. I'm Ronendar. I'm the CTO and co-founder of Ron AI. And I'll be speaking about training large language models on Kubernetes. And I think one of my goals in this talk will be to convince you that scheduling is a critical component when it comes to operating AI clusters. We saw quite a few talks on scheduling today and actually also in previous KubeCons, previous AI days. So I think there are a lot of people who are already convinced in that. But I'll try to convince those that are not convinced. But let's start. So OpenAI, right? A year ago, they did this amazing thing and launched ChagGPT to the entire world. And the amazing, powerful capabilities of language models suddenly became available to everyone, essentially. And that was amazing. And that created a lot of excitement today around LLMs, around generative AI. We all feel it, right? It's amazing days to be in the AI space right now. So amazing times, right? OpenAI, they launched ChagGPT and shortly after, they raised $10 billion from Microsoft. $10 billion, that's a huge number. And in Microsoft's announcement, they spoke about the goals of the investments. So three main goals. The first is to build a supercomputer for OpenAI. The second is to use OpenAI's models in Microsoft products. And then the third is to be the exclusive cloud provider for OpenAI. So $10 billion mostly almost entirely goes to compute, just to compute power that OpenAI needs to operate, to train models, to deploy models, to run their operations. And since then, a lot of startups raised billions of dollars in inflection AI, raised $1.3 billion. Anthropic just recently, $4 billion from Amazon and $2 billion from Google. Crazy numbers. It all goes, mostly goes to compute power. And we're seeing this trend in the last 10 years that there is explosive demand for compute power in AI. This graph shows the computing power that is required to train state-of-the-art models versus the time in which those models were published. So you can see the blue region, right, the blue part and the red part are those the parts were in the deep learning era and the large language era. So you see that since 2012, 2013, huge increase in the computational power requirements to train state-of-the-art models. In the last decade, we saw eight orders of magnitude increase in the computing power. Eight orders of magnitude. That's 100 million times increase in computing power to train state-of-the-art models. And with large language right now, we're seeing again this big jump. The large language models are those red dots. So two orders of magnitude more computing power to train LLMs with respect to training traditional deep learning models. So a lot of computing power. So there is a big promise for AI and LLMs, but also a huge problem when it comes to computing power. And what we see right now is companies struggling with compute for AI. Sam Atman tweeted about that. That's like a famous tweet about the compute cost that are there. They are eye-watering compute cost, right? Microsoft warned investors in their previous financial reports that they might not reach their goals because they won't have enough chips, enough GPUs, essentially, to get to their goals. So the big companies are struggling, also small companies, right? The GPU shortage, we saw that in the last year. So companies are struggling with just getting access to compute. Sometimes you'll get into your cloud provider and you won't find a GPU to spin up for days. When it comes to high-end GPUs, H100, you can wait for days until you're just getting access to one GPU. That's crazy. So compute, access to compute is a big problem. And what we're seeing right now is a shift in how companies are consuming cloud resources. So companies are just securing from their access to GPUs, to AI compute. And there is a shift from on-demand spinning up resources whenever you need them to just reserving more and more instances, more GPUs, reserving blocks of GPUs and having a capacity for your own use. And when it comes to managing compute capacity for your AI workloads and just managing it, operating it, providing access to your compute capacity is key. It's key to your AI development, to your AI initiatives, because when it's done wrongly, what we see time after time when we're working with customers is that a few things can happen when it's not being managed correctly. So first of all, GPU utilization. I think we are all hearing about it. GPU utilization can typically be much lower than 20%. So you have a very expensive GPUs and less than a fifth of it is actually being utilized. But more importantly, even when companies are secure for the such GPU access, time after time, people are still feeling that the GPU is a bottleneck, but just getting access to GPU is still a bottleneck and there are long waiting times to get access to compute. And that means that productivity of teams is degraded, poor productivity. And that means also that the quality of AI degraded. So managing and operating compute capacity and access to compute capacity is key. And what are the solutions that we see right now and how people, our companies are managing access to compute? So we see a few solutions. I want to go over them quickly because one thing is, and that's like straightforward approach is just giving static allocations of compute resources to users. For example, each user is getting a whole GPU machine in the cloud. It's for them used, for their use only, it's always available for them. So that's the good thing. But then what we see is that static allocations are somewhat of a problem because when it comes to AI, AI requires this dynamic access to compute. Most of the time, people need just fractions of GPUs to run their notebook, to build and debug their code. And just occasionally, they need a lot of computing power to fine-tune LLMs to just run multiple experiments. So we see this, this is like profile usage of our own users and the way they access GPUs around the workloads across time. And we see, for example, user two, user two for a few days running a big workload on 16 GPUs, just fine-tuning a model on two nodes of GPUs, and then idle, almost idle for weeks. And then user four, idle for weeks, and then launching a lot of jobs in parallel using more than 20 GPUs, right, for a few days, for like 10 days. So the access to compute is very dynamic and it's burstable. So when giving static allocations, that's not that efficient, right, because either the allocation is too high and then you're getting a lot of GPUs idle most of the time, or the allocation is too low and users are being limited by the quota that was given to them. For example, if user two had gotten just 8 GPUs, right, those workloads could not run. And if user four will get an allocation of 8 GPUs, most of those 8 GPUs will be idle for a long time, but then those workloads wouldn't be able to run or they will run for longer time, right, so time to get results becomes longer when allocations are too low. So static allocations, that's a problem, right, to pick squatters and you're getting these idle GPUs, people are not getting access to as much compute they need whenever they need it. And so there's a need of this dynamic quotas, right, dynamic quotas that adjust to the needs of users at every point of time. The users could share their GPUs, share their idle GPUs. And the second solution that we see is teams trying to share GPU machines between themselves, right, in a manual way with Excel sheets, or through Slack channel, just, you know, coordinating between themselves. And right, doing that is not fun. First of all, doing like manual education. But then we also see this problem that we call it GPU hugging. So people don't want to let go their GPUs because if they would, then they would need to fight to get back their GPUs because someone else would take it. So we're seeing this absurd situation where people are hugging their GPUs, they're not actually using it. And while other people in the team are just waiting for their GPUs, they're waiting to get access to GPUs. So that's also not the way to go, right? But there is a sense of giving a guaranteed quota to users, just giving them guaranteed access to compute whenever they need it. And last thing that we see is trying to share a GPU cluster, right, a cluster of GPUs with an orchestrator, with a scheduler. And then, right, we have HPC schedulers, high performance, scheduler from the high performance computing world that were developed more than 10 years ago, like Slalom, right? So as we see it, they are really, they provide some of the good solution when it comes to running batch workloads and running training batch workloads. But AI also requires these interactive sessions for users to use their Jupyter Notebooks or any other IDE tool. And there is also, right, a need to deploy models as services, as applications. And HPC schedulers were not built to that. So when it comes to interactive and inference, they provide some more limited solution. So HPC schedulers are great for solving part of the problem as we see it. On the other hand, Kubernetes provides a somewhat like a mill image to that, right? So Kubernetes was built for running micro services, running applications, right? So it's a perfect fit. You're getting a lot of tools like load balancing, all those scaling, a lot of tools that you need when you're deploying services. And it's also good, right, for running interactive sessions. But when it comes to batch scheduling, we all know it, right? Like we saw talks about it. So there are limitations. There are some gaps when it comes to the different Kubernetes. But the good news is that solutions exist. Solutions exist today so people can build and can use and build a management cluster, management AI clusters with Kubernetes and just solve the whole problem. So solution exists. So let's recap, for example, so what's needed when it comes to AI and with scheduling in specific? So there's a need to run, to schedule interactive sessions, the batch jobs inference. Kubernetes can do all of that, right? But there are gaps when it comes to dynamic quotas. So with the default schedulers, static allocations, there's a need to get like current guaranteed quotas and sharing resources in a fair way with fair share algorithms, hierarchical queues, advanced queuing, gang scheduling for distributed workloads. So all of those, there are gaps in Kubernetes. And when it comes to GPU provisioning, same thing, there are gaps. GPUs can be provisioned in a dynamic way in Kubernetes, but it's the solution for just provisioning fractions of GPUs in Kubernetes that's somewhat limited, right? And over provisioning GPUs, that's also very limited, right? Why not over provisioning GPUs as you over provision CPUs, right? But as I said, the good news is that there are solutions today and the community did this amazing, amazing work. And when it comes to scheduling, right, those tools, Volcano, Unicorn schedulers that cover those gaps, Q, we heard about that, Q together with the scheduler framework, so they're providing them a way to overcome those gaps. When it comes to GPU provisioning, advanced GPU provisioning, we heard about the dynamic resource allocation today, so another framework by the community that can solve those gaps. And we are trying AI, we're doing that, right? So we built our own scheduler and we're providing this layer of advanced GPU provisioning with fractional GPUs with a way to over provision GPUs inside of Kubernetes. So we worked on that in the last five years, we did a lot of work there, and we essentially, we have an enterprise product, right? An enterprise product built on Kubernetes with providing the GPU scheduler, the layer for GPU provisioning, and all the tools that cluster admins and researchers and data scientists need to easily train and deploy models on top of Kubernetes with all the good stuff under the hood, the scheduler and everything that is needed. So that's around AI, right? An enterprise product. I won't speak a lot about us, but I will invite you to visit our booth. That's our booth. Our people are there. And we love every KubeCon we're here. So please come and visit us. In the second part of this presentation, I want to talk about the community work that we're doing. We have a few projects going on. I want to speak about one of them. And this project is about just making it very easy for the community to train LLM, train transformers on Kubernetes with all the data preparation, data preprocessing that is needed. So it's a joint work with NVIDIA, with the NEMO framework by NVIDIA. We're working on it together and we're providing, in our repository, an optimized script with optimized configurations that people can use to launch training jobs, LLMs, on big workloads, on multi-nodes very easily. So everything is optimized. You just need to run it. So that's the goal with the how to guys and everything. So we have our repository. All of it is also under the NEMO framework and the docs as well. So this is a work by our engineer, Omer. He's our lead engineer. He's amazing. He did all of this work. He was supposed to be here, but he got stuck in Israel, so he couldn't join me. But I wanted to give him the recognition. He did a good, amazing work. So we also did this benchmarking work on training performance on Kubernetes. And we had this setup with DGXs, 4 GPU machines with A100, with 80 gigabytes of GPU memory, overall 32 GPUs, with InfiniBand. And we used this software stack on top of Kubernetes, the NVIDIA GPU operator, the network operator, which the network operator simplifies a lot of the installation and setup that is required to run distributed training in an optimized way. And the Kubeflow training operator, which we use our customers using a great tool that we highly recommend. And so we did this work and we trained GPT-3 with five billions of parameters with Kubernetes and without Kubernetes on bare metal. And we measured the throughput in terms of number of tokens being processed per second on one node, two nodes, and four nodes. And this is the results that we got. You can see here the difference between the throughput with and without Kubernetes, right? We know the software layer that comes with Kubernetes. And you see that overhead is really small, less than 2% in this case. And you can see here also the scaling, throughput versus the number of GPUs. So with and without Kubernetes, very close to each other. And the results are very close to perfect linear scale with the number of GPUs. We also trained the smaller model and GPT-3 with 126 million parameters. And again, with and without Kubernetes and with MPI. And the results again are the difference is very small. And in terms of scaling, also scaling, it's harder to scale smaller models with a number of GPUs, right? So we see here somewhat of a degradation in terms of how close the results are to perfect linear scale. And that's it. Those are the, because you can get access to those scripts, to those frameworks that we did in our repository and in NVIDIA and NemoDocs. And that's our booth. And that's it. Thank you very much. Thank you. So we actually have quite a lot of time for questions. So the microphone is in the middle. Get open. Bring it on. I can kick things out. You had a reference in the on-premises and cloud resources in one of your slides, I think, on the previous one. So I have two questions on, eh, here, down after. Yeah. So I had a couple of questions, two questions here. One is how do you solve the challenge of, like, merging on-premises resources with cloud resources? And do you see a challenge there, not only for multicluster on-premises, but when you start crossing boundaries like this? The second one is you mentioned ASICs. And can you talk about a little bit more about how do you integrate those with Kubernetes and any challenges that might be there? Yeah. Okay. So that's a great question. So two great questions, actually. So one is about multicluster, right? Multicluster scheduling, essentially. You can have one cluster on-premises and another cluster in the cloud. And scheduling across clusters, that could be beneficial, right? And when resources are finished in one cluster, you want to maybe migrate them to another cluster, so doing all of that. So we're actually, eh, we're providing a control plane on top, a multicluster control plane, right? So you can orchestrate and view and monitor all your resources, workloads across clusters from a single location. And you can set policies and quotas for each cluster and from a single plane. In terms of multicluster scheduling, that's interesting. That's a hard thing to do. There are projects like that. So Armada by Geo Research is an amazing project. They're working hard on that. That's a great work there. And yeah, I would love to see, like, how this, this topic is being moving forward by the community for sure. And you spoke also about, what was the second question? I guess especially for, yeah, especially for inference and how you manage that. That's a great question, right? So, and of course, most of our customers are using NVIDIA GPUs, right? Like, most of the world is using NVIDIA GPUs. And in terms of A6 and integrating A6 into Kubernetes and scheduling workloads on top of A6. So that's an interesting topic. We're working on those things. So we're working, of course, AMD are there, TPUs are there and more, right? And more A6 are there. I think we'll see more and more from them in the future. And I think that today it's relatively easy to integrate A6 into a Kubernetes cluster and schedule workloads on that, on those A6. Maybe we, you know, there are challenges around monitoring, you know, providing, like, metrics in a simple way. But also those are, like, not big issues. So from what we see is something that is doable in a relatively short time. I think things like fractionalizing A6, like you can fractionalize GPUs, that's harder. That's more difficult. But in terms of scheduling workloads on A6, I think it's, in Kubernetes right now, it's in a good state. I don't see any, okay. I don't see any additional questions. So thanks again for a great talk. Thank you. Oh, there's one. Save. I'm sorry if I came in a little bit late, but you mentioned fractionalizing GPUs. Do you do anything beyond MPS or MIG or those things that come with the operators that NVIDIA provides? Do I think with MIGs, can you repeat that question? I didn't. Yeah. So I'm assuming fractionalizing a GPU means that you carve it into little pieces which NVIDIA does with MIG or MPS. Are you doing anything else or you're kind of orchestrating those things? Okay. So we're doing a few things. So as you said, there are a few ways to fractionalize GPUs. One way is with MIG and that would provide you a hardware isolation. You'll get hardware isolated slices of the GPU with MIG and that's great. And that might be less flexible with hardware. When you do things with hardware, it's less flexible when you're doing things with software. And there are also approaches of slicing GPUs which are software only. So MPS is one of them. What we're doing, we're doing few things. So first of all, fractionalizing GPUs is a problem that needs to be solved at the Kubernetes cluster scheduling layer. That's scheduling workloads on fractions of GPUs. That's something that the default Kubernetes doesn't support out of the box. So that's something that needs to be solved. That's one thing. And then the second thing that we do and I think is unique for us, we're sitting at the CUDA layer and we're intercepting CUDA calls and we manage the access to the GPU and we allow multiple workloads to share that GPU in a controlled way. So with software, you can be much more flexible in terms of how those workloads are accessing the GPU. Because for example, if you're running two workloads on a GPU, when it runs with MIG, then those two workloads at every point of time, they're getting just the MIG slices that were provisioned to them. But when you're running with software, you can essentially give the entire GPU to one of the workloads if the other workload is idle. So that flexibility you can give just with software. So that's one of the advantages of using software to fractionalize and functionalizing GPUs. Then of course, there are questions of what happens when the workloads are colliding. So how do you control that? So we're sitting at the CUDA level to control what happens when workloads are colliding. So how they are colliding in a controlled way. Awesome. Then let's thank Ronan again and switch to the next talk. Thanks very much. Thank you again.