 All right. Welcome, everyone, to what is the name of our talk really, navigating the processing unit landscape in Kubernetes for AI use cases. So today we're going to be talking about processing units will go over the basics of what they are, and then we're going to talk about how they're used in I'm a developer advocate at Google Cloud where I focus on GKE and open source Kubernetes. I'm also a co-chair of the special interest group for contributor experience in open source Kubernetes. So if you have any questions about contributing, feel free to let me know. I'm also a CNCF ambassador and a co-host of the Kubernetes podcast from Google. And my name is Mofi. I am also a developer advocate at Google focusing on GKE and these days my focus mostly is around running AI workload on Kubernetes. And hello everyone. I'm Rob Koch. I'm a principal at Slalom bill. I'm also at AWS data hero. So sorry in between amongst two Googlers here, but I also work in the CNCF deaf and hard of hearing working group. I'm the co-chair here and I'd like to welcome the rest of the group here that's supporting me today. All right, folks, we're going to start off with some heavy stuff. I'm going to tell you the truth about containers. Did you know that you can write containers in like 100 lines of bash containers really are primarily made up of a couple of core Linux kernel components called a C groups and name spaces. C groups allow you to do resource sharing. So you've got so much CPU. You've got so much memory on your machine using a C group. You can say this process gets this much of the processor and gets this much of the memory. The other core component is name spaces and name spaces are a logical environmental isolation mechanism. So they're a way of separating processes from one another. And aside from that, a container is basically just a process and that's going to come into play a lot as we go through our content today. So let's start talking about processing units. There are a whole bunch of different types of processing units actually. Not sure if you're aware, but there are a whole bunch of them. They all end in PU because they're processing units. But today we're going to be talking about them in the context of Kubernetes. So for a processing unit to be compatible or supported by Kubernetes, it basically needs to be supported by the Linux kernel. It needs to have a device driver that's compatible with Kubernetes and the hardware that Kubernetes works with. And it needs to be supported by the Kubernetes scheduler. Finding all of these components can be pretty hard. There's nowhere in the docs that actually has all of this written down, but basically there are three main types that are supported CPUs, GPUs and TPUs. And actually FPGA is also supported in Kubernetes. And there's a talk going on right now that is talking about FPGA and Kubernetes. So check out the recording of that later. That one is mainly used more in like personal use cases or on-prem use cases. It's not used in cloud environments really. So we won't be going into detail on that, but check out that other talk if you want to learn about it. So first up, we're going to talk about CPUs. So CPUs, a lot of work is happening behind the scenes there. So it's working on looking through memory, reading from memory, writing from memory, and going on and on with crazy amounts of what's going on there. So for the general purpose use of CPUs, there's a lot of different things that you could do with it. You can write code and have it be processed there and continually improve on your speed of your messaging and all of that going through every day. So seeing many different things happening per second. But because of the billions of executions, there's not enough of those things happening, right? So we have multiple core CPUs for the purpose of processing more, and they can work in parallel at the same time running all together. The CPUs also are receiving a lot of instruction and doing the basic math to create 3D avatars maybe, or processing graphics that way, maybe eventually sending what you know. So you can see here this kind of process of setting up CPUs. It's basically a switch, like an on-off switch, like a 1-0. And it tends to then go from there, writing input to the processors you can see here, and then it decides what's going to be done with it with that set of instructions that's being passed. So then that gets written to memory. And then we go through the control plane to pick up whatever's there in the memory, process it, and then it keeps going and going and eventually getting to the output. So this is well and good, and processors, CPUs can do all sorts of general things, which is awesome in how our phones and computers generally work. But there's one little problem. So the CPU is basically, as Rob said, either doing work or it's not doing work. And if it's not doing work, then it needs to go figure out what the next instruction is, and that's stored in the memory. So every time it has to go back to the memory to figure out what the next instruction is, that takes a lot of time and makes CPUs quite slow. And that's what we call the von Neumann bottleneck. So there's a way that we can adjust our architecture to make this at least a little bit faster. We can take the memory and instead of having it outside of the CPU, put it inside of the CPU. The closer it is, the faster it's going to be. So modern CPUs generally have multiple cores, and each one of those cores has some memory in it, an L1 and possibly an L2 cache, and there can also be an L3 cache within the multi-core CPU. So all of these different layers of memory are giving us faster and faster or less fast and less fast, depending on which way you're looking at it, memory that the CPU can access to make them feel faster. So when you think about how a CPU works, is like what we talked about, a CPU is constantly reading from memory what the next instruction is to be and computing it. In this line of code, if you have a variable y equals to wx plus b, it has w and x and b stored in memory, so it will multiply w with x, add that to b, and then store the whole thing into y. If that thing is running a loop, it has to constantly go back to see what is currently w, what is currently x, what is currently b in your memory. So in the best case scenario, that thing is constantly running and the memory is very close in the L1 cache, but sometimes they are not because CPU also does other things in your computer. So if for some reason CPU looks at somewhere else or on other processes, it will have to go back and reload that instruction back into its memory. So every time it goes and reads stuff from memory, this is a connection being happening over the memory that creates latency as well as produces heat to do that calculation over and over again. So in Kubernetes, there are a wide variety of CPUs that you can use. There are of course single core CPUs, but we rarely see them these days, so I have it marked out here. But generally these days you'll see multi-core CPUs, and those multi-core CPUs can be architected in different ways themselves. So there are ARM CPUs and x86 and x64, we probably learned about these in school. And Kubernetes supports a wide variety of these architectures. Here I've got an example of Kubernetes by Justin Garrison, which uses Intel processors and a Raspberry Pi cluster that Melfi made that uses ARM processors. So both kinds of CPUs work in Kubernetes. So now let's talk about the special type of accelerators that probably a lot of we are hearing in the last few months to a year. GPUs. Before we talk about GPUs, why do we care about this kind of special hardware for AI ML workload? It's a concept called embarrassingly parallel problems. So in computer science, there are certain sort of problems that can be computed the next iteration without looking at the previous iteration or other things that are running in parallel. These are also called perfectly parallel or pleasantly parallel. I like the other terms rather than embarrassingly parallel. It kind of makes me think there is something shameful about being parallel here. But in reality, those problems are really nice for us to be able to solve with multiple processor doing the work at the same time. So a GPU stands for graphics processing unit. They actually have been around for a while now. They're initially created to process your render your graphics from your video games or your computer monitor. And we use these things to speed up our frame rate on playing video games. Turns out when you're doing things like machine learning workload, the actual task of computing the next cycle on your neural network is very similar of you can do multiple of those things in parallel and compute that at the same time. So GPU works really nicely on that kind of workload. So the main concept here is a CPU has a very few really fast course like we saw for core architecture. You can even buy up to 16 core for a consumer machine in the cloud. You can even get up to a few hundred cores in a single VM. But what if we just did that but put like a few thousand of them in the same machine. So we could get the fast speed of CPU and do the work in parallel across few thousand of this course. And that's exactly what they did for GPU. A single GPU usually have multiple processing clusters. So they put this course in a single processing cluster. The processing cluster is made up of a bunch of streaming multiprocessor also known as SM's. And each SM will have some L1 caches and L2 caches for the quick access to memory. The problem of going back to the memory for the next instruction is still a struggle for GPUs. To solve that problem, they use something called caches within the GPU architecture. And the whole processing cluster communicates with high bandwidth memory so that it can handle a large variety of large volume of data at the same time. This would be probably a typical archetype of what a GPU structure would look like. The greens here are the actual cores that made a mix of streaming processors, which combined to make up a processing cluster, which have a shared L2 cache in between and each of the processors also have L1 cache for quicker memory access. And when the GPU needs to access like other data, but that's not in the cache, it will go to the memory controller to talk to GDDR memory. On the latest iteration of GPUs, you're going to see things like GDDR5 or GDDR6 memories. And each iteration of those memories are faster and faster, but also consumes a lot more energy to produce the information. And this is the same image, but for the NVidia H100. Now the second latest GPU from NVidia, like this slide became outdated in two days because NVidia had their GTC conference like in Monday. But NVidia 100, the same image that we saw here, now scaled up to a lot bigger because the H100 has a lot more of those green things at the cores, and between each of them you have the L1 caches, between them they have a L2 cache, and two of those processing clusters together make up the H100. For the B200 that got announced like two days ago, this same structure now got scaled to twice this size. So we're making bigger and bigger, more and more cores into the same system to be able to handle more and more data. Now the same task we saw for the CPU, the calculation of Y equals WX plus B, which by the way is a computation we usually do for MNIST, which is a model that can understand handwritten data set. So for doing the same computation one core at a time, but now on thousands of those cores, you can do the calculation parallely across all of them and collect the information at the end of it, that gives us the output at the same time. So the stuff that was happening, the computation time is probably a little bit slower because a single CPU core is probably faster than a single GPU core, but when you can have 5,000 of them working at the same time, your actual workload speeds up. Now let's talk about TPUs. So TPUs are specialized hardware built by Google Research for doing matrix computation. So TPU stands for Tensor Processing Unit and this is the definition of a tensor. This is from mathworldswhollform.com and it says an nth rank tensor is an m-dimensional space, is a mathematical object that has n indices and m to the power n components that obey certain transformation rules. So who in the audience understood this definition? Alright, those of you who understood, please explain it to me after because I did not. So I'm going to try to explain the definition a little bit in simpler form. So in the rank, what we're talking about, at rank zero, a tensor basically is just a number. We all know numbers in programming or math. At rank one, it is an array or vector for math. At rank two is a 2D array, an array in an array. At rank two plus, they're generally just called a tensor. So tensor is the general term for any rank of matrices. And turns out in machine learning and when you're doing neural networks, most of the data is very well presented in a tensor form and a tensor flow or libraries like PyTorch understands tensors and that's what they compute to figure out the next process in your neural network. So TPUs are generally designed to be really good at computing tensors. So our TPU is a matrix processor designed for neural network workload, thousands of multipliers and adders connected together in what's called a systolic architecture. What basically it means is it takes input in one end and it spits out two things. One is the multiplication of the previous two input and one is a summation that is continuously calculates. So it creates almost like a pipeline within your system to calculate the information that you're trying to calculate. So the TPU host that is sending data to TPU will send a bunch of data to the TPU through the input queue and store that in the high bandwidth memory inside the TPU. After calculation is done, the output queue will then collect the information for later or using on a different input queue. To perform matrix operation, TPU loads the parameter from HBM into matrix multiplication units. Those are internal cores of TPU that does the calculation. So if you have some information like previously we had, instead of computing one at a time, we're going to just load all the input parameters inside the TPU at the same time. So all the numbers gets chunked out and sent to the TPU at the same time. That just happened there. In the next step, we load the data, the other, so that you had WX plus B. We just did the X. Now we're going to do the B that we're going to add to this TPU from the HBM. As each multiplication is executed, the result is passed to the next step. The output is the summation of all multiplication results and during the whole process, the benefit of TPU in this case is that your machine is actually no longer going back to the memory for the next instruction because everything was loaded in the beginning. So we have all of the numbers loaded up inside the TPU and the input parameters now all gets pushed into the TPU, calculated both the multiplication and summation at the same time. And the pipeline progresses until it reaches the end of your systolic array and at the end you can collect all the results at the same time. So because TPU does not have to go back and forth to the memory, it becomes a lot more efficient in terms of energy usage because it's not reading through the wire back and forth. So now let's talk about processing units that exist in the AIML space. The first one is TPU. TPU is kind of like the main unit of compute for any kind of workload. And as Kaslin explained before, TPUs are just C groups and namespaces. Kubernetes does not really reinvent the wheel with containers. It just has APIs to talk to the underlying run C and container runtime to talk to the container underneath. So if you looked at any Kubernetes YAML definition before, you probably have seen some file like this. YAML is everybody's favorite. Yes, yes, yes. Everybody loves YAML. But in Kubernetes we define this YAML to define our workload in this case. In this case, what we're looking at is this block of resource and request, which we have defined for memory, which is our RAM. We're not talking about that as a compute unit per se. But the thing we're interested in this case are CPUs. We're saying I need 250 mCPU. What that means we're going to talk about in a second. So if you want to know what happens when you click, like, to apply a particular YAML file like this, if you go to our sketch page, there is going to be the PDF of the slides. You can click this link. It goes into a lot more detail exactly what happens. But the part we're interested in is how does Kubernetes then schedule our workload onto the hardware underneath. So when a new node gets attached to our Kubernetes cluster, it lets the Kubelet know what kind of hardware, how much CPU and memory it has available for doing work on it. API server then learns that information from our Kubelet. And when a new workload comes, like a YAML file gets applied to our Kubernetes context, API server will talk to the scheduler, will ask, okay, I need 0.25 CPU. I need 2 gigabytes of RAM. What do you have available that I can run this workload on? If the scheduler finds something, it will then create that pod onto the node. A pod is multiple containers that have the same name space for network and mount. So a pod is just an abstraction on a container. And once that starts, that process will then directly talk to the hardware onto the node and use the resources to run the workload you're trying to run. Now, when you say request and limit, basically what happens is two Kubernetes API server, we ask, give me max of X, which is the limit in Y, which is the request increment. If the scheduler can find a node with resources greater than request, it will schedule it on that node. If your pod asks for more while it's running, and you have limit over that, it can restart the particular pod to start with more resources. And if it can't get any more resources, it will fail. So if you have run Java workload ever on Kubernetes, you probably have seen this dreaded OM killed on your pod because it asked for more RAM, and Kubernetes scheduler said, I can no longer give you any more RAM, so it gets out of memory killed. Now let's take a step back and think about what it means when you say something like 0.25 CPU. You can't really go to Best Buy or some hardware store and be like, I want a laptop with 11.5 CPU. That doesn't exist, right? So in the world of Kubernetes, when you say something like I want a fractional CPU, what that basically means is the scheduler and the container runtime with C groups and namespaces is going to limit that particular process from using more than 0.25 CPU cycle every second. So it is not necessarily saying, I'm going to chunk out and cut out 0.25 CPU to give to this process. If the process still runs on the entire node, it's just a software level isolation that stops the same process from using more CPU cycle every second. So let's try to see that in action, and hopefully that will work. Okay, ooh, reconnect. It is not doing it, fantastic. Oh, close. Welcome to the live demo segment. Okay, so the code we're running here, it's fairly straightforward, but what we're trying to do here is I want to run, GoRoutine is written in Go, but the main idea is I want to run this code in a thread that's going to do a bunch of calculations. It's just going to do a multiplication, and every time it does it, I'm going to count how many times I have done it. So if everything goes well at the end of this program running, which runs for about 10 seconds, I'm going to get and print out that says the calculation ran for 10 seconds, and it did that many counts of calculation. And what I have here is a Kubernetes, oh boy, okay. Is a Kubernetes cluster that is made up of a bunch of node pools. Okay. And the node pools I have there are four different. I have an E2 standard, which happens to be general purpose. CPUs has two CPUs on this one, eight on this one, and 32 on this one. Okay. So I'm going to run the workload, keep CTL, oh, before I have to show the workload. So cat job one.yaml. So the job itself is says I have a node selector that selects the particular node under target, and I have made the Go application that I had into a container image, and I'm giving it a request of 500 millicore of CPU for each of the job that I'm running. And this is the same exact job definition that is targeting different node types. Okay, don't do like demos people. You know what, I'm going to abandon that particular demo real quick. Don't worry, there's another one. So I'm going to come back to that demo in a second. I'm going to continue on the other part, and we're going to get to that demo in a second. So basically what is supposed to happen, and what I'm going to show here is that, when you run a workload on Kubernetes, and that asks for how many CPU I have in this particular node, even though you said we want to give it only like 0.1 CPU or 0.5 CPU, because it's running as a process onto the node itself, it will actually see all the CPUs that exist in that particular hardware. So what will end up happening is if I have a Go routine that says, okay, spin up as many Go routines as possible in this particular node, it will try to spin up that many Go routines. But because I have a C group limit that says, don't give this particular process more than 0.5 CPU, your CPU kernel will constantly swap out your work every half a cycle, so that even though you are supposed to get like all 32, you see all 32 CPU being available, your workload does not actually get all 32 CPU, it gets constantly swapped out. So if you have workload that requires to have constant CPU workload, you probably want to give it the full node. So if you have really CPU intensive workload, although a single node can run multiple pod, you probably want to get in the world of running a single pod per node when it requires the whole hardware underneath. Okay, so we'll come back and see what happened to the demo in a second. But CPUs are pretty well understood. Kubernetes understands CPU from the Go because that's how Kubernetes processes run. But GPUs and TPUs and all the other PUs that can exist are not something that is available in every single node. So Kubernetes core does not and should not care about these other PUs types. This is what is known as being out of tree for Kubernetes. So Kubernetes, if you go to the Kubernetes source code and search for GPU or TPU, you're not going to find anything. Now, how is then these things are found and used in Kubernetes? So Scheduler needs to know which node has these processing units, a container runtime to be able to talk to this hardware, and finally, Kubelet to know these code and code general devices. If you want to learn more about how exactly this works, there is a great talk from LastCubeCon from David, I forget the other name. And the link is here, you can go take a look. They go in a lot more detail. And I think they're giving a second talk, this conference, even going further about devices. But in a nutshell, what basically happens is when you have GPU drivers at the bottom here and that GPU driver talks to something called device plugin. So these device plugins are something, a hardware provider, you will write yourself. And that can talk to then Kubelet, which will basically registers itself as a named entity. In this case of a GPU driver, the name could be something like nvidia.com. If it's an AMD device, it will have some different name, but most of the cases these days are NVIDIA. And the API server now knows there is a certain device called nvidia.com. Now, when a new workload comes to the API server saying I want some resources, API server now tells the scheduler that okay, I know of a one node that has this nvidia.com GPU available. Scheduler then will create this pod onto that node. And the pod will then talk to container runtime and RunC to create containers onto the node. This container talks to the hardware directly because they're just processes running on that particular VM. So a lot of people have concern about using containers and Kubernetes for AML workload because they think they might lose out on some performance because they're going through some sort of virtualization. In reality, the performance difference between using a container versus directly talking to these devices, there shouldn't be any because containers are just processes that are talking directly to the hardware. So in the example we saw for CPUs is the very same for GPUs and TPUs. You have resources that you can set with named entity. For example, for NVIDIA GPUs, and the number of GPUs you'd want. In this case, I want eight GPUs. You also have something called NodeSelector that selects a particular node. This is to help the scheduler figure out which node it knows to have the GPUs that we're looking for. And these labels are self-created. So this is something either your cloud provider is attaching to the node pool or you yourself if you know that that particular node has a particular GPU you can add that yourself for this. In the same thing for TPUs, the example is very similar where you have this google.com slash TPU and that's the name the TPU driver is attaching itself with Cubelet and you have the node selected to do the same thing. So the second part of the demo which, okay, great. I have to go back. Okay, so I have this YAML definition of a TPU job which is going to be something that runs a fine tuning workload on a live language model. So I'm going to try to kick this job off if it wants to work with me this time but even if it doesn't, if you want to see how that is working, if you come by the Google boot we have actually a couple of demos running on the screen of that same workload that we can show you how this actually works underneath. But another workload I'm running here is an inference workload which is serving a large language model to talk to. You can have examples of talking to Gemini or talking to chat GPT. We have a very similar example running. So this is running an open model and I'm just going to reload this and I'm going to ask it a question and it should probably know what is Kubernetes. It's a KubeCon, so if it doesn't know we should teach it. But I ask it what is Kubernetes and this is a very small model compared to the bigger ones like a 2 billion parameter model and it gave it back with something and I'm going to ask it something fun. Write a poem about Kubernetes for its 10 year anniversary and it's going to come back. So the way this is working is I have a pod running on my cluster hopefully it comes back. I have a pod running on my cluster okay I can't tell if it's good or bad because I can't read all that quickly but it looks to be rhyming for the most part so I'm going to guess it's good. So what is happening here is I have a pod running on this cluster KubeCTXGPU I'm switching context to get PO so I have this one pod hopefully it comes back running in this cluster called TGI Gemma deployment which has GPU attached which loads up the memory which loads the entire model onto GPU memory and then I have a radio service that we are talking to to communicate and get information. So in terms of talking to this particular GPU bound workload the actual process of doing that is fairly similar to what we are already used to. So in terms of the process and talking to GPU bound workloads very similar way we can talk to CPU and memory bound workload. With that we are going to move on to Yes sorry that the demo got a little bit away from you this time seems to be the way with live demos right but in general the CPU's tasks is to write the code, the CPU runs and understands what you've written there and then writing to memory and going through that with a lot of transactions there. With the GPU you can play a game with smooth graphics and avatar generation and all of that is really wonderful for generating those images. With repurposing that concept for machine learning it's making the process a lot faster and with that machine learning on CPU's rather would be days to process all of that with this it can go down two seconds as you can see. So adding that TPU it will be more of a specific use case specific software TensorFlow libraries etc that can be used in parallel and so processing those tasks will be a lot faster. So reasons you should use Kubernetes for AIML Kubernetes of course generally uses containers and containers are language and framework agnostic they're just a process running on a machine so whichever framework and language you use is fine. Containers generally and also Kubernetes are open source which means that they can run on a wide variety of hardware and environments and Kubernetes is meant to manage many machines so when you have AI workloads which sometimes need a whole lot of resources and sometimes they don't need as much they're very bursty they're really good way to make sure that you're using your resources efficiently. Yeah we have so many varieties of hardware out there that could be used for very specific use cases and a lot of times they can run parallel tasks. For optimization of hardware and things like that you're already running machine learning processes and it costs time and money so the hardware can make that reduce costs and reduce time efficiency and get those products out to the market faster so that's what we all want is to get everything faster and get the fast answer out there. So couple of tips and tricks for using CPU so for critical batch worker as I said before often times having that swaps in the process can be very costly so if you're running a lot of batch workloads you might want to run one pod per node so we have customers that run all 5,000 node pods or even 15,000 node pods on GKE and consider using CPUs with higher clock cycle for more demanding workloads so the most basic CPU you can get from your cloud provider are probably the cheaper ones that does pretty well for most kind of workloads but if you want faster clock cycle most cloud providers or even data centers can offer more expensive faster clock cycle CPUs and finally number of ML workloads can offload to CPU and RAM so if you're running a GPU and TPU workload with a lot of memory and you give it less CPU sometimes some of the libraries will want to use offload some of the work so you want to use that. Couple of the things for GPUs and TPUs is that label your nodes appropriately so that you can find them easily and use taints to stop from non-GPU-TPU workload to be scheduled on to GPU-TPU hardware because again if you're using up that resource for non-ML workload you're just wasting resources at that point and if you're using a multi-node setup if you want to try to keep them as closely if it's your own data center schedule them appropriately if you're using a cloud provider use their semantics to find out how to get them closer and finally using GPUs they have things like multi-instance GPU time slicing and multi-processed services to use the same GPU for more workload. So that leads us to I know we have a short amount of time left but I'll try to condense this slide a little bit so that leads us to the sustainability portion of things so we want to be able to use you know I'm sure you've seen arms the laptops are being used out in you know Apple M1 and 2 and we've got Google Chromebooks as well as you know service arms so there's a lot of benefit to that with our batteries lasting a lot longer so we've got all of the cloud provider you know Amazon, Google and other cloud providers out there that are working really hard to reduce that footprint trying to go a little bit more carbon neutral and so they're adding them to their pillars of well-architected frameworks and various things like that now Google I know has within their UI you could pick a specific region to show you where it's more carbon neutral so that's really nice and then you know this is kind of a lot of information here but we probably have more and more PUs coming in the future so this is just a slide kind of emphasizing that feel free to check that out later. So in review, this was a lot of information and I know that you probably weren't able to consume all of it so if there are a few things that you take away I hope that you learned that containers essentially are just C groups and namespaces and basically are just processes on your machine. CPUs are great multi-purpose processors while GPUs make processing power even stronger through parallelization. TPUs both do parallelization and they load in all of their instructions all at once to make them faster so that they're not going to memory as often and through the power of device plugins all of these accelerators can work with Kubernetes. Kubernetes doesn't add anything in between it's just enabling you to make use of what's there with the applications that you're running. So with that, that's going to be the end of the talk of course provide your feedback if you want to it's going to be on the same page that you saw the talks in and with that do you have any time for questions? Not really but there's nothing after this so if you want to come we're going to be hanging out here and outside if you have any more questions about specific tasks with that thank you for coming to this talk. Thank you.