 So, I'm Freddie Roland and with me is Adrian Kiris, who is a software engineer at NVIDIA, part of the cloud orchestration team in the networking business unit. Our today's work is to enable networking technologies in Kubernetes. Today we'll talk about dynamic resource allocation, also known as DRA. DRA is a new API for requesting resources in Kubernetes. Okay, so let's take a look at the agenda. First, we'll go over different resources available for your workload, and how you actually request them. Then we'll talk about the device plugin, how do they work, and what are their limitations. And then we'll go over DRA and its main APIs. After that, we'll go and do a deep dive into the DRA driver flows. And also, we will go over the steps that you will need to do in order to build your own DRA driver. Lastly, we'll go over CDI. CDI is a container device interface, which is part of the container run time that is required by the DRA drivers. Okay, let's start. So, Kubernetes is all about running workloads inside containers, right? But not every workloads have the same requirements. For example, if you have a CNF application, like a router or firewall, you will need some networking, very specific requirements. Or if you are using DPTK for this application, you will need your pages, right? And in AI, for example, GPUs are required, both for training and inference. In training, we will need multiple GPUs among multiple nodes, and maybe we will require some fast networking in order to be able to share efficiency data within them, maybe using GPU direct over our DMA. So what are the resources that we want? We can allocate to our workloads. So first, we have the regular one, CPU memory, UHD pages. Then we have storage-relected workloads. And eventually, we also have the device plug-in workload resources. So what are the device plug-in resources? For example, nv.com.gpu. Okay, so where do we see these resources? We have in the node status, actually two sections. First one is the capacity. Second one is allocatable. The capacity is a whole pool of resources that we have on this specific node. And the allocatable is what is still available to schedule future workloads. So a kubelet is in charge of reporting the node status. And it is also in charge of reporting the available resources. So you see in the first part what we can call the built-in resources, like CPU, UHD pages, and memory. And the second part, we have some example of some device plug-in resources. Can you hear me now? Yes, better? Sorry. Okay, next. Here an example of allocating CPU, memory, and UHD pages. So under the spec of your port, under each container, you have two sections. Request on limit. So the scheduler will look at the request part and will search for a node that has enough resources to actually answer this request. And according to it, it will decide where this port will be eventually scheduled. So in storage, we have several options. First, we have the available storage. Some will call it the scratch space. So for example, if you want to download some large file or have some state saved in your local files, you can use this one. But you need to understand that it is not persistent. So if your port is restarted, all your data will be lost. Regarding persistent storage, we have a few options. First one is what we call the entry storage volume plug-ins. So in this example, we have an NFS mount that you can just specify the NFS server and all the needed parameters and we'll get the mod inside your port. So what is it called entry? It is because that the implementation of these volume plug-ins are part of the Kubernetes core code. And it was actually not very convenient for a storage vendor to have this code inside the Kubernetes base code because they were tightly coupled with the cadence of releasing Kubernetes. So if you have a bug or they want to release a new feature, they need to wait for the next release. So as an evolution from the entry volume plug-ins, we got the CSI. CSI is container storage interface. And it gave actually the storage vendor a full freedom to implement at their own condense and they are releasing at their own condense so they can fix bugs and add features. Then just need to implement the API that was defined by CSI. So what do we have in CSI? We have a storage class. In the storage class, you have a name and you have the CSI driver that will eventually provision and expose this volume to your port. In addition, you have the possibility to have a bunch of parameters. These parameters are freestyle. It means that you can do whatever you want there. But they are very limited in their structure because they are just a string-to-string key map kind of structure. So next we have the persistent volume claim. The volume claim that you specify some parameters like, for example, access mode and size. And most importantly, you can also specify the storage class name which will actually say which provider will eventually provision your volume. So DRM, the dynamic source allocation, is taking from this API the main approach. So it will take the main ID of the storage class and the claim and it will extend it actually for any resources, not only storage. OK, so how do you actually request the volume inside the port? So you have a volume part under the spec and then you can actually say what is the PVC that you want to have in your workload. In this case, the PVC was already created before. OK, next, the next resource that we have is the device plugin. So why do we need device plugins? So sometimes, as you know, you have specialized hardware. And for example, here we have a Blue Fit 3 GPU. We have a GPU A100 and a Conex67 NIC. And we want to be able to utilize this hardware inside your workload. And like we saw, Kubernetes does not support specialized hardware. That's only a set of limited resources that is aware of. So here comes the device plugin to help us actually to utilize these resources. So how does it work? So device plugin is a kubelet plugin. It means it runs in the node. It will first advertise itself to kubelet and we say, OK, this is the resource that I'm working on. And then it will expose a gRPC interface to kubelet. And the most important method here is list and watch. So the kubelet will have to plug in. Give me a list of the available resources. And it is a streaming API, so if there is a change on the status, the device plugin can update kubelet with a change. And the second important one will be allocate. Allocate will be called by kubelet just before creating the port. And the device plugin will give the kubelet a list of instruction or to be passed on to the container runtime. Explaining exactly what you need to do to be able to access this resource. OK, so as I mentioned, we can see this resource is also available on the node status. Here we have two examples. Once the GPU second one will be a VARF SRV resource. And how do you actually request them inside your port? Under the resource, you have the request and then it goes like domain slash name of the resource. So here we are requesting one GPU and one SRV resources. So I can see this interface is, we can call it countable. It's just a number. So what are the issues with the device plugin framework? First of all, you cannot have shared resources. Let's say, for example, you have a GPU that is able to work with different workloads at the same time. Using the device plugin, you cannot do that. Why is that? Because resources don't have a name. It's just a number. So if you would like to request another one or use one that has already been created, you don't have the possibility to do that. Second point is unlimited resources. So if you are familiar, for example, with CorpVirt, which is running VMs inside Kubernetes, they have a device plugin for KVM and it has a count of 1,000. And it really doesn't make sense because KVM is not a limited resources. It's just a configuration of the CPU. But since they want to use other things that are part of the device plugin framework, they still need to publicize a number account. So it's kind of a hack, but actually it doesn't have any meaning. Last one, you don't have the possibility to do advanced configuration. Let's say, for example, that you have two GPUs and you want to have different configuration on 10. The device plugin framework don't have the possibility to do that. Everything will be configured the same. So here comes the error to actually answer all of these issues that we mentioned. So what is the error? It is a new way of requesting resources in Kubernetes. It started in 1.26. You will need to have a container runtime that supports CDI. CDI is container device interface. You can see here the version for container, the archive that you already have this support. It is still in alpha, meaning that if you want to start to tie it up, you will need to enable a feature get. And the idea behind it is actually to give an alternative of the device plugin framework that we mentioned earlier. And similar to CSI, the idea is to give the full control to the vendors. Like we mentioned, storage vendor no can release anything as a ground condense. We want to do the same guard in resources. And it actually takes the same approach. So if you remember we have a storage class, now we have a resource class. And before we had a persistent volume claim, now we have a resource claim. So the idea is similar. But in addition, we have also some things that are a little better. So for each resource class, you can have a CRD defined by the vendor that can be a class parameter. So if you remember we have the list of strings in the storage class, now we have a full possibility that the vendor of the resource of the DRA driver can have whatever you want in the parameters. It can be much more complex than what we had before. And in addition to the resource claim, also have the same thing. You can point to a vendor defined CRD with a lot of parameters for each resource claim. We also have a resource claim template, which we will explain in the flu sites. OK, so first of all, how do the spec of the pod change? Most important thing as an end user, what would you need to do? So it's a little bit more verbose, but we need to keep in mind that it will give us a lot of more flexibility when using these sources. So on the left, we have the device plugin configuration, the account that we mentioned earlier. So we want two GPUs. On the new way, you have a new section on the resources. It's called claims. And then you give a list of names, the name of the claims as a resource claim that you want to use. Then you have also a new section that is called resource claim. And here you need to configure for each claim that you want to use. What is its source? In this example, it is a resource claim template that is configured on the right. And each time that we reference this resource claim template, a new resource claim will be created with a spec defined in the resource claim template. So the idea is that every time you refer a resource claim template, the new resource claim is created. It's not reusing an existing one. And lastly, we can see that in the spec, we have a reference to the resource class. OK, let's take a look at the resource class. First of all, all the examples here are from an existing DRR driver, a Kubernetes driver for GPUs that has been implemented by Kevin Klose from NVIDIA. He also did a great talk about it with Alexi Formenko from Intel. You can check it out on the last clipcon. We'll give a link at the end. So the resource class will define, first of all, a name of the resource. And then the DRR driver that will actually be bind to this resource. It will be created same as the storage class created by the CIS admin. OK, next, we mentioned that we also have the possibility to have parameters for the resource class. So how do we do that? We just configure a reference in the form of the API group kind name, which is a CRD that the DRR driver will implement. And then you can have a specific parameter. So in this example, you want GPUs that are not non-shareable. OK, so we have a resource-client template and resource claim. So what is the difference? Like I mentioned earlier, a resource-client template creates a new resource claim for each time they are referenced. And the resource claim will refer to the exact same object. All right, so now we mentioned that also the resource claim can have a parameter. And it gives us a lot of possibilities. So here in this example, we have a GPU selector on the resource claim, meaning here we actually want either a T4 GPU or either a V100 with less than 60 GB memory. So you can imagine that there's a lot of flexibility and possibility that you can configure your resources with the same type of resources, but with different configurations on each instance. OK, next, how can we share, actually, resources between workloads? So here, an example, on the same port, different containers, you just point to the same claim. Since now we have a name, it's quite easy. So we have a GPU name. And then on the resource claim section, you define the source. So here, you recreate your resource. And then you can actually refer it from two different containers in the same port. And it goes the same regarding sharing between different ports. So again, using the name of the pre-created resources. One thing to mention that the DRA driver implementer needs to specify in the resource claim that this resource is actually shareable. Otherwise, the scheduler won't allow this kind of configuration. So we saw that DRA comes and solves the share issue that we mentioned, like we just saw. It also solves the unlimited resources, because you don't have to actually expose a number of resources that you want to support. It's not required. And you can easily implement the DRA driver that don't have any limits. And last one is a lot of more flexibility regarding the configuration. Each different instance of the same resource can easily have different configurations. So now, Adrienne will take us in a more deeper dive about the different flow. All right. Thanks, Freddie, for providing us an overview of DRA. So yeah, we're a bit short on time, but let's try to make it. We'll go through some high-level flows here to understand what happens a bit under the hood with DRA, as well as we'll go ahead and see what is required to implement a resource driver and some helpers for that. And then we'll have some time for question, hopefully. All right. So what is the anatomy of a DRA resource driver? Essentially, it's composed of two components, separate by coordinating a centralized controller, which is running with high availability, and a node local Kublet plugin running as a demon set. And we also have a set of CRDs, as Freddie explained. The centralized controller coordinates with Kubernetes scheduler to decide which nodes an incoming resource claim can be serviced on. It allocates the resource claim once the scheduler picks the node, and it's also in charge of the allocation. The Kublet plugin essentially is in charge of doing all of the node local operations. It will publish the node local state to the centralized controller. It will perform any allocation requests requested by Kublet. We'll see that later. And it will also perform some deallocation requests. The CRDs, essentially, each resource driver can define its own. It's driver-specific resource class parameters, resource claim parameters, additional CRDs, which can be optionally added, for example, to store the global state or the per node state to keep track of allocated resources. And that's it. In regards to the allocation modes, there are two allocation modes used. One is immediate allocation, which means that the allocation happens immediately for a resource claim. Once the resource claim is created, the resource driver will allocate the resource on a specific node, and then POD, which referenced this claim, will get scheduled onto that node. Delayed allocation, or also known as Wait for First Consumer, will delay the allocation of a resource claim until a POD is referencing it. At that point, essentially, the resource availability will be considered as part of the POD scheduling, in the sense where the entirety request of the POD, the resource cells, CPUs, device plug-in, other claims will be taken into consideration in the scheduling decision, and we'll see how this happens. Let's dig into the immediate flow. So we have the flow is the same at the beginning, so the admin will deploy the resource driver, the global plug-in, the CRDs, and will define the resource class. A user will create the resource claim for the resource class. At that point, the centralized controller picks that up and proceeds with allocation of this resource. It will allocate it on some node in the cluster. Once it's allocated, it will essentially update the resource claim status with the resource handler. This one contains, essentially, a string blob, which is passed through the system, essentially, by the Kublet plug-in to the DRR driver again, as well as setting the node on which the resource was allocated on. At that point, the user will create a POD which references that resource claim, and the Kubernetes scheduler will kick in here. I inspect the POD. It will see that it has a resource claim referencing, and will proceed in scheduling this POD onto the node where the resource was allocated on. It's a long process, right? So once the POD, the node was selected, then the Kublet will pick that up. It will then, again, see that this POD is referencing a resource claim. It will call the Kublet plug-in via gRPC, passing it the claim information. The Kublet plug-in will perform the allocation needed, and return a set of CDI device identifiers. We'll discuss them at the end, which will then pass to the container runtime, and the container is spun up, exposing the devices. All right, that was the immediate allocation. Now we'll sort of complete the picture for the delayed allocation. The initial flow is essentially the same, right? The admin will deploy whatever is needed. The user will create a resource claim. At that point, one thing to note is that the centralized controller does not kick in. Again, it's wait for first consumer. It will not kick in. The user will create a POD referencing the resource claim. At that point, the Kubernetes scheduler picks that up, and now it essentially looks at the POD, looks at the resource claim. It creates an object called POD scheduling context. This object is used to coordinate operation between different DRA drivers and the Kubernetes scheduler for the POD. It will set a set of potential nodes. Essentially, these are nodes where the POD may run on. And on the other hand, the centralized controller will read those potential nodes, and will sort of try to narrow down the list by updating this object with a set of unsuitable nodes. So it's a subset of nodes which this POD should not be scheduled on. This operation is repeated for all resource drivers until a scheduling decision is made. Once this scheduling decision is made, the Kubernetes scheduler will update the POD scheduling context with a selected node. So a node was chosen. At that point, the centralized controller will pick that up, the selected node, and will proceed with allocation onto that node. Same as it was in immediate allocation. So this was like a quick rundown of the two allocation mode and how it works with Kubernetes. And now let's discuss at high level how you would write the DRA driver. Yeah, so essentially what you would need, you need to first, of course, define an N for your driver. Define the CRDs which are to be referenced in the resource class and resource claim parameters. Essentially, these are costume parameters for your resource, which may be global or per resource allocation. You decide how the controller and the plug-in are going to coordinate or communicate. Is it per node CRDs? Is it some GRPC with some database? Combination of the two? The key concept here that you essentially need to represent the following. You need to represent the set of available resources in the cluster or on the node, the set of allocated resources, and the set of prepared resources. You will need, in addition to provide the default implementation of your resource class to be distributed with your driver, so a user can use it. And then, of course, there's the implementation, implementation of the controller, and implementation of the Kublet plug-in. Both of them include some boilerplate code other to interact with Kubernetes APIs in controller case or interact with Kublet, as well as, of course, the business logic for the two. OK, so this was a long list. So to help you do that, essentially, what we have is a bunch of packages like created by the Kubernetes ecosystem to help you do that. The first one is the controller package from the dynamic use of allocation controller project, which implements most of the boilerplate code to interact with the Kubernetes DRA API object. It defines a driver interface, which you need to implement, and we'll go over that. And once you implement that, you provide it to the new method, and you get a controller and just call run, oversimplifying it a bit. But that's, at high level, how it works. For the Kublet part, there is an implementation for the registration with Kublet at GRPC. So this is all like the registration is already provided for you. You just need to provide the GRPC implementation for the node server, so it's like the GRPC server which will allocate and deallocate resources, and again call a run method there as well. GRPC is defined in the Kublet APIs and the Kubernetes project, and that's for the Kubernetes part. We have a bunch of CDI helpers here. So you can reference them later. Essentially, they will help you create CDI device specification to be used later on about container runtime. And I think most importantly here is the example driver. There's a DRA example driver, which is fully functional on top of Mock GPUs. You just need a kind cluster to bring it up, and there is a pretty good readme with step by step instruction how to run it. There you can expect the different parts. It serves as a reference implementation where you can sort of take reference for it, fork it, and extend, or rewrite. All right, so guess what the driver interface is in the controller? So that's the driver interface. So it has a couple of methods. We'll quickly go over them. There is the get class parameters and get claim parameters. Nothing too fancy here if we discuss the vendor specific CRDs for the class and claim. This is the class and user's claim. These are the getters for them. They will return the specific instance of the vendor of the CRD. There is the allocate call. So the allocate will essentially perform the allocation of a resource. Notice the selected node field. So this guy is empty in case of immediate allocation, where you need to choose your own node. And it will have a value in case of delayed allocation because of the whole pod scheduling context which we went through. Again, essentially, you will get all the claim, the claim parameters, the class, the resource class, the resource class parameters. And you need to return allocation result. This struct will contain eventually that string blob which will contain information of the allocated resource, as well as the node where the resource is available on. The allocate call essentially deallocates the resource. It's called when the resource claim is deleted. It should essentially free resources which were created by this claim. Unsuitable nodes. So this guy gets called when, during the wait for first consumer flow, where we need to negotiate with the scheduler on which nodes we are scheduled on. Essentially, it accepts like potential nodes, and it needs to update in the past in claim allocation object the unsuitable nodes for each claim. Again, as we discussed before. So you update the struct with what you don't want to be scheduled on in this one. For the node part, so there is the node prepare and unprepared resource. This, again, run on each node by the Kublet plug-in. And node prepare resource will prepare the resource. It will generate the CDI device specification and return the CDI device IDs. One thing to note here is, and of course, the resource handle, you will get the resource handle in the request, which is that string blob, which we talked about earlier. One thing to note, the call must be depotent, and you have under 10 seconds to finish the call, currently, at least, with the Kubernetes. Unprepared does the opposite of prepare resource. It's get called when, I didn't mention that the first one gets called when the pod is created. References claim this one will get called when the pod is deleted, and you need to perform cleanup for the resource. And again, this call must be depotent as well. And let's talk a little bit about what is CDI, like we mentioned before a couple of times. CDI stands for container device interface. It's essentially a specification, which it's a JSON-formatted specification, which describes how a device should be exposed to a container. It contains, essentially, information such as device nodes, which needs to be exposed, like char devices, environment variables, host mounts, and hooks that needs to be run. It's sort of a standardized way to expose devices to container. It's getting consumed by the container runtime, like container decryo, to expose devices to container. And that's like an example of a CDI device specification, just contained. As I said, you can dig into it later. And next thing is just link of a couple of resources, which we added throughout this presentation. So it's all here. You can reference later. And with that, I think we are done. 12 seconds to go. Thank you.