 Hello everybody. I'm not sure if you're here for DRA or Kubernetes resources or you just you know sort of found a hole because of the coat check which is right over there Everybody getting ready to get going back some dynamic allocation of Listeners, maybe we'll see what happens I'm Mike Brown Unfortunately, if you're more Alexander he could not come. Okay. He's Stuck over in the EU But if you can't if you need to talk to Alexander about the DRA Architecture work that he's been working on please contact him. We'll have some links at the end No days long past right Kubernetes it for me it seems like it's still very new But we've done some of these things and it's been around for a couple of years and we're ready to do some some updates if you don't me I'm a maintainer of container D and OCI and the Kubernetes cry API We Work a lot with the cryo team. We make sure that we do device management support at least in a compatible way so that when you're Schedulers or whatnot throw things over to a container runtime. It doesn't matter if it's going to be us or the cryo team That's pretty much where we're at Again, if you know CSI device plugins and that kind of stuff, there's been modifications to it to support CDI but for the most part It's been consistent, right the the resource management inside of Kubernetes If you want a GPU you say I want GPU GPU number one And if another pod needs GPU number one, you're sort of waiting where you have to scale out And go to another node where I know that that GPU one is available okay So this demand on accelerator advice is because of AI and Other other things are going on as a confidential containers There there's this expectation That You're going to be able to get an isolated set of memory an isolated set of CPUs that are secure in Confidential containers, and then you have the opposite over When you're trying to do inferencing and it's all in one company and you want to be able to share across all these pods The GPUs that are available on that host Okay, but if you don't use something like batching Technology, then you might get stuck because your pods Aren't going to be able to run on that house without a lot more resources And because you're having to pull these additional resources in there's a lot of power management problems We want to be green we want to share these resources If you only want a slice of the GPU for a small period of time, we should be able to share it So those are some of these goals here as well as with the enclaves We want to be isolate certain tasks in a VM for example We're changing big hardware changes. You guys probably heard about performance cores Physical groups L2 caches L3 caches XL caches new techniques for being able to have memory being piped around inside of a Instead of a host or in these hopes could be vertical scaled or horizontally scaled The new memory modules are like SNC supports RAID 0 so you can actually do slicing Get a lot faster memory. Okay, a lot of things that we've had in the main France for a long time But now that's coming coming to the PC architectures and stacks and racks If you're interested in CXL or anything like this, you know, please come up after the after this discussion So how do we deal with this new this new complex world where sometimes I want to isolate more sometimes I need to share and I don't have enough abstractions at the Kubernetes API level To really be able to explain this right we need a we need a topology We kind of aware managers aware CRD if you will we need new new ideas for How to tell these schedulers and and you know accelerators To make them work in this new it it's not in genetics anymore, right? You're not just running in genetics with you know with all resources available to do it The other part down here at the bottom where it says Where we're adding GPUs that have 32 gigabyte memory same same kind of thing is applied here LLM's require many many gigabytes of memory if it if you say I want a GPU you need to be able to say I need a GPU that has 128 gigabytes of RAM please and if you don't have that ability It and it's only set up so far to you know to run a web service in a host Then it's not quite atomic enough It's it's you don't really have this a way to what it specify this in the Kubernetes API At least not without doing your own custom controllers So this is a set of features that we're currently working on, okay? We've got a DRA driver That and a DRA technology that went into alpha and it looks like it might get reverted. We'll see Hopefully it won't get reverted But we're looking for people to help us integrate this this DRA functionality with some of the you know auto scalers Like closer auto scaler if if they decided that they want to open up another node. We need to tell them that Before a priori that you know we want to we want to run more resources because we're only using 10% of The other resources that are needed for that particular class of application and I'm just bought a new word class of application we'd like to be able to obfuscate if you will a little bit the on the pod spec requests that you want to Be able to do you want a certain class of resources that has been predefined for this particular application type and In that that gets somehow translated Down at the bottom layers to the devices and the configurations for the devices that integrate with each other At the bottom end when he actually runs the containers in the pods, okay? I've been working on this thing called NRI for a while with the I need a drink so NRI plug-ins is is a common Device API and or Damian API that we're hosting in the container run times That allows somebody like NVIDIA, you know or AMD or Intel to to walk in or any other type of Networking devices to walk in and say when these class of application runs on on this host I Want you to allocate this network device. I want you to allocate a slice of a GPU with a certain amount of memory or it's confidential containers I want to be able to allocate a New layer that's isolated Across you know with certain amount of memory that's attached and isolated from other tasks in the system Okay, and NRI plug-ins allows us allows these device manufacturers to do that So we're pretty excited about it Okay It utilizes all kinds of technology when you create a container. There's a certain sequence of steps You know create start run stop all credit operations, and we at that point in time We have access to all the container OCI specification modifications the mounts that that are being created by you know Kubernetes everything is there and We can pass that over to these Plug-ins so that they can augment enhance modify Okay, the OSI specification before you start the container or after it starts after it stops the container It knows okay up this guy's no longer using that percentage of resource and because it's a class It knows how to deal with the other containers the other processes the other pods That are that were sharing that resource, right? So if somebody was only getting 10% and it wanted a quality of service that was 15 percent Now that it's no longer 10 and we're splitting it 10 ways now We can go ahead and give this other person 15% and this resource manager device plug-in will be able to integrate with the node resource topology manager sitting up there in the API server and do the notifications telling it What's changed so that it can tell the auto-scaler? Hey, I don't need another node because this one just went down You probably heard about plug events. This is like plug events on steroids Because this device manager plug-in is using TTRPC We're directly engaged with with this manager and he will know exactly what's happened the task took an oomph oops Immediately because we get the exit code and all that sort of thing It's an oomph this time. Sorry the device manager will know and he'll be able to You know read go ahead and get rid of all that memory wipe it because it was an insecure gone clay. Okay? DRA What is possible with the device plug-in right with with this device plug-in model now now? Containers that are in different pods can share the same device Nobody's excited Come on Come on come on what one GPU being shared by two can buy two containers One in pot a one in pot B with the same class Okay So you can now have multiple pods and a group of pods either running a scheduled set of you know transactions Or whatnot, but they all have to access the same the same device right be it memory device be it CUDA device you know AMD supports CUDA on PyTorch If we're as long as the device is understand the class and they've got a device manager Plug-in running as an NRI plug-in, then they'll have this immediate notification and be able to quickly Update and allocate the resources dynamically for these containers. Oh And see yeah, yeah, it's in Container D 1.7. We shipped it a year ago. So yeah, it's been around and it's big We when we shipped it in 1.7 in container we we did highlight it as experimental And it has been modified quite a couple times since then but we are GA in it in container D 2.0 Probably in March, maybe February. We're not sure. I think 214 is the current date that was scheduled for GA Container D 2.0, but Just a couple days ago. We did go ahead and make container D 2.0 beta available. Okay, and this and this is available today Okay, additionally cryo same same device model same NRI plug-in model. We're sharing together We even have a that's come up and ask about it I'll explain some of some of the history that we've had for with NRI and run C plugins and things like that I've been working with Murnau Patel on on this particular area of the specification and you know run C hooks, for example Probably seven years ago. We're sitting around. Okay. Never mind. I'll tell you later. Okay a little bit more on NRI This you know high-level architecture Make sure we talked about a little bit. There's a there's a config for the plugins that get Loaded in the container run times on on in it. We only check we only check once for new plugins But you can always restart container D and when you restart container D It automatically integrates and reconnects to Kubelet So if you had to change a device plug-in you could do that without actually bringing any of your pods down Which should work fine. Okay So the design here again you can modify your OCI specification for each container Bit of an eye chart here. We've got a lot. We're showing a lot of the NRI plugins are currently available all right If y'all want to come up and talk about them, okay Take a pause. Anybody have any immediate questions on the NRI or anything we've talked about so far Raise your hand and walk up to the mic Yeah, we're good So QoS class resources, this is this is something that Is a little bit different than the dynamic resource allocation Stuff that we've been talking about and the reason it's different is because quality of service class resources or things that we can tell the node resource Discovery agent what is available in your system. This is closer to okay. I've got a GPU All right, and that GPU can be can be Published and then the scheduler can look at what's available on the node and make decisions on whether or not you're going to deploy a pod To a device based on the type of the type of thing that's available But now this class resource is slightly different again We talked a little bit earlier about that you could have us a class definition and That class definition could be for dynamic resources It all could also be for static resources that are available in the in the container in the host and and we can make those available And we can make decisions Be based on either bandwidth or whatnot of that device Get out the the bandwidth of the device can now be published up into the center T We've got no resource apology manager CRD where we can publish additional information about the the the GPU for example or the type of memory what kind of how much bandwidth that has and then that can enhance The existing architecture of Kubernetes. I don't when we talk about things that might be a little iffy This isn't one of them. I don't think okay, but it's very very important to enhance these These class resources and have a little bit cleaner API if we believe at the pod level I Get the architecture. It's pretty it's pretty clean the runtime NRI plug-ins can enforce the quality of service. We talked a little bit about quality service earlier And if that if quality of service is something you want then you should be able to define that Okay, in a in an RT With some kind of specification and that will be the class you asked for and then it'll it'll flow through the scheduler and Kubelet will say hey, you know, let's create let's create this pod with this quality service And then the if we get a reject that can come back up the standard process, right? Sorry that quality service is not available All right, although the We should have been able to integrate with the the scheduler earlier to not actually schedule something that wasn't available But as you know We have noisy neighbors, right a noisy neighbor is somebody that's taking up all the resources all of a sudden Goes to max memory and we thought we had some shared memory And maybe it wasn't gonna be a problem where we've got a hard drive that's getting filled up because 20 gigabyte 11 just got pulled down as being uncompressed and the resource is getting full So sorry for that particular device we can't we have our quality services gone down so This is this is our our vision of the future, but it's already available in alpha and and we believe it's gonna It's gonna make it. This is a request to get involved. I don't know if hands anybody work with devices Yes, okay Maybe maybe about 10 I expect you guys will probably be coming up here to talk talking with Evan or myself about about this If you're interested in the NRI stuff it goes beyond just resource management The at the NRI because we're hooking into at the OCI Crud level if you will you can actually do additional logging types you can do forks tees You could just check for certain Things that are happening for security reasons or performance reasons, right and again because it's a very low level you know Unique socket connection without the whole g-part gRPC, you know baggage It's it's fairly quick and and you'll be able to have that Damon That's just looking for a particular class or a particular You know OCI SPAC Feel like a volume name or something like that. Okay, and then run your extra log Okay, but yeah, so that that's it We can take questions Have Evan here Evan's one of the Architects of this Yeah, or I can just repeat it Hey, so um, so I came here wondering we have a constant issue with our What we do for integration environments? We've spiking loads So we have to allocate more CPU and memory that we actually need for runtime with this field to help us in that case We're spiking loads. Yeah, like when it starts up it just needs too much CPU But just for like a fret for a couple seconds, and then it just I mean, I don't think anybody's really talked about spike cases yet. Have they or I Don't I don't see why we couldn't include in the topology a definition of you know spike availability if the if the CPU is not Fully, you know allocated already for by quality service. I don't see why we couldn't allow one to spike Okay, or we could knack it So I don't know if you did you want to filter or some just some way to say like just borrow this for another time for a little bit of time and give it back so that we can just have a little bit of you know Resource limitations because we're running into like we're requesting too high the limits and we're trying to Pack too many things into one cluster. So it's just trying to figure out is there a solution for that coming down the road Well, this is actually designed to handle that I guess is the appropriate way to say it as opposed to No insufficient resource available Auto-scaler, please give me another no right is the current answer. I believe and but but yeah I think this is actually designed to fix that problem What the NRI? Plug in be a way to do like CP at CPU set pinning without having to have like the guaranteed QoS at the scouser level because we have issues with setting CPU limits and CFS not playing very well But we introduce a sidebar or anything that breaks the guaranteed QoS there We'd like to still like kind of give you an integer CPU amount Yes, and no, it's not designed Currently to support metrics or you know stats. We haven't hooked in stats or metrics We just flow them up to Google it See like CPU set for so that I get one dedicated CPU. Yes It it would See groups who too is also fairly new and there's been an enhancement and run see the next version or unsee that's going out I believe once that happens then we should be able to hook that in with another quality service and a no Resource apology manager. Just make sure and yeah, you could do it certainly with the You could check to see if that's here. That's your is actually defined in the spec with a plug in You could see if somebody's asked for it And then you could knock it you could pull it out or yeah and reject the container if you wanted to from in our eyes Quick question about so the resource is actually managed by the operating system even in case of GPU, etc Right, so what guarantees can we provide? So when we are giving us an API saying you can set a limit say in GPU can say for workloads can share Etc we can't really provide any guarantees. So what's the yeah? You want to know how the magic works a Little bit, okay, so down in the guts of this thing we have The ability to not only talk to the resource managers who have device integration capabilities, but we can also Okay, run code either before In the inet container before we transfer Control ownership and run your command We can also hook in and run code Okay, after the containers been start at being created, but before it's been started in the same thing when it fails We can run code. Okay, so because of that because we can actually run as you in the container We can do lots of fun stuff But it can also be misused by users who think they are better than CFS or what the operating system does, right? Oh, yeah, wait. Yeah, we'll we'll specify for admins and security professionals where these configs are located How they can be? Manipulated by default. This is off in container D for example So yeah, but but you can enable it, you know with the switch with the right if you have root access Yeah, yeah pros and cons No more Okay, well go ahead documents Dr. Max's are quantum Quantum expert you're making my question more difficult So is it is this gonna make DRA? Kubernetes useful for AI Very much so very very much Kevin's not here Is it gonna give the work on a nude Full access to the GPU as if Kubernetes didn't exist It's been a big problem getting Getting the GPUs to running at full load has been a problem, right there You have to allocate GPUs For new nodes even though the GPU is only being run at 5% for example Evan can probably answer that question a little bit too Evan from NVIDIA Yeah, hi, so just to clarify Once the container has access to devices It's running with access to those devices without interaction with a cubelet so or or Kubernetes or any any Kubernetes objects or components Because you you inject the device nodes and the libraries into the container and so it is as if it is running As if it started that container by yourself, right? So if that is your question whether you'd be able to use it that the entire GPU without Giving some away to to Kubernetes something like that like that's already what happens. Yeah The question is whether or not Kubernetes will be in the way Or is it just doing the work to allocate and getting out? Yes, it's doing exactly that. It's Depending on your your request or the resources that you have requested for whatever workload it is that you want to run the idea is that Kubernetes the scheduler makes that is in possibly in depending on what mechanism you're using Makes the decision of which node or set of nodes something needs to land on and once that happens Your process has direct access to those resources So once that started then it's no longer involved. So it's just a placement and scheduling problem And a little more specific to if if the another pod wanted to come in and it was the same class Requesting then it could join. Yeah. Yes. So if you are using something that allows sharing then you could could allow multiple Processes to share the resources or you could select. I think the current model is basically exclusive access, right? unless we allow over subscription to some extent, but Basically, if you're requesting a single GPU, you have access to that GPU and know how the process is Because Kubernetes will also like does the accounting The cuba does the accounting in terms of how many of those devices are available and if it's been allocated Then it's not no longer allocatable to any other Other processes or other jobs that need access to it then last question is it any kind of GPUs meaning that Is there is a specific connection to Kuda or any specific interface to a GPU or can I plug in let's say my max GPU that I'm working on Yeah, of course the device manufacturer is going to have to have a device manager that I could do anything specific But today with PyTorch, we actually already have a nice little layer where if you want a GPU You select kuda for example as a default But if you wanted to run on AMD ready-ons, then they support kuda as well And so therefore once you requested the GPU for your pod Then that would already today be allocatable for both AMD and for But just that location will be independent of GPUs But as it my workload will need to know that I requested max GPU. Oh, yeah The great the great new the good news about all this right is when you're inside the container when you're inside your code If you want to say if MPS you're going to be able to say if MPS Right and we as a container runtime won't know anything except the platform that was selected was and the class was a Sort you know a GPU class. You want care also? Yeah, yeah That's six minutes that come on up