 Hello. Hello everyone. Hello. Hello. You know, so let's hope we today we'll get more people. Hello. It is long time having chat. Hey everyone. Hello. I don't know. Let's see if so. I hope he can hear us. Sorry, my internet is acting up a little bit. Good morning. No problem. Good morning everyone. Maybe we should start. All right, so welcome everyone to today's CDI meeting. We are the 23rd of February. On the agenda, we have the following topics. The logo, the first draft implementation, and a demo. Are there any other items people would like to add to the agenda? Well, we added also listen. Remember this old topic about non root containers and usage of devices. So we just need to agree about your next steps. Okay. The logo and draft implementation. I'm just going to skip over them basically. Please go look at the slides and give us feedback. Feedback can be done either through slack over the slides, or just as an email to this to the to say runtime. As an answer to, to the email. The draft implementation. There's nothing new here. Just please take a quick stab at reviewing it. It won't bug upon an implementation. And that's it. Do you want to bring up your topics. Wait a second. So regarding the first implementation so it's your reference and just a part of CDI part, but have you already published it with Podman pull request or if you created something. I mean, we have a small part of an implementation that I showed, like, last time. But I didn't make a pull request because it relies on the pull request that is for CDI to be actually merged. Okay. Just to get the wiring right so you know about the import. So, you know, next question regarding with demo. So we have almost the same set of people what we've had in past time. Do you know if, let's say, Kevin from your side is going to join. I know what my mouth. Unfortunately couldn't join. And he'll be able to join next time. And I think we are now needed to can't join these meetings anymore. He has a stand in conflict. So we might need to reschedule. He was with Runael a couple of days ago and he wanted to rejoin the meetings. Because it's getting now concrete and implementations on Podman and cryo so I think next time he will join. But let's see. Sounds good. That's one more person than last time. I'm thinking like, do we have I mean, does it make sense to repeat them or just use what previous recording. What do you feel. I mean, if you don't like it's okay to not redo the demo we can we can always like point to the recording I think it's on YouTube. Well, I'm just curious how much people are interested to see it because like Okay, well, let me ask you differently. Like, Zvonka and Rodney you are two new people who previously haven't seen what demo. Do you want to look at it and description about it or you want to look at the offline in a recording both ways for us work fine. I don't want to waste for me work as well if we have time I would love to see the demo. It's needs to be a quorum so if other people are not interested in it I can take it offline. I'm interested on seeing the demo to you. Okay, when let's let's do it. All right, so how we structure I think I will do some introduction words just explain the basic idea. What we are trying to achieve and afterwards we will see two parts of the demo like that will show how it works for FPGAs and look is going to show it how it works for GPUs. So first of all, we are talking about how to use the CDI so right now in the CDI is on the level what how we describe devices in a form what it can be consumed by the runtimes and attached to the containers. What we want to show is the step further so how it potentially can be used in Kubernetes world. So, well, so where the main idea is what we want to get is to have CRD objects, which will be describing devices and will serve the object is something what port can consume. So using or exposing devices as CRD objects gives us a lot of freedom of how devices are allocated how devices like life cycle is managed and so on. So, we have preliminary idea of how it maps to a few objects and let me walk through. So first of all, we have idea of the device class. So this kind of objects will contain information about the vendor, like set of devices which are compatible. So for example, like several generations of GPUs can be the same or like several FPGA cars might be compatible with multiple functions. So this part contains information. What is the common parameters what is the vendor specific provision or controller, which handles these types of devices in the cluster. When we have device clean so practically this is how the workloads says I want to use particular device. And inside what object what we want to have is set of parameters exactly describing what actually what kind of device user wants. So, for example, like it can say I want to have device come with class. FPGA or GPU, but when I can provide a set of parameters like I want this particular amount of memory on with on with device. I want time slice if it supports the sharing and the other, the other, based on the devices it might be a lot of different products. So when we have a controller which actually handles with claim, and based on this claim parameters that does it creates objects or this actual device allocation. So, this actual device location contains information, like actual information about the device which can be consumed by workload. So practically it's a part of the JSON file what needs to be created for CDI to be consumed, but as well, topological information for Kubernetes, for example, like on which notes it can be available. If some additional information needed to be exposed for Kubernetes as well. And one with going to be consumed by report. So, as I mentioned, we're just on file which is created from device allocation, it will be using your standard. By our proposed CDI interface as part of one or multiple containers within the port. So, to demonstrate all of this concept, we didn't want to re-implement everything from scratch. What we thought is what the current Kubernetes CSI components, we have all the needed components which we can, I would say, misuse to demonstrate this kind of concept. So, we are picking back on CSI, like controller and node agents, persistent to demonstrate like simple device allocation. And ephemeral storage primitives if we want, when we want to demonstrate like a bit more complex like deployments with multiple ports where you have a like template and every port is actually allocating similar kind of devices for the workloads. So, on one step, I'll hand over to Ed, who can show actual thing. Hi, everyone. I'm Ed Bartosz and I'm going to continue with the demo. But before the demo, I would like to, let me share my screen, I would like to show kind of allocation workflow diagram. So, let's start from the CDI components. So, the CDI actually contains two main components, the controller and node agent. The controller usually runs on master and node agents, they are responsible for collecting device information, so they need to access the devices and so they are running on the nodes. And then, this is actually implemented as a CSI plugin, so it has like similar components. So, as Sasha said, we are picking back on CSI implementation. So, we didn't touch the control plane components like scheduling and so on, we just implemented it as a CSI driver. So, on the workflow, we have this storage class, which describes the device type, device family, if you wish. Then we have this persistent volume claim, which is kind of device claim. So, it has specific allocation parameters, they are passed as a PVC annotations. So, when the pod is created, it's referencing PVC and PVC is referencing storage class. So, on the pod creation time, we have all the information like from storage class and from PVC and from pod, of course. And that actually pod creation triggers two CSI calls. First is a create volume, so it comes to main CSI controller. And controller, at that time, it has all the information about all the devices in the cluster, because node agents actually provided this information to it. So, it has all the information to understand which node can satisfy the parameters. And parameters again, they are coming from storage class, PVC and pod. So, it picks up the node, contacts the CSI node agent and CSI node agent actually makes necessary steps to mark the device as allocated. And like when pod finishes, then the allocation request actually, it's going like the same way as allocation just other way around. So, and the next call, which is also called when the pod creation is actually started, is node stage volume. So, in CSI world, the aim of this call is to mount the volume on the node, and then it can be been mounted into the pods file systems like when the pod is created. So, that happens before the pod is created. And that call is handled by CSI node agents. So, what it does is basically it has an information about the device, the allocated, and it just creates the CDI JSON file on the node file system. By the way, we are using like CDI spec, well, at least that part about the devices. So, and this file contains the information about the devices first, and then about these parameters, and parameters can be passed as environment, for example, and environment variables like to the container. And then we have this Run-C roper. It's when it's called, like, I actually omitted all this CRI things, because it's, it's, it doesn't matter in this picture. At the end of the day, Run-C is called. And it's reading this JSON file, reading the information about the devices that should be accessible inside container does, it actually updates the spec. The sets devices there, like, and sets environment variables. And then, like, it creates the container, the Run-C not a wrapper. So roper is just to read the CSI JSON and update the spec. And the Run-C creates the container, and container has the access to the device, and it has access to the environment variables, which are coming from parameters. So that's, that's the picture, at least how I can see it. If you, if you have any questions, you can ask along the way. So in the next step, I will try to demonstrate it with the FPGA devices, how the allocation works. And, and it's like very simple demo. The next part would be like more complex demo and this one is just like simple, as simple as I could come up with. So it's screencasted. So I just run a screencast. So we have two nodes with actually control plane, ports running, and we have this CDI controller running and CDI node. So if we have more nodes, then like more CDI node components would be running here. And then we have two devices on the, on the node, two FPGA devices. So the first one. So I would like to point out to these two parameters. So first one is interface ID. It actually describes the device class. In this case, it's area 10 FPGA device. And accelerator ID is basically the function which is programmed into the device currently. So it's programmed, programmed device. So that can be changed. Like this one also can be changed if you like upgrade the firmware and stuff like that, but we consider it as a kind of constant for the current device class. So, and we have another device, which is basically the same device. But as you can notice, the function is different. So it's, it's programmed with different functions. And what we actually want when we like specify that we want device, we want some accelerator function. So for example, like jzip or whatever like some compression, another compression algorithm. So that is specified by this ID. And that that like this ID, we will be using actually in the in the PVC will be using this ID. So we are expecting this first device to be allocated and accessible inside the container. And, and this device like wouldn't fit the parameters because we want this accelerator function. So, let's start from like creating storage class. So as I said, it is describing the device family. So in this case, like device family area 10. And it has this interface ID parameter. So, which is common for both our devices. We create the class. Next step, we will create persistent volume claim. So it's considered as a device device request. And just creating the claim doesn't actually trigger any device allocation because we don't have pod yet. So as soon as we have pot that that is referencing this PVC, the, the volume or device in our case, like volume location request or device location request would would be actually created and processed by the CSI machinery. So we created PVC and next step, we will create a pot. So let's look at the pot how it's referencing. So as we are using CSI, we have to use like this mount pass and volumes thing and the, the way how it's referencing the PVC is through this persistent volume claim parameter. Create a pod. Then everything goes as I tried to describe on the picture. And then we can just check if the device, the first device like which ends to zero is accessible inside the container. It is. So that's, that's practically the demo. So again, we had, let me switch back to this. So we had one parameters passed from storage class parameters, and we had another parameters the, the function ID passed as an annotation, like through these calls. And then the end, we had this device accessible inside this container. So that's, that's it. And next demo is will be showed by by Ukraine. Thanks. I'm just small common. So, yeah, because we are using the CSI for this demonstration, like some of the parameters are passed as annotations. But the end goal is to use like the proper series so we can have a custom fields what Landers can specify based on their device requests or device needs. Okay, my turn. Yeah, I hope you can see it. So this is the same thing that I presented a couple of weeks ago. I took the CDI thing to a GPU context and basically, I was interested more in figuring out how we could improve on, let's say sharing of GPUs and things like that getting like multiple boats running and bigger deployment and how it works in situations like that. So instead of trying to exactly match a device like it showed, I created resources for the GPU, like memory and millicores, which, which could be consumed by ports, sort of dynamically, until they run out and see if I could do this. In addition to this sort of exact matching of allocations. And in this demo, we've got more nodes. There are basically three nodes. We've got CFL knock it knocks, two of them small nodes with one GPU each, and we have a CMLS machine over there, which has two GPUs. You can see also one machine at the top, which is running the CDI controller board, but it could be any machine in the cluster really. Typically the master like it said, but anyway, in this case, it's not. So the, the cards in these, if we look at the device files in the nodes through SSH are here. And like I said, the knocks only have one GPU in them, but the CMLS machine has two GPU cards. And for these GPUs, there is no, there isn't the proper memory query or, you know, figuring out of the capabilities, I instead just hard coded those things. And I created two resources, memory and millicores. So each GPU gets 1000 millicores and basically four gigabytes of memory through the hard coding instead of figuring out the capabilities. And these are then just, you know, then moved into map, the map of parameters. By the way, if you're interested in the demo code, go there. Let's go back to the actual demo. At the bottom left corner, you can see a deployment YAML. Now, since the cluster has four GPUs with four gigabytes of RAM each, that's a total of 16. And we've got a deployment here with 10 replicas each trying to get 1.5 gigs. So that's 15 in total, which, you know, from a number point of view, it should fit. But if you look at the GPUs having four gigabytes each, you could basically only fit two ports for each GPU. And that means that four GPUs equals eight ports that should get up running. The millicore amount here, since each GPU has 1000, shouldn't be a limiting factor as much as the memory. And we're using FMRL volumes here just because if we didn't, we wouldn't be getting individual volumes created for each port. This is a little drawback of using the CSI. So the expected end result when I launch this deployment is that we get eight ports running and two of them should end up pending. If the resource management works. So let's see. And it seems that it is still working. And also the expectation is that since the CMS1 machine has the two cards, it should be running more ports. And we can see that we should have four ports running in there. And that's correct. And these NUCs only have two ports. So everything seems to be fine. There's still one check we can do. Let's see. We can check what kind of devices got mounted into these ports. And again, the expectation is that we should see only two ports using the card number one. So we have J4LHQ, which is over here. It should be the two card machine, and it is. And there's another one over here, which is a Lex4GV. So that should also be CMS. And that's correct. So it kind of works as long as you don't try to allocate more than one GPU. That's basically the place where we have a little bit of issues using the storage, because if you try to allocate two, they don't get allocated and considered at the same time, and that creates issues then. Now, if I would try to add another deployment there, it should end up bending. And if I then delete the first one, which may take some time, the one that was bending should eventually end up running again. So it's like this. So basically any kind of params could be created, but now which you chose to do memory and millicores in this sort of dynamic fashion. I got a question. How do you test this? I mean, let's say that you have an application that requires some CUDA access to the library. How do you expose the content of those drivers inside your container? In this case, so that they can reach and make use of that library, that makes use of a specific hardware. I mean, how do you test that? That's basically the question I'm asking. I would like to direct any CUDA-related questions towards Nvidia really. I don't know if Renault can fill in how they do it, but we try to, in Intel, we try to do our stuff in upstream and basically. I mean, it's not specifically to CUDA. I guess that what I'm asking is, this framework that you guys are thinking about has to take into account the fact that there are some sources, some code, some libraries that might need to be imported, or exported into the pods. And I'm just saying that we have to think about that, and how is that going to be done within the context of this framework? Let me answer that. So in general, like importing, while adding anything to the container, it's possible. Like with CDI specs, it allows you to add any new mounts to a container, or any, let's say, pressed-out hooks. So, for example, how Nvidia does it is what, inside CDI is just on file, it will be injected information about, like, the config scripts and CUDA libraries. It's, of course, like, certainly separate questions like, should it be done or not, but it's for fact of life, and yes, it might be that way. So the idea is what this just some file for CDI is created by vendor-specific node agent. So it will be, like, similarly to CSI drivers, you have two components which each vendor needs to provide, like the controller, which holds, like, the custom vendor logic about the whole cluster allocation, and then the node agent, again, vendor-specific code which transforms the allocation object to actual CDI just so. So if any vendor have some specifics, for example, like this library injections, when it will be on the node component, which with bits will be injected. Mike Sons? Yes, it does. It asks the question. Thank you. Right. And from my side, this is mostly what I had to demonstrate. We didn't get much further than this for the time being with the storage as our tool. So I'll suppose I'll stop sharing here. Yeah. And again, storage is just a show to show the concept where the final idea how it should end up is what we are considering to use for CRDs and when we need to come up with a way how those CRDs can be specified and report as a consumable resource. On that note, I will stop. Awesome. Thank you very much for the demo. I just wanted to ask Zvonka if you have any feedback or impression. I need to digest some things, but my first thought was every GPU vendor will implement sharing in another way. AMD with C groups, NVIDIA with MIG, I don't know how Intel is partitioning their GPU. So in the last screenshot, you are requesting resources via millicourse and memory. Are we going to talk about an interface, how we are going to request those resources? I mean, millicourse could be the right term, but most of the GPUs have like CU units or compute units. Are we going to do a abstracted interface for requesting those resources? It could be GPU could implement sharing with time sharing or the amount of CUs it provides. So this would be my question. How are we going to talk about the interface, how to request the resources? So Ugric can talk about what in more details, but my genetic answer to that will be such. So you are correct. So all vendors might have different variants how to control additional parameters of devices. So luckily for GPU, at least in upstream kernel, there is a discussion going on what like at least like upstream kernel drivers will have the same unified C groups interface to consume or to specify those resources. It's a long question, when it will be merged to upstream kernel and when it will be stabilized, but at least like some intermediate patches exist at least on our side, we are testing as well. However, for different classes of devices, for example, FPG, I hardly believe what it will be some standard interface through the C groups or through the kernel. So we might end up in with scenarios what in some pre-start hooks or pre-create hooks like some of the parameters for the device will be handled in this like scripts which is executed before the container is actually started. So for example, like we are handling to a programming for device through draws but for our type of devices, it might be some other vendor specific logics. So the whole idea is what we are trying to provide the mechanism where first of all, the user experience will have the ability to specify all kind of parameters which required for vendor and for particular device type. And when we have a whole pipeline to deliver those parameters down to a point where it can be actually consumed and put into use for example, like through kernel drivers or through any other mechanisms because practically think about it. So the same concept might work for devices over the fabric. So you just need to provide these parameters to the fabric controller to prepare devices based on where these parameters and where else needs. Yeah, the parameters shown in this demo were just examples and there was no intention at all trying to force them down anybody's throat. So we tried to make this demo at the moment able to use any kind of parameter names. So there's really no limit there what they could be. It's an interesting thought to actually try to unify some of those for GPUs but at least I didn't have any kind of intention towards that at the moment. But we're open to any kind of discussions of course. Yeah, it was clear that this was not like the final stage and you don't want to have millicores for every device there but thinking of GPUs there is some abstraction as you just said we could anticipate to have but it will not fit for an FPGA it will not fit for a let's say a NIC or something else we cannot say millicores or memory or something like that that's completely fair to say. And that's why I asked are we going to think about to add a interface to some specific devices or are we going to leave this completely open are we going to make this somehow configurable and the controller will pick it up and manage any of these annotations that we are adding there so I'm just thinking about how to move forward. Okay, great, thanks. My initial idea when I initially thought about this is what I don't want to limit vendors into a specific set of parameters or set of things. So if you look at my existing CSI layer we can see what like there is a hard coded set of parameters or set of fields which needs to be filled so for example you need to request the size of something like regardless what kind of storage it is. I don't want to repeat the same slide here. The idea of using CRD is that it should be like really specific controller to reserve this to actually process the schema process the parameters and then just say okay, allocation is here and this is the point of the object which handles the allocation and node agent on the node knows how to handle that object and expand it to the CDI JSON. So that way we can have like any kind of parameters. So for example like imagine we can say let's say network device added so we can pass like list of new ones which needs to be pre-created for those devices and when exposed. Or something like very let's say we can have like optional devices so we can have devices which can say okay we can be shared within one node but we shouldn't be shared between multiple nodes or we can be shared within one port but not within multiple containers like this kind of like flexibility of how it can be expanded. So the goal is like to give a flexibility and give a flexible interface what vendor logic can handle allocation not only within one small piece of node but also across the cluster so multiple devices over the fabric scenarios and when flexibility of what kind of resources it can be consumed within one port. So scheduler knows what it's accessible and how it's handled it's handled by one time. Okay makes complete sense that's why I also mentioned that every vendor will have like their own way of sharing and we need to adapt it to to their needs and not lock them down in any anything. The plan going forward is is to extend then device manager and device plugin to resemble this functionality you created here or do you want to create a complete new instance running Kubernetes like your own CDA agent and the CDI node thing. So I don't believe what existing device manager as it's implemented right now it will be capable to let's say step by step to do migration. What I'm thinking is actually to create similar to CSI part inside Kuglet the mechanism well actually it's not only Kuglet it will touch across the whole stack things. So first of all we need to have in PodSpec mechanism to reference to CRD as a consumable resource per container or as a Pod resource actually it's both valid cases. Second thing we will need to have on the scheduler part like similar pieces of algorithm which says like currently scheduler is handling the storage so Pod is not scheduled to be not until we actually get the claim fulfilled and when we have actual allocation with topology information where this volume is existing. So when the scheduler makes the decision to place the pod. So here we will have the same thing so scheduler should wait for when this allocation will be fulfilled and afterwards it will schedule to appropriate node where this allocation is available. And one for the Kuglet so the Kuglet needs to have similar to CSI mechanism to do those few things like publish, prepare those basic things which needs to be done like the steps before the container is run. And like devices is the major use case but in reality what we are talking about is actually it's a primitive what can say what pod can reference as a consumable resource and obviously it might not result as an actual attached resource to container but it might help with just like scheduling logic of Kubernetes. So right now CRDs are the things on top of like the pod is the most smallest thing when eventually what we will get if we implement these things what we are showing here is actually where CRD will also become a consumable object with input. Okay, thanks. I don't know how do you feel good idea, bad idea. As I said I need to digest everything but looking at the concept it looks good to me for now need to think about more and as I said there's definitely the connection to the scheduling topology hints device manager device plugins topology manager CPU manager all those connections to be made as well Just need to pick on this thing. So right now we have a big problem with topology manager CPU manager and device manager is what they do the decision and those decisions are not based on like how big problem of migrating of workloads between components so for example making a decision for devices heavy thing if you have a device you cannot move it, it's physically attached but some things like some resources like CPU course it's easy like it's almost instant migration. Memory it's a bit more complex you require to move from memory to defensive thing. So and there's all this mechanism we have a problem what for example like existing storage layer it completely not visible to topology manager so you might end up what you have like yes your your pod is consuming the device and it's efficient but when you have a storage which is completely connected to different PCI bus you are this one workload processing the whole system so there are several ways how to do it differently I don't want to be well I don't want to include in this particular discussion but if we want to have really efficient resource management we need to consider also how the storage is handled and how resources are allocated all across the node So you are involved also in the topology where scheduling things so I suppose you have this CDI use case in mind going forward with all the caps and that we have for topology where scheduling right? We have but again I have to say it will be multi-stage things so we can improve like small things right now with the current design but in very long term we will need to revisit some of our decisions at least for my way sorry not my but our ways of thinking about what is currently what Swati and Alex are doing is what we are trying to make it sure what the information exposed to scheduler will be enough to make good decisions but as soon as we will start to add let's say vendor specific logic so for example like this NVIDIA MIG so it's a it's a fabric of interconnected devices so where let's say inside topology might become a completely irrelevant when we are building like multi-node workloads so we need to have some flexibility how to expose it and how to use it and schedule our bug it might be not as trivial as the current device plugins interfaces what we have right now Yeah, okay, I agree okay, we are almost at the top of the hour any other things we should discuss before closing this Yes, we have actually one one more item which is non-route containers and it's great what Mike at least joined it so Mika if you want the stages yours Yes, so I just wanted to follow up this topic so it's a it's a long topic that we discussed I think like more than six months ago I prepared this document in July we had some conversation I had some feedback about the document but then it's been mostly silent on what the next steps should be but I did I think it was like a month ago or so I created two POCs based on the proposals that I laid out in the document and sent a reminder to Signal making this about getting feedback getting feedback about those POCs that I built but I got zero feedback to those as well so on the two and I have the conversation what the next steps should be on this my POC is I created them based on my POC using container so if Mike is here would you be kind of willing to take a look and give your feedback what I have Hey Mika I'll take a look, thanks I can send you the links I have them maintained in my private repository one thing that I was discussing with Shasha today was that I could also create a full request for the container repository I just re-based my POC to the latest master to today and everything seems to be fine so would you be fair? You want to update the status then on the open issue in Kubernetes in KK? I can do that I can do that then I'll just follow the link I'll review if you want to push that code that's great we're a little swamped which way I can create the full request to container repository as well but right now I have the code living in my private work I have I actually suggest let's create a full request to cry on so we have two full requests for both major implementations We want to do the implementation in the same way I can certainly do that but let me share the update to this KK issue with the latest code I have and I can also submit it as a full request to container repository and to the kind of same for cryo as well Yeah I haven't looked at your document did you choose I think I put a couple of options there I'm not sure if you picked one to be fair that was June of last year Yes it was it was a long time ago and it was kind of the time I was starting my summer vacation when I got back I had a vague memory of reading through it and making some suggestions but I don't remember the details so I'll have to refresh my memory and then get back to you but yeah if you'll put something in there that updates it a little bit just a reminder I've got it on my screen of course but yeah we could definitely use the help on PRs here but so we can now kind of keep the conversation going and I'm at least committed to kind of getting this close to have the problem sorted out so let's you know it sounds like you did it both ways Yeah I have two POCs both ways are implemented our preference is the kind of the opt out model and the original idea that we would use the run as user run as group as the user ID and group ID Yeah I hate to have to do it on annotations but I think that's the only way to get it through cleanly right now until we've moved cryo to pass beta and when we get to a point one we'll be able to update the security context field but Mike we need to update the security context fields for this particular change Right, that's correct that's why I'm saying we can just do it as an annotation Yeah and well Mika we have two implementation and we experimented with both and I think just we just got to pick one Yeah but both works we just want to like with PR what Mika is going to create we'll just include the one which we think more user friendly and device plugin developer more friendly Yeah that's fair I didn't really care which direction was if you guys want to pick go ahead Yeah Okay I'll I'll post the update to the issue and there was somebody asking the document so yeah it is available as the Google Doc link in the issue so Thanks Mika taking the lead on this Thanks Mike Alright so we are at the top of our thank you everyone for joining today thank you for listening and for watching our demos and asking the questions let's continue over with males and let's chat in two weeks See ya Thanks Bye