 Um, yes, so let's start a few words about me. I'm a golden developer and tech lead at Convert. We mainly dealing with Kubernetes and cloud native staff. We're building a platform for AI and machine learning on top of the Kubernetes. I'm a husband and father to kids. And finally I am at the middle of my skipper course and I actually I'm at the end and I'm really excited about this. And, uh, so what is talk about a, we will talk a bit about containers and device sharing. Um, and then how these works in Kubernetes, how the Kubernetes manage resources and how Kubernetes manage these resources. Um, what are the production great challenges? And finally we will talk about meta GPU, which is our open source project. Um, yeah. So, uh, let's start. Um, so let's talk a bit about how we going, how we can share device on Linux. And if I will take a simple example where I am creating simple loopback file and I would like to share this a loopback file, like I would like to mount it. Um, and I would like to mount between. I would like to use it between many different processes, right? So, um, I will be able to mount it as read only, uh, as much as I want, right? But if I will try to mount it as a read write, it will fail. Now this pretty obvious, uh, a pretty obvious example, which can, uh, teach us one simple fact. Sharing devices between processes on Linux will be depends on two things basically on the device driver and on the setup on the configuration. So in, in some cases I will be able to share device and another, um, case I want, I might be not be able to share this device. So with that, let's move to the, um, second, second simple, uh, question. And we are talking about Kubernetes and AI and GPUs. So, uh, how easily I can share GPU device between two Docker containers? What do you think, guys? Is, is, if, like, how easily I can share my GPU device? Um, so I think the most simple thing that we can do is just figure it out, right? So I have very simple TensorFlow script and this TensorFlow script uses, um, uh, GPU, GPU index zero. And I would like to run two Docker containers and see if they both will be running and if they both will be used this, uh, GPU. So let me show you really quickly. Uh, this is my bar script. This is my two, uh, containers and I will just execute this script and my two containers are up and running if I will check logs. And this, so this is, uh, GPU, uh, minus one container and this is logs for the second one. So it seems like my, my both Docker containers are up and running, which means I were able to share single GPU between two different containers, right? So what about NVidia SMI? Yeah. So you see these, uh, processors, they belong to these containers and it seems like everything is working. So, um, if I came from the, um, from non Kubernetes area, right, but I do want to use Docker, so containers in general, I can say that I will be able to share my GPU devices between my containers, which is really good. Uh, now, uh, I will clean up this one. So what about Kubernetes? Will, uh, will be it possible to achieve exactly the same behavior with the Kubernetes, right? Um, so again, like it's open question. It might be, it might be yes, might be no, but again, we can easily figure this out, right? Uh, so let me jump to my second panel and, and you see now I going to deploy exactly the same container, but this time I going to use, um, Kubernetes manifest and Kubernetes cluster. I will create two replicas of the, uh, of this container, like two replicas inside this, uh, deployment, and I will use one, uh, GPU exactly the same scenario as it was with, uh, Docker. So, um, now if I will apply this, if I will, uh, create like that. Yeah. So my, um, you see, I created, I created this deployment and one, uh, container is running. The second one is spending any idea why it's painting, painting on what, uh, what fall it is painting. It's waiting for new machine. Okay. Let's, let's see why it's waiting. And we can pretty easily describe this and indeed it has inefficient NVIDIA GPU. Um, this is what at least Kubernetes, uh, think happening. Now, uh, okay. It's saying that I have, I have not enough GPUs. Let me run NVIDIA SMI and see what is going on. So indeed there is only one process that is running, but my GPU utilization is equal to zero and I'm using only two gig of memory where I haven't total 12, almost 12 gigabytes of memory. So, um, Kubernetes just, just not aware about the fact that I do have enough GPU, um, capacity. It's just saying I do not have enough. Um, right? So, uh, and this behavior seems not right because if I came from the Linux world, right, I were able to share my GPU device. Then I took Docker and I also able to share a GPU device. So why Kubernetes does not allow me to achieve the same behavior, right? Because Kubernetes is based on Linux. It is based on containers, but it is not. In fact, we see that it is not possible. So you know what? Let me clean this up, clean up, right? Like that. And, um, and now I want to achieve exactly the same behavior as with, um, Docker's, I would like to achieve exactly the same behavior with the Kubernetes. So, um, yeah, I will change a bit my deployment manifest and I will try to apply it again. Now, um, so what kind of changes I will apply to achieve, uh, the same behavior as I had with Docker. Uh, so first of all, I will remove this resource limits and NVIDIA. I will just remove it. You see, this is the 10.yaml file that I going to apply and, uh, we won't find here any NVIDIA.com anymore. And this is first change. The second change is I will add one environment variable NVIDIA visible devices and I will just statically set it to zero. So now let me create this YAML like that and see what will happen. Okay. So now I have like, I have single, uh, container. Um, let me scale it to two containers and let's see if it will be running. It is running. Okay. It's really cool. So let me scale it to four containers. Oh, not 41. By the way, 41 will work as well. Um, let's see. Yeah, so I have four containers. This is awesome. What about NVIDIA SMI? And NVIDIA SMI I see four processors. So I was able to share my GPU, right? I did not install any devices plugins. I mean, nothing. I achieved exactly the same behavior as I had with Docker and with Linux. And basically, like from this point, I can go to my manager and say, hey, I am able to share my GPU, like GPUs. I am able to achieve, um, um, the fractional GPU, right? Without even writing piece of code. But probably if this was true, we, this talk wouldn't happen, right? There is a reason why I'm presenting. Yeah. So, yeah, this is exactly what I'm going to talk next. So exactly. We use, usually we use Kubernetes when we have at least plans to use more than one node. And, um, right? So, and if we will have more than one node, we will have many different device aspects, right? I might have a different amount of GPU units, different amount of memory on each device. And I will need to address this in a certain way. Also, how I will choose the right node. This is also something that I need to care of in some way. How am I going to calculate total availability and capacity? In my example, I was able to create, like, I started from one, go to two, I scaled it to four, and I can, like, it's completely unmanageable. Someone need to enforce and calculate what is the availability and capacity. Like, to how many pieces I would like to split my device. And finally, what to do if some devices are not healthy, right? Today, we have all these cloud node groups in, in clouds, and, and we have scaled to zero policies and so on, so on. And might be my device is not healthy, maybe yet not healthy. And I need to address this as well. And probably there are many other reasons that I'm not mentioning here, but they are exist, right? So with that, what kind of resources we have out of the box inside Kubernetes? I mean, what resources Kubernetes manage for us? And we have CPU memory, ephemeral storage and huge pages. All those are available out of the box, which means Kubernetes aware about total capacity and allocatable, which help us as the end users to schedule our workloads in the, in the way that we are aware about, like, what amount of CPU are going to request and the memory and if I have enough or not enough and Kubernetes will match the right node for me and so on. But obviously, we have much more resources than these four, right? One of them is GPU. But GPU, but Kubernetes itself does not address GPU issue in any way. And this is where the custom node resource and device plugin came, came in. Actually, what Kubernetes allow us to do is we can extend Kubernetes and implement required logic for our device, right? So I can say, I can decide now how I'm going to allocate the device, right? What should happen exactly when I'm allocating the device and so on and so on. So very quickly, the device plugin based on GRPC service, you need to implement GRPC service. It will communicate to Kubelet through Unix socket and you will need to implement five RPC methods and one registration call. All right. So let's see the method GPU in actually how it's working. So in terms of architecture, we have these three binaries. So method GPU, actually, this is demon set, which is running on each Kubernetes node. And we have MGCTL, which is binary for management and Prometheus exporter to export some metrics. The share logic, by default, we're going to split each GPU unit to 100 pieces, while each piece representing 100%. The memory and GPU utilization computed relatively to allocated amount of meta GPUs. All right. So if I have one GPU card, I will have 100 meta GPUs. If I have two cards, like one card to units, I will have 200 meta GPUs. Okay. So let's try and make it fractional. But this time, let make it fractional in the right way. So, yeah, I will create this YAML file and it's still painting now, right? So I will check. I will edit deployment. And this time, instead of requesting nvidia.com, I will ask 30% like that of meta GPU, right? ConvergeIO meta GPU. This is what I'm going to ask for. And I'm saving this one. Let's look on the pods now. Okay. So now my two pods are up and running, which is exactly what I want to have. And now I can try and scale it. I will try to scale it to four. Let's see if I will be able to scale it to four. Yeah. And obviously, I won't be able to scale it to four, right? Because I have 100% and I allocated 30% to three pods. So one won't be able to run. Now, if I will look on the node itself, and I will describe this node, what I will find out is now I have ConvergeIO meta GPU, which is equal to 100. And from this, I allocated 90. All right. So this actually allow me to manage resource allocation in a way, like in manageable way, right? It's not like that now that any users can create as much as they want. Now they can create exactly up to available resources. All right. So all right. So the next thing that I would like to show you is actually Kubernetes, like how GPU is natively, how native GPU inside Kubernetes. So let me show you something. I will create, again, it's the same script. It's MNIST training with TensorFlow. And in the one second, my container is up and running. Okay. Now it's running, which is cool. Now I'm running as a root. So I have a sage access to this machine and I'm a root. And if as a root, I will run NVidia SMI, I will see my process up and running, which is good, basically, right? Pretty obvious. So if I will do the same command from the container, I do not have any any processes available, right? So why is that? I mean, something is going wrong here. Now usually, when we are running inside cloud, and we have this all these immutable nodes, we like the data scientist won't have access to the SSH node, right? This is usually what will happen. And if I'm running just my MNIST model training, and I just want to see what is the utilization, I mean, I just as a data science want to see what is actually memory usage, I won't be able to get this information. So we wanted to address this by we wanted to address this issue. So we created MG CTL binary, which allow us and not only us, but our users to see all the relevant information for their workloads. So again, as a root, I am able to execute this command. And but now I have much more information. So for example, and VDSMI does not provide for me any information about Kubernetes itself, right? I have only the process name and that's all basically the memory and the PID, but what I am running inside Kubernetes. So I need, I would like to see, for example, the pod name, the namespace of the pod, what is the amount of meta GPU request is allocated for this pod and also maybe some node information like on which exactly node it is running. Now, if I will run the same command as a container owner, so I am getting also some information, useful information, which is not available when I'm using NVDA-SMI, right? And this is like pretty useful, I believe, because now users, they do not need any more root access and they're completely okay to go and and run this command inside containers. Okay. So I will clean up this as well. And move, I will move forward. So what about limit enforcement? Now, when we are able to split, to shell a single GPU, what will make, like we need some way to enforce the amount of memory or GPU utilization that has been allocated for each share, right? Because the Kubernetes will make the schedule for us, but in the same way as it happened with the regular memory, right? But C groups will enforce memory. Now, we do not have C groups for GPU. And that's why we need to address this problem also. So let me create. Okay. So now I created again two MNIST training jobs. And if I will run MGCTL again, what I will see is actually my second container becomes red, right? You see it's red. Yeah, it's red. So why it's red? Because it uses more memory than it should, right? So I have 10 meta GPU requests, which equal to 10% of the GPU unit. And it should use up to one gig, but in fact, it use much more. So what can I do with that now? So I can either issue an alert, right? Or I can just pick up a phone and say, Hey, dear user, you are using more than you should. And this is if I am nice person. But maybe, but I can, but also I can enforce this in the same way as C groups enforce this just by killing this process. So I can achieve this enforcement. If I will change some configuration, like memory enforcement, this is currently it's set to false. If I will change it to true. And now I will I will need to restart my device plugin just to make sure it's fast. Yeah, now my device plugin is has been restarted. And now in a few moments, we will see that this container you see now it's it has been terminated. And so each time this container will start. And once it will use more, more memory in this scenario, when then it should this container going to be killed, and this the it will fall to the, you know, the crash loopback state. So then we can choose how to address this, we either reduce amount of memory or we increase amount of meta GPU requests. All right. So right now. Oh, yeah. So, so how the event looks like, right? Oh, yeah. Yeah. So currently, we do not have events. But what should happen is actually, we have exporter, Prometheus exporter, which can be used with the Prometheus and alert manager. And based on that, users can create an alert. Yeah. Yeah. Yeah. Yeah. Definitely. Yeah. And we need to actually like it's open source project. By the way, it's completely open and get up. So any contribution are welcome. Yeah. So with that, basically, that's all what I wanted to share with you. Fast recap. So max utilization, limit enforcement, dynamic resharing, easy of use and Kubernetes native. This is what we're trying to achieve. It is open source. Please contribute, use it. And thank you. Does anyone have any questions? Hey, hello. Hello. Okay. Thanks for that talk. I was wondering if you might comment a little bit on how this interacts, if at all, with the vendor tools that do the same thing from Nvidia, for example, the sort of multi instance, multi instance GPU, which I saw you were running in your driver screenshot. Yeah. So good question. Thank you. So the thing is, first of all, not each GPU card has support for MIG. There are many, many other cards, old cards, which does not support MIG and we still want to share them. So this is the first comment. The second thing is to configure MIG might be some, sometimes not easy thing to do. Also, it has some limitation. You might need to reboot the node. You cannot easily reshare, right? Because if you would like to reshare, you need to stop all the workloads and so on and so on. So the plans for this project is to use Nvidia open source libraries, NVML, for example, which they have and build on top of it cloud native Kubernetes native tooling, which will simplify day-to-day work. Yeah. As a user in a team, especially when you're running in a multi-tenant approach, how do you get an information earlier on on what GPU resources is on the node before you start like scheduling? And also, if you do schedule and there is no node, is there any possibility of that actually working with closed outer scalers? So let's scale up more GPU nodes. Okay. Yeah, right. So again, the plans, now what I showed is actually converge.io slash meta GPU. Now, the plan is the real plan, like the big plan is to create a logic which will help, which will allow you to bind a resource name to certain GPU. So you might imagine that you will have multiple resources names and they will be bounded to multiple, to different cards. So for example, I even can create a resource which will I call Dmitri converge.io slash Dmitri. And these resources will represent certain GPU which only I can use. So for example, and regarding the multi-tenancy and the way you can see all these devices. So we have different visibility scopes. And when you are running this command with the device level visibility scope, you will see the complete spec of the device. Now to be able to see devices between different machines, right now, you will need to run this command on each node. We have plans to create, like we need to see how community behaves if we really have like people like it, people want to use it. So then we'll put some effort on this and we will create a central control plan where you can see all your devices, Grafana dashboards, alerts, you know, everything that really needed for production grade open source project. Thank you. One quick question. Is it possible to over commit basically? So to request, for example, many devices but set the enforcement limit only if user like really, really eat above the cap. Yeah, right. So a good question. So the thing is, you, yes, you can over commit, but what will happen is there's two scenarios. If you over commit, your container might be get killed or it might be continue running as usual. To avoid over commitment, you need to add support inside your code. So what does it mean inside your code? When I running my TensorFlow scripts, I specifying the total amount of available memory. All right. So and these need to be a tango between device plugin and your code. In certain way, it's very similar to Java, right? So if I will run Java without max and mean parameter for the memory, it will use all the memory regardless what I configured inside my Kubernetes limit and request. So in short, yes, over commitment is possible. And to avoid over commitment, you need to apply changes at the code level. Or you can leave your code as is, and then you can decide what should happens when over commit is happened. Either to kill this container or just leave it running and get some alert for this. Thank you very much. Thank you guys.