 Right, so let's start. I'm software engineer at Red Hat, and this is just a quick summary of what I do. So, as some of the people previously, I'm working on Overd, and then on Qvert, which is the virtualization for Kubernetes thingy. I mostly work on the node and host management stuff. So the things you can see on the slide are pretty much the features that I've had my hands on, and the last thing is I have a block, read my block. It's cool. Now, this talk is very cross-tag, because for device assignment, we have to start really at the low level. We're starting somewhere at the kernel driver level, then going up to all the KVM, QMU, Livered, all the way up to the cloud. So let's do a small poll. Who here knows about device assignment for VMs? That's not bad. And who knows about device plugins in Kubernetes? Oh, that's a bit worse. Okay, I'm hoping to change that. So this tag that we will go through is pretty much, we'll start at the VFIO PCI driver, and then go all the way up this tag to the Kubernetes level. So, moving on, let's start with philosophical question. What even is the device? And it turns out, you can ask 10 people and get 10 responses, because it depends on how far are you from the kernel itself. So for someone, device is the actual GPU that you buy in store and plug into computer. For someone in kernel, it could be just a bunch of memory regions. For us working on Node, it's usually exposed as several paths in the system. So it could be CISFS PCI device, or just under Dev. For example, DevKVM is also a device. Now, to deal with PCI devices and virtualization, we can't just make, we can't just transport the device into the VM, right? There needs to be something to handle the isolation for virtualization itself. And for that, there's a VFIO PCI driver. This is really a quick overview because this itself is whole talk. So if you're interested in VFIO PCI, just you can see my different talk. What's important is that to properly assign device to VM, it needs to be bound to the VFIO PCI driver. Unfortunately, it's not that easy because devices come with different degrees of isolation. So you may have a device that's perfectly isolated. On the other hand, you may have several devices that are not really isolated from each other. And in that case, they're not really suitable for virtualization. And this isolation is expressed in something we call IOMMU groups. So I really like the example of a consumer level GPU. So if you have regular graphics cards, it probably comes with a sound card on it for the HDMI output. And in this case, both of these devices on a single card will be in one IOMMU group. So when devices are bound to VFIO PCI, they become accessible at the VFIO and the group number. There's a more graphic example. It's taken from the different talk I had, but it nicely sums up the situation with IOMMU groups. So you have several devices and some of them are behind PCI bridges. In this case, the device zero, as the IOMMU group one is in IOMMU group by itself. There's no bridge on its way to the CPU. And this is the perfect case because for virtualization, we can pretty much assign the device by itself. We don't have to deal with IOMMU groups or anything. So that's great. Second case, and that's the group two and three is also great. There happens to be a bridge that supports PCI express access control services. And this allows it to isolate each of the device in its own group. So again, perfect case. The not so perfect case is if you have bridge that doesn't support PCI express ACS, all the devices behind that bridge and even that bridge is gonna be in a single group. And in that case, you can't really assign just device three or device four. You will probably have to work with them as one whole device. Now moving up the stack. Now moving up the stack, we have Livered. So Livered is library and a demon to manage VMs. I'll base the talk about KVM, QMU stack, but Livered is also able to manage like Zen and other hypervisor. So really it's, QMU is just handy for me in this case because I've worked with it the most, but in the end this should be applicable to other hypervisors too. What Livered does is you may have seen some of the previous talks with crazy QMU command lines. If you try to have a proper VM with all the devices that need to be there, the virtualized ones, the real ones, it gets really long. And Livered tries to make this better by creating an abstraction about the command line. It's an XML format, but that doesn't measure too much. It's mostly accessed programmatically anyway. Now important part about Livered is that there's a snippet and you can see that the device is given by its address. You're referring to one specific device, not really the IOMME group, not really anything else, but really the specific one path. So that doesn't play well with IOMMU concept because of the grouping and they'll probably have to work around that in the future. Again, going higher. This is actually a pretty different path. So for containers, this is again like 101. Keep in mind this is different from VFIO, but we still need to know about that because those two things are gonna merge at some point. And for containers, we don't need any kind of special driver, luckily. On the other hand, we can really just pass through anything because what we are giving the container is path to the device. It can be the dev path or sys path or pretty much any other path in the system. And in case of technology that has like well-in-slogo, it supports several flags like a dash-dash device that pretty much should set the C groups and make sure the device is accessible in the container. Then through volume, you can expose the path itself. The fun part is that you may have to run the container in privileged mode, but that depends pretty much by what kind of endpoint are you trying to access. So it's a guessing game. Why is this useful? Well, some devices like GPUs expose, for example, just the direct rendering endpoints instead of having the whole memory region thingy. There are several files that you need to do the rendering itself. There are also toolkits, so if you're using CUDA, CUDA itself is exposed as a path. So again, you can get that into container pretty easily. Now this is all great, but it doesn't really work when you're in cluster, right? You don't have one host and you're definitely not going to be setting up Docker for each host in that cluster. And this whole thing just becomes building block for Kubernetes and how it does the device assignment. So that's where we are headed, right? So I'm not sure how many introductions to Kubernetes have you heard today. I hope it's not that much. So I've tried to really extract only the points that are required to talk about device assignment. So we have this idea of having declarative weight, orchestrated containers through, okay, through ports where port is just several containers. All of these objects within Kubernetes are just resources and that's the really important part. Remember, everything is just a resource. And for this talk, I'll just be showing the resources as YAMLs. So I believe, okay, we don't have resources yet. You'll see multiple resources later on. And the first idea of device assignment for Kubernetes was added in version 1.3. I'm not sure how long ago is that, but it's quite some time. And there are people using it, companies using it for machine learning stuff. It is very vendor-specific, as you can guess from the name. And what it does, it allows you to express the need for some number of NVIDIA GPUs in your container. Or, yeah, in the container, sorry. And that's pretty much it. So this is how it looks. This is the normal hotspot. And you can see that there's some container and the container requests a resource that happens to be named NVIDIA GPU. There are several problems with this approach. One of them being the fact that it's vendor-specific. So if you don't have this kind of GPUs, then well, you're out of luck. The other problem itself is you may want some kind of specific GPU, right? In case you have part of your cluster has these big, beefy GPUs that have a lot of memory. On the other hand, you might have some that are used just for rendering and there's no way to express that. And for this reason, this pretty specific concept has been deprecated by something called device plugins. And device plugins are the main thing that's interesting for Kubernetes and for this whole device ecosystem that we may have in Kubernetes. It's not really specific to VMs. It's just a way to express some kind of resource that is in the system. It was added in Kubernetes 1.8 and that's not a long time ago. It's, I guess, several months. And the presentation will sometimes shorten it to DPI but that's just presentation implementation detail. It's still very early in alpha. And there was discussion about enabling them by default so Kubernetes clusters wouldn't have to have the feature gate added but it has been postponed so that it's not yet enabled by default. So what really is device plugin? Turns out it's pretty simple. It has, it is a binary that has GRPC server or maybe more servers that is responsible for tracking the resource, reporting it and doing the allocation for the container itself. There are the three end points. One of them is register. So first when the device plugin boots up you need to register yourself with the Kubernetes note agent with kubelet. So you do that via the register API where you tell kubelet, oh, hello, I'm a device plugin. I track this resource. Just keep me in mind. There's an allocate call which is the one that's actually called write when you're creating the container and it needs to get the device. So this is a place where you can have some kind of setup and just make sure that the device is in correct state and you can then pass it to the container. And there's of course list and watch that is used for tracking of the resources in the device. Maybe I'm not sure there could be fourth call because it's currently in implementation that's called initialize and it injects write. That's a call that would be called right before the container is started. So you don't really have to do device setup and allocate but it's still in discussion in Kubernetes itself. So the fun bit is we have one GRPC server per tracked resource. This is great if the resource you're exposing is something that you could call my GPU or whatever. The problem with virtualization and devices is that you probably want to expose multiple devices. And I have this code that I really like. You don't really have to understand it but the thing is there's a four plugin, sorry, four cycle that runs for each device in the system and starts the actual GRPC server. So in the end, my machine has roughly 50 PCI express devices and this thing just starts 50 GRPC servers just to expose something like this. That's not perfect. It's still alpha, let's say it will get better. But yeah, when we actually have the device plugin that is able to, in this case, track PCI express devices, this is what you get in the node description of the Kubernetes node. And there's just some namespace which happens to be my blog right now but then what this plugin exposed is the actual vendor ID and device ID tuple. I really wanted to use semicolon but you can't use semicolon so it looks different than else PCI but I mean we get there. And then there's the number of resources that are actually available. In case of PCI express, you'll usually see multiple one devices because a lot of devices are just like CPU temperature sensors and stuff and these are not really great for assignment. On the other hand, the ones that you see for example in quantity four may be network interfaces or something like that. This one is definitely a network interface. Now that you have this stuff exposed at the node level, you need to get that to the actual running containers. And in this case, this is the hot spec. So specification how to have your port built and it's similar to the actual NVIDIA device plugin where you just have the request for the specific device. There's one like implementation detail that the requests and limit needs to equal each other for it to work properly but again that's something that might be improved later on. So what these plugins are, so they're a really flexible tool to advertise any resource. It does look like they were initially built to just expose one resource that is cloud ready. On the other hand, you can hack around that and really think of the things you can advertise that way. So take DevKVM. DevKVM happens to be a device that we might just expose this way and mount it into container. That is great because then the container could contain a VM, actually a VM that could run. So that's really cool about device plugins. Like it's flexible enough that you can really expose anything that you can represent by a path in the system, make sure it's in the container and the whole scheduling and tracking of the resource is something you get for free. Something to keep in mind is that there are some rough edges. For example, when the container that used the device dies, there's no signal of that event back to the device plugin. So if you would like to do some cleanup, you're currently out of luck. Now Qvert. So who has heard of Qvert here? Great, it's improving. So Qvert, the idea is having Kubernetes add-on that allows you to run VMs in Kubernetes in the same cluster using the same resources or similar resources that you would normally use to do that it has some custom resources itself for the virtual machine itself. You could see the talk before migration, for example, and several services that are there to manage the resources. So this thing is based on Libvert and this is where it all comes together because we have this device assignment for VMs through Libvert and we have device plugins in the Kubernetes, right? So this thing starts to make sense to have in one place. And this is the actual architectural diagram and this is really sad thing. It was up to date five days ago, but it's now outdated. So a few things have changed. But anyway, this is the whole architecture where you can see there's extra service close to the API server that handles our cluster requests. On the other hand, we have a service that's running alongside Kubelet that handles the node level. And the node level is the important part here. It's called WordHandler and it previously talked to LibvertD to actually inject VMs into containers. This has changed and actually the Libvert is running in each port where VM is running. So that's a bit of difference. It doesn't really change the way we can do device assignment though. So how can we get devices there? Turns out the whole thing we need to do is just create a device plugin. And some way of transforming the idea that there are some devices in the port to deliver itself. So the device plugin already exists. It was recently moved. So it was previously in my personal GitHub repo, but now it's under Qvert GitHub repo. You can fetch it, you can try it. It pretty much works. Mostly, but we'll come to that. But what it does is first, you have to ensure that VFI OPCI is actually loaded on the host. This already makes it a pretty evil thing because it runs mod probe inside containers. So we have to make sure that the container has correct capabilities, that it's privileged. So pretty evil stuff already. And we're at the first point. The next part is the device tree, at least for PCI, is nicely exposed under ccfs. So we can just traverse the path you see, like sysbus, PCI, and devices, and get some info about each device there. So you can get the vendor ID, device ID, and the IOMMU group. And then there's just the gRPC power where it's reported back to Kubelet. That is what we have. It turns out there's, the amount of things we don't have is probably bigger than the amount of things we have. And so one of the missing parts is the IOMMU group awareness. This turns out to be really tricky in case of device plugins because they really work on pair device fragments. So you can't really say, oh, this device is actually bound to another device. This might be worked around by having a health reporting of each device. So this is one thing I forgot about device plugins. They also allow you to check health of devices, and this might be abusable to plug in the IOMMU group awareness. So you would report devices within one group that are not to be assigned as unhealthy. It's kind of hacky, but it might work. Maybe a cleaner solution is coming in Kubernetes 111, hopefully, and that is device topology. And that would be really nice because if you would have the device topology expressed in the node, you could say that I want this device and I expect it to be bound to multiple other devices. Even there might be some hosts with that device, but just different topology. And in that case, this would work really nicely. Now, obviously there's the device deallocation problem. So the fun part with VFI OPCI is that if you bind a device to it, the device is then unusable for what it has done previously. So if you bind a NIC to VFI OPCI, then it doesn't really work as NIC in the system anymore. It's just this stub that is to be passed to the VM. So in case you would want to actually deallocate that device after the VM is done running, there's no way to do that currently. Funny enough, I was thinking a bit about that and it turns out that the VFI O endpoint itself is just opened by QMU. So you could run iNotify against that endpoint and track the close call. So when the file would be closed, you can deallocate the actual VFI O part. Now, this could solve a lot. It's still not perfect because some devices should not be deallocated. So certain GPUs, for example, if they're rebounds to their original driver, actually reboot your host. That's great. And also what we don't really have is the edge case handling. This is something that's just really missing from the source part. It's not missing anything in the Kubernetes, but you still have to take care of when the QBlet dies, when the device plug-in dies. This really needs to be added. So how do we bridge these two things on the API level? So the first thing we have, it's not exactly what we have, it's just an idea of what we could have, is this specification of the device itself. So we request something that has vendor 1000, device 1000 in the VM spec. So this is the keyword part. We need to translate that into the actual pod specification, right? There's a tool for that. It's based on concept of Kubernetes initializers. So Kubernetes initializer is, again, a thing that runs in the cluster and is able to mutate the specification before it is posted or before it is recognized in the API server. And you can do these to do pretty much any mutation of the spec itself. So in this case, you could actually try and initialize every pod submitted to the cluster, look to the corresponding VM resource, and then do the device transfer, which is kind of trivial, it's almost a string operation. It might not be needed at all. That's the fun part. It's just very clean Kubernetes way, but since we already have the word handler and word controller, the cluster element, this thing could be done on that level. So yeah, this was just an idea that might not make it to real state in the future. So that's it. Well, almost. So we have the device in the pod. We have VMXML. The problem is Livered expects addresses. And in the pod, we might not have the device address properly expressed. Like if the pod mounted the whole CIS, then every device in the system would be in that CIS within the container. So how do you figure out which addresses? Which device addresses are the ones that you actually want to assign? And this is really just a brain dump, I'd say. So I was thinking about like, yeah, initially you mount the whole CIS and that's where you get the problem of how do you find out addresses of the devices? You could try and mount just the soup path, just the device path. But then there's problem because anyone else might require that the pod has the whole CIS mounted and that would get overridden by that. So yeah, something else, like site car file for that. Yeah, I guess we're not there yet. These are just ideas. So if you think of different way that would be cleaner, feel free to comment. So that's pretty much devices in keyword in 40 minutes. It's quite a lot of stuff, but these are the main parts. So there's a proposal how to do that. If you're interested in the whole device plugins or in keyword, give it a read. The device plugin itself, so that that is now moved under a keyword umbrella. Then we have the initializer that I will probably delete someday. And yeah, that's it for me. Any comments or suggestions about this approach in the whole state of the thing are welcome. So this is the real summary. Like VMs are real and so is the device assignment. It works if you work around some of the rough edges. Yep, that's the question time. So the first question was if I have demo. I can probably show you demo after. I wouldn't run it here because I'd have to SSH over to my host. So we can stop by later and I can show you that. What was the second question? USB. Not yet. Give me like few hours and I will write the device plugin for that. There would be pretty much same problems as with PCI. So again, the only exposing the devices would be pretty simple. Again, it would start around one GRPC server per USB device but don't plug in all 128 devices. And again, you get to a problem. How do you actually find that in the port? But yeah, pretty much the same thing. Yeah, so I will try to repeat a short part of the question. So how is this whole thing really cloudy? Because device assignment is anti-patterns for VMs. It's kind of anti-pattern for containers too. Did I get that right? Yeah, it's not really for the generic, for the general cloud use case. This is really when you have clustered that has very specific workloads. It's not just about GPUs. So you still have network interfaces. And that is happening in Qube a lot, right? How do you make your container have super high network performance? This is really for the 5% of use cases that are important to the people that try to get these in. So I wouldn't say that this is really generic cloud way of doing things. It's really like if you have specialized applications, that's what this is for. But yeah, I mean, we're trying to, I'm trying to do it as cloudy as possible. And this whole device address manipulation, it's just implementation detail. So if you were to use Qwered with device assignment, this whole thing really translates to a few slides back where you would just specify this and we do all the magic behind it and then you get your device. Yeah, okay. The question is how is this integrated into scheduler and how it works with live migration? So the first part is great. I wrote a lot of code for that. It's exactly zero lines of code. The reason for that is because this plugs into or the cube scheduler already works with that. It's able to track the resources that, you know, the node exposes this resource, the pod requires this resource. It just merges this thing together and then it's able to find the nodes where the resources are available. The other question is how this works with live migration? The answer is it doesn't. Any other questions or comments? Okay, so I could use special USB driver to... So it's USB over Ethernet. Yeah, yeah. So there's USB over IP driver that can be used to use USB over IP. I think we could work with that. I just, I don't see all the challenges that we would have implementing it, but as long as, again, as long as we can get that represented as device path, we could probably get it into it and get it as a device plugin. The question is, would you require the capability to use the remote USB or would you require the device itself to be present? You know, this would be the main pain point that maybe doesn't really fit Kubernetes right now. But if there are, is there a question? Okay, there's a question. Yeah, the question is about vendors and hardware compatibility. So with the legacy plugin, there could be something, I haven't really seen the documentation itself. I think they say that you need to have some kind of GPU for that. In case of VFIO, the thing is, we don't put any expectations on the GPU itself. It's really about the vendor and there's some documentation about it. We know which devices work. We know devices that don't and there's also a class of devices that tries to block you from assigning them and there are workarounds for that. So the list is that there's not a single page. Like some of it is in upstream documentation for Overt. For example, most of the people try that on VFIO user list. I guess when vendors get into this more, they might publish something, really. But it's not really a case where having AMD GPU would make this unusable. It is still usable. It just depends on your luck, pretty much. Okay, so the question is, if I use this in production, the answer is no. This is still like early, you can see that some of the issues are still unsolved. So I'm just developing this and trying to figure out how to make it work on the lowest level. So like this whole thing or just the whole thing? Well, I could do some demo, as I said, like outside. But I'd say not yet. Like there's no whole complete demo where I would launch a note in Kubernetes that would. Yeah, well, you could use that for like machine learning. It's probably the best demo for it. Bitcoin, definitely mine bitcoins on it. Yeah, this would be the main demo use cases. Yeah, Bitcoin as a cloud, great idea. Let's productize that. So that's it, one more. Yeah, so I believe at least for some features, the OpenStack Nova is copying my work from Overt. So this is a kind of feedback loop around how devices work. And maybe Fabian wants to intercept. So the answer is that there is person from OpenStack Nova working on the same things that I work at. And I'd say we're just like copying each other's work. So I am aware of how OpenStack Nova does that. I did that in Overt. So there are some differences that we know about. And yeah, maybe the keyword way will be merger of both approaches that will hopefully take the best parts of both and leave the worst parts out. Right, that's it. Thanks for coming.