 virtual machines. So as you probably know, GPUs are used to accelerate graphics rendering. For instance, to play video games for design and modeling software. To play video games, that's important too. For data visualization, what you see there is an example where we show in real time about 1 billion data points using this visualization software. Playing video games in case you forgot. They are also quite good for any kind of finalizable computation. For instance, machine learning and artificial intelligence. You have here a training set that NVDI uses to train the AI for driving. Human genome analysis. High-quality 3D rendering, so when you don't do it in real time. Now for those of you who are interested in GPUs, at SIGGRAPH 2018, about three days ago, NVDI announced a new generation called the Turing. And for the first time, they claimed they can do real-time retracing. So why does it matter? Because it's actually not that new. Our friend Ulrich Repper at DevConf Tech was talking about GPUs being something like 120,000 threads that you could use at the same time. So how can you waste that much GPU power? Well, thanks to a guy in New York Quillers. You see his photo here. He's been working for Pixar and companies like that. So here is a very simple example. So this is real-time retracing. So you can see that the bottom there, you can see all the pictures. If you want to check that this is real-time, this is actual time, the clock. It's a clock you can check with your own clock that it's the right time. Now, this is pretty naive. This is something I did about six or seven years ago. You can do much better now, for instance. You can do that now. So this is the same idea. A real-time clock, you may recognize a daily work, you know, the modern clock. But now it's showing the real-time. And you can do a number of different effects, characters, landscapes. This is a realistic-looking rock, except you will see that it changes shape over time. So this is all computed. Now, this one is taxing my GPU a little bit here when running full HD. You can create moods. So all this is done in real-time. This is not a movie. You can create landscapes that have physically nonsensical features. OK, so how does it actually work inside? So the first level at the top there, the application, is going to call an API, the number of graphics API like OpenGL, Vulkan, et cetera. And then that's going to go through a GPU driver, which sends that in some proprietary format that I call the graphics bit stream on the chip, through the chipset driver in Linux in the kernel. And some graphic card is going to render that in a frame buffer, and then the frame buffer is going to be converted to a digital signal that goes to your screen. Now, the compute acceleration that I told you about, for instance, for artificial intelligence, follows a similar path, except, of course, different APIs, and you don't send the output to a screen. Now, why would you want to virtualize GPUs? Well, one reason, for instance, is compatibility. For software development testing, if you want to run a GOS, a game operating system, like Windows, for instance. For flexibility, this can allow you to have video streaming to thin clients, to this enabled cloud gaming. And, of course, all the benefits you can get from scalability, management, et cetera, for you can get with graphical devices as well. Now, in terms of large-scale deployment, this is a Titan supercomputer built by NVIDIA. So this is mostly done, most of the compute power in this kind of thing now is in GPUs. And, of course, we can have swarms of GPU-accelerated nodes. OK, so the problem is that when you want to actually virtualize this, there are many, many solutions, and it's a bit complex. The simplest and most naive one is to have a full-device simulation of a VGA class driver. So this is really how we did it in the 2000s or something like that. And the good thing about it is that it's very compatible, for instance, with old software, old hardware. And it works at Guest Boot, so you can emulate that before the firmware is even loaded. So it supports practically all the virtualization features. You can do migrations, stuff like that. You can have as many concurrent virtual machines as you want. And you can have remote access to it. I'm going to talk about remote access at the end of this talk. Now, the concept of this approach is that it's very slow, because you're emulating a device that was not designed for virtualization. It has tons of legacy quirks, like memory. It doesn't feature 3D. VGA doesn't have 3D by default. So if you have 3D, it's all software. No compositing, so no matter on desktop. And so that's basically the reason why you want to move to something else. And the next step up is to basically use a virtualized kind of interface that you're going to expose to the guest. And your graphic API now talks in sending some specific comments that are going to be converted by the host driver. So now you have the guest channel in there. And you have a virtual driver that talks to the actual driver. It's slightly more complicated. It's obviously simplified in this picture. But basically now you're sending a stream of virtual graphics comment. And the rest, the bottom of the stack, is the same as before. Now the pros of that is that you can accelerate 2D. You can get some 3D. Basically, what you don't get is specific card features. It's flexible. The cons is that you can't expose the vendor features from the card. Because now what you're exposing is a virtual device that doesn't have the actual card features. That limits the virtualization features as well. And the performance is at best medium. You don't get direct buffer access to the card, stuff like that. So it's basically, you can get multiple VMs running at the same time. But you don't get the best performance. Now when you run multiple virtual machines at the same time, there is another difficult problem, which is that you have to schedule rendering on a device, the GPU, which is not necessarily designed for that. Contact switching on a GPU is problematic. And so resource allocation is also an issue. Because now you don't have a dedicated card for one guest workload, you have to share it. And the GPU capabilities are hard to expose, as I mentioned. Migration is tentative at the moment. Part of the problem being how you feedback the screen outputs. If you switch from one host to the next, then the graphical output is going to switch from one machine to the other. So you lose some of the benefits. And remote access is a problem in this case, whether you put the remote access control in there. So the next step in order to get better performance is to do GPU device assignment. So basically, in that case, you pass through the graphics API directly to a vendor driver that resides in the guest. And now it talks directly to the hardware through vendor-specific comments. So now you're poking a hole through the virtualization layer there. Now that hole can be somewhat, you know, the safety issues can be somewhat mitigated. But that's really how you do it. So the pros of this approach is that you get near-native performance in the best case. Now you're talking to the card directly. You have good compatibility now with new features, because now you can see the card so you can know what kind of features it exposes and, therefore, switch to a more modern rendering API because you use the vendor driver in effect. So you get the latest APIs. You get the best graphic features. The cons is that, for instance, the boot console is problematic in this case, because you're no longer connected directly to the hardware. So basically, you get a black screen. The setup is not very flexible, because it's attached to a specific VM. And migration is a real problem in that scenario. No sharing either. I'm going to explain in a minute later why not. So the next step, to see the difference between these two cases, I have to switch back and forth so you can see because the difference is not completely obvious. But the main difference is how we talk to the hardware at the lowest level. So there, we are going to find a way to share the same CPU across multiple VMs. So in the single GPU case, you have, so that's a VGPU case, basically. We split a GPU between multiple VMs, and now that's a GPU that knows how to be split and knows how to support virtualization. So we get some compatibility, same benefits as before, but you get multiple VMs, and that's the big benefit here. Now the con is that you still don't have much flexibility. The hardware requirements are high, because now you need a lot of extra GPU power, because you are going to split it across machines. And in general, it's done in a static way. Migration and features like that are still problems. Now it turns out that when you do this, when you do this kind of device assignment, it might seem very simple. It's just, you know, I take a GPU, assign it to a VM, what could go wrong, right? So it turns out it's not that easy. You need hardware support, not just in the GPU itself, but in the chipset as well, because you need some kind of exhalation between the GPU and the memory it can touch. So you need to be able to enable the IOMMU, for instance. And then there are a number of hardware quirks that you need to be able to deal with. So in terms of topology, you have not completely isolated things like you would when you, for instance, virtualize a CPU. Now this is in a sense relatively similar to what happens with NUML when you go to virtual machines with CPUs. So what happens now is that your devices might have some sort of arbitrary topology in the physical world. And so now you can create your IOMMU groups and split them in any way. So now as long as you have a single GPU, it looks fine. You know, it's basically a straight line, so no real issue there. And if you have two GPUs that are connected to the same PCIe bridge, they can basically talk to one another. It's all within a given peer-to-peer domain and you can have two GPUs that enhance one another, can talk to one another, et cetera. Now if you have multiple scenarios like this, you're still through the same bridge. But if you have to go through inter-processor communications with QPI, for instance, quick path interconnect, then it becomes much slower. And of course, so that means it's not transparent. The way you lay out your GPU is not completely transparent and you have to deal with that. Which means that from a management point of view, devices and assignment is not something that is as simple as, for instance, allocated in memory or allocated in CPUs. Essentially, at the moment, it's rehost spinning. And this means it's not really designed or not very applicable for a cloud style of solution at the moment. And I'm talking about the case where they're not really showing. So in order to mitigate that, we want to use the Mdev framework. And in that case, you're basically exposing through Linux. There's Linux, in that case, is going to give you an interface that gives these mediator devices, so Mdev, that lets you get the kind of acceleration you need in virtual machines so that you can still talk directly to the memory of your device, but they can't override to some other location. And then you can still get it. Now, Mdev is a really complicated topic. I can't cover it today, but I invite you to look up this presentation if you want to see exactly what you get out of this. Now, the problem is that it exposes an interface that is not very easy to manage at the moment. There are things like type description, number of instances, et cetera. All this is at the moment completely vendor-specific, so you sort of have to have your management software learn the details of what's inside. And there are a number of things that are not exposed, like device quirks, driver limitations, whether it can support multiple instances or not for a single VM, these kind of things. All the stuff related to licensing is hard to manage it that way as well. So I expose a number of different ways to split your GPU. And if we try to compare them, well, we see that native GPU will give you best software availability and performance, for instance. And then you should stimulate a GPU that's performance takes a big hit, scalability takes a big hit, but again insecurity and flexibility. And so if I keep going like this and try to compare all the various cases, I see that there is no one big win. It's not like there is one that is the best. It's really a series of trade-offs. So this is why it's still very much work in progress for everybody. For instance, in terms of features, this is relatively recent announcements for live migration of NVIDIA GPUs. I'm sorry, the one before there. This is a demo. So I was telling you about the problem with migration where you see that it switches from one monitor to the next. So if you have to run across a data center in order to see the output of your machine, that's not very convenient. So another thing that shows that this is still very much work in progress is that all the resource allocation and sharing is still very much on a permanent basis. The choices are not completely set, which means rather from a management point of view, there is no single way to do it. So to take a Facebook analogy, it's complicated. Now, if you add remote access on top of this, so SPICE is a red hat solution for remote access and it tries to address some of the issues we are talking about here, notably about things like migration, et cetera. So most virtual machines are used remotely and so when you connect to them in a data center, you're going to basically get a stream of video updates. Now, this enables a number of use cases like cloud gaming. You can stream high quality 3D to your phone, for instance, because H264 and streams like this are completely asymmetric. It's much easier to decode than encode and so you can have a big heavy machine that does the encoding in the cloud and then a lightweight machine on the other side for visualization. So this means you can bring your work environment to any devices. It has many nice features. So for KVM users, the solution for to get this kind of things is SPICE and we're working on making SPICE basically streaming capable like this. So to explain a little bit the evolution compared to what exists today, you can have multiple ways to do remote access. You can do it either from the guest or from the host and the historical solutions would be sending 2D comments basically too. So they will intercept drawing comments somewhere and send them over the wire, random them on the other side. Now you can intercept them in the guest and that would be typically for Microsoft remote desktop or these kind of solutions. Or you could have, in the case of SPICE, the SPICE server do it for you in which case you get the remote console as well for your VMs, et cetera. For video streaming, it's more or less the same idea. You can do that from within the guest or you can do that from within the host. I'm going to show the trade-offs in a minute. So what happens when we do this is that instead of sending graphics common in a device specific format like we did before, instead we are going to switch to network kind of graphics. So if we do that from the guest, for instance, if you were using terminal services, you would use something like that where there is in the guest some kind of software rendering stack and then it sends the data over the network and you're basically using the virtual network the normal way. Now, when you do this, this is widely available. Most operating systems now have this and it may be as simple as X11 in the case of Linux. It's transparent for most users but of course it only works after the guest has booted so you have no console in that case and it does require that the guest network works. Now, why is it important? Because in some cases you may want your guest to be something you can access remotely, graphically to do something but that the network is not going outside and you want the network to stay constrained inside the VM. And so that's why we have also host-based remote access in which case, so in the case of Spice we have a Spice server component that will do the encoding and the way this works then is that we have basically a driver in the VM in KVM basically that is going to intercept 2D comments and behave mostly like a VGA then transmit that to the Spice server, Spice server sends that over the network and you have a Spice client on the other side that renders that as graphics. That works pretty well for 2D but it doesn't work very well for 3D. So if we want to do 3D we do need some kind of streaming so something that looks more like we have an encoder that generates video on the fly. Now, if we want to do that with vendors like NVIDIA they really insist on doing this in the same way as is being done for virtual GPUs and so remember that in the VGPU case what we had was that the GPU driver resides in the guest. So that's why you need to do the encoding as well. So that means basically that your encoding H264 from your frame buffer within the guest which means that you need in the guest some kind of agent that is going to send the data over and you send it to the other side over the network. So this is a little bit complicated to set up and it has the same kind of drawbacks as we had before because now you have to wait until it's booted and you have to have an alternate solution pre-boot that shows you the console before it boots. And so finally something that is not done yet but I hope that it will happen within maybe the coming year or something like that is to do host side streaming still using hardware acceleration and so in that case what we do is we send the graphics down there and we still have the splitting abilities that we talked about before with the GPUs. Now we're using the VAN.GPU driver here and we do the encoding host side meaning that now the whole network stack happens here. Now if you do this that way then the nice problem that you have in these cases that you need to find a way to share the features, share the encoder, share the frame buffers, et cetera. Now why is this a problem? Because now your graphics API is doing the rendering in this side, right? So it's doing basically all the rendering is done in the guest and you need to find a way to pass the guest data preferably without copying it in a way that the host driver can see it. So there are some developments that are going on in the Linux kernel at the moment to try to enable that but it's a relatively complicated problem and so you have to realize that for instance the rendering may happen at 60 or 120 frames per second so you need to be able to pass the data from the guest, I did a guest user space down to host kernel without copying it. So that's pretty expensive. And I'm actually done and I spoke too fast. Sorry, I'm out of time. I removed a number of slides compared to, yes. It's not again. It's fine. Thanks for the presentation. I wanted to ask, because in the Kubernetes community, whenever you request a GPU, then you get the whole GPU and sometimes you might not utilize the GPU fully, right? So it's very interesting the work that you're doing here with splitting GPUs. So this VGPU, if I wanted to try this out, where would I find this? Okay. So some of the stuff that I showed is actually working today with, well okay. When you say you want to try it out under which kind of conditions do you want to try it out? Like I want to download it. I want to install it on a three-node or five-node Linux cluster. This is for experimental purpose. Not production kind of use, right? Yeah, this is, yeah. Probably going to be experimental, because the part of utilizing GPUs in Kubernetes is that limitation of not being able to split how much resources. If you want to request how much CPU you want, you can do that, but you can't apparently do it with GPUs. So at the moment, this is still a lot, very much work in progress. So if you want to have VGPUs... Oh yeah, okay. So Nvidia VGPU is supported in REL75. So if you want to play with it, I think you could with CentOS, but you have to buy the grid software from Nvidia. Yeah, but it's supported as I understand today, it's supported in queue configurations today, so one-to-one. So it means it can't split, I think. No, it supports VGPU, but not with Spice yet, that's all. Okay. Yep. So it's supported with REL75 and with Rev 4.2, but I don't think you want to set up Rev or Overt. So you just have to get the grid 6 software from Nvidia. So there can open source alternative like OpenCL right now? No, so at the moment, as far as I understand, none of the VGPU capabilities that Nvidia offers are supported by Nuvo, for instance, if that's what you're asking. Like, yeah, or OpenCL, it's an alternative to Nvidia. So for OpenCL, okay, I would need to check because my understanding, so apparently my understanding and Karen's are different. My understanding was that at the moment, Nvidia was still offering only one queue, well, basically whole card configurations and that the rest was still unsupported configurations. Karen says otherwise. So I think I'm probably wrong on this. And maybe this is only for remote viewing. Now for compute only in full open source, I don't think there is anything that works. If you want in full open source. So Intel is also working on VGPU. So KVM GT, so that would be fully open source. Okay. There's another question in the front. I understand why the streaming agent approach is not ideal, but are there any open source streaming agents I can use right now with Overt, Slash Rev, or with KVM and Libvert? So the streaming agent itself is open source and it's basically built so that there are plugins in which we are trying to open source because they use APIs that are normally public. So there is no reason we could not open source it, whether it's still under discussion whether we can actually open source that or not. Now the fully open source one does not use a hardware acceleration, that's the problem. So it basically does NJPG and coding and stuff like that which is only software. So it's good for testing purpose but it won't give you the hardware acceleration. It will give you hardware accelerated rendering but not encoding. Okay, I mean I know that Spice does NJPG, is that part of the Spice like Agents and all? Yes, that's what I'm talking about. Yeah, it's built in the Spice guest agent right now. And now there are other, you can of course use other solutions like BNT or like GX or whatever. So if you use BNT for instance, you will have very fast rendering on the counter but you will get very slow transport over the network. Okay, and there's lots of proprietary streaming agents though, I guess. Yes, as I'm sure you have, GX is an example. Okay. Any other questions? Thank you. That is another one. So assuming there's three hardware vendors, you know, NVIDIA, AMD and Intel, do you have to buy professional graphics adapters for virtual GPU including Intel, do you have to have like a Xeon and all? Okay, so as I mentioned in one of the earlier slides, so in order for this to work, you really need good support for IOMU. From the card side, it really depends on the vendor. Each vendor has a different approach. So for NVIDIA, you typically want recent cards to get VGPU, you really need cards that are dedicated for this. For historical reasons, AMD used to have the IOMMU support baked in earlier. It doesn't necessarily mean it's better supported in Linux but at least from a hardware point of view, older card have better chances of being isolated. It doesn't mean that the driver support is good. So for Intel, Intel is working a very open source way. So their own approach is from a software support point of view is nice to have. The problem is that the performance at the moment is not the best among the vendors. Yes, okay, but are those hardware features only in the quadros rather than G4s and only in the radian pros rather than the radians? Yeah, so as Karen said, it wasn't that. It's specific software. At least in the case of NVIDIA, it's a specific software license, so the grid software. And basically, in order to activate that, you have to talk to a grid server, the third software, you have your license for this or that configuration. So it's not just the hardware. Okay. It's also a software license. And there is also a specific grid SDK that lets you take advantage of some of the features. So for instance, I was mentioning that we're using public APIs for the streaming agent. That's part of this grid SDK, for instance. So APIs like how to capture the frame buffer, how to stream it, and code it, stuff like that. Looks like we are out of questions. Thanks a lot for your attention. I'm sorry for talking so fast.