 Hi. Good morning from London. My name is Daniel Stone and I'm here to talk to you this morning about the state of upstream virtualised graphics. I'm the graphics lead at Colabra. We're an open source consultancy and a lot of our work has been with projects like the upstream Linux kernel, Mesa and many others with a specific focus on virtualisation, including the automotive industry. So firstly I'd just like to give a general recap about virtualisation and graphics and the context in which we've done a lot of this work. The where, the why and the how. So I'd like to go through the demand, the use cases, the three ways in which we generally achieve graphics virtualisation. The way that it's been implemented today and what we're looking to to the future so we can have better, safer, more performance and more useful, more featureful virtualised graphics. At the end I'd also just like to give a bit of a look forward to neural networks and machine learning acceleration, since that is a very important use case. And it is one which is also surprisingly close to two GPUs and how we accelerate just with a different problem field. So where do we see virtualised graphics? And the answer to that one is pretty simple, it is everywhere. Graphics isn't just the immediate things that are in front of your face anymore, which sounds a bit odd, but if you consider where graphics in general started, it was very simple, you know, an arcade style environment where you had one thing that was being rendered directly to the screen and that was the only use that we had for graphics acceleration and the only context in which any of it took place. But now if you consider the workloads that we have and the way it's put together, it's not just that render and present directly to the screen. Not only is it a much more complex, deeper pipeline with a lot of intermediate processing, we also have things like machine learning, we have some offline workloads where you might be doing some sort of either streaming or generating content on the fly. You might be using general purpose GPU for things like computer vision, any kind of analysis up to and including ADAS. So it's really shifted throughout the whole stack on the client and single purpose devices. So you might think of your phone or your laptop or something like this. We have this desire to have single silicon but with multiple assurance boundaries. You know, that's something we're familiar with from virtualization as a whole in security techniques like sandboxing where even though we're running this kind of single platform, we really want very discrete parts of this platform with very different properties and strong isolation from each other. So things like if you have a sort of trusted domain that should be somehow privileged and then you have the QM domain or if you just have a very secure trusted base that you want to ensure is always pretty rock solid. But you do want to be able to run arbitrary general purpose applications which you trust much less. In the data center we're seeing a huge demand now for virtualization and making that available. So you can pool your resources and have greater efficiency. You know, the cloud has really been driven as a whole by virtualization and it's about commoditizing effectively. The cloud has commoditized CPU time, it's commoditized storage, it's commoditized network access but we still have GPUs. It's this kind of special thing which you need to pay more for and since they're becoming so indispensable it's really important to be able to commoditize this as well. And this goes out to the edge where you want to be able to make that available on the edge as well as this kind of bridge between your big data center cloud and your smaller, more focused clients. Our other kind of hybrid model is things like streaming where you might have a relatively low capability be it because of cost limitations or thermal limitations or anything like this where you want to be able to do your workloads, your computation, you're rendering on some bigger, more centralized machines and then using the fact that network access has scaled to some extent more than client capability to be able to deliver that down to the client using these networks and that is actually quite a powerful tool for helping to bridge the digital divide and bring everyone up to this same baseline level of capability regardless of the power of the thing you have in your hand. So we know we need it and we know it's been done. Let's take a look at how it's been done because there are three very, very distinct ways in which you can approach virtualized graphics. The trade-offs also go in three ways. There are different levels of security, there are different levels of portability and there are different levels of trade-offs, of performance, I'm sorry. And each of the three approaches we'll be looking at, they have different trade-offs and different levels of security, portability and performance. So which approach will be best for you depends on how you view these trade-offs as well. So the first approach is direct hardware virtualization. This is kind of similar to CPU virtualization in that the hardware essentially gives you a segmented view of a part of the actual GPU hardware and this is going to be isolated and restricted in some way. It's the thing that gives you the highest possible performance. But unlike CPU virtualization there's no standard for it. So we don't have something similar to the X86 and ARM virtualized extensions for GPUs. Though most vendors have a way to do direct hardware virtualization it is completely vendor specific and for many vendors this is a kind of value add which is only available in certain SKUs or only available with certain licenses and so on and so forth. So all that gives you the highest performance, it gives you the lowest portability because you are tied into this particular vendor stack and possibly a particular segment of the available GPUs they offer. The security is also a complete unknown. It really depends on the vendor. Some of them are aggressively isolated at the hardware level and very secure. Some of them almost not at all. At the complete other end of the spectrum we have API virtualization. So the only standards that we do have that are completely vendor independent in graphics are high level user facing APIs. So OpenGL and OpenGL ES and here I'll just use OpenGL to refer to the two interchangeably and Vulkan things like that. And the approach to virtualizing these is that you have a shim in the guest which captures the API calls, transports them down between the domains and replays them inside the guest. This is extremely portable because everyone implements those APIs. So it's a completely vendor independent solution which requires no hardware support, no specific software enablement from your vendor driver but the performance ceiling. So the highest possible performance you can achieve is not close to the direct hardware virtualization approach. And it also has high processing complexity within the host. So possibly one of your more trusted domains is actually executing quite a complex amount of code. There is a midpoint approach which we call the mediated approach. And some vendors offer this not too many as kind of a hybrid between both worlds where they will expose to the guest something that looks very, very similar, very close to a direct hardware interface but it does require specific support within the guest and usually some specific support within the host as well. It's kind of like a proxy approach to the hardware. This offers quite high performance and the isolation is, you know, medium because it's not fully direct within the hardware. We do still have that software intercept. So if we take a closer look at how we do API virtualization today because this is one of the most widespread approaches and again the one that's the most standard and the most portable. So Vertio that we all know as the standard from Oasis it's a completely vendor neutral solution. This includes Vertio GPU as well as something called a Virgil or Virgil. I'll be talking a bit today about the Mesa and Linux implementation of this specifically. There are other implementations inside other operating systems of the hypervisors and so on. But since the Mesa and Linux implementations are the open source ones which are available to us, that's what I'll focus on. As said, all these do conceptually is that they capture the OpenGL, OpenGL ES workload from the client. They take that, transport it down to the host and then replay that on the host side. And it's kind of an OpenGL driver in reverse. What do I mean by that? If you look at how GL works, you know both from the application perspective and also the driver perspective. Every frame the application has to emit all of its state one by one. It has to tell you which textures it's using, has to tell you how it wants to blend, which programs it wants to use, has to tell you about the geometry of the draw call that it's going to use and then push that draw call to. This happens every single frame. So if we take an example here, we have the application has three textures. It wants to use them all within a draw. It wants to have blending enabled so you can have some nice offer transparency. It wants to use a depth buffer. Every frame the application will tell GL, enable blending, bind this texture to texture unit zero, bind this other texture to texture unit one, bind texture two, make sure depth is enabled and now draw. For the second draw it wants to do for this frame, it doesn't want to use blending. So disable blending, disable the second texture unit and now draw again. And this happens every single frame. It's like OpenGL has no kind of persistent memory. You have to tell it over and over and over exactly what you want to do. And it ends up being very verbose. And so this really doesn't map particularly well to the hardware we have. So we have this kind of intermediate cache where the driver will look at all of the state that the application has requested. It will compile that into a format which is useful for the driver and the hardware to have its own internal state. Which is how fundamentally all the newer hardware really works. It's not a direct interface. It really wants to capture some sort of pre-compiled state. So in order to make this efficient as an OpenGL driver, we have a state cache where we just look at all of the state. If we've seen it before, if we've compiled it before, then we have a little look-aside cache where we can just pull out this sort of compiled state that we've generated and just submit this into the driver. And this is what every single driver will do. Mesa does it. It has a very well-optimized implementation. And any vendor driver, which is even remotely performant, will have to have this kind of cache just because OpenGL is so verbose. So when I said that we essentially implemented it in reverse for a virtualized implementation, what we have when we're using this API virtualization approach is that we will take everything that the app has told us about its state. We'll have our own little compiled state. We send that through to the host and then the host explodes this compiled state back out. And because we have only an OpenGL driver on the host, we have to explode this. And so take our small compiled optimized state and then on the host side tell the host GL driver that interfaces with the hardware, enable blending, bind text to 0, bind text to 1, bind text to 2, enable the depth test and draw. And so you can see quite clearly how this is a limiting factor for this API virtualization approach. Because we have, in the guest, we have to do this state tracking and this state caching, even on a virtualized implementation to achieve any kind of performance. And then the host also has to do it because that's interfacing with the hardware. So we're already taking twice of the overhead of the state tracking of the state cache of just emitting the draw calls. It's a lot of CPU time just to translate between OpenGL's notion of only having a single piece of a global state for the whole application. And the hardware is model where it really wants to have these sort of compiled states, which are a really close representation of how the hardware works. These compiled states which represent a hardware configuration and instead just have the client reference this already compiled state. So despite the fact that this is our efficiency ceiling because of the CPU time involved with this overhead, it's still surprisingly stable and it's still surprisingly performant. We have at least tens of millions of users that we know about who are using this for real workloads to play games even. We have a very high level of API conformance and part of this has been driven by people who want to play games in virtualized environments. So despite its limitations, it does work well today. But we're also looking to the future and we do think we can do better. And the way we think we can do better is by having standardized virtualization using Vulkan. So much like we have Virgil as a part of VertIO GPU, we also have a solution called Venus, which is Virgil but Vulkan. And this is a really good map because Vulkan conceptually works much closer to how modern hardware works. Modern hardware wants you to take a bundle of state to compile all of this up and to have a sort of state object that you can reference and point the driver at that state object and say, OK, use this with these fairly limited parameters. It's a very thin driver in that sense because OpenGL drivers have a lot of overhead. They have all of the state tracking and those cache mechanisms I was talking about to deal with the fact that the hardware and the model of the GL API are different. Vulkan is based around having very thin drivers at the expense of having the apps do a little more tracking and be a little bit smarter about how they work. One of the nice things about having such a thin API is that we can actually much more easily than GL validate that people are using the Vulkan API correctly. So there's been a lot of work gone into having a standardized from Kronos Vulkan layer, which will allow you to validate that really this is correct Vulkan API usage. So it's much easier to reason about how Vulkan is being used rather than GL where we have this huge surface area with a lot of interactions and validating that is really expensive just because it's such a deep, such a large software layer within the driver. So the way Vulkan drivers work, it's just one more step. So the application using my previous example where we have two different kinds of draws, the application knows that it will have these two kinds. So when it starts up, it says, I want to make a representation of the hardware state for when blending is enabled, when I have three textures and when a depth test is enabled, compile that way. Then it says I want another state with no blending with only two textures compile this way. And then every time the draws, it just tells the hardware, use this first piece of state. Here are the parameters for this drawer only do that. So instead of these repeated calls over and over where we're telling the driver every single bit of state every time, we're just telling that state over there, use that please. And so when we lay a virtualization on the top, Venus is again, it's part of Virto. It's an open vendor neutral standard and Venus aims to be quite a thin proxy for Vulkan. So unlike OpenGL, we don't need to do a lot of the state caching tricks. We don't need to do all of these things to pretend that the hardware looks like something that it's not. It's a very thin layer that we have where the hardware maps very well to the Vulkan driver and the host. And then that maps almost directly to what we implement in the guest. So we remove a lot of this complexity and we just push in the application of its Vulkan. We push that down to the host, the host of its Vulkan and that looks very, very close to what the hardware already has. And using these Vulkan layers to ensure the correct API usage with validation that we have, we can put that inside the host and the host can ensure that the guest's use of the Vulkan API is correct. So we can just have a bit of skepticism. Another point where the host doesn't have to trust that the guest is doing the right thing because it's very easy to validate Vulkan usage and ensure that it's being used correctly. We do have implementations for Venus available in Mesa, in the Linux kernel, in the Cross-VM hypervisor and also I think perhaps not merged yet inside the QMU hypervisor as well. But as I said, anyone is free to implement Venus. It's a completely royalty-free vendor and neutral standard. But there's always a but. We still need to support OpenGL. So even though we have this great, glorious future where we're using Vulkan as this efficient virtualization layer, how do we support OpenGL on top of that? One of the good things about Vulkan being a very close match to the hardware and being very low level is that it's sort of a good host for doing a lot of things. So you can write your game engine on top of Vulkan. You can write your GPGPU compute directly on top of Vulkan. But what you can also do is you can use it as sort of a lowest common denominator. You can combine both of these approaches where we say that Vulkan is the thing that we support in the host. It's the thing that we support over the virtualization layer. And then within the guest, on top of the Vulkan implementation, we add a layered implementation on top of Vulkan, where OpenGL is translated down into Vulkan. And this really just contains a lot of the complexity of OpenGL within the host. So if we look at what it looks like when we combine these two, we have the application, every frame is still doing this very verbose piece by piece state of mission. But inside the guest driver, we're doing the state tracking and we're doing the state compilation, but we're caching these Vulkan state objects. And the host only ever sees these Vulkan state objects. It doesn't see the full frame by frame complexity that we have from trying to track all of the OpenGL state. All the host sees is that the guest is using Vulkan and the fact that the application is using OpenGL is just completely left as a detail that the host doesn't even know about. So this is also available today. Within Mesa, we have, again, a vendor-neutral project called Zinc. It was originally a collaborative project that we developed specifically for this kind of virtualization use case where we wanted to contain OpenGL into the guest. But it's had a lot of traction and a lot of adoption across the industry. It's been pushed specifically for game workloads where there's been some really intensive performance work. And sometimes it even outperforms a native host GL driver, even though it's a layered implementation. And we're well on our way to having that be officially conformant as an OpenGL implementation through Kronos. There are other examples of these approaches. There are projects like Angle from Google which implement the same concept because as an industry we're looking towards Vulkan being the single implementation that vendors will provide for each piece of hardware and sort of commoditizing OpenGL on top of that. So I promised I wouldn't just speak about graphics. And if we look at what this means for neural networks as well and machine learning acceleration and how we can achieve that, the landscape is quite different. In terms of neural network acceleration, it is a bit of a wild west right now. Everyone agrees fundamentally at the high level that some of the frameworks and toolkits like TensorFlow with MLIR, PyTorch in particular those are sort of the standard from a very, very, very high level. But in terms of how it's implemented, all of them get extended differently by vendors to support their hardware because we have no low level specification for how to deal with hardware at a level lower than PyTorch or TensorFlow. There's nothing kind of similar to OpenGL or similar to Vulkan. So while you often want to design and iterate in frameworks like TensorFlow and PyTorch they are quite big frameworks that quite heavy and involved and you probably don't want to be running them all the time. Kronos does have a specification called ONNX and the Open Neural Network Exchange format I believe and that is something that you can use as a vendor in independent representation of how to convey your machine learning workloads. But still there's definitely a gap in how we approach these devices how we have a common implementation for them, common expectations of what they do and just portability between all the different devices. So having seen that need, we do have a lot of work in flight so one of the projects we've been doing at Colabora is helping to enable some of the inference devices you see within SOCs. We've been enabling those within Mesa and the Linux kernel which are available for upstream review and merging. And there's also a framework within the Linux kernel called Excel which is just about to be merged and this is the first time the kernel will have a generic framework for ML and NN, NPU, TPU type devices similar to how we have DRM establishing common expectations for GPUs and one of the things we're hoping to achieve with this is once we can establish some fairly low level expectations and baselines of how the hardware should function and what a hardware model looks like and we're definitely pointing to a future in which we can virtualize NPUs and similar to how we've done with GPUs where we can make them more universally accessible where we can commoditize them for all sorts of different use cases we're looking to do the same with this we're looking towards that kind of virtualized commoditized use case where we're making them more universally accessible and very portable between vendors so it's much easier to take your workloads and execute them in a data center on the edge down the client devices and just make them as available and easy as can be ideally So that's what I have for today in terms of the presentation If you want to find out more about virtualization in particular there's a lot of great talks from Jerry in particular who leads the AGL working group on virtualization If there are any questions I'd be happy to take any of those but please feel free to go look at these open vendor-independent implementations are we available in the open-source projects like Mesa and the Linux kernel and the hypervisors, take them and use them and thank you very much