 Hello, everyone. I'm Tina Zhang from Intel Virtualization Team. In this session, my colleague Wewek and I are going to present a solution that combines a password GPU with a parallel virtualization-based display to achieve a scan buffer zero copy goal. Although the implementation is specific to GPU or display virtualization, the idea behind it is generic. It is considered that the solution can be leveraged by other kinds of pass-to-suit devices, which have a similar requirement to share a cross-domain game bug. Okay, then let's get started. This is the disclaimers. Here is the agenda. We have four parts here. I'll cover the first two parts, which are about the modulation and background introduction, as well as the architecture design. My colleague Wewek will present the implementation details and do a summary. Let's start with multi-GPU platform introduction. So a multi-GPU platform basically means there is more than one GPU provided. If a platform has two GPUs, usually one of them can provide better performance, but may need more power to consume. And the other one, which might be designed as an integrated GPU, may not have the capability to provide the same performance as the powerful one tried, but it may not need so much power as the powerful one needs to do display or render it. So in this way, a multi-GPU platform can give user more choices about power and performance balancing. There is a use case in client virtualization field that with multi-GPU support, a powerful path through GPU is provided to a virtual machine to accelerate 3D applications running within the virtualization environment. Meanwhile, let the integrated GPU work for display. So that every application, no matter running in guest or host, can be displayed on the monitor controlled by the host's integrated GPU. Such a system can provide good power and performance benefits for users. Sharing guest-scan-out buffers between path through GPU and the integrated GPU is one of the key requirements in such a system. CPU copy might not be acceptable here, as it impacts too much user experience. Unnecessary GPU copy isn't good for system power saving. So it's considered that sharing scan-out buffers is a must here. Since scan-out buffer is basically a kind of DMBuff, we see that across DOM and DMBuff, sharing mechanism is needed in such a system. But sharing DMBuff owned by a path through device is challenging. First of all, hyperweather has no visibility of a path through device DMBuff resource. Another challenge is that a DMBuff might be located in a path through device private local memory, which may not be accessible to other devices. So in short, due to lacking of virtualization knowledge, a path through device may not be able to share its DMBuff. Since it might not be feasible for a path through device to export its DMBuff to another device or host, a peri-virtualization-based DMBuff exporter is proposed. The essential idea of the peri-based DMBuff exporter is either straight away in this picture. The peri front end works as a DM across DOM and DMBuff owner and has its DMBuff backed by guest pages. That's important because in this way, we can make sure that a shared DMBuff is located in a place where it can be shared. Then after the peri front end exposes a DMBuff, the path through GPU driver imports the DMBuff for rendering. At last, the DMBuff gets returned to the peri front end for display. Similar to the front end, the peri back end works as a DMBuff exporter on the host side. It creates a DMBuff using related information in the metadata passed by the front end and exposes DMBuff to the host side. Last but not least, the peri-based DMBuff exporter must support a buffer producer-consumer synchronization mechanism. We DMBuff was our first try to implement this solution. Although it can work, it introduced a set of new APIs, which may need to add much code change to the user space graphic stack in order to be used. At that time, we got good advice from upstream communities that our GPU already has the cross-domain buffer-sharing support and can be easy to use. Besides, as a virtual GPU device, our GPU has already been supported by the user space graphic stack, so we decided to choose our GPU. Here is the high-level design picture. In this picture, you can see that with a virtual GPU and a password GPU, from the guest point of view, it's like a multi-GPU use case. Or, if the virtual GPU is working in 2D only mode, then it's a virtual GPU display plus password GPU run-only case. Both of the two cases are supported by the latest user space graphic stack. And no matter which case it is, the idea is the same. User space GPU driver is in charge of graphic buffer allocation. It understands which buffer is a scan buffer. So code change is added to the password GPU user space driver to make sure that a scan buffer is located through the one-high-level GPU front. And after the allocation, with the help of user space driver, the scan buffer is exported as a dmbuff and get imported to the password GPU driver, where it can be rendered. At last, the buffer is committed to the virtual GPU front for display. The virtual GPU front passes the metadata of this scan buffer to its backend, where a host dmbuff is exposed to the host world based on the information in the metadata. So there is no buffer copy here. At last, with the help of host display server and the user space graphic stack, the scan buffer is imported into the host GPU driver, where the scan buffer can be displayed. In that way, the zero copy goal is achieved. Okay, I'll stop here and let my colleague Vivek come around the remaining parts. Hi Vivek, you can start to present. Thank you, Tina. Hi, my name is Vivek Kasaredi. I am a Graphics Software Engineer at Intel. Today I'll be talking about some of the performance issues we encountered with virtual GPU and also a bit about Kimu UI. So just a bit of background in terms of why we are pursuing these. So there are more headless GPUs that are going to be available in the next few months, particularly the STV platform and also SRIVVFs. So the headless GPU use case would become mainstream given these impending releases. So when we started working on whatever GPU, we noticed that there was at least one CPU copy and a couple of GPU copies that were done as the guest frame of a data was transferred from the guest to the host. So when we looked around to see what options are available to eliminate these copies and improve performance, we found that the concept of blob resources was already there, added to the spec. It was particularly a collaboration between Jared Hoffman who is the, whatever GPU would Kimu UI maintainer and cross VM developers. So but the patches were not integrated. They were just in a RFC form and alpha form. So we took a look at those patches and made some improvements and added some more changes to them and finally got them merged. So that's the status in terms of those patches. A little bit more about blob resources. So the concept or the idea behind the blob resources feature is actually related to the DMA buff share buffer sharing frameworks, particularly on Linux operating system. So the DMA buff makes it possible to associate a file descriptor with any buffer that can be shared between user space and kernel space or any drivers for that matter. So before this feature was merged, the data associated with the resource, which is actually the frame buffer was mem copied from the IO WEC to the shadow buffer on the host. And then texture was created. But after merging this feature, we can actually associate an FD or DMA buff FD with the frame buffer or the resource and create a texture from that. So we also added support for mapping huge pages to the UDMA buff driver, which is the centerpiece of this whole thing, which I'll be talking about a little bit more in the next few slides. So just a pictorial representation of how things work. As you can see the KMS RO framework is also one of the key components here. It enables the, enables the composter to create a scan of buffers via the whatever GPU DRM driver, but the rendering can be done or the temporary buffers can be created via the DRM driver. So once that happens, the buffer is passed on to the whatever GPU device on chemo, which then forwards it to the UI module. And UI is the one which, which actually renders the guest frame buffer onto its own buffer and then finally submits to the host composter for display. I'll also talk a little bit more about these components in the next few slides. So one main challenge that we encountered as we implemented this feature is that there is a chance that the guest may use the frame buffer that is shared to the host while the host may be still using it. So as, as you can see, this is the synchronization problem and we figured that this needs to be solved. And one way we found out that was relatively easier to implement is via sync objects. So I'll be talking a little bit about sync objects in the next slide. So basically we create a sync object after the, the blit is done by the UI module and then extract a file descriptor from the sync object and have chemo wait on that file descriptor. And until the file descriptor signal the guest would be blocked from rendering as you can see in the pipeline below the as the buffer is passed on from one module to the other. So, so the creation of sync objects is done using eGL APIs that you see here, but and enabled by the extensions the eGL Chronos fencing and the native fencing. And the UI modules also currently implemented in chemo such as SDL and GTK that use eGL for rendering can make use of these sync objects and one other advantage is that once the sync object is signaled it also means that the host is done using the buffer and the guest is free to reuse it or delete it or whatever it can do with it. And that also ensures that the, the, the rate of rendering by the guest or the guest compositor does not exceed the host monitor refresh rate, thereby ensuring that the guest is not rendering or wasting GPU cycles. In terms of the status all but one patches still remains to be reviewed and merged. Oh, so the, we also looked at, as I said briefly mentioned earlier, in terms of the copies were found that that in order to be in order to eliminate all the GPU copies. We would need to use a UI backend that that cannot be using the eGL because because if you use eGL then the buffers are allocated by eGL and there's no control. There's no control over those buffers by the application in this case which is chemo UI. So, so as you, as I discussed using blob resources would eliminate the CPU copy but not the GPU copy. So we found that the possibly the only way to eliminate this copy is to eliminate the last of the GPU copies is is by simply forward or simply presenting the DMA buff associated with the guest cannot buffer. As a native buffer to the converting it to a native buffer and presenting it directly to the host compositor. As you can see and with the with this there is no need for the UI module to use eGL or render anymore. So we can directly submit the guest cannot buffer to the host compositor. So some of the advantages of using this is obviously the fact that there will be no copies at all. But however, that is only possible if the buffer submitted by the UI module to the host compositor is placed on a hardware plane, which would make it absolute zero copy. But if a hardware plane is not available, then the host compositor has to inevitably blit it around to blit the frame buffer onto its cannot buffer. So with this UI backend, the maximum number of copies would be limited to one the inevitable blit if there is no hardware plane available. However, there are some drawbacks with this module with this new well and UI backend. That is this its effectiveness is essentially limited to integrated GPUs only because if you use discrete then there's the local memory and there needs to be blitz or copies done from local memory to system memory. And also the fact that the guest compositors need to do double buffered rendering, which most of the valent based compositors do, but some makes older x based compositors do still don't do it. And obviously you cannot draw window declarations like menus or other buttons. In terms of the status, we already have a proof of concept that is working, but we found that there is this massive overhaul that is underway associated with the UI modules, particularly the implementation of the bus. So after that work is done, this UI backend can be patches can be reviewed and hopefully merged. In terms of summary, the DMA buff as seen discussed the DMA buff exported by the pass through GPU in order for that to be shared the PV based solution which uses whatever GPU is the most efficient and the generic way for for doing that. And as I discussed blob resources, the usage of that blob resources feature would eliminate CPU copies but in order to eliminate all the GPU copies except for one we were we would need to use the valent UI backend. Lastly, in terms of acknowledgments I'd like to thank Jared and Daniel for providing feedback and ideas. Thank you.