 Okay. Let's start. My name is Shanghai Tao. I'm currently an engineering manager, which is managing the ZNGT project in Intel. This is for the, this ZNGT project is the software based Intel graphics or translation solution, which sounds quite similar to the talk you will have here. And we are also trying to set up a demo with the here, but the issue is for some time, I don't know why it cannot directly unlike the VGA connector. So I think one of the, my colleague David here, he has a demo here. So if you want to have a, see a demo just go there and say, it's okay. So let's start. So for the, for the agendas, first I will give you some background. I think it's just really to show you the, what do we need to do graphics or translation in a machine? And we will quickly go through all the existing are not all just some one. I know the existing arts and they try to compare them a bit. And then we will propose our ZNGT architecture. Followed. We will try to show you some on the performance data and give you the summary. Okay. First starting from the background. So actually essentially, you know, the graphics computing is quite important. It can be used in the entertainment like the games you play in videos. And you can also, for the UI acceleration for the Windows aero or compass, or you can also do the cloud components computing. So I think the key point is it when you move all the physical things to the machine, we still want this kind of same experience in the machine. That's basically why we need graphics in the machine. So what can we use this machine for? Basically in the client, you can really launch multiple machines, but which has the really rich user experience compared to the native. On the server side, we can enable you to share the GPU source on the server side to do the transcoding to do the GPU computing and do the VDI. Although on the embedded area, I don't have offer any usage model. But basically, if you have some usage model, our ZNGT can provide you the capability to share the GPU among the virtual machines. Okay, for the existing car, first one is the device emulation. So currently, I think you already know this. This is the QMU Windows, which shows you basically the VM display. Although we can do some automation to provide the display a little bit better, but we cannot do the 3D part. We cannot emulate a modern GPU due to the really complexity and the bad performance invasion. Then next, we come to the spirit driver model. Basically, this is the front end and the back end. They talk to each other to provide, this is just the framework. In the graphics side, basically, you are trying to, you can forward the API like the OpenGL, the DirectX from the guest to the host for processing. It really achieves pretty good performance, and we have multiple existing solutions available. It's hardware agnostic, but it's also quite complex. For example, you need to support so many the API versions. And what's more, for example, you have the guest is using the DirectX while the host is using OpenGL. It's still a hard problem. So next one is the device password. From the performance point of view, this is really good because we really achieved almost a native performance. But the question is, there's no multiplexing at all. You can only support one VM. Okay, let's talk about our NGT architecture. Here I want to emphasize, in this NGT, we are trying to propose a mediated pass through. By seeing this mediated pass through, as you can see from the bottom, there is a device, I think the IOMLation spectrum. On the left side, you go to the device emulator, which has the most multiplexing capability, but has the worst performance. While in the right side, you go to the actually direct pass through. So our solution is somehow in between. So actually, we are trying to pass through the performance creative resource, while for others, we provide the mediation. By doing this, we are able to achieve a quite good performance compared with native, and we also provide, I think, moderates the multiplexing. Next side is a little more details about actually the software conformity involved in the NGT architecture. As you can see, we can run the native graphics drivers in the machine. They can access some of the hardware directly, and for some portion of the hardware, you can see it is trapped by then and forward to the VGT manager for the mediation. So basically, this VGT manager is a kernel driver, which resides in DOM0. And here I want to emphasize, from the VGT concept, the DOM0 is just another guest. So you can see the DOM0, the GEN graphics driver, I think we all know it's an I915 driver. It's also only access part of the graphic resource. And for the previous resource, we still need to track and forward to VGT for processing. And we did something in the past, and it is then to call, I think, self-depraved to trap this I915 and forward to VGT, which also in DOM0. So any question for this picture? I guess it's okay. Next is I want to talk about the Intel graphics, Intel process of graphics. It's not the traditional diagram, you say, with all the 3D pipelines and the media decode pipelines. This is only a simplified diagram from the resource point of view. So why we show this diagram is actually we, as I said, we try to divide various performance-critical results and others. So first, I give you the overview of the resource. The next page I will show you how we decide which is password and which we need to provide the mediation. This is the GPU for the Intel. Actually, I don't know, other vendors. For the Intel, first is the graphics memory. In Intel card, we have a single global virtual memory space. And we have multiple pro-process virtual memory space. You can see in the picture. And for the Intel card, we don't have the onboard video memory. So actually, all this virtual memory space is backed by the system memory. So which means we will have a sort of page tables to do the translation. For the global, we have the global GTT. And for the pro-process virtual memory space, we have the PP GTT. Next is we divide the engine to be render engine and display engine. They have the internal context. And for all the others, which is not covered specifically in the picture, we just call them global state. Next one is we did some profiling on the access frequency through the GPU interface. This is the main way we want to decide for which part of the resource we will pass through and which we will do the mediation. As you can see, we are trying to run the tool workload. The most different kind of access to the command buffer. The command buffer itself is related in the graphics memory. So from our point of view, our decision is we do the pass through for the graphics virtual memory space. And also from the command buffer, I think I will cover the command buffer in a little more detail in the next slide. For the mediation, we do for the others. Okay. First one is about the global virtual memory space. Essentially, we are partitioning this space among the virtual machines. And we are trying to use the balloon to achieve this partition. I think there is some difference compared with the general ballooning logic in the DOM zero. I think in the general ballooning logic, essentially, you are trying to ask, hey, try to give me four megabytes of memory. You have too much memory. Give me. And for this graphics virtual memory space, we are seeing the same, but giving me the four megabytes of memory starting from this address. So that's the difference with our ballooning. So actually, by this, in each VM can only access their portion of the resource. The next one I need to mention is about the page table. We partitioned the global virtual memory space, but the memory itself is backed by the system memory, so which we have the translation in the GTT. We cannot let guest directly access this. So we are also, the GTT access will be mediated. For the portion of the GTT, that guest has the access right, I mean, the corresponding, if the corresponding virtual memory space is the, as the guest is the owner for this, the corresponding memory space, this GTT, we will validate and do the translation and populate the global GTT. For other parts, we just, it is virtualized. This is where we can ensure the isolation. Next one is about the poor process of virtual memory. I think the concept is quite similar compared with the CPU page tables. You have this PPDRR name, which is quite similar to the CR3. Then we have two level of page tables to do the translation. And actually, how we provide the translation for this, we are also doing things using the shadow page table technique. I think that's all for this. Okay. Let's come to the command buffers. This is actually an interface that the CPU will try to submit the workload for the GPU for processing, and the GPU to provide feedback whether which command I have already been processed. So the CPU will submit the workload by, first by writing the command buffer to actually write the command. You need to do this, this, this. And then update the head register. Then the GPU will start the processing. Whenever the GPU has passed this command and submitted to the core engine for execution, it will also update the tail register to point to the next command. So in our handling of this command buffer, we allow the guest to write to the command buffer directly. So essentially, this command buffer is in the guest's virtual memory space. We allow the guest to have directly access. But we cannot allow the guest to update the tail register directly. So if the current guest is the render owner, which means he has the ownership of this render engine, we will track this right and update the hardware accordingly. While if the guest is not the current owner, we will just queue this command without updating the hardware. Then this queued workload will run when we next time we do a context switch and pick this VM as the next target. Then we reach to the render engine sharing. So I think I already said a little bit just now for the render engine, it's shared just like the CPU. So at a given time, only one VM can be the render owner. That means he can submit the workload, really submit workload to the hardware. And we did a simple round-robin schedule. I think the time epoch is 16 milliseconds. I also list the basic render context switch flow here. First, we need to wait for the VM1 to finish their work. Then we will save the MML registers for the VM1, and we flush the internal TLVs, caches, and issue a hardware context switch command. Then we will install the MML registers from the VM2 setting, and we submit the queued workload. Let's go through the display engine part. In the display engine part, currently we are providing two models. First is called the direct display engine mode. This is essentially when you're trying to do a VM switch, the VGT driver will update the, I think will update the surface point to the guest surface directly. That's why we call this the direct display model. Another one is the indirect display model. Basically, the VGT driver will provide an API to expose the location of the guest frame buffer, and its format, and other related information. This is the competitor in DOMA 0 who need to compete this frame buffer to its display. So VGT does not own the display to the hardware. We just provide the location, format, anything you need, so that you can do the compositing. I believe Jon Ben, their coffee approach can also use the combined with the API we provide by this indirect display model. So far, any questions? Okay. Next, I will show you the performance. I think, first, this performance is not the final performance. We just want to show you so that you can have an idea of what the performance will look like. This picture shows that we are trying to run one machine, and we run this four workload. We compare with the native, and with the direct pass-through solution. As you can see, VGT can achieve pretty good performance compared with the VTD and the native. Next is I try to show you how is the performance of the two VMs. So we are trying, for example, we are trying to use the single VM as a reference. It's normalized as the 100, and then we sum the performance from the two VM together. You can see the sum is nearly equal to 100. So basically it's the performance sum of the two VM is equal to if you only run one VM. I think that's all for the summaries. First, we need to provide the graphics virtualization so that we can sustain a consistent and rich user experience in the virtual machine. And for our VGT architecture, we are able to achieve good performance because we essentially pass through the performance critical resource. And we can support moderate multiplexing by do the mediation. And I think the current status that we are able to support up to four virtual machines sharing the single GPU on Intel, I think, the fourth generation. I think the fourth generation has the CPU. We have already published our code and hosted it in the GitHub. We will also constantly update to reflect our progress, the recent development bug and the future. So what I really want people to do is, if you have interest, please try and provide our feedback. Maybe the bugs, maybe your usage models, maybe your specific requirement so that we can consider and we can integrate in the next release. And although currently we did not push all the patches to the upstream, it will be our target next year to provide it to push its patches to the upstream. But we already published the code here. That's all. And if you have questions, please ask if you want to see a demo. Please go to David. Which card? So basically for the, yeah, I think the fourth generation, which the code name is Haswell, I mean the CPU integrated graphics. If you boss the fourth generation, i7 processor, i5 or i3, as long as it has the Intel integrated graphics, we support this. And we also support the i3 server processor. Because that server processor also has the Intel integrated graphics. Oh, sorry, I don't know. Yeah, so the key point is you need to have the Haswell generation of Intel graphics processor. Oh, I think that's a hard question. Because currently we are limited by the, because we are doing the partition among the graphics memory, as you can see, we are limited by the total graphics memory and the aperture size. So for example, if you run a single Windows VM, you need 128 megabytes of aperture. So in the hardware, we only have, I think, 512. So we cannot support them all. So we're talking about two years time frame then? I think I'm a software guy. I'm not a hardware group. So it's hard for me. But I will prove, I think I will give feedback to the hardware team to say people want to run 1000 water machines using Intel. In three months. Oh, okay. Yeah, kind of curious about those GitHub repo as well, a couple of things. One, why you just put a patch in the repo rather than actually putting the code in the repo? An odd way of using a GitHub repo, I thought. I think that it is because we don't have enough space in the GitHub account. So I can only host the patch. If I host the, if I host the whole then to you community and kernel to you. And then the other one was sort of looking at your Zen patch. It is pretty large. And it seems very dedicated to this purpose as opposed to kind of going for the more general purpose secondary emulator approach. I was wondering why you needed to bury so much code directly in Zen? Sorry, I don't quite understand the last question. Your Zen patch is probably at least was to me over a thousand lines as far as I can tell. And you seem to have a dedicated IO emulator buried directly in Zen. I provide You mean dedicated what? You have a dedicated emulation model. You trap MMIO reads and writes directly in Zen. And I was wondering why you needed to do that. I think we need to provide the emulation line. So anyway, we need to trap the guest access to the results. Why you didn't do that in game? You would get all the MMIOs anyway. That's a good question. So David, do you know the answer? I just repeat for the mic. Early on that decision was just made to provide a fast path to the driver, the real guts of everything. You look at the kernel patches even bigger. And we just want to look for the shortest way through. So you said that the common buffer is directly shared with the guest. So the guest is free to write the common buffer whatever he likes. And so is it possible considering that there is some shared state at least the GGGTT is shared across all the VMs. Is it possible to use a common buffer for a malicious guest to read or write memory that actually is shared? Actually, first, in theory, you can only access your own portion of the graphics memory. The common buffer is in your own share of graphics memory. So you cannot access other graphics memory. We provided isolation. The next question is how about there is, for example, pointers in your share of graphics memory, which essentially points to other place. That is another VM's graphics memory. We did this, actually, when, as I said, all the when you finally try to submit the workload, we'll write to the tail register. At that point, we will scan all the commands. We will find if you are trying to access the resource that not belongs to you. You have to do validation of each command. Yeah, yeah, sure, sure. Everything opens source even the demo. In the slide, you show... Yeah, yeah, yeah. I think the demo is built on the build with this patch. Yeah. Everything is there, nothing internal. So just to repeat, it's all open source. Other questions? So what about support for OpenGL ES in a scenario like this? I think this is the driver, right? Because we support the native driver. If they can support, we can support. More questions? Thank you. Well, a little bit. Thank you. Thank you, everyone.