 Okay, so my name's John Babaval. I work at Citrix on Zen Client Enterprise. I'm the maintainer of the device model for Zen Client Enterprise, and lately I've been doing a bunch of graphics virtualization work. And I'd like to talk about removing copies from the display path. So I'm gonna go through a little bit of basic stuff, how bits get on the screen with Zen. Some of the existing optimizations that are already in the path and how we can use gem to help. And there's not a lot of time in this session for these things. So some of the stuff is gonna be just a little bit boiled down. So this is a quick version of how graphics works in QEMU. So there's two sides to the graphic stack in QEMU. There's the hardware in the UI. The hardware is device specific emulations. So this can be your Cirrus VGA or what have you. And the UI is device independent so you can have your SDL rendering code or VNC. And it allows you to mix and match and they're pretty well separated from each other. So basically what happens is you have a linear frame buffer in your guest's VRAM. And there's a periodic timer that calls hardware update. And that goes through and it processes the guest pixel data into a display surface. And then the UI does a refresh. And it takes the updated regions which hardware update calls DPUI update to tell the UI what portions of the screen have changed. And that puts the new bits on the screen. So this takes us to somewhat of a worst case scenario for how bits can end up on your screen. Your guest has a frame buffer somewhere that it's been software rendering into. And it's gonna write those to emulated VRAM. So those get copied over into your emulated VRAM in QEMU. And it's going to look at that scan line by scan line because it might not be in a format that's friendly to your UI. And it's gonna copy those over scan line by scan line to your display surface. And the display surface is then passed to the UI through the DPUI update call. And it will then blit out onto your window. And then there may or may not be a bunch more copies after that but hopefully you don't care because they're all done in hardware. Now hopefully no one's ever running in the worst case scenario. The most typical optimization that already exists is to do a foreign page mapping so that you have shared VRAM between the guest frame buffer and QEMU's virtual VRAM. So that eliminates this first copy and you just have a shared buffer. Okay so the next optimization that can happen is called shared buffer mode. Inside QEMU you have the display surface which is passed from the hardware to the UI. If the buffer in that display surface is in a format that the UI knows how to display natively there's no reason to copy the data out of the VRAM into the display surface. You can just put a pointer to the existing frame buffer into the display surface and you're left with just one copy from the display surface into whatever your final destination is. In order to optimize that process because you don't have emulated rights to VRAM Zen provides dirty page tracking. So when you do the hardware update you can generate a list of calls to DPUI update with scan lines that we have reason to believe the pixel data has changed on. So you only have one copy and hopefully it's not a full frame copy it's only a subset of your screen. Okay we'll talk a little bit about graphics hardware next. This is the really short version. A lot of systems have a unified memory architecture and that means the GPU is connected to the same memory bus as the CPU and it can scan directly out of VRAM. And GPUs have their own virtual address space which means they can scan any domains memory if the GPU is programmed right. And it's the really short version so that's how GPUs work. So that leads us to the obvious solution is if we set up the GPU page table to map the linear frame buffer of the domain you program the CRTC controller base address to point to the mapping in GPU address space. You have bits on your screen and there's no copies and it's simple. Unfortunately nothing's simple. GPUs are actually a lot more complicated than that so you don't necessarily know how to manipulate the page tables of your GPU. Vendor might not tell you how to do it. Even if the vendor does tell you how to do it next week they come out with a new GPU and then it doesn't work anymore. So there have been lots of implementations of zero copy video in virtualized environments and generally they ship and then at some point they break and then they go away. And then the other problem with it is that you probably want to use your GPU for more than one thing at a time. You probably want to have a windowing system running so you can show your VM frame buffer and you can show some other application or you can show multiple VMs or what have you. So in order to solve that problem the idea is to use gem. The Linux kernel has drivers for a lot of GPUs. There's a lot of great code that's been contributed to open source to take advantage of. And the API is relatively standard which means we can write this once and have minimal change when the next GPU comes out and we can rely on somebody else to update when the new GPU is available. The trick about gem is that it can give you some memory to write to your GPU with but we don't just want any memory. We want to tell it specifically which memory to use. So when you have a gem object basically what you have is a bunch of GPU specific information that you don't care about. It's behind the scenes. It's for managing state inside the driver and then you have a scatter gather list of the pages that make up the pixels in the object. Those pages are filled in with functions that are already hooked in gem. So it's relatively easy to add a new get pages call and a new put pages call so that you can fill in the scatter gather list with the right pages and you theoretically don't care about the GPU specific state. So you fill in the right pages and you pass the object to KMS. The kernel sets the mode to point to that frame buffer and you're done. But you need to know which pages and gem needs to know which pages. So the question becomes how do you get the right pages? And earlier when we talked about an existing optimization to QEMU, what QEMU does is does a foreign page mapping and that gets a virtual address, a local virtual address to the frame buffer. Unfortunately, it doesn't get us the machine addresses and the GPU needs machine addresses. Also a Linux scatter gather list is filled in with page structs. So the memory needs to be known to the kernels VM and foreign pages are kind of magic. The hypervisor goes behind the scenes and it fills in the right entries in your page table so you can resolve a virtual address. But then when you go to ask Linux what the physical address is for that virtual address, it has no idea and you get all sorts of crashing. The grant mechanisms and Zen work around that. So we could use grant tables. Some code in DOMU would allocate a big grant table. They would create a grant reference for each page in your linear frame buffer. You have some mechanism to transfer the grant references from the guest to DOM zero. And then inside the grant code, there's something called the M2P override which takes a struct page in the kernel and it overrides the machine address of that page. So that when you go and look up the address, it sees that it doesn't actually hit and then it goes into this lookup table, gets you the machine address you've overridden it. And it works. But there's some caveats. You have to have a cooperative guest and ideally you wanna be able to do this without having some sort of PV video driver in the guest. It really inflates the size of your grant tables and it sets up a bunch of virtual mappings to go with these addresses that nobody ever looks at. When you look at the pages in a gem object from user space, they get resolved through the graphics aperture instead of through the virtual addresses of the associated pages. So the virtual addresses are never used. So we can sort of do a hybrid approach. We can skip setting up a foreign page mapping and just translate the guest piffons using an existing hypercall translate gpiffin list. So you pass a list of the guest physical addresses of your linear frame buffer to Zen and it happily increases the reference count on all of them and hands you a list of addresses back. And you can then just use the MTP override section of the grant code to override some pages and you give the pages to gem. So now you have no guest knowledge, no unnecessary mappings, but you do have page structures. And then you get a nice gem object representing your frame buffer. So in order to do this, a new I octal is added to gem. It's very simple because the get pages and put pages functions are already hooked. So basically it takes some additional parameters and it overrides get pages and put pages. It calls into Zen with the translate hypercall. It sets up the table and it fills in the scatter gather list and it basically looks like this. You give it the guest frame number. You give it the size. You give it the domain ID and it gives you a gem handle. So now this is kind of a bit of a jumble but now all of your buffers are the same pages. So nobody has to copy anything anywhere. So now you have this gem object that represents your guest frame buffer. There's a few things you can do with it. You can use KMS and you can turn it into a scan out buffer. You get an entire VT in your DOM zero Linux. There's no overhead whatsoever. So your screen is updating. CPU usage is zero. All the copies are being done in hardware. You don't get the nice guest events from X so you need to find another source for your user input but that's not too tough. You can get them from dev event. That doesn't solve the using your display for multiple things problem though. It's taking over your entire screen. So there's a couple other options. If you want to use this inside of X so you can convert it to a prime object. A prime object is a file descriptor. The idea is that you can take your gem object and you can share it with multiple processes in user space and still have some reasonable permissions. And when you get this file handle you can give it to the DRI two or the DRI three which doesn't really exist yet but it will soon extension in X and do what you will with it. Or more interesting you can convert it to an EGL named image and map it to a texture. And then you can do all sorts of hardware accelerated open GL things with it. And you can even use this technique to display hardware accelerated frame buffers with ZNGT. So Hytow I don't see here. We'll be demoing that later so you should be sure to go see his demo. I chickened out on the demo. But the code is on GitHub. So you can go check it out. And I have a spec and a document which I will put somewhere and tell Lars so he can tell the rest of you but I haven't put it online anywhere yet. Any questions? Where did Lars go with the microphone? Microphone. Microphone. Microphone. Did that go too fast? Microphone. Microphone. Microphone. Microphone. Microphone. The translation I recall is that part of Zen now? It is part of Zen although currently there's no consumers of it that I'm aware of. I don't even think it's in there. It was removed in 2009. Yeah, it was pretty ancient. It's gone. So the Zen side of it's still in but the Linux side of it isn't. No I'm pretty sure the Zen side of it's gone as well. Where's Ben? Ben what version is Zen are we using? Four three. Okay. I have to look and see if it's in a patch that I've applied to Zen but I was pretty sure that. Isn't it interesting that I have a patch too that I applied to Zen that does exactly the same thing? Okay. If it's not in Zen, in order to do this you'll need to put it back. The Zen client's entry has a list of patches that get applied to it and I wouldn't be surprised if it came out of that list of patches and I just didn't realize it. So there's another way around to doing this which is the way which we implemented which is certainly simpler I think which is where you allocate the memory in the host or in the domain zero in this case but then actually map it into the guest as the frame buffer pages. So then you don't have to worry about any of the foreign mappings or translation on the effectively the domain zero or host side and actually certainly simplifies things from that point of view. Right. So you can definitely do it both ways but that doing it that way adds a different problem which is then you've allocated a gem object that's backed by machine pages and you need some way of asking gem what pages those are so you can give them to the guest. So either way you end up playing a little bit of a mapping trick but yeah you can certainly do it in either direction. Okay. Do you have any more questions? Did you do any performance measurement? I mean this would be quite much faster. Yeah so when you're running using KMS there's really nothing to measure there's literally nothing happening on the CPU while you're displaying at least in DOM zero. So from that perspective the performance is very good and then if you're using one of the other mechanisms it depends a little bit on your GPU because you're relying on your GPU to do the rendering. It's performance wise very similar to what Zen Client XT does with Surfman. So if you want a performance comparison that would be a good thing to compare it to. The key difference is that you don't have to maintain proprietary GPU driver you are using the code in Linux. So I mean like I said zero copy video is nothing new but zero copy video without doing a lot of code maintenance is the idea here. More questions? Thanks very much John.