 I spoke yesterday, so you may recognise me. I am a principal software engineer at Citrix. My main responsibilities are PV drivers and virtual GPUs, so this one is more aimed at the virtual GPU end of things. So the talk is entitled multiple device emulators for HVM, so I guess that kind of gives away the approach we took. But first I'd like to sort of discuss how emulation takes place. I'll then move on to where we started in our work with NVIDIA to bring virtual GPU into ZenServer. The current implementation, which is available in tech preview now, which will be shipping in product towards the end of the year. And then I'd like to sort of go on to where we expect to take the implementation and possible spin-offs from what we've done. So emulation done by basically trapping IO port accesses, which cause explicit VM exits or page faults in MMIO regions. Some devices at Zen emulates itself, so critical things like HPEC, RTC are emulated in the hypervisor, but mostly a trap like that will result in an IO wreck data structure being built by Zen and that being passed to QMU. So that kind of gives you two options for device emulation. If you want to add a new device, you've got to go with QMU or you've got to build it into Zen. So this is kind of just a slide of the general NVIDIA grid virtual GPU architecture. The basic idea is that in guest the OSC's an NVIDIA GPU, it looks like a piece of hardware. This has advantages because it means that you don't need a special driver. You can just use the NVIDIA standard hardware driver to talk to the thing. And actually to be honest, underneath what you've got essentially is a hardware GPU. It's just a slice of the actual real GPU. So I guess you can kind of think that this is SRIV, but it's not actually SRIV hardware. You need some software to make it work. So yeah, the bits and dom zero basically the kernel driver would normally be the physical function driver in SRIV world, but essentially you've got this extra device model on top, which creates the virtualisation layer for you and you need an emulator to make that work. So how do we do it? Well there's a couple of options. Obviously you could build it into Zen. This was a particularly popular choice because for a start the code is closed source. We don't even get the source from NVIDIA, so it's definitely not GPL. And it needs to coexist with the VJ implementation which was in QMU. So QMU is obviously going to be the natural home. Problem with putting into QMU is that it was basically a large patch to QMU traditional, and I mean quite big. Largly what it did was put its hooks all over the VGA code and then create a binary plugin interface for the actual implementation, which as I said was closed source. So this probably wouldn't gain much traction upstream. It wasn't the sort of thing I looked forward to forward porting onto QMU upstream. And if we weren't actually going to be able to upstream it then it was going to be a patch we'd have to carry in perpetuity would be generally pretty hard to rebase I suspect. So we didn't like that approach either so we kind of needed a third way. So some work that Julian Grohl had done in Zen Client was to come up with the idea of an IRX server abstraction. So normally QMU talks to Zen in a using a variety of HVM parameters and bits and pieces spread out all over the place. But if you take that list of things that it needs, which is basically a shared page with a per CPU data structure for passing synchronous IRX from Zen to the emulator, and then another page which you can pass asynchronous IRX from Zen to the emulator and an event channel to signal that ring. If you take all those things and hang them in the same data structure then you can create this IRX server abstraction. And once you've done that then you can have several of them. So you stick them in a table and then when you take the VMX to do an emulation, you can basically go and look up where you want to send it. So what we've got is a table that basically allows IOs to be steered according to port address ranges, MMI address ranges or PCI bus device function number. The bus device function number steering is the hardest to do because you don't actually get an emulator or a trap which has a bus device function number in the X86 world. PCI config space access is done indirectly so you get an address cycle that goes through the CFA port address and then a data cycle which goes through CFC. So basically you have to trap the CFA right in Zen, hold on to that information and then when the CFC comes in you steer it to the right place at that point. So what we did was put the necessary patches into Zen which are pretty small in the hold. Sunny, I don't know, maybe 100 lines of code, not very much. I patched queuing you to use the new IRX server abstraction which was just some extra library calls in Zen control and then I created a new emulator from scratch which actually really wasn't very hard. I mean my first Abbott doing the secondary emulator was just a done PCI device with an IO bar and an MMI bar and I think that in total was about less than 100 lines of code, it was pretty small. But then I ported the code that NVIDIA had written in the QEMU patch onto that emulator so then we had a standalone emulator that would emulate the PCI device, the PCI NVIDIA device and then also I piled in a VGA emulator into that device model as well so that the two could talk to each other. The implementation is still in domain zero with all that lot though which is obviously a bit concerning from a security point of view. Given that we don't have access to the source, we can't really audit it. All we can do is get security statements from NVIDIA regarding that. So in future I'd like to try to isolate it. So I thought the best idea given that we now have a separate emulator that you can basically put that anywhere you like that we would go with a disaggregated solution so we'd create an appliance domain for VGP. We'd then pass through the physical GPUs to that domain, run the emulator in that domain which would then service the actual guest VMs. And then we'd need some console implementation. Now the console implementation we have at the moment is a bit on the hacky side. Basically what I did was create another graphics device model in QIMU which really isn't a device model at all. All it does is render a console on top of the video around from the guest. We split the video around up in a slightly funky way. The aperture is 16 megs wide. All we do is I split it into four and twelve. So the video around we actually tell the guest, we tell the guest it's got four megs of video around. And that means basically if it's in text mode or in a mode that's not 32 bit color depth, then we actually have software scrape that video around and render an actual 32 bit color depth graphical console in the other 12. And then we map the actual QIMU console over that. In the case where the guest is actually in 32 bit color depth, we put the QIMU console over the video around and it just goes out with any translation at all. That works very well. And then when you're actually running the NVIDIA device rather than emulating VGA, it DMAs a composite frame buffer 25 times a second into that video around. And so you just get a console automatically basically. It was just an easy solution. But if we go with the appliance domain approach, then I'd probably go with a LibVNC based console directly in the appliance domain. I think that would make more sense. Although I don't, I've not used LibVNC so there's still some investigation work to do there. Another thing that we did was patch QIMU. I believe we probably don't have to patch QIMU to make it work. So I'd like to come up with the idea of an IRX server that's basically a catchall. So we call our IRX server zero or something like that. You can talk to it using the old fashioned HVM prams and the standard interfaces. And if you don't have a secondary emulator that traps a particular IO range or a particular PCI device, then everything goes to a server zero and QIMU gets it by default, which is how it always works. So you don't need to change QIMU at all. So a kind of possible spin-offs from having the ability to have secondary emulators. Well, one of the things we've discussed in the past with ZenServer is the idea of the Windsor architecture, which is just disaggregation of driver domains, service domains and ZenServer. But obviously one of the things you need to do is still have emulated hardware for HVM guests that don't have PV drivers. And so if you've got a QIMU running in DOM zero as is traditional, and then you've got a driver domain actually with your network hardware in it, how do you get the IO from QIMU and DOM zero to the driver domain? Well, then you've got to build a PV path from DOM zero to the driver domain. So you just increase the length of your emulated path. And possibly the performance may even be so bad that you'd actually noticed it above the actual overhead of emulating IO anyway. It's certainly going to be more complicated to set up. So given the ability to have disaggregated emulation, it would make more sense to just have a standalone emulator for the network running in the actual network driver domain. Similarly, if you had storage, you'd have an emulator running in the storage driver domain. So maybe we can do that. Certainly we already have a user space TAPDISC process. So it serves as a PV backend so we could just make it serve as an emulator as well. I'm sure it wouldn't actually be that hard. One problem you have, though, obviously when you have multiple emulators is the QIMU unplug protocol, which is kind of a bit of a strange thing to drive. When you bring up an HVM windows domain, for instance, you've got emulated hardware in there. It has to use the emulated hardware to boot to a certain level. But once you're up and running inside the kernel, you want to unplug that emulated hardware, so you write to some magic IO ports in QIMU and your emulated hardware disappears. But obviously if you've got... Those IO ports are implemented by the Zen platform PCI device, even though they're not actually part of the PCI device itself. But obviously if that's running in one emulator and you've got your network back in running in a different emulator, the unplug is not going to work. So there's going to have to be some other way of doing it. Now, Julian already did some work in that. And he created disaggregated QIMUs, which is probably what we want to use. But to make the unplug work, he pushed the emulation of the IO ports down to Zen, which I think is a good thing. It's a natural way to do it and then created an abstract unplug IO WREC, which would then be broadcast to all emulators. Obviously whoever was owning the network drivers, the network backends would unplug at that point. Whoever is owning the storage would unplug at that point. But I'd like to go a little bit further maybe. So one of the things I thought about was actually creating a new first class interface in Zen. So perhaps we could actually have HVM ops to do the unplug. Obviously we'd need to support legacy guests that didn't know about them. But if you had a new guest, perhaps we should move to a new interface so that future front-ends know about these new hypercalls and will make these new hypercalls. And eventually when the old front-ends disappear, we can get rid of the old IO emulations. Similarly, if we have patches into QIMUs so that we can run full disaggregated QIMUs, then we don't need to implement the unplugging QIMUs using the old-fashioned IO port emulations. So we could have it understand these particular IO WRECs and then we could potentially tell the front-ends that there is no catch-all emulator there. You don't have to use a legacy mechanism. And we can maybe unplug rather than just by net or disk class. We could perhaps look at unplugging individual discrete devices, which is actually very useful from a Windows point of view where you may want to run one emulator device and one PV device. We certainly find that useful in trying to run a logo test, for instance. Where if you have the same driver for all your disks, windows, happily off-lines it and then blue screens because your system disk has gone away. So that's going to be a future idea. Actually, I forgot to click through the slide there. But I'll now attempt to move on to a demo. So this may not work because the network's been a bit flaky of late. But this is a host I've got an Nvidia K2 in back home, which is two GPUs. And I'd just like to try starting it up so you can see the emulator in action. So you get a bit of noise out of ZenStore when the tool stack talks to it. Obviously the font's a bit lousy there, and I can increase that a bit. Yeah. Nice. The network's done the lock up on me. Yeah, I was afraid this might happen. But you can basically see the emulator VGA is rendering a VNC console. ZenCenter doesn't know anything about virtual GPU in this version, so it's still telling you it's got a GPU assigned and you better use RDP to talk to the guest, even though you don't actually need to. But if the network is still alive in any way, now that the VM is part, we might be able to log into it over ICA. Now this is a host that's running in the lab back in Cambridge, and I'm talking to over the VPN, over a Wi-Fi network from Edinburgh. So the bandwidth ain't great, but ICA is actually surprisingly usable. You'll probably notice some, like, you know, H264 artifacts on it, but basically it's there, and we may even be able to actually kick off the Heaven benchmark and get some meaningful interface out of it. It takes a little while to load it out. I mean, the actual framework we're going to see is nowhere near what the application is actually going to be rendering in. It has an FPS counter in the top right that's actually what it's doing. We're probably only seeing about 10 to 15 frames at best, probably, per second. The NVIDIA virtual model limits the frame rate at around 60. Sorry? Yeah, the default settings, yeah. So take a little while to round the put now, but you should be able to see the FPS counter on the top right start to increase. We're seeing massive dropouts, occasioning as the network congestion hits us, and you can see it's basically doing that around 50 odd frames a second. And that's using a device model which gives you a quarter of the K2 GPU, so using that device model on all your VMs, you basically get eight VMs on that particular card. It's got two physical GPUs on it. So yeah, I'll leave that running and take any questions. So have you considered... You've got your multiple IO requests thing. Yeah. My IO request providers, which sounds cool. Have you considered changing QMU to explicitly register? Yes, yeah, exactly. I would advocate doing the catch-all thing for obviously QMU compatibility, because we want this to work potentially with older key mus, without modification necessarily, because this is a distinct emulator for use, but in future, we would like to use multiple key mus, and once we're going down that route, they would need to explicitly register their IO ranges, so we would have to add the code to QMU to do that. But it would mean that you could take another old QMU from a distro, and you'd still be able to use it as a single emulator. I have a follow-up question, which is, with respect to the unplug protocol, if you... ..ask these unplug requests, so, in principle, you could have some kind of snooping IO request registration, where each QMU would say, well, I need to know about these rights, but I don't care about saying whether they've succeeded or not. Yes, indeed you could do that, but since you were potentially going to the trouble of registering explicit IO ranges in QMU and making modifications anyway, why wouldn't you just add an extra IO wreck at that point to say, I'll listen to unplug IO wrecks? They kind of need to... You may be able to build more functionality into them in the future. Well, if I were a hypervisor maintainer, I wouldn't want that weird shit in the hypervisor. Well, I mean, the old IO ports, basically, that's why I suggest adding new HVM ops to make it more explicit. I think just putting the old IO ports there is just a legacy thing we potentially have to support. But I certainly wouldn't want to add any more to that interface than is already there. I'd prefer to replace it with something much better. OK. I don't have a question per se, but I have already done exactly what you suggested to Julian's IO wreck server patch so that it hasn't catch all. And it's currently in the Zen client tree and it's shipping in the current version of enterprise. OK, cool. Just grab it. Actually, I should also point out that I asked you about the translate hypercall yesterday. Well, we basically added the translate hypercall in to make this work as well, so I suggest we just put that back in upstream. In the architecture where you have the service VMs, one, for example, for network one for disk, and you run the network back-end in the network service VM and the network emulator. So do you have a way to restrict somehow the operations of the emulator, or you can just map any random page of the machine? Because otherwise... Yeah, I mean, at the moment, yes. Obviously, the emulator does run in DOM zero. It can map any old page it likes in the machine. We'd probably have to come up with some mechanism to restrict that. I'm not entirely sure what that would look like yet. But yes, some sort of policy would definitely be the right way to go. Do you know why NVIDIA is not using an SRIOB approach? My suspicion is it's just expense. I mean, SRIV is going to cost you in silicon to implement that stuff. There's probably a whole load of compliance stuff you have to go through to make SRIV work. So why would you? Basically, I mean, you can do the job with an emulator that's actually pretty small. And also, having an emulator there has advantages. At the moment, none of this stuff can be migrated because there's no actual way there's actually no way of sucking out the state and replaying it into new hardware but you could imagine doing that with an emulator, essentially. Most of the states held in the guest anyway so all you're doing is taking the register state which you've got in the emulator anyway packaging that up, shipping it somewhere else, and then basically as long as you can replay that into another virtual instance and then say go and start reading the DMA key from the guest again. Then potentially you could migrate this. With SRIV it's not clear how you'd do that. Which kind of NVIDIA card would support this technology? At the moment, so it's just the grid K1 and K2 if I can actually get this network to talk again. Yeah, I don't know price-wise what sort of cheap end of it is. I mean, the K1s are cheaper than the K2s. They actually have four GPUs on the K1s but they're smaller lower power GPUs. But if I run I was just curious if it was something, no mainstream and the card was just actually if I if I go in on the console maybe that would work better. So at least I can show you if I just run NVIDIA SMI, which is NVIDIA's little tool then you can kind of see what you've got there. That's actually two grid K2s in there so there's two GPUs on each and you can see the VGP process running at the bottom there. Another thing is that whilst that heaven's going, if I run top you can see you should be able to see the VGA process appear eventually. But actually no, it's really not using very much CPU. Once you're up and running you've basically given a slice of the hardware to the guest it set up all the mappings because it really doesn't have much to do at that point it's all done by the hardware. We done? I think we done.