 Yes, so I've just been informed that I'm using the old Zen.org logo. So just Right, so as many of you know, my name is George on lap and I'm gonna be giving a talk on the technical deep dive of the pvh The new mode virtualization mode So pvh has been talked about for quite a while and it's been worked on for quite a while However, the the patches haven't been quite the patches haven't been submitted yet And although people have a kind of a high-level idea about what pvh is supposed to be about I think there's not a widespread idea about exactly the technical details of how it's how it's implemented And there's not a whole there's not a whole lot of people that have actually done the work of digging into the patches and understanding them and so My goal for this talk is to give you guys a technical overview of pvh at a fairly relatively detailed level So that you can well for one even if you're not going to actually do any programming so that you can understand the characteristics advantages and disadvantages of the mode when you decide to use it or not to use it and Hopefully so that many of you can actually approach the code and understand what's going on And to help us fix issues with it and and and improve it So in order to do that I'm going to start with an overview of pvh and hvm and some of their the things which We wish were better about them and then talk about pvh and then I'm going to do a fairly technical description of pvh from Zen's perspective and Then pvh from Linux's perspective and then we're going to talk about some of the outstanding issues of pvh They still need to be sort of sorted out Okay, so Why pvh in the first place so it just a pv the big issue with pv as we have it right now is the pvm emu code in Linux so the pvm emu code Most of the pvops are kind of out of the way and they're not very expensive The most of the ZEM code in Linux is kind of in its own little area Where the rest of the Linux crowd is happy for it to to live and have its own life and then have to worry about it but the pvm emu code is right in the guts in the middle of the x86 code and This has caused a lot of pain for The the cruel community as a whole so it's very easy It turns out for people not even touching the Zen code or anything related to Zen to accidentally break Things that work in Zen or vice versa and so this causes unhappiness for the Linux For the Linux maintainers The exodus maintainers it causes unhappiness for the Zen maintainers and we would all much be have rather be doing something else Another issue with pv is in 64 bit mode. We have this issue of 64 bit hypercalls are very slow so the reason for this is that in order to Do We need to have three levels of protection At least three levels of protection in order to run a hypervisor. So you need the hypervisor mode the guest kernel mode and the guest users based mode and When 32 bit when Zen was being developed for the 32 bit operating systems There was a user and supervisor mode Which could easily be used to protect to separate out the guest user from the guest kernel We still needed something to protect the guest from the hypervisor from the guest kernel And we had this thing Called a segmentation limit which basically no one else was using and so we kind of Commandeered that and said fine. We'll use this to help protect the hypervisor from the guest kernel Now about the same time that Zen was finding a new use for this basically unused Processor feature the AMD folks were coming up with their new 64 bit architecture and I wasn't involved in the discussions obviously but what I presume happens I said well look there's this here's this thing that nobody actually uses and Every additional feature in a processor is expensive to implement and so they just got rid of it Which means that now for 64 bit we don't have three levels of protection anymore And so in order to implement three levels of protection The guest kernel for 64 bit has to run in user mode and every time we switch between guest user and guest kernel 64 bit mode PV you have to go up into Zen and Do at least some level of on page table Flushing and so on so 64 hypercalls are very expensive So I could table oh, I'm sorry Yes, okay. Yes 64 bit system calls. I'm sorry for that. So 64 bit system calls that are slow not hypercalls Thank you So We could use HVM HVM basically makes an entire clone of the ring zero through three So now you have again four levels of protection for the 64 bit system calls However, it has some things that we're not exactly very happy with first of all you have to have a QMU process Which is an extra level of complication extra level of more memory that needs to be handled and so on You have a legacy you're still doing a legacy boot So you start in 16 bit mode and have to work your way all the way up through to 64 bit mode And there's a number of devices that have to be emulated inside of Zen So the virtual a pic virtual timers and so on So the idea that was kind of kicked around for a long time now is the idea of doing PV in an HVM container And so the idea here is to take the best aspects of PV and of HVM and make a new mode that takes the best access Both of them and so two years ago about two years ago Mootkesh Rathor at Oracle began the work of implementing PVH and he started posting past series to Zendavelle I guess at the beginning of this year so January so I was that ten or other months ago now And it's been gone through a lot of iterations and the most recent set of iterations. I have sort of taken them and Done some significant revision revisions to them and so What I'm going to be describing today is sort of the state of the art like what things are like right now They haven't been checked in yet and then things are not in their final form But if I give you the state of the art right now, then you can understand where things are and you can maybe engage in the discussion And see where things are going And I do want to emphasize Mootkesh did the vast majority of the amount of work for this and so he gets all the credit One of the main reasons that I did the revision So one thing was there was a bunch of things that The Zen maintainers weren't happy with and so I thought that I could do a decent job of Changing some of those things But one of the big things was I wanted to be able to I didn't actually have a very deep understanding Of the pvh mode and the interface between Linux and Zen and I wanted to get a better understanding of that And I thought that forcing myself to rewrite the pastures and also forcing myself to give this talk would enable me to get a much better understanding of that and so I'm happy that people in the audience will be able to correct me if I'm making mistakes But I hopefully the fact that I'm not a super expert in it But I have learned quite a bit mean will mean that I can make it accessible to other people who are not also experts in it Right so pvh from Zen's perspective. So at a very high level You're beginning the atrium guest You disable atrium specific things that you don't need You start it in 64 bit mode and you keep it there and Then you enable a pb path for a handful of things. So that's a kind of a super high level Going into more detail So things that get disabled sense perspective. So you disable the device model QMU and that means disabling all of the MMI instruction MMI emulation Disabling emulated hardware so Zen emulates a number of things that are required for Performance reasons apics and pitstuffs. You disable the that for pvh guests you disabled nested atrium and MSI acts I Think that's yes, okay Then so to pit it in fit 64 bit mode you set initial values for the cr0 cr4 and e4 to make it 64 page mode and then It turns out there there are a handful of things that need that happen in HVM mode when you transition from non-page mode in the paging mode and so For pvh guests You need to make it a special call to how this can happen when the vcpu first boots up because it's never going to Switch from non-page into paging mode since it starts in paging mode And then you basically just disable the guests from changing the paging mode So you disable writes the e4 and you don't allow the guests to change the paging related bits in cr0 0 As far as PV paths you enable a set of pvh hypercalls and If you really interested I have a slide later that has exact hypercalls in it You need a pv820 map because you're not starting with the bios then you need a pv way of getting the memory map The pvh has a special way of doing a vcpu booting and I'm going to cover that in the on the Linux side of things and We have a pvcpu id. There's a number of special cases for cpu id that have to do Dom zero So if you look in the code, there's a whole bunch of extra things that they have to do for Dom zero so we just take the pv path for that and give you that answer and We take the pv path at the moment we take the pv path for programmed IO And we're going to talk about this a little bit later in the issues section So from Linux's perspective sitting inside of Linux Some simple things like pvh zen hvm domain is false then pv domain is true But the most the biggest thing is you just have Linux act natural So there's a huge number of special cases for pv now. You don't have to do any more There are one of the interesting things is there the fact that we're doing what's called all the translation Which means the guest is in charge of its own page tables has a number of interesting side effects that are important to understand So I'm going to go into that and then so we use the pv hvm callback vector setup and There's a pvh the specific pvh vcpu bring up which I'll describe All right, so things disabled this is kind of a nice list to have so there's no pv id t There's no pv IR q ops. So there's a special callbacks that happen after the IR q. There's no pv cpu id We have a native sys call in the center No PV vm assists. There's no event safe callbacks There's no need to set the IO privilege level when doing a certain kind of IO and In particular for mmu watts, there's no need to pin in the page tables There's no need to do a pfn to mfn conversion there's no need to special case the page table protections and There's only one pv You P mmu op which we need to actually have a special case for them and that is tlb flush others And presumably that's because it's cheaper and easier to simply ask Zen to flush the other guys tlb then to go through the whole work of like Sending IP eyes and stuff like that yourself So this is much much cleaner and should hopefully Resulted in a much better interface with with Linux I want to translate so in the pv case page tables are controlled by Zen and We have the real mfn inside the page tables and so you get things like When a pv guest talks to zen it can it can say here's a grand table or here's an mfn Please map it here, and then it doesn't have to work. It doesn't actually have to worry about The guest pfn the mfn can be used just the way that it is In the pvh case the page tables are controlled by the guest and What it writes in the guest page tables is what's called a gpfn So it's the guest idea of the pfn not the actual backing mfn And so every page that is is mapped in an auto translate guest like pvh must be in the p2m So this is one of the key differences between pv and pvh. It has a lot of number of side effects So any kind of a special page versus the grant frame you have to make sort of make a hole in the p2m and then like map it in there Anytime you map a foreign page like if you are running qme or something else like that in dom zero for instance you need to rather than Having the hypercall that says map this mfn in this page You have a hypercall that says map this mfn in this p2m entry, and then you can map it yourself by writing in the page tables And a number of similar things like that Okay, so the pvvcp bring up So the pv is brought up via hypercalls Sorry that the vcps are brought up via hypercalls just like they are the pv case Presumably because in the hvm case you have a local apic and things like that that you can use to send the bring up messages However, because the guest controls the idt A lot of the pv code for loading up the processor Can't actually be guaranteed to work properly So what we do instead is we only set a handful of registers in this case you set the the gs by default So in the pv case typically you would set almost all of the context for the vcp before bringing it up in the pvh case you only set this one bit the Guest gs and a couple of other things cs and eip and then the other state has to be set By the cpu this adds as is booting So minor difference there Okay, a couple of things in pvh that are not yet working at this point so 32 bit it hasn't implemented yet virtual tsc so the only tsc mode that is supported is Reading the the hardware thing directly from the cpu Shadow mode is not yet implemented. So this is not a guest visible thing But I think if we should have some time I can talk a bit later about the ha ha p versus shadow And the vcpu hop plug for the Linux side is not yet implemented And this I'm sure there's a larger number of fixies in the code that haven't been fixed either All right, so a couple issues Um So one is the so the original idea of pvh was to have a lightweight container That would be Rather than having a full heavyweight sort of hvm with all the different stuff in it You would just like kind of get rid of most of that stuff and say we just have this nice lightweight little container That will just do a few things. Okay, and sure that that should be faster However, the reality was In order just to get the minimum functionality of hvm requires Most of the code that's already in a all already in the hvm path. Does it make sense? So we started by saying so Mukesh started by saying instead of using the existing hvm code will Make an entirely new thing a different path that we used for pvh But as it turned out that the two paths were you know, I don't exactly not exact number But say 70% the same Result is there's a very large amount of code duplication And this is really bad from a from a lot of perspective is hard to see exactly what's going on It's also hard to maintain and as bug things going forward and and everything So the current patch that I've done we use the hvm path For the hvm container and we just put in a couple of special cases for pvh And the result one of the results of that is that the patch itself is a lot smaller There's only a handful of things you need to change Another issue that we're sort of in the process of dealing with right now is pio instructions So there's a number of things that the pv path will give you So it allows the guest to do direct access to pio instructions There's apparently think all the pv pit like I didn't I never heard of this before but if you look into the deep code There's a couple of special cases where we say okay. Here's a pit All right But some guests may want to use that so and there's a couple of other things like a couple of tricks That you have to do to be able to access the PCI config space But one of the doozies is that Many platforms will use special IEA instructions not as a normal program diode Which was just a kind of a reader or write but which actually acts as a function call Into the platform stuff. So if you know SM SMM is a system management mode, which is kind of a thing that a lot of People who write platforms motherboard Manufacturers and things like that right to kind of be able to get in little hooks and things under the operating system and so in order to What we have to do to make that work is to actually execute the exact same instruction in Zen using the guest in general purpose registers So there's a whole thing where we on the stack make a little tiny function We load up the guest registers execute the one instruction the I am instruction Which will then trap into SMM SMM would do whatever magic is going to do and change the guest general instructions And then we finish they will return back into normal mode and then we Take all the general instructions and load them again into the guest context and come back into Zen And then finish the thing that we're doing and then return into the guest Which will then of course take the general instruction registers and put them back into the What's that Can we nuke the bias authors from orbit instead unfortunately, that's not So anyway So there's a lot of so right now There's a lot of special cases in the PV path that we have to have if you're going to run Dom zero as PVH Which is one of the main purposes that for which Oracle started the PVH work However, there's some problem with this with this method So right now we just from the HVM thing we call the PVA I opa so one problem is in order to allow the PV code to properly emulate things for PVH guests There's a number of ugly changes that we had to do Okay, so probably three or so of the of the patch series were some fairly ugly changes That had to do with making this possible most of the other changes are very straightforward Just okay simple switches, but these are not very nice Moreover, there's apparently a race condition because Sort of a double checking and I won't go into the thing, but there's this checks that happen in the hardware That before you even get into them and then after that there's there's other checks that Zen is doing and the fact that there's a race between these two sets of checks is a potential security issue So We've just been literally just talking about this earlier this week so The basic the reason that a lot of stuff happens is that in the first place is that we need two sets of access controls one for user Processes and one for the guest right so the guest operating system needs to say this this process is allowed to do these things And then the Zen has to be able to say This guest is only not to do these certain things, right? so in the PV case we only have one set of kind of permissions to Hardware things to do that with and so therefore we need to use the hardware things to Enforce the guest permissions and when you use the and then we emulate stuff at Zen to enforce the Sorry, there's a harder thing so you guess the user space permissions when we use emulation at Zen to enforce the guest permissions but so PV's only has one but pvh we actually have two potential sets of the permissions because the guest has its own HVM stuff and then the VMX code itself has another thing that we can we can control for the guest operating system So it's possible that we may not need to Do all this kind of crazy I am elation anymore and there's crazy I own instructions which need to execute on the stock and that kind of stuff may be able to be executed in the pvh context and that would Also be a big win for then as a whole if because then we could get rid of some of these crazy paths But that's just not idea at this point. We'll have to see if that actually works or not. Okay So I have a little bit of time here one thing I want to bring up just this is some more information on it about pvh It's about h.a.p. Versus shadow so There's a couple key differences between h.a.p. Versus shadow h.a.p. Hardware such as paging Also known as ept in the Intel world or SVM. I think it was called in the The AMD world and the basic idea is that you allow the hardware to do a lot of translation instead of having to do the having Zen So if you're gonna do a page table update in h.a.p. It's really really cheap. It's just a memory write. It's exactly the same as a page table cost of a page table update in Bear mental mode in shadow ultimately because we're keeping they have there's the guest page tables and we have a shadow copy of the Page tables inside of Zen Ultimately every time something is updated in the shadow page tables. It has to Zen has to be involved in updating that and Everything up every time something is updated in the guest page tables then it has to be involved in updating that in the hypervisor page tables before it can actually be used now We have done before HUP came out We did a lot of work in optimizing this path But ultimately Zen has to be involved in every single translation before it's used So this means page table updates for shadow are very very slow um TLB effectiveness, so HAP there's using the same hardware and so TLB's and x86 only have 16 entries However, you can have each entry can point to a 4k page or a 2 megabyte page or 1 gigabyte page And so HAP allows you to use the Increase use the superpages to allow you to have a more effective use of the TLB This is I think another one of the reasons that Oracle is particularly interested in HAP mode It's right in pvh mode is to be able to use the superpages For databases where this is pretty important and in shadow You can allow guest superpages in the guest page tables But when those map into the shadow page tables there's still only 4k pages at the moment And so a shadow will not will not allow you to have the higher TLB effectiveness that HAP will have Okay, so again here's another place where HAP is definite when However, when it comes to TLB replacement costs if there's a TLB miss Now think the things are turned around because in shadow The worst case if you have a 64 bit guest is that you may have to do four memory reads. Okay, so Level one two three and four may require you to to read the memory. Okay HAP your best case if your if your guest is using superpages and your Your p2m is able to have superpages Then you'll have nine memory reads and in the worst case if your guest has Is using 4k pages and you don't have superpages in your p2m then you'll have 16 memory reads 25 okay, oh Right. Okay, so my numbers here wrong so so the the best case is three so that that should be three times four then 12 and 25 Anyway, it's a lot more than four It's even worse than I thought okay, so and does this matter and I haven't done any Performance tests recently, but it used to be the case that kernel build Shout out would be 30% slower than HAP right because kernel build has very very good TLB locality and it has a huge number of Page table updates right However for other things like spec JBB spec JBB it sets up the page tables once and it never touches it again So the H8 so there's basically zero cost for shadow for page table updates. However, it has very very poor TLB locality and so The result is that typically in the past we've done this kind of measurement shadow is about 30% faster than HAP So all this to say that shadow isn't dead. It was still be important in the future and so Yeah, that was the main thing so moving to pvh. This is one of the effects is that now We have to have to deal with this HAP shadow thing and that's going to be something we're going to have to continue to look at in the future Okay, that's all I have All right, so we talked about pv8 hvm and pvh We looked at pvh from Zen's perspective from Linux's perspective We talked about a number of issues in pvh and hopefully now I've given you a technical overview and you will be able to better understand the characteristics advantages disadvantages and Hopefully many of you will be able to approach the code to improve it and to fix it and that's a story With that I'll take any questions Comment With the hardware accelerated pages if you're using large pages then 25 axis is go away and the spec JV slowdown goes away So have you tested that absolutely, okay, so I'd suggest maybe I'm doing another test look at doing Pvh only with large pages or something like that. Yeah, so in general There's then there's no case where shadows faster than hardware paging if you're doing it, right basically, okay? The second so you're talking primarily about in the guest So in Zen making sure that you have using superpages in the p2m Yes, so then tries very hard always to try and touch to do that, right? second question I had was about Not emulating in a pic right with the hardware APIC virtualization that's coming in the household processors And he concerned that not being able to do posted interrupts would actually make pvh slower than hpm in the long run And have you considered doing a pic emulation so that you can take advantage of future hardware support? I haven't looked at it at all. There's anyone else want to comment on that Yan reasons and So you were mentioning about the bootcase where I was not completely clear So when you're booting a Linux kernel in the pvh case You're gonna jump directly into the kernel at a spot where paging has already been turned on So that's what happens in pv mode right now, okay? So in pv mode you construct the you have the domain builder that will inside the guest user space construct the pv Kernel and in an rd and then jump right to it and it's already set up and the paging modes are already set up But you cannot do that with the pvh case Yes, yes Yes, oh, yeah, so I was discussing the implementation of that with hvm because in hvm mode So I was talking about changes from hvm into pvh so hvm you start in Yeah, yeah, you start in real mode and then you go to 1632 da da da, okay more questions my god, you make me run I would suggest the next question come from the guy over there But it makes sense to say for example that you describe that We don't have to start in real mode and there things can be done in an easier way But on the other hand on the guest side all that code already exists anyway, so it's already there It's already implemented. There is you're not gonna have to implement it again So the real reason to have something like this will be performance And so how much performance gain do you get on top of pv on hvm? right, um, so it's another thing I haven't I haven't had time to actually test that yet Sorry That was a joke Sorry I wasn't positive until you actually were there Yes Yeah, so so the main reason so one of the big things that work will wanted from this and with that their target is Is for a pvh dom zero and so basically You can't run a normal pvh vm thing in dom zero you could try to do what is so there's a talk earlier today about Running a pure hvm dom zero And we don't know the technical details of how much how much harder that it is or how much uglier that may be Then just running a pvh But definitely pvh is better than for pvh for dom zero Is the only alternative to pv dom zero And pvh will certainly be better in many cases than pv dom zero because of the system for 64 bit system call overhead. Yes Okay, we've run a bit over time. I can still take one question No, thank you. Thank you