 I hope it is. Okay. So, hello everyone. Let me welcome you to POSDEM and to our virtualization and infrastructure, the service dev room, and I'm the first speaker of the day, and in my presentation, I'm going to talk about how you can run Windows Guests on KVM efficiently. So, in your infrastructure, you're running virtual machines, and some of these virtual machines are Linux, VMs, some of them are probably Windows VMs. So, doesn't make any difference from the virtualization to stack point of view, like which operating system we are running in your guest. And answer, well, it depends. So, in theory, it doesn't because with KEMU and KVM we are actually trying to emulate some existing physical hardware by building a virtual machine, right? But then, if you put your Linux get on KVM and take a look in the log, you will see something like this, right? And you will realize that your guest knows pretty much everything about the fact that it's running virtualized, it knows that it's running on KVM, and it's actually using some features. So, why do we do that? Well, the key thing is that when we are trying to emulate physical hardware in software, that some interfaces were not designed for that, and it can actually be slow in some cases. So, how do we solve these problems usually? Well, if the hardware interface we need to emulate is slow, and we cannot make it fast, we come up with our own solution, and we invent so-called paravirtualized interface, which is fast and which is software friendly, right? But then when we have our own interface, we have to put support for this interface in the guest operating system, right? Because it doesn't know anything about it. But the question is, what do we do about proprietary operating systems like Windows? How do we put these interfaces there, right? We don't have the service code. Well, we can probably try writing drivers, and that's actually what we do, for example, with virtual devices, right? But the thing is that not everything is a device from Windows point of view, and some very core features of it like interrupt handling or clock source are actually not devices, not drivers, they are in the core of the operating system. So, you may have hard times writing these drivers for your proprietary operating systems. And moreover, there are multiple different Windows versions and you basically have to check that this solution works for every of these. So, what else can we do? Well, we know that KBM is not the only hypervisor out there. There are other proprietary hypervisors, and the thing is that these hypervisors have to solve the exact same issues. Because, well, for them, these hardware interfaces are also slow and they also have to come up with their own interface. So, in Windows world, this hypervisor is called Hyper-V. And we do emulate Hyper-V in both KBM and QMU. And these are basically like two different types of emulation there. We emulate these core features, which in Hyper-V world are core enlightenment. And that's why my talk is called enlightening Hyper-V. I'm gonna talk about the first part. Device drivers is something which would make it possible to replace, for example, Virtio. So, if we write the embossed device drivers, then we won't need Virtio drivers for Windows. And there is such effort, and Virtio as a company is currently working on it, but it's not currently upstream, and I'm not gonna talk much about it in my presentation. So, Hyper-V features which we emulate. Where can you read some documentation about that? There is no in QMU and KBM for you as a user. And in LibVert, you get this. That's basically it. Probably not much. You may or may not understand what these features are. And if you want to know more, you can go and basically read the specification. Hyper-V folks were generous enough to publish their spec it there on Microsoft website. Or you can listen to me now. So, what features do we have in KBM and what are they needed for? So, I'll be showing you both like QMU, syntax and LibVert syntax, how you can enable the feature and I'll tell you a few words what this feature does. So, let's start with this one. It's called like Relaxed Timing. It's enabled by like HVRelaxed in QMU and like this features Hyper-V something in LibVert and mostly these Hyper-V Enlightenment in LibVert I enabled like that in features but there are some notable exceptions I will show you them. And this feature basically tells your windows that it's running virtualized so it should disable all hard watchdogs on different events. Because different operations can take different time when you're running virtualized, right? So, if you put some hard watchdog there your windows can crash. And actually more than windows versions they don't require these. They will detect Hyper-Visor CPU flag and enable this automatically. But for all the windows versions it makes sense to enable it. Paravirtualized Epic. So, it's enabled by HV Epic and it basically provides a shared page for each CPU to assist dealing with Epic. And the notable feature here is Paravirtualized End of Interrupt. So, here is a good example when like emulating hardware interface is slow. When you have an interrupt, right? And like a level triggered interrupt pending your Hyper-Visor will stop your guest inject interrupt there and resume your guest. Your guest will notice the interrupt and probably will start doing something about it like launch like an interrupt service routine. But when it's done it needs to somehow signal the fact that it's done with the interrupt and it's ready to receive the next one, right? And in hardware like in physical Epic you basically write to a register. And it's, the operation is like pretty fast, right? So, you write to the register, it resets the bit and then you can receive the next interrupt. But if you do it under the Hyper-Visor you will get a VM exit, right? So, your guest will be stopped. You will drop in the Hyper-Visor and Hyper-Visor will basically mark that the interrupt is not pending anymore and resume your guest. It takes time. So, so-called PV end of interrupt was invented. It's basically like the guest is just clearing one bit in the shared page and the Hyper-Visor will periodically look at this bit and when it's not pending anymore we are ready to inject next interrupt. We don't need to do it synchronously most of the time. And there is a side effect that this feature is also required for enlightened VMCS feature. I will tell you about this feature later. Paravirtualized spin locks enabled by HV Spin Locks and you can tell it like to you move how many attempts to do before like giving up. The thing is like there is a core concept of a spin lock, right? When two CPUs are trying to get the same resource they may do this like cheapest possible locking. It's basically like checking a variable in memory and seeing if somebody else is doing something with the shared resource. And you set like basically one there, you do something, you reset it, right? The other CPU looks at it. Oh, it's busy by someone else is doing the job and it just spins. It doesn't do anything. It constantly checks the state of this indicator to see if it can do something. In virtualized world it can take significantly longer because your virtual CPU which actually took the resource may not be running at this moment, right? It can happen that it took the resource and then it was outloaded, right? And some other guest is running there. So your CPU which is trying to get the lock will have to wait for quite some time. Instead, we can basically give up and give a chance for other with CPUs or other guests on the same physical CPUs to run, right? And that's what the feature does. We also have a counterpart in KVM but Windows cannot use this KVM feature so we can enable this hyper wave feature. Next one is a simple one. It's like we've been a virtual processor index. It basically tells, creates a virtual model specific register where each CPU can read its own number. And in KVM they almost always match like the order in which they were created. CPU one will get one, CPU two will get two. But the thing is that we need this model specific register for some features I'm gonna tell you about. And Windows, if it won't see this feature it won't use this PVTLB flash and PVIPIs for example because in this hypercalls CPUs are actually specified in this VP index terms. Runtime information, right? So you have a virtual CPU and sometimes it runs, sometimes it doesn't and some other virtual CPU or the host is doing something on this physical CPU. And if you wanna do some like fair scheduling for example you may want to give your tasks same slices of time to run but the thing is you think that your task is running but actually it's not. And something else is running there and how can you know that, right? So there's a protocol basically again like a shared like a registered model specific register where Windows can read the information for how long the VP was running and for how long something else was running there. But the thing is how it's done in Hyper-V it's done through a model specific register. It's not a shared memory page. So reading it will trap in the hypervisor. So it's kinda slow and Windows as far as I know doesn't do that for scheduling by default because it would be really slow to switch between tasks. And I'm not exactly sure when it actually does use the feature but maybe sometimes it does. Crash information, that's quite interesting. So your Windows crashes, everybody knows that, right? So you will get a blue screen of death but the thing is that not all of them are the same, right? So you may want to know especially if you're running VMs on a larger scale you may want to know like if you're actually seeing same crashes on different hosts or these are like different crashes or like how many different crashes do you have? So you can analyze them and Windows can provide some information basically like five registers or, I don't know, yeah I think it's five on crash and you can get this information. If you enable the feature then in Libbird log if you are running through Libbird in QMU you can get this information too but I think you need to do like a QMP command so it's not easy to get this information from Libbird you will get it by default in the log I think. Windows will tell you basically where it crashed and some parameters like registers. So by comparing these in the logs you can see like if you're seeing same crashes or different crashes it can come handy in some situations. Clock source, it's actually one of the most important enlightenment and the thing is that in some workloads we need to get timestamps pretty frequently. For example, we're trying to timestamp records in the database or network packets so your operating system will constantly be reading from the clock source it has but the thing is what is the clock source it's trying to access and on physical hardware it's usually nowadays it's TSC it's a register in your CPU which is usually good but in virtualized environment you cannot do that because your VM can actually for example migrate and there's gonna be like a jump in TSC value and the jump can actually be backwards so not nice and virtual machines came with this concept of a virtualized clock source and in KVM world it's called KVM clock but Windows is not gonna use your KVM clock right by itself so we emulate HyperV clock which is basically the same concept it's a shared memory page with two values and to get the timestamp it reads the TSC register from processor multiplied by like scale and at the offset if your VM migrates hypervisor will update its values and the reading will stay like persistent so it won't jump anywhere so it's quite useful and it spins up Windows a lot so if I will have some time I will show you some benchmark at the very end of the talk so synthetic interrupt controller so that's the core component of building VM bus VM bus is the key component how you can create this PV devices which I'm not gonna talk about but that's how you can create PV devices in HyperV so it allows you basically to it's something like a communication protocol between the guest and the host you can like basically post messages and signal events and it's not interesting by itself unless you have some VM bus devices which are not yet implemented but this enlightenment is required for Windows to use synthetic timers and synthetic timers yeah so synthetic timers is something like like an alarm clock right you wanna get an event in like one second say right so you set a timer you get an event and Windows does this pretty frequently so again in hardware world you can use something like TSC deadline timer now right so you set next TSC value and you will get an interrupt when it happens it's gonna be quite slow because you will have to program this every time there is an event and again it means that you will be exiting to the hypervisor for each event you can set a periodic timer with this enlightenment and actually there was an update of Windows 10 and Windows 2016 last year when they changed the frequency of basically setting up these timers and there was like a huge performance regression for Windows guests under KVM users were seeing their guests constantly spinning like consuming 30% of the CPU even when they're idle you enable this and this goes away because Windows sets this timer once and gets this event when it needs it without any hassle till be shut down again as you know like when you map something in memory you may want to flash a TLBB buffer which is like a fast translation from one to another from physical from virtual to physical and in x86 world if you wanna flush this buffer on other CPUs you send IPI's there so it's basically interrupts and you wait for them to perform just a shutdown in virtualized world it may happen that these two CPUs you want to flush are not actually running so it's kind of pointless to flash buffer there on the first place and second you will spend quite some time waiting for this to happen so they came up with this concept of pervergealized shutdown so you tell the hypervisor to do the shutdown operation on your behalf and hypervisor actually knows which vCPUs it needs to flash and which are not running and they don't require flashing so this speeds up some over committed environments significantly when you have like more virtual CPUs than your physical CPUs pretty similar concept with pervergealized IPI but it cannot just drop the IPI because this inter processor interrupts they have to happen the only thing that we can flush send IPI's to for example more than 64 CPUs at a time with this and in hardware you'll have to do VM exit for every 64 CPUs you wanna send so it becomes like cheaper yeah there are a couple of like useless things you can do like you can set hyperweave under ID Microsoft Windows doesn't care about what you put there you can put like hyperweave, TVM, Microsoft hyperweave it doesn't really matter the other one is pervergealized reset say another model specific register which allows your guest to reset itself and the thing is that even genuine hyperweave doesn't recommend using it so the feature is there but for no particular reason at this moment but maybe for some very old Windows guests it was required for modern guests it's not required so there are also a couple of features which are required if you're running nested guests if you're running like hyperweave on TVM or if you're enabling some security features in Windows which actually enable hyperweave underneath there are such features there and first is if you wanna get stable clock source and I just told you how important is to have a stable clock source if you're running nested you will need a couple of additional enlightenment one of them tells your level one hypervisor about your epic frequency the other one tells it when it changes for example when you migrate your level one guest with all its guests somewhere else so it actually needs to know that the frequency changed and that's how you do that it's not currently fully supported in KBM so actually it doesn't send these re-enlightenment events so if your CPU is more than enough and you have TSC scaling it's not an issue but if you're running on older CPUs your clock may start ticking at the wrong frequency it can happen so we know about it Enlightened VMCS I was telling given a talk about it like last year it's that pretty complex feature but the thing is that to run virtualized guests you're dealing with such called VMCS state on Intel and you're using specific CPU commands specific instructions which are first not very fast and second I mean if your level one guest is building this state for its level two guest you don't know what it's actually doing there because it runs on the CPU natively so you'll basically have to read the whole state there is a PV protocol for that which speeds things up for that so we have more features than works and this one is already on the mailing list and that's why I put it on the slides if you're running Hyper-V on KBEM it would also like to see synthetic timers there but it cannot use synthetic timers in their current shape in the shape in which Windows uses them so because it doesn't set up like full this infrastructure the Hyper-V is like a very minimal hypervisor there it wants like a simplified mode and a simplified mode is getting an interrupt instead of a VM bus message and for that there is a timer direct enlightenment which is already implemented in KBEM and which will land in QMU shortly I believe so as I promised some benchmarks so you understand how important these enlightenment are so this is Hyper-V clock source what we do in the test we basically spin and we do clock get time it's like basically what's the time right now right in the operating system so if you run it with and without HV time you will see the tremendous difference between because with HV time it's basically reading from memory so it's not very very different from actually like reading like DC register from parallel processor on bare hardware without HV time it means VM exit to the hypervisor every time so the speedup is great here Enlightened VMCS if you're running nested guest and you will do some operation which actually traps in the hypervisor and CPU ID is as you know like gives you this like CPU features you have but it always needs to trap in the hypervisor so you will see that with HV VMCS we achieve like 10% difference here we still be shut down the test case is quite complex here and this one is like part of it but the thing is we are doing M-Map and M-Map of some like big file in chunks and this operation is known to cause TLB flashes on other CPUs and then what we do we are running the same test on the same host but we are just adding more and more virtual CPUs to our guest and as you can see when the number of like virtual CPUs matches there is almost no benefit in the feature it's the same right as sending these IPIs and doing flash natively but as we go over committed like more and more CPUs we have this PVTLB flash on the right the number stays more or less the same because we don't really need to flash these vCPUs which are not running and they cannot be running at the same time and but with physical PVTLB flash you will see the slow down of the same test case on the same physical host so that was it from me thank you for listening any questions? Yes? Just regarding the features you mentioned on which versions we can expect to have them and to make use of them? The question is on which versions we expect to see these features I'm guessing that you're asking about both like KVM versions and KIMU versions right so everything I was telling you about today is already upstream in KVM including the synthetic timer's direct mode in KIMU I don't actually remember like a top of my head but I think that everything except for like PVTLB flash PVIPI and Enlightenment VMCS is there in like 2.12 or something in 3.0 we were adding like PVTLB flash and Enlightenment VMCS something like that so if you grab current KIMU it has everything but this synthetic timer's direct mode as RFC is on the mailing list I'm also trying to come up with a simplification which would be called like HVOL which will enable all Hyper-V features for you it's like a little bit controversial because the question is what happens when you migrate such a VM right your other host may have different Hyper-V Enlightenment support like you have different KVM versions so like LiveWord folks prefer to have all these Enlightenment's listed there so prefer to keep them like fine grained and they may not support it but in KIMU it may actually come handy for like development test cases for a single host so you suggest stuff like that so expect to see this feature in near future More questions? Yes Oh, so many At the back you were the first to raise your hand so please go ahead Yeah, the question is why are these features not enabled by default and what's the cost for enabling them for the guest operating system so the cost is basically zero the notable exception is Enlightenment VMCS because Enlightenment VMCS comes with a penalty for example you will have your posted interrupts disabled and for some workloads when you have for example some physical hardware which is actually able to deliver posted interrupts that's going to be a slow down in other cases when you don't have such hardware it will be a speed up so this feature we cannot enable by default the rest, the cost is zero even if your guest operating system is not using them you can enable them for a KVM guest and you won't notice anything why we don't enable them by default probably because of how the virtualization stack is designed and the most important thing there is migration, right? So if you don't need these features but you enable all them that later you cannot migrate this VM to some host which doesn't have this feature because from the hypervisor point of view we don't know if the guest is using the feature or not or we will have to come up with an interface oh what's the guest using this feature or not can we disable it we don't have this in either QEMU or QVM so yeah, so thank you guys very much we're out of time so I will take your questions here in the corridor yes