 Good morning and welcome to the virtualization track. My name is Vitaly and I'm gonna tell you what we do to run Windows and Hyper-V guests on KVM better. So in your infrastructure, you are running VMs and some of these VMs are Linux VMs and some of them are probably Windows VMs. So the question is, how is it different from the virtualization stack perspective? Does it make a difference? And in theory, you can say that it doesn't because with like KVM and QEMU, we are trying to emulate physical hardware by making it virtual. So we are making a computer and then some operating system will run there. Okay, but if you try booting Linux on KVM, for example, and look in your log, you will see that Linux knows quite a lot about the fact that it's running on KVM. So as you see, we are using some clock, we are using some PVTLB flash and features like that. So why does it happen? So the thing is that when we're emulating hardware interfaces, it's not always so fast as we would like it to be when we do it in software. So how do we solve the problem? Well, we come up with a different interface which is gonna be software friendly and these interfaces are usually called like Paravirtualized, don't mix it up with then PV, it's a different story. Yeah, so yes, but if you come up with your own interface, right, you will need to add support for this interface to your guest operating systems, otherwise it won't use it. Yeah, but the question is, what do you do with proprietary operating systems like Windows? Well, you can probably try writing drivers and in some cases, it will work. As you know, for example, we are writing our Vertio drivers for Windows to work with Vertio devices. But thinking that some very basic features like the clock source, like TLB flash, they're not devices from Windows point of view. So you may not be able to write such device drivers for your proprietary operating systems. So what do you do in this case? Well, the thing is that KVM is not the only hypervisor out there. There are other hypervisors and especially in the proprietary world, in Microsoft world, they have their own hypervisor which is called Hyper-V. And they have to solve the exact same issues we are solving in KVM. They have the exact same performance problems. So they probably have all these interfaces we need already. So if we emulate these interfaces in KVM, we will automatically make Windows faster and better. So we do that. And that's called like Hyper-V emulation in KVM. It splits in basically in two parts. Some like core enlightenment for very basic features and device drivers, VM bus. Device drivers are not yet upstream and it's probably a broader topic. So today I'm gonna talk more about core enlightenment, the very basic features we have. So what are these features and where you can read something about them? There is no documentation for them in KVM. There is some in LibBird. Actually, that's it. Probably not much, right? So if you ask questions and where you can read more about them, you'll probably be pointed to Microsoft Docs, like their so-called top-level functional specification for Hyper-V, which to a certain level describes these enlightenment. Or you can listen to my talk. So let me go quickly through these enlightenment and I will show you both how you can use it from KVM and LibBird and I will tell you something about what these features are and why we need them. Okay, so first feature is so-called relaxed timing. So you can see QEMU syntax, how to use it and you can see LibBird syntax. Like mostly in LibBird, features will look like that but we will have some notable exceptions to that. And when you enable these enlightenment, like basically do in QEMU command line, you pass minus CPU, something you have, HV relaxed. You tell Windows that it's running virtualized and that it needs to disable all hard watchdogs on timings because some operations may take significantly longer than they take on hardware and if there is a hard watchdog there, it can crash. Newer Windows versions don't actually look at this. They, when they see hypervisor signature in CPU ID, they will switch to this relaxed timing mode automatically but it's also good to enable it for like all Windows versions you have. Another enlightenment is called Virtual Epic, HV Epic and what it is, is like an assist page. So for every CPU, your guest will map one page and it will use, like it's gonna be shared between your guest and the hypervisor and it will use it to speed up some Epic features. So for example, the most important one is a virtualized end of interrupt. If you look at it, and that's like a good example how I can show you why emulating a hardware interface is not that fast, right? If you, there is an interrupt happening, right? And it is like a level triggered interrupt, right? So there is a bit one pending in some register. So it's been signaled to your guests. So first exit happens when you need to signal something to your guest. You stop your guest from the hypervisor, you inject the interrupt and you resume executing your guest. It takes some time. Okay, so guest now sees the interrupt and does something but then you want to know when you can give the guest next interrupt, right? So it needs to somehow signal you that it actually finished processing the interrupt. And for that, this end of interrupt technique has been used and in hardware, it just writes to some register which is pretty fast, right? It is a hardware register or like a shared memory. If it happens on a hypervisor, it means that you will get back to the hypervisor and hypervisor will process this end of interrupt. So this per virtualized page is a different interface where a guest can signal the fact that it finished processing interrupt asynchronously. So when you do that, you will spend like twice less time in the hypervisor on processing interrupts. Another one is per virtualized spin locks. So if you don't know like a spin lock is a very basic concept when two CPUs are trying to access the same resource and it needs to be like mutually exclusive. So you create a variable, like think of it like an integer somewhere. And when one CPU grabs this resource, the other keeps spinning. So it constantly checks the variable if the resource it's trying to access is available or not. And when it becomes available, it does some job. So we use these locks in the operating system for when we need to lock the resource for a very short period of time. We cannot sleep there, we cannot access external hardware there, but still it happens a lot. And what per virtualized spin locks is that in virtualized world, waiting for such spin lock can take significantly longer. Think about what happens when this virtual CPU which actually holds the spin lock, it goes like offline, right? Like it's got another guest is running there. And it can take significantly longer. And your other CPU will be spinning. So instead, you can actually yell to the hypervisor and give a chance to some other CPU to execute while you're actually waiting for spin locks. And we have that for other hypervisors like Zen and we also have this for Windows on KBM. So you can do it by setting the number of attempts before it will give up, right? So it's actually good to try it at least like on work. 1000 times, not that much. And if you are still not able to get the lock, it will go in the hypervisor and some other CPU will execute. Virtual processor index. It's a very simple enlightenment. It gives one additional register like MSR model specific register where the number of the CPU is written. And this is additional to already existent like Epic CPU number, which on KBM will almost always match. So basically on CPU one, you will see one there. On CPU two, you will see two there. But this enlightenment is required for other enlightenment when Windows will be doing hypercalls. For example, PBT will be flush. It will use these numbers to indicate which CPUs need to be flushed. Runtime information. So you are sharing your physical CPU with other guests. So sometimes your CPU runs, sometimes it doesn't. And in your guest, you may want to use the information for better scheduling, right? So for example, you are trying to give same shares of time like same slices of time to your applications, right? And if your CPU wasn't running at that time, you're not giving them fair share. So there is an MSR which you can add and KBM will tell Windows how much time it spent running the guest, how much time it spent not running the guest. But as it's an MSR, reading it is kind of costly. So Windows, as far as I know, doesn't use it for scheduling. It can use it for something else, but I don't quite know. Crash information. Your Windows guests crash sometimes and you see a blue screen of death, it's still blue after so many years. Yeah, and you may want to analyze these crashes, especially when you are running on a larger scale, right? Like you have hundreds of guests on hundreds of nodes and you may want to see if these crashes which happen are like the same or different, right? And for that, Windows can tell you, for example, where it crashed, exact instruction pointer. And then you can compare them like looking at your crashes. So it has an interface. It can be enabled with HV Crash parameter and then you can get it from QEMU, but only via API, I think, or you can get it in your Libvert log. You will see that Crash and parameters are and you will see the instruction pointer and some additional data like basic registers. Hyper reclock source. That's actually one of the most important enlightenment and if I will have some time, I will show you some benchmarks, how important it is that there are some workloads which are actually trying to read time very frequently. For example, when you are doing timestamping of logs or database records or network packets. So in these environments, you want to get timestamp as quickly as possible. And if you don't give Windows like Hyper-V clock source, which we have, it will use something else for that. And this something else like HPAT may not be that good or that fast. So give it to your Windows and you will see the significant speed up. Synthetic interrupt controller. It is an extension of a normal interrupt controller and it allows you to basically communicate between the guest and the host in terms of messages and events. By itself, this enlightenment is not very interesting to you but it's required for, not yet implemented VMBus devices and it's required for synthetic timers and I'm gonna talk more about synthetic timers. Synthetic timers are another extension of this synthetic interrupt controller and it allows Windows to set like periodic interrupts. For example, 2,000 times per second getting interrupt from the controller. And last August, there was a significant regression for all Windows guests running on KVM with some Windows 10 update. When it became much slower or actually without this enlightenment, you would see your Windows guests consuming significant portion of your CPU even when they're idle, like going up to like 30% of the physical CPU and they're not doing anything. And turned out that they just changed the frequency of the time keeping interrupt they need and without this enlightenment, it means that like 2,000 times per second it goes to the hypervisor, programs next interrupt, waits for it, it gets delivered. Again, it goes to the hypervisor and that's what it does for this 30% of the physical CPU. So when you enable this, this issue goes away. Tilby shutdown, not gonna talk much about this, but sometimes on X86 architecture you would like to flush your TLB buffer on other CPUs. And how it's done? It's done by delivering interrupts like IPIs to these CPUs and they perform Tilby shutdown because on X86 you can only flush your local buffer. You cannot flush anything on another core, for example. So turns out that if some of the CPUs are not currently running on your hypervisor, you don't actually need to flush the buffer there, but your guest doesn't know that these CPUs are not running. And this enlightenment allows it to actually perform the shutdown on CPUs which are actually running at this moment. So it doesn't need to wait for CPUs which don't actually need to perform the shutdown. Paravirtualized API is somewhat similar, but the previous one is optimized for Tilby shutdown. Paravirtualized APIs is like the same story, but these APIs will actually get delivered because they're generic, right? You don't really know what these APIs are sent for. Do you really need to do some work on another CPU or not? So again, to speed them up, you can add this enlightenment and Windows will run faster. Yeah, there are some enlightenment, we're actually not even actually enlightenment, but are implemented in like Q-movement word, you can change your like vendor signature. That's what Windows will see. Though it won't anyhow act on this information. You can call yourself KVM, you can call yourself Hyper-V, it doesn't matter. It won't use this, but you can. There's one pretty useless enlightenment. It's called HVRESET, and it's just like another way to reset your guest. And even genuine Hyper-V doesn't suggest you to use it, but it's there, we implemented it too, it was pretty easy to do. Yes, so have a few more minutes. In case you're running nested environments like Hyper-V with its guests on KVM or you're using some security features on Windows which will actually automatically enable Hyper-V role for you, you may be interested in a few more. First, you may want to make clock source in your level two guests, and if you're running Hyper-V on KVM, Windows, which you will see, right, is your level two guest. You may want to have stable clock source there too for exact same reasons I told you. And for that, you will need two enlightenment. I have Hyper-V frequencies and Hyper-V re-enlightenment. Frequencies will tell your guest the exact frequency of a pick timer. And re-enlightenment is a thing when your level one guest migrates with all its guests, it may need to know about these facts so it can update level two guest clock sources. So that's not needed when you're not running nested, that is needed when you're running nested. And for that, there is a re-enlightenment feature which is currently not fully implemented in KVM. So it will run, but if you migrate to a whole environment, you may see some glitches, like your clock will tick with run frequency. Enlightened VMCS, probably too hard to explain in one minute I have, but the thing is that on Intel platform, you can use special instructions to manage your guest state. And these instructions are not as fast as memory access and we don't really know what was changed in the guest state. So when Windows manages, Hyper-V manages its guest state, we have to reload the full VMCS and it takes time. There is an enlightenment which will make it faster though it disables some of our transition features like posted interrupts. So depending on your workload, you may or may not want this feature. And the last one which is currently in works is direct synthetic timers. It's the exact same story as with synthetic timers I told you, but Hyper-V doesn't really enable cynic. So it needs another way to get the notification of an expired timer. And this mode enables it by allowing your hypervisor to deliver this timer exploration by a simple interrupt. And Hyper-V will only use synthetic timers if you have this. It won't use normal synthetic timers. So to solve the exact same issues with Hyper-V, you will need this and it's already implemented in KVM. I'm currently trying to implement it in QMU and later I will do it in Libbird. And yes, so last minute I have, yeah, I will just show you like a few benchmarks like how important these features are. So clock source, simple cycle reading clock, running on Windows. Without HVTime, it takes 17,000 cycles. With HVTime, it takes 400 cycles. So the improvement is tremendous here. And latent VMCS, that's a nested guest improvement. And we are doing like CPU ID in a loop in a nested guest. And here we have like 21,000 almost without the feature 19 and a half with it. So it's like 10% something improvement. For some workloads, it's actually significant. And TLB shutdowns. Fairly complicated to show the full example but the thing is that we are running multiple P threads which are doing M-Map and M-Map. Basically mapping parts of file to memory and un-mechanate. And that operation is known to require TLB flash on other CPUs. And we are gonna be running over committed. So we are running more virtual CPUs than we have physical CPUs. And as you can see, like the application is the same. The physical host is the same. We are just adding virtual CPUs to our guest. More the CPUs we add, slower it becomes without the enlightenment. With the enlightenment, it stays almost flat because we don't really need to flash anything on CPUs which don't run. So even if you're over committed, it doesn't change anything. So that was basically it. Any questions? This is all upstream except for this synthetic timers direct mode which is coming in QEMU. I mean patches are already on the main list. I sent them like last week. So eventually they will get merged. The rest is already upstream. The rest is already there. I'm also working on a feature for QEMU which will ease your life. It's called HV-ALL. Basically enable all Hyper-V Enlightenments you have except for Enlightened VMCS because as I told you, it disables some other features. So for some workload, we cannot really tell up front, right? So if you have many devices which can directly deliver posted interrupts to your guest, you will see a regression actually. And if you're not running such devices, you will see an improvement. We cannot tell in QEMU. So this is not enabled by default. You'll have to pass HV-ALL comma HV Enlightened VMCS. Others will be enabled automatically and I encourage you to enable them all because they're not known to bring regressions. Yes. It was all Windows Server 2016 with or without Hyper-V roll. So like this one is without, this one was with because we needed like nested guests there. So it's 2016, but as I said, we don't, we are not really aware of any regressions for other Windows versions when you enable this enlightenment. So it's pretty safe to enable them all basically. It's only when you are having some issues you may try to disable them. And if you find an issue, just like talk to us, come to the upstream mailing list and we'll happily look at what's going on there. Thank you. Yes, last question, Kristoff. Well, they have the specification and the thing is that they have old Windows versions which they support. So they cannot easily change these interfaces without breaking Windows, right? So we are trying to piggyback on that fact, right? So they're more or less like sane people. So they're trying to make sane changes to the specification. Windows Server 2019 is already released but there is no specification for it yet. So we don't know what we can do there. Probably we will have some new features there which we will again try to implement in KVM to make Windows guest's life easier. Okay, yeah, last question and I'm done. Okay, sorry, no time for questions. Yes, yes, sorry. Catch me here and I'll answer all your questions. So thank you guys.