 Good morning, everyone. My name is Kashyap Jamalti. I work for Red Hat as a test engineer in Cloud Department. And this talk is about Nestor virtualization and about Intel and KVM on KVM, essentially. So that's what I had access to. The reason behind it was partly I was curious to try this out. And I had access to one of the newer Intel machines, a Haswell machine, which had improvements for Nestor virtualization. Before I proceed, just a quick poll. I don't know how many of you use KVM here? Quite a bit. How many of you use KVM with Intel? Almost pretty much. How many of you used Nestor virtualization? That's very cool. Do you see a real use for it? Is it because a cool thing? All right. Yeah. Because when the patch set was posted initially by the total project authors, there was a lot of discussion about use cases, what's valid and what's not, is there a real use, real need? But it has evolved a lot over the time. Anyhow. So that's a quick agenda, just a bit of an overview. And what are the optimizations that are in to resolve the bottlenecks of Nestor virtualization? And a little bit of configuration information and a simple initial test with all the latest KVM Git. Obviously, I couldn't test all combinations and possibilities. That's one of the biggest problems. Test matrix, you always almost have an explosion of test matrix. And then some conclusions. All right. So in your regular virtualization, as you can see the bottom of the stack, you have x86 hardware, Intel or AMD. And then you have your regular hypervisor with exposes slash dev slash KVM character device. And you have your regular guests running as any other process on your host. So it's fairly simple. It's been there for a while. So each virtual guest also competes with the other processes for the resources. And KVM kernel does the scheduling aspects of it. When it comes to Nestor virtualization, we have a newer layer introduced. So you have L0. Is this a bit of a terminology here? Well, L0 is your level zero physical host. L1 is your regular guest or guest hypervisor. And L2 is your nested guest. So you also have an L1, which is a guest hypervisor in this case, which runs its own associated guests, which are the nested guests. And by introducing this, there are some overheads involved. Before we see that, let's see what are the use cases. So one of the more popular use cases is the user control hypervisor, meaning said there's a developer and he or she wants to test the application on different distributions instead of going to a cloud provider asking them three different operating systems. You ask for a beefy hypervisor. And then he or she can manage her own associated guests in the user control hypervisor. Well, that is more appealing to cloud providers this use case, I believe. And there are more use cases like you can demonstrate an entire virtualization setup in a single VM with its associated guests. And obviously, there is open stack cloud infrastructure software where you can run the entire infrastructure in a bulky guest. So that it would be a bit more easier to demonstrate. And performance would also be slightly better with all the latest improvements. And of course, there is also live migration of hypervisors. And I haven't tested it at all. But I don't know how it's how if anyone has tested. So I was just curious in the audience if anyone had an opportunity to test this. And there are more use cases that were discussed on the mailing list. One is hypervisor level root kits and security related aspects. But again, that's just theoretical, I guess, for now. And where is the bottleneck? The primary bottleneck is the number of VM exits and VM entries, meaning whenever a guest hypervisor needs to execute a privileged instruction, it has to relinquish control to the host hypervisor. So that is a costly operation. When it's a privileged instruction, it could be access to hardware resources like time and date, maybe something like a CPU ID instruction. Giving control back to the guest hypervisor is the VM entry. So this is one of the bottlenecks. When an L2 guest, a nested guest tries to exit, a single L2 exit can translate into multiple level 1 exits. This is a term that's exit multiplication. So how are these resolved? Well, two new things at least have been introduced. One is with the Haswell architecture, which is upcoming. There is a feature called VMCS shadowing. It is watcher machine control structure. VMCS is a processor specific data structure, which contains the guest and host CPU states. And with VMCS shadowing, L0 hypervisor can define another VMCS for your level 1 hypervisor so that your guest hypervisor don't have to exit to L0 and it can directly store states of guests in the VMCS shadowing, the shadow VMCS structure. These are fairly detailed aspects of X86 inter architecture, which I am still learning and I still just started to explore this. So yes, this is essentially a processor optimization at a very high level. And with this shadow VMCS, from my minimal tests, the number of VM exits and entries have been reduced. It's not like a huge performance difference, but at least 4 to 5% has been definitely observed. And that was for CPU optimization, VMCS shadowing. Another optimization is the memory level optimization. EPT has been around for a while, I don't know, Westmere architecture, three or four years. So EPT external page tables provide a second level translation on hardware. Before that, there was something called shadow VMCS. I'm sorry, there was something called shadow page tables, which was extremely slow because of the number of times the translation has to happen from guest virtual address to guest physical address, guest physical address to host physical address. And the number of times, again, exits increase a lot more in that case. With nested EPT, the feature of extended page tables is exposed in L1 hypervisor. So it essentially emulates extended page tables on your L1 hypervisor so that it can let L1 run EPT when it's running, it's associated nested guests. Well, it's a fairly complex thing to understand for me. The MMU layer, while I was trying to test, and it's like a rabbit hole when I try to learn and try to explore this. But yes, that is a very user space level kind of understanding from my testing. And with nested EPT from an initial test, there was almost an observation of 10 times of performance improvement. The test is, you can see the result in a little while. So this is a bit of configuration. Well, these are the parameters which we have to ensure to be enabled on your host. The first is the nested parameter for KVM Intel module, and then enable Shadow VMCS, which is available from Haswell machine architecture. I don't know when it's released, probably in a few months. And then you enable EPT. EPT is enabled by default, and that's just a XML snippet. I'm using LibWord, in this case, and KVM. So to expose VMMax instruction set to L1 hypervisor, you could use something like that, a snippet like that, or you can expose the complete host CPU model by using host pass-through. But I think this will create problems when doing live migration, because the target host should also be of same architecture. And this is just a LibWord invocation where when we did CPU host pass-through in LibWord, you can see the QM or KVM invocation, where it's using the L1 guest hypervisor. It's using dashCPU host, meaning it's exposing the host CPU model to L1 hypervisor. It's fairly ugly, the QM or KVM command line, but people are working to get it resolved. Yesterday, I was there on QM or KVM forum. I mean, there was a lot of discussion around improving these aspects. So a simple workload which I tested was a deaf-config kernel compilation in 10 iterations. So just run it in a loop, clean it, and ensure data is flushed, and also drop the data from page caches so that the compilation is consistent, the timing is consistent. And then just output it to a file. And that's a result with VMCS shadowing enabled and disabled across L0, L1, and L2. So without focusing too much on the numbers, you can just see at a high level, a 5% improvement with VMCS shadowing enabled on L0. My compilation time between these two is not really a large, as you can notice. But yes, I confirmed this with some of the authors of nested virtualization, Intel nested authors. And they do indeed say that this is reasonable. And this was also discussed on KVM mailing list. Sure. Why would you get any benefit at L0 by turning VMCS shadowing on? No, L0 is just consistent. It's just, it doesn't matter here. It's almost the same. Well, sorry? It's the truth of the measurement. Yeah, you don't really see any huge difference in L0. L0 should be the same. But I just calculate an average and try to write it to the mark. Same for all the authors of L1. Yes. But it's just an approximation that you can always. But right, the most interesting case here, we have the case of L2. The same cases with EPT enabled and disabled? Yes, but I haven't wrote them down here. I posted them on my GitHub page. So we can check out the results there. Is there any downside to enabling the ability to run nested KVM? Does it slow down or add overhead to the non-nested case? Why isn't it on by default? Well, at least the KVM developers and maintenance still call it experimental at the moment. But, and also not a lot of people have been reporting bugs or users writing questions. So it's probably not enough confidence, I guess, in enabling it. So it's not an issue of overhead added by enabling it. It's purely a question of is it stable enough to be on by default? Well, yes, that's one main aspect. And the workloads, which I have been testing are mainly Linux on Linux, no graphics whatsoever. It's just minimal core installs. So of course, you'll expose a lot more possibilities and combinations when you see graphics and whatnot involved and other kinds of guests like Windows. Yeah, that's probably some good reason. Well, you can try the, yeah. If I've been running this basic test for a while, like six to seven months, maybe more with latest KVM Git. And it has been quite consistent. And I haven't seen any arbitrary crashes and such. And with latest KVM Git, it has been quite decent. So maybe you might want to give it a shot with latest KVM. But yes, some people do find it a bit more unstable. Some of the over people, over project, they do run a lot of these kinds of tests as well. And I've seen some reports which they found a lot of instability. But yeah, the workload which I've been doing is like fairly minimal Linux core. And it does not exercise a lot of other cases. It's arbitrary crashes, I guess, when you're running different workloads. On the nested KVM. Sorry, can you say that again? On the higher level. Arbitrary crashes on your host. For instance, I had a crash, which I only found it. I only was able to trigger it twice. But that was when I tried to shut down or force power off an L2 guest. I was only able to trigger it twice. But I posted it twice anyway. I asked the KVM maintenance, does this make any sense? Then I said, this does look new. Let's just post it to the list. People who are more in the trenches can maybe answer it. But I haven't yet got a response. It's been just like two weeks, I guess, so kind of people are always very busy. And that was, oops, yeah. This was a single make process. So if I run more jobs, make J4, then it would significantly reduce the time. Sorry? Yeah, this was a single, because to show consistency across L0, L1, and L2, I ran with a single make process. Otherwise, I also ran another case. What's the single process like? SMP guest, also I'm here. So for instance, another case where I ran L2 with two processes, L1 with four, L0 has eight. So actually the next case is that one. It's erroneously marked as single make process there, but it was a slight inconsistent, meaning all L2 was run with two processes, all L1 was run with four virtual processes, and L0 with eight. So if you run, as you can see there, in L2 case, with shadow and shadow topmost and necessarily PT on the bottom, you could see almost like 10 times or 1,000% improvement in kernel compilation. This was with make J2, so it's two virtual processors. If it was a single make, I think I just logged into the machine and saw it was something like 19 minutes in shadow on EPT and EPT on EPT somewhere like eight minutes and 30 seconds, something like that. And yeah, it's, as you can see with and without and EPT the difference there, but if we, when we run more workloads, we will get probably more insights and when you trigger more kernel cases. Yeah, this was just 10 iterations, not too much. Say that again, I'm sorry. When you tried with, yes, with more virtual CPU, it worked as well as with one or? Yes, it did work consistently with more virtual CPUs. Well, that was quick, but some of the, summary you can just say that with current KVM Git, you definitely see a very visible performance improvement and the current KVM Git has patches for nested EPT and VMCS shadowing support has been there for a couple of more months, but nested EPT has been merged in current KVM Git. And again, there's a lot of test combinations possible. You could only just running this make test itself, it took like quite a bit of combinations and trying to keep track of what's running where. So yeah, definitely more test combinations need to be exercised to understand more real workloads that may not reproduce the exact real world case. Well, yeah, I was planning to do some of these tests. KVM, one of the KVM developers suggested to Ryan what to test inside what L1, as part of a L1 guest hypervisor. So it exercises different operating systems and it's also quite robust in adjusting workloads. So that's one of the cases which I'm interested to try and we can run more fine-grained tests to understand workloads of MMU and CPU. One thing which I didn't mention at all is I didn't even get a chance to test the IO improvements. No IO workloads have been tested. So that's another case where definitely more tests can be done and more testing with OpenStack where you can run nested guests. You know what guests can be run with KVM enabled in your level one guest and any other cases which you can dream up. So if you want to try from KVM Git, so well, it's fairly just straightforward. I just wrote it for completeness sake. I checked out just the Q branch, the KVM Git Q branch is where all the testing happens before they push it to next tree. So that's how you could just go ahead if you want to try. But this should be in 3.12 very soon. Well, most of my notes, whatever I've been doing, I posted there on the GitHub with complete details. And if you'd like to try out, maybe you can consult it. Well, I think that's quick. Any questions? Sure, that's fine. First I'll take the. No, so the question is, have you tried VMware on KVM? No, that's one of the cases with Turtles project where they tried. But I was doing this as a part-time curiosity. So I didn't get time to test any other hypervisor apart from KVM. The thing with KVM also, I was testing some cases. And for instance, there's a shadow on EPT case, which was in that case, L2 guest was not even booting. And I had to bother the KVM maintenance on IRC. And then they finally figured out the issue and then submit a patch and then get it. Again, compile the kernel on L1 and install it and running. It's just too time consuming to try out too many combinations. I had a question about the last set of performance numbers you showed. Why was it that the L2 was significantly faster with doubly nested EPT, but L1 was actually faster with shadow on EPT? Why would shadow on EPT matter when you're not actually running a second level guest? This one? Yeah, the L1 was faster with shadow on EPT than nested EPT, but the something on shouldn't matter if you're not running a second level guest. Well, yes, one of the things which I had to investigate is shadow on EPT case, that when one of the persons reviewed the test, reviewed the slides, that was one of the questions. So yes, I have to investigate the issue. Why the, what's the bottleneck there with shadow on EPT case? There was a question there once. What about L3? Sorry? What about L3? What about L3? I haven't tried it yet, but again, the same problem will be there with L3, the number of exits and VM entries and VM exits, but I certainly have not tried it. I don't know. What do you envision in any, this is just for fun? Well, I certainly have not tried it, but I thought of doing it, but let's focus on few meaningful tests. The turtles project has certainly, turtles project is the one where they introduced the nested watch for Intel. So the project certainly claimed to run L3 guests for sure, but I have not yet tried. So I do not want to say that I did try with L3 guests, and just L2. No, is VM CS one level deep? Is it only a single instance or is VM CS something? Any time one, only one, honest. Yeah, it also, it's a data structure, so it exposes software VM CS as well. So for each virtual machine, you can use a different VM CS as I understand it. So I do not think, for instance, I was running multiple L2, four L2 guests and consistently without any crash. Yeah, the four shadows are used. Well, yeah, that's a certain. But you didn't need the keys. No, yeah. Let's see if we can quickly see, I have the machine here running. We can, if it's accessible, it's a remote machine. It will be slow. So we still have a bit of time, right? So the bottom here is the L1, L0. So the bottom is the L0, so you can see list. One regular guest is running to ensure this is indeed running with VM extensions enabled. Let's grab for the QML process. So you can see it is running with CPU host, so it did expose the KVM slash dev slash KVM character device. So this machine is running here. This is the L1 guest. To indeed prove it is L1 guest, let's just log in from here. Where is that you list? You list the guest. Where is that console? Regular guest. So it's indeed a regular guest. And let's check for the character device. Yes. And then let's see how many guests we have there. One is already running. We can just start the other two. Let's just clear it. Two of them are running. So it's running fine. We can even run a make process to see if it... No, I didn't catch the question quite correctly. These are fully virtualized. The guest, there is no parallelization. Well, yes, that's one of the cases that you still have to try windows and other sorts of guest to exercise. But yes, that certainly needs to be tested. I don't know what your experience has been as... All right, any further questions? I think we are ahead. But thank you very much.