 Thank you for telling me, and I'm from Red Hat, and at Red Hat, I am from the Virtualization Development Team, and the team is primarily responsible for developing KVM and small-to-ground attack. Team, Libro, and an offset. But, my job is a little bit different, because usually I work on certified hypervisor. What I do is that I support leaders as a guest for another type of item, hypervisor than KVM. Mainly, I work on a type of item like sand, type of lead, sometimes I'm doing red. So, the idea of this talk was heavy. I did talk about my last year's KVM forum talk by Jim Messon from Google. And he's called Jim... No, fuck it. Whatever I want. Yes, so, I'm talking about testing KVM on KVM. And while this is really important and interesting, in this talk, we'll mainly describe which activizations we can do for KVM as level zero hypervisor. And when we're on other type of visors being tested, we don't have such options, especially when we're on top of provide the real, evil type of lead. But, can we see these options in the business to be a good guess? And actually, are we a good level one hypervisor or not? So, this is basically the idea of this talk. So, now we have the setup. Yeah, so, what's the setup? I'm really talking about in this talk. So, basically, we have a type of lead installed on their hardware. And, as you know, the type of lead is, like, hyperbolic or rather hypervisor. The type of lead is hypervisor, not one hypervisor, which is lead between the hypervisor itself and the management partition. And the partition for hyperlead is always in the burnout box, you know. Right? So, what we're going to do is we'll be writing a list in the VM. But, inside this, we'll go to the VM and we'll try to write level zero guess. Only limits for now. So, I'm going to try to do those kind of possible. So, important and why would you actually, like, care why would you want to try that? Well, there is no human heart for that. But, why on earth would we be that? It may be because of Asia. And may, and somehow we're, like, driving the blood on the type of lead. It may happen to you that your customer, for example, is moving to Asia, right? And you have no choice. So, you have to move your, like, workload or containers or whatever you've gone to this type of provider. If there are some non-technical reasons sometimes, try to do the nation for example. Yeah, so, what you can do with this technical level virtual machine? Well, first you can try speaking with instances in the town. For example, at L-128, if you know it has, like, 128 to be used for their rise of narrative. That's a pretty big issue. So, sometimes you may want to actually divide it into several parts and use them separately. You may be interested in running secure containers. Stuff like intel player containers, data containers, huge, huge, huge. You just heard about from Fabian. So, you may actually want to run a virtual machine there on this cloud. And for rescue, you definitely need this virtualization there. So, that's what I'm going to do. Just want to move the old workload that you have or do some testing and dividing there as possible. So, next in hyperbole. It's relatively new. They only entered it in 2016. And there were many families running hyperbole and hyperbole. Why didn't they do that? So, I think what they told me is that their main driver was the curve in this 2016 period of these windows as prescribed very evil. They added some security features which are organization-based. And if you want to use this feature and you're running the virtual machine, that means that you know they're running that. So, if you have just like this movement, windows set up, windows set in the 2016. Next thing is you're not enabled there by default. You will run a virtual machine and carrier won't load there. It will tell you, sorry, no hardware support for that. There are no like various extensions for that. So, you have to make it. And after you do that, it doesn't start working. At least that's the expectation. So, you can run a carrier there. You can create a virtual machine there. You can basically do whatever you need now. There's this carrier and the whole stack on better hardware. So, great, right? Are we done? Well, the geeks are not using it for testing purpose. It's really neat to see how well it actually performs, right? Especially if you want to run like a campaigner. If you want to test. You need to do lower performance of the campaigner compared to level one. So, what I was interested in when I started working with this like several months ago was like, okay, cool, it works. But does it really perform? So, I started doing several benchmarks, trying to look at whether it's aspects of such testing. And I think it's really that every particular workload requires taking care of. Because different workloads have different implications on the ground. And you're still at this environment in the first place, but the next is even more complicated. So, first example, I hope you can see there, like a classical example of how you test your virtualization in general. I will try to explain later what. But this example is a simple thing. Just a type, TPUID. I hope that everybody in this room knows that. But our DTSC returns this cycle column from your CPU. It's just a counter. So, a single or something instruction. You run it, you get the column. And then there is another instruction, which, like this particular one, returns this like main CPU feature like this. And we try to see how fast it works, right? Very simply. So, the results. If you run this from bare metal, come to like skydive processor, you will get like 108 types. Levels, and here level one, I mean, you will get like 1,000, 3,000, 5,000 cycles. We'll see where it is similar. Added like 1,000, 2,000, 7,000, this sound. But if you run a virtualized guest on hardware read, you will see something like that. That skydive a lot. I mean, more cycles is standard each instruction's worth of performance, right? So, is it bad? It is. But can we make it better, actually? Right? Why is it better? And for that, to actually make it, understand that we do not have virtualization on Intel works in general. And I'm sorry, I will only be talking about that Intel virtualization today. NG is somewhat similar. With R, it's like technically different so I was born with it. Make it into that. We don't have time for me for R public reviews yet. So, we know that. I'm not going to complain that they have internally, they aren't internally, but nobody outside of Microsoft code, and I didn't take it in here. So, how does this thing work on Intel? Well, so, your time provider, it's like, it doesn't matter if it's like that, KDA or not, or any kind of platform, it's going to do the same. So, you prepare a special area which describes your debt. And there is like one plate of memory, four kilobytes. And this is like a special area of two memory, but which you cannot, like, it is directly you need to like do special instructions like that. And you describe your debt, whereas the starting point is like, what's that? Different, all different things, like all these, et cetera, these features you have. And when you create this memory, it describes your debt basically around you. And after that, your debt comes into view on the graph. You don't do anything from the type of wiser. So, the type of wiser is not involved in that case. It's only when your debt will do something special which requires some attention from the type of wiser, which cannot be done. You will trip back to the type of wiser. The type of wiser is supposed to handle the trip and receive the debt. And that's how it works. Like, we're going to the trip, we handle where is in. With like several debts, we sometimes have to schedule them out, schedule them in, it's like a little bit more complicated, but in a nutshell it's like that. Or, this is like the first level of optimization, better simple. Okay, so how do you think method optimization works? If you didn't know anything about this, how it goes many a week, you think that level one type of wiser just got to stay, right? It prepared this area for this level two debt, and it's on. And when something is happening in level two debt, we will jump into level one type of wiser, right? Handle the boat, receive level two, and continue to keep it. But it's not how it is. From level one type of wiser to level two, and yeah, really speaking about method, it's really a bit complicated, and I can even confuse myself saying this level one, level two, it's really not easy. I'm sorry. So, how does this work? So, level one type of wiser, basically like the only type of wiser in the system. So, it prepares the area. But when it runs the area, it cannot run. It first needs to be merged with the idea of what level zero type of wiser thinks about level one type of wiser. Because like level one type of wiser may already have some restrictions. And we cannot prepare, I guess, which will have less restrictions, right? So, we need to merge these two ideas. So, when we run, we actually go to level one type of wiser. Level one type of wiser analyzes the state which we just created. Nor does it which is maybe also how it needs to be run. And then it runs in parallel to level one type of wiser. So, it's mostly on top of level one. It's actually like in parallel. And what the interesting thing here is what happens when we track? Like, we confuse this to be a destruction, right? To be a destruction, right? Until many of them are always tracked. We get into level zero first. That's only a real type of wiser in the system. And level zero looks at the exit. Like, oh, the exit is probably not for me or I'm not going to handle it. So, you take past the execution flow to level one type of wiser. And level one type of wiser will be seen. Oh, our level two guests are tracked. But it's not like that. We didn't just track. It's like the execution flow is coming from level zero. And it's a bit slower, of course. Especially when level one type of wiser tries to edit the state, it tries to do this special comment like being read, being write here, familiar with them probably. And by default, all these comments track into level zero type of wiser. So, Intel came with a feature like three years ago, which is called like channel being seen. So, level one zero type of wiser can actually load like a special area for level one type of wiser, which won't be executed. But level one can access it directly. So, nobody messes. And it may be pretty possible. But when we try to run our level two, we still need to track into level zero. Level zero needs to analyze all the state, knowledge of this idea, and run level two. And that's how it works in Intel. So, back to our question. Can we make it better? Well, we cannot make it as good as well as it from level one. In terms of these architectural limitations, we will always be tracking into level zero, level zero will always be performing some work for us. And merging these ideas and reasoning the guest, and each name makes it will cost us significantly more than what costs up on level one. But we can still do better. And Microsoft calls in their stack. They have a publicly available stack. You can download it. And you can see that in the other text drive, all the teachers are walking around. So, they created a feature called the Lab in GNCS. The feature means that level zero with level one will communicate through a well-defined structure memory. So, level one suddenly becomes aware. Oh, I'm running nested. So, I can track with help level zero hypervisor. I will try to tell you what I actually change in the structure. We need to actually analyze the whole stack. Because if level one was taken to something with a structure which we have no access to, we don't know, right? So, they created this like pre-me protocol for the interaction. And the pre-me protocol is technically faster. So, yeah. There is an extra reason for making this. It still works. But I would expect it to get merged for 17. Not this terminal, but the next one. Because this machine was already open and we are still working on this one. So, the result. Well, we go down to somewhat less than $90,000. And if you remember what I showed you, we had like $20,000 before. So, this is actually, if I need to go speed up. So, benchmark. So, that was the very hardest to do, I think. Let's try doing one simple thing. Let's write the type which we should be trying to figure out what's the time right now. Right? And while it may seem very synthetic, right? I could do that in this program. It actually, it may be. If you have like an Apache Server running and it's trying to output something to its loss, it will try to time stamp each entry in the lock. So, it needs to get the time stamp. So, it will actually be doing something very, very similar to that. Reaching the time from the system. So, simply in this case, let's try to see how it performs. Again, benchmark on 55 cycles. So, it's pretty fast. You run it, you get this machine. Level zero or level one. It's not that it can be different. 25 to 70 CPU cycles is nearly the same. But, in level two, you will get to come to like this. And this is both this spectrum meltdown and this. Before that, it has to come. So, that's it. Yeah. I've had to read this text for this talk multiple times. There was a kernel that was changing so fast that it just didn't want to show you the same number. Like, I had two contents in here. Like, I want to say go. But then I said, like, what's the point to show you these two kinds of code? It's not going to have 100. Well, that's not enough to perform well. You may have to. And you're correct. So, what next? It doesn't perform and we need to figure out what. You may have a very good idea of that. Something with our clock source is wrong, right? We probably have a better one somewhere. So, what do we have? If you check on the level one, we get, like, TST page clock source from hyperreader. And on level two, we get KM clock. And if you go read the guidelines, like, Calvin, I treat my system to the performance of hypervisors, you will see that these are actually good processors for both of them. So, for hyperreader, we said that these hyperreaders are 50-page. For KM clock, they will tell you, for KM, they will tell you KM clock is probably what you want to use on KM. And they both perform very well, you should. But for some reason, they don't perform very well here. Why? At this point, you have to use some tools, like Perl, to figure out why, or even go and read the clock. So, I did. And it turns out that currently, KM only treats TST clock source as a valid clock source to make its best clock source state. Or what does it mean? It means that in case you use something different from TST in your clock, clock, virtualization clock, which is L1 now, it is. Stable clock source won't get passed to your level to death. And then comes the difference. When the clock source is not stable, you can't use VDSO method in Kerbal. And VDSO means that you are not doing an assist clock. You are not doing an assist clock, you are reading it most energetically. And you are playing this like 7-2-3 cycles, which you get. If you do a assist clock, you get what you get. And post-spectrum and you have to exercise your stable on that. 1,500 is not a problem. So, can we actually, like, teach KM and tell it that clock source is TST patient, like a very good clock source? Well, yes we can. And we do that. Super patch. All starts working. Well, I will show you the numbers later. But we get a mission. Imagine you want to migrate your level 1. With all its level 2 guests inside. And you migrate. So what happens when you do that? You may want to migrate to a different CPU which has different TST practices. So that CPU has different practices. You can do that. And what will happen? What is this TST page clock source? And why do we need this? The TST page clock source is basically just shared memory page with CPU. Which tells us that when we read TST from the hardware, we need to multiply it and add some offset. So we can basically, like, scale it and add some offset to it. So when we actually migrate, the hypervisor will update this page for us. And when we read it next time, the reading will be correct still. So we won't see a jump. We won't see it, like, speeding up or slowing down. We will still see the same clock source before. And that's what we actually need in this. Because if you ever like going down your user space, or going back, if your frequency will change, then you may measure time in your programs and correct you. So we actually need that. And this happens. Hypervisor got that automatically for us. No need to watch. But what happens to level 2 guests? In level 2 guests, we have more or less the same clock source. We should call that period log or period clock source. To look from the clock source side. Which is supposed to be the same. Like, when we migrate guests, which are on KVM, KVM will update when you're there. So the reading will still be correct. But nobody actually told KVM that we've just migrated. It's not a virtualized guest, so it's not aware of migrations. We need to somehow tell it to please stop your guests and go update all the stages before you are able to execute them again. So the reading will still be correct. And we need to do this exactly when we migrate. So, can we do that? Yes, we can. In construct, we can find so-called ring light in the feature, where the things work. So, you can ask no, you don't have to do that. You have to wait a week to send an interrupt when you migrate. And you can ask it to emulate all TST access before you manage to update all the stages. And we did that after that, these are the numbers we get. So, as you can see, this issue here goes away completely. So, such workloads will run in level 2 with exactly the same speed as... basically exactly the same speed that they're running in level 1. And it works. Okay, so, these were some microbenchmarks and some synthetic tests you might think, but how does it affect real workloads and shall you actually use it nowadays, show the weight to death march into the kernel? Well, so I think basically to make bigger workloads and to do my measurement. And the first one was iPro. If you know that iPro is a performance benchmark for networking. It uses big packets there's not something like LFV with small packets not something like VK. So, we just take the code channel and we are trying to saturate it all. We can send, like, traffic parallel and we loop how well they perform. So, the conference took up somewhat important here but pretty powerful we have, like, 40G network tired I want to start over with two minos and as it's like 12-ounce connect extreme tired, we can actually do a CERA IOV. A CERA IOV in iProV is kind of special because it only supports a few network tired nowadays. There's, like, this melanode some in-house and I think that's it. So, but I was lucky to find one which is actually support. And when we're on a level 2 guest I was trying to, like when we were at the door of KERNAL after the business meltdown pepper pieces unlocified and the question is how do we connect level 2 network to the real network and we don't actually have many options because we like even with virtual functions we can do some sort of bridging there because we can't have more than one type of thread there initially so we can't put it in, like, for me it's more of, like, we actually level 0 won't allow us to. So, I was doing what, like, we were done by default because the only thing that changed was I changed the number of people to make it, like, 2. And this and that device. So, okay, we have, like, big guest for level 1, we have, like, one new one, no, for level 2 and we will run after all. So these are the numbers. As you can see, if you're really in level 1 you need, like, 4 parallel strings to saturate this 14 gigabit network type. With level 2 it's worth but it's not like with these synthetic samples when they're kind of slow, right? So, especially when we have this light in the DSES network we are basically able to saturate the network without our team. We only need, like, more strings and if it doesn't are used in, like, normal DSES we pretend we don't know anything about how to re-enter all. But, as you can see, the difference is not that big. So, for this room, I thought that's why I actually didn't do this today so you can go ahead and try it with me. And what you can actually play is how to tune this setup. I'm not sure if you understand that it was like the purpose of the setup we can probably make it better, of course. So, you can use, like, the new pin in L2, especially, or in L1 especially if you have, like, multiple numeros. So, you may make your level 2 guest nowhere-aware or you may pin it to a single numeral and that's what I do. Like, to level 16 and 3 more times on one numeral. We have multiple techniques to play with in these codes and there are multiple presentations and talks and everything on how to do it properly. I'm definitely not an expert. I played with them a little bit and turned out that, like, with DQs we get the best results. Not sure why. And you can actually, like, tune your character but I found out that, like, not a lot of them did a great job, by default. It will create as many Qs as you have to be using one numeral and fill them accordingly, by default. So, I didn't have to do many changes there. Okay, so, another great part is going to be almost the same, what? Just in case you're on most DQs and you want to find a network card which you can use as a means in hyperbole you have to use emulators twice, so called VMWass or NetRace, you may be familiar with. And other than that the setup is, like, exactly the same. So, I just, like, turn NetRace on and try to maintain the difference. And you will see something like that. So, yeah, this line, instead of, like, four, you have to use, like, five, three, but we can see that it saturates this device equally. With other Qs, it becomes more complicated and we're actually more, we're only, like, reaching 30 data bits. And our enlightenment CES doesn't actually help us a lot as we can see. So, why doesn't it help? Well, it's probably because we have too many copies of the same data. L2 uses Windows and uses WearKaioNet and copies it to the cost, copies it to VMWass ring, VMWass ring is being observed by Windows which takes the package and copies it to the real Q of the network card. So, we are not able to reach better than 30. That's the price of data. Here, we can also do stuff like that to be opinionated because, I think, but we have a decision that's all complicated. If we use properly and related devices in terms of data, they use, like, VMWass. And instead of real, like, interrupts, they use so-called channels. And these channels, they are being distributed automatically. Unlike an IRQ, you can do something like run an IRQ balance and it will try to put them in different communities. This thing with the kernel driver will kind of distribute them randomly and it will stay like that. You cannot change the assignment. So, what we can do, we can try to pin in your guest to different CPUs for example. Which is not really, like, a practical advice because next time you may get a different distribution here. We should come on and observe. So, and in high quality, there is no way to resize it. There is no way to do that. So, we cannot make it support as a kernel for now. Maybe, like, Microsoft calls to do that and give us a notion for that. But so far, we're not aware of it. At least it's not in the public sector. What you can play with, we can try to change the number of channels on this network. It's like a side effect of this change that we're going to have to go the way and come back and you'll have a different distribution. So, if you don't like to change the number of channels you'll get more likely to do that. But, like, a hackish approach to it. Okay, so, yes, so those range mark ID was a kernel view. That's so cool, but I wasn't running this script itself. I just need to build. And to compare, like, Apple's with Apple's, I had to reboot my level one, it doesn't say terminal as L2 was running. So, when it was not running, L2 was running stock to the kernel. When it was running, L2 I was doing that from, like, RCA plus my Delighted VMCS and this goes off. So, what do we do? We have build kernel in, like, 8.3 to some all-yes content and yes, we're build kernel. So, what do we do? We try, we'll see something like that. As you can see, the numbers are very close, actually. And our line of things, I don't even help in this case. We make things a bit worse. So, why is that? The answer is that when you run such a workload, we can see that you're intense. We don't do many VMXs, as I explained to you in the beginning. We are basically sitting on the same interview and doing some, like, interpretations. So, as we expect, the speed is the same, right? We are still on the same instruction set. And you don't actually need to optimize anything if your workload is only, like, to be intense. Oh, yeah, I have to mention that I was running this in memory, so no real hard drive was used, so the kernel was to be better. So, if you mind Bitcoin, it doesn't really matter where. If you use your scheme for mining Bitcoin, why would you do that? You can do it in level 2 equally. Good or equally bad. In level 1. So, what else? What's left if you read the stuff? We have a couple more features which we are not implementing in Linux yet. And first is Lightning Demo Star Goodbye Pack. And, oh, it's like, you can tell level 0 hypervisor but, yeah, I can probably describe it like that, but when you write and you virtualize it, yes, they can create a page with all this bits for all MSRs. MSR is like one of the things you register in the CPU you read it and you get some result. And you can actually tell the type of work, tell the CPU when you want to have the VFX, whether it's just MSR or when it writes to just MSRs. And we have a memory page. And, again, the problem is it's actually the same which has normal VNCS, that when our level 1 hypervisor edits the structure, we don't really know what was changed there. So, we have to go through the code page and check all the bits in this page to see if something was changed. Maybe not something was changed in there, but we don't know. So, they, again, introduced the problem always. When you can tell the type of level 0 hypervisor, it's in hyperV. When you can tell it if something was actually changed in this page, and it really needs to examine it before it resins existing in your guest. And it's supposed to fit things up a little bit, but we don't have this yet. The other interesting thing is that we need to be flushed. And it's interesting even without this L2 guest. As you may know, that we sometimes need to flush DOD. For example, right now we speak from like thermal pages to user space pages, or whatever, without the direction, and we sometimes want to flush DOD, so it won't get a sale entry there. And how to dump in d66. In d66 we have no single comments, just invalidate DOD on all 2D media caps. If you want to invalidate your DOD on other 2D media caps, you can drop there, basically like an IPI. It's not a real cashed IPI, and invalidate its own DOD. So our virtualized guests can do exactly the same. They can send IPIs to our virtual CPUs, and they will execute the parts that are working. But the problem is that some of their CPUs may not be scheduled at this point. Some other guests may run in there, so we send them to drop there, and it's pending. Because the CPU is not executed, like we're not able to blame it. Moreover, the IPIs, it's waiting for the job to be done. So, oh, they can't send it just over when you can, and over coming to guests when you have more than one DCPU per physical CPU. So, how do we folks they take off with a TV protocol, that they're solution to all issues, can come up with a TV protocol when hardware doesn't perform. And they introduce a code that you can use on the clock. They'll basically execute in a hydrophone and IPIs to all CPUs. And you'll have a watch that will see if it's some different DCPU, which is there, not from your guests. You don't even need to invalidate the DCPU there, because the side effect of switching DCPUs there is the DCPU flush. So, basically it doesn't matter what the DCPU is, and on CPUs which are actually scheduled, actually does a DCPU flush. And we, like this year several categories ago I've got numbers, we had the software for this protocol in limits. So, when you run limits on hardware, you can benefit from this video with flush protocol. But the feature they have, and which I listed here, is that they have direct flush for level 2 guests. Because, when you're running messages, the probability that your virtual CPU is actually being executed at this time is even lower than the 3.1, right? You may have several level 1 guests, and in each level 1 you may have several level 2 guests. So, they can get a tiny, tiny portion of real CPU. And they provide an option to do this DCPU flush directly from L2 guests. They will just go into level 0, level 0 will accept which virtual CPUs of this guest are currently scheduled and do DCPU flush accordingly. So, it is also supposed to bring some performance improvement on the virtualized level 2 guests. But work hasn't started yet. So, if you are interested, join on the phone. I'll try to get to know it. And, that's it, guys. So, if you have any questions, can I use, like, 3,000 or 4,000? Yes, you can. The question was, what's needed to start let's get on Azure. And the answer is that they have certain instance styles which support it, and certainly show Azure is made of return-type revolutions. So, if you look at it, that new instance styles, which run in 2015, they have an age. So, I know for sure that this is the biggest M128 supported. I think that the instance is version 3 supports that, and maybe E version 3, something like that. I'm not sure about certain instance styles support that. Yeah, so, dropping in my life, I was somewhere written, I asked my microphone directly and I had a life but, you know, you just try. You give him what you love, just like what you love to give him any type. But it doesn't tell you that, oh, no hardware support and it's a work. It does tell you that sorry, it's not a work. Then you have to try the numbers and stuff. You can't know no pressure, it's able or disabled. They will live on their own for you. Another question? Yeah, the question was how does it talk there with running KVM on KVM? And the answer is that when we implement these three features like we perform significantly better than KVM on KVM. KVM has a lot of software. In particular, Paolo told me that before Spectrum held down he was able to reach at least 30,000 TPU cycles or less than VMX. That was the best setup he had. With my series yeah, I saw 7,500, so that's half. We have huge root optimization inside KVM being level 0 hyperwise. And we may actually think about bringing this feature into KVM one way or another. So, I'm going to ask about KVM. So, take care.