 Hi I'm Jun Nakajima from Intel. Today I'm going to talk about implementation options of KBM-based Type 1 or 1.5 hypervisor. Before that, I'd like to appreciate Chang Xiao and Anthony for their effort and contribution to this project. They implemented POCs and also measured a lot of data for this project. Anyway, thank you very much Chang Xiao and Anthony. So as usual, now that's in there, I'm starting from the motivation and show some implementation options and then POC performance data and followed by our conclusion next step. I presented this at the last KBM forum, but at that time I pointed out the security risk of a guest on Linux KBM. Basically, KBM piggybacks or depend on the Linux to host and Linux is, as you know, needs to run various operating systems like loads and it's much larger than KBM itself or a hypervisor, so it has more attack surfaces and making the guest more exposed. Also, user space BMM like a QMU has full access to the guest memory, also guest CPU state, and then the kernel has full access to any guest memory or CPU state. So that's a kind of security risk of existing Linux KBM implementation. So the motivation of a type 1 hypervisor, this is not the new, for example, if you look at Zen, it's kind of 1.5 and then also Hyper-V and then other hypervisor has a similar architecture. The benefits of the type 1 or type 1 point hypervisor is it separates the hypervisor from Linux, which handles more complex operations. For example, IO uses various device drivers from various parties and process management, user process handling, whereas the hypervisor will be responsible for isolation. And it also, if trusted, hypervisor can create secure environment like trusted execution environment or trusted BMM on top of hypervisor. Now take a look at how we can convert KBM to type 1.5. So we take a standard Linux plus KBM system and then insert hypervisor as L0. Now the KBM becomes L1 and then the existing VMs become like Nestor, so L2. But we still want to keep IO path through. So the hypervisor itself basically doesn't have any device driver, IO device driver. And then we call DomZero like a Zen guys did. Essentially this is very similar to the Zen architecture. The difference would be as DomZero uses basically unmodified Linux. But there are more differences I'll talk about. And in terms of implementation, we have two extremes. One extreme is, as I've been talking about the hypervisor implementation, one extreme is to use minimally configured Linux plus KBM. And since it's a Linux, although it doesn't have IO device driver, it's basically operating system. It has a scheduler, memory management and so forth. And then it boots first and separately from the DomZero. As a Linux, it can run user processes and it can run user-level VMMs for DomZero. We cannot use a QMU per se mainly because it doesn't need to. It doesn't need to use a QMU because IO virtualization is not really required there. For other VMs, we probably need to use QMU. This is one implementation. And the other side is they say the lightweight hypervisor is simply just a deep privileged Linux for isolation. And it's just a reactive. So as long as the host Linux behaves, the guest Linux behaves well, it doesn't generate the VM exit. And this can be loaded by Linux at the early boot time. As long as the Linux was healthy at the time. And then common thing is again IO path loop. So let's talk about pros and cons for each implementation option. One is Linux KBM hypervisor. So again, the hypervisor is minimally configured Linux plus KBM. So it has user-level as well, user support like the DOM, the user-level BMM and then QMU. The benefits as a hypervisor, you can run unmodified guest on L1 here. And since the hypervisor is a Linux KBM, you can benefit from a Linux KBM or some continued benefits. But there are some disadvantages of this approach. First of all, I'll talk about this later, but the higher latency to DOM zero because of scheduling, double scheduling and the VM exit. And it's still big, so it may have issue as a TCB. Well, we may be able to handle that. And then what are you for the guest? As long as you run the guest within a DOM zero, it switches just Linux, a typical Linux system. But this guy, this L1 is outside DOM zero. And there are some issues with the handling of the VMs. Also, power management is essentially who should manage the power for the CPUs and the platform. So I'll talk about some scheduling and power management issues here. As hypervisor with Linux, it needs to own the VM scheduling. To that end, we need to intercept halt, emulate in DOM zero. But this causes inefficiency, especially for the client, because of a two-level scheduling. So now we have a scheduling of the two levels in the hypervisor Linux and also within the DOM zero Linux kernel. And because of those, we see unexpected latency in VMs, especially in DOM zero. Now, other issue is how are we going to create the VMs? We need to invoke the QMU process here to create the L1 VM. But from a user point of view, this user process is available. We cannot really access to hypervisor level of a user space. So we need some secure communication from a user level in DOM zero to hypervisor or the QMU or some process on the hypervisor side. The QMU is also a problem because from the QMU process point of view, he doesn't have a device driver available in the hypervisor. Only a memory file system will be available. So QMU can, in the hypervisor, run VM without Bertaio. So it's, again, it's limited. There are probably ways to allow that QMU to access the real IO devices. But we don't know how at this point. Probably we need something like a Zen or Hyper-V kind of a solution here. So switch to the lightweight hypervisor. Pros is, like I said, code pass. If you look at the code pass of this DOM zero, the code pass is basically almost probably identical to the bare metal. Same code to the bare metal Linux KBM. And that's low overhead and then low latency. And then it's small TCB. This advantage is if you want to run the one VM, for example, in Enclave a secure environment for purpose, then we have a kind of similar problem. Basically, this lightweight hypervisor doesn't have any device drivers. And then also no user process available. So it's not really not so easy to support virtual devices in the L1 VMs. You can run the unmodified guest, but in that case, the VMs will be run as L2 as a KBM guest, which is L1 nested. So there may be some performance concern with that. There are some optimization when running a KBM guest on top of lightweight hypervisor. So one is so called optimized nesting. I'll talk about more details on the next page. But we can path through most of the fields of the shadow VMCS. And also we can convert the shadow VMCS to real VMCS very quickly just flipping one bit. Anyway, I'll talk about more detail on the next page. The other technique we have is we can have the KBM, the first level entry point inside the lightweight hypervisor. Then handle the VMX is immediately inside L0 hypervisor and quickly go back. For a more complex handling, we need to go back to KBM. So in that case, we need to enter the KBM or the DOM0 for that part. But still it's faster as long as we can handle within the L0 hypervisor. In that case, the L2 VM looks like kind of a L1 VM. So now I'll talk about more details on nested virtualization. For nested virtualization in the KBM, we use the so called VMCS shadowing. And the benefits of VM shadowing is it doesn't need to cause the VMX it upon VM read or VM write in L1 VMM. So if the hypervisor allow the guest VMM to directly access the VMCS, then it doesn't need to generate the VMX, it just access the VMCS. The hypervisor can set the bitmap for the VMCS fields read and write, the separate bitmaps will read and write operation. So as long as they are not in the bitmap, then the L1 VMM can just simply do the operation, VM read or VM write operation without causing VMX to the L0 VMM. But sometimes it needs to generate, the L0 VMM needs to intercept. And even still we use the VMCS shadowing, certain writes or read to the VMCS field causes a VMX. Also the biggest problem with the VMCS shadowing is this shadow VMCS itself cannot be used for VM entry. So when we go back to the L2 VMM, we need to copy or sync some of the VMCS field so that we can make a real VMCS for the VM entry back to the L2 VMM. Now the optimization we have doesn't require copy or sync because the VMCS shadow or shadow VMCS is identical to the real VMCS. We still need to intercept some of the limited VMCS fields by the L1 VMM for security purpose, but the contents are basically the same so we can just flip the bit. Now let's talk about POC. So we extended VBH to create the lightweight hypervisor POC. The original VBH, VBH is a virtualization based hardening. It simply deprives the Linux kernel to handle the kernel using hardware based virtualization features. We pass through all ION and also APIC and in the POC from the VBH, you know, beyond the VBH we added simple nesting support. This only works for the L1 VMM, like VMM, where the GPA is identical to HPA. Like I showed, we implemented optimized VMCS shadowing and also virtual EPT to make sure isolation. Also added a feature to run a simple L1 VMM in trusted environment. We can run OptiOS. Please look at the pointer below. And we are working on virtual IOMMU support. Okay, now let's take a look at the performance data. What kind of performance data we have by comparing the VM L1 or L2 VM performance on KBM based hypervisor versus lightweight hypervisor. So we are comparing the VM L2 VMs on top of Linux KBM hypervisor versus lightweight hypervisor. So the first performance measurement is the comparison of L2 VMs. Okay, so this is kind of a result and this one shows the improvements from flipping, you know, the shadow VMCS indicator, the one bit. Okay, so if you look at the VM exit cost, initially KBM had this much, but now it's more than like one-tenth. Okay, less than one-tenth. So you see a consistent result in terms of the VM exit latency reduction, like about 10x or more. Okay, also at the time of VM entry, because of a fast switching, you also see like almost 10x improvement or more than 10x improvement. So we have a very good L2 performance compared with the KBM L2. So if we compare KBM L1 versus that L2 on lightweight hypervisor. Okay, so this is a result. Okay, result is, it's almost same. We have some regression in IO area at this point, but this is kind of first implementation. So we don't have any optimization for IO at this point, but so the performance looks good. And then we don't have another optimization like a KBM Funstore entry point handling. Okay, so today the L2 VMs on lightweight hypervisor runs purely in nested way. So from POCs, we found a couple of things. So we also have a POC for the Linux KBM hypervisor. It actually has structural impact. Okay, the structural impact in the sense of a resource management. Okay, scheduling, like I said, the double scheduling and then also power management. The Linux as a hypervisor needs to have the power management. Although hypervisor doesn't have any IO devices, it still needs to handle the VM management. But from a user point of view, I mean, the user of L0 DOM0 user level, it doesn't allow to manage the VM directly because the QMU on the DOM0 just handles the VMs on the DOM0. Okay, vertical implementation would be a bit tricky. We need a similar solution like Zen or Hyper-V provides like ZenBus and back-end and front-end solution. The bigger challenge is kind of a redo of the performance or validation work if we switch to this one because you have a different resource management structure. Large difference here. In terms of performance or tuning, you cannot really get same results as a bare metal. Once you have this hypervisor, this architecture, you need to redo the optimization tuning. On the lightweight hypervisor side, we had some concern with nesting. At this point, the performance of L2 on the lightweight hypervisor and KBM L1 is almost the same except IO. IO is about 90% of L1 KBM at this point. So we still need more optimization for IO. Here's our conclusion. The lightweight or reactive hypervisor approach is more suitable for the existing Linux KBM. When making it more secure, i.e. Type 1 or 1.5 BMM, because we can maintain the same code pass as bare metal Linux KBM, including scheduling and part management, so forth. It also provides low latency in an overhead. In addition, the VVH-based hypervisor can hold in the DOM zero kernel and guest kernel as well. KBM guest also run with minimal overhead, even though they run in L2 environment. We also found advantage when implementing trusted execution environment because of a small TCB. So this is the next step. We want to finish the VVH-based POC, especially complete IOMMU virtualization, and optimize KBM guest more, especially before IO areas, especially IO write operations. We also add first level KBM entry point in VVH, and then after that we share the source code. With that, this is the end of my presentation. Thank you.