 Hello everyone, my name is Deng Liang, a software engineer in Bidance. In this, hello everyone, my name is Deng Liang, a software engineer in Bidance. In this topic, I will introduce our work KVN Deword. I will present the technical details of how to extend KVN to a zero overhead partition hypervisor. The first part is our motivation. We observe the change in our dead center that the core number of a single server increases rapidly. For example, there are already 384 cores on AMD Genave and 224 cores on Intel iSlack. Running a single Nenis kernel on many cores introduces some problems. Firstly, the Nenis kernel encounters many core scalability bottlenecks. For example, the locking contentions in the file system network stack process scheduler, memory management, and so on. The second problem is the fault isolation. If one application crashes the kernel, the whole server crashes. This can be even worse if the number of applications increases. The third one is that it is hard to fulfill application-specific kernel requirements on a single kernel, such as kernel configurations and kernel boot parameters. The existing method to solve these problems may be to use KVN visualization to partition the server and run separate guest kernel in each VM. However, the current visualization overhead is non-chill or may be still high. The overhead is caused by additional VM exits due to timer and IPI visualization, hotel notifications, some pre-village instructions, host interrupt. We also observe that posted interrupt, which is introduced by VMS to eliminate VM exits for guest interrupts, still incurred some overhead for its complex hardware path designed for security. Moreover, additional address translations such as Stage 2 address translations, such as EPT on Intel or MPT on AMD, and DMA remap translations also contribute to the overhead. So we introduce KVM Divert that runs bare-mantle machines for physical resource partition instead of security isolation on both Intel and AMD platforms. It leverages a set of PASU techniques to eliminate all VM exits after guest kernel initialization phase. Additional address translations also are eliminated by its memory-diversalization and DMA-diversalization techniques. As a result, VM for a single partition can achieve the same performance as native host. VM's performance can even better than native host for multiple partitions since it uses separate guest kernels to resolve the contention problems. In the following, I will detail these techniques. The first is interrupt PASU. We do not use posted interrupt instead we pass physical local APIC registers including IRR, ISR and EOI directly to the VM and configured external interrupt existing bit in VMCS to deliver guest interrupt directly to non-route mode. However, this configuration will also cause the host interrupt also delivered to non-route mode and lost. To prevent this, we configure the host interrupt arriving at the guest course as NMI and thus VM exit is triggered for it. In addition, we re-trigger a self-IPI in host NMI handler to solve IRQ mask issue. To do visual interrupt injection to guest, we only need to send a physical self-IPI and VM entry since physical local APIC has PASU to VM. After introduce interrupt PASU, the interrupt in map logic for PASU device is also changed. We directly filled the IRTE in our MU hardware with guest vector of the guest device interrupt and the APIC ID of the physical well VM runs. Thus, the interrupt of PASU device which is configured as IRQ type can be directly delivered to non-route well VM runs. When VM changes the visual IRQ to VCPU binding relations or the guest vectors, the VFRO model in Houston can update the IRT with new value. In the following, we will introduce the technique named IPI PASU. The VM can directly assess the physical ICR register in local APIC as we set the ICR register as PASU mode in the VMCS on Intel or VMCB on AMD platforms. At the sending call, the VM first maps the visual APIC ID of the target VCPU to the physical APIC ID and directly assesses the ICR to send the IPI. So, at the receiving call, the IPI which is configured as IRQ type can be directly delivered to VM without any VM access. This process has been detailed in the above technique named interrupt PASU. The next technique is the timer PASU. The VM can directly assess the PSE deadline MSR register since we have set it as PASU mode in the VMCS or VMCB. During the execution of LAPIC next event, the VM directly sets the timer deadline to the physical TSC deadline MSR register. Since the VM can only see the visual TSC, it should first substress the guest host TSC offset from the timer deadline. The offset is mapped by KVM into VM at VM.app. On LAPIC timer aspiration, the timer interrupt which is configured as IRQ type can be directly delivered to VM without VM access. The next technique is memory de-visualization that removes the use of stage 2 translations such as EPT on Intel or MPT on AMD platforms. When BM updates its guest page tables, it first translates the GFN in the update request to PFN. Thus, guest page tables directly use PFNs and EPTs are not needed. When the BM reads its guest page tables, it also translates the PFN into GFN. To support the translations between GFNs and PFNs in BM, KVN statically pins BM's guest memory at VM startup and maps both GFN to PFN and PFN to GFN mappings into BM. In addition, we use a hypercaller in guest page forward handler to emulate a MMIO trap to host. The next is DMA de-visualization that enimulates the use of IOMMU page tables. When the pass-through device driver invokes the DMA map function before ensuring a DMA request, it first relies on the GFN to PFN mappings to translate the GPA in the DMA request to HPA. Thus, the DMA remap is not required and we can just configure the IOMMU DMA remap hardware as pass-through mode. Additional modification to the device driver is also required to ensure that the DMA map is re-invoked as page-size granularity. The last technique is VOTAO notification pass-through. When the VOTAO front-end in BM sends notification to back-end, it directly assesses the physical ICR register to send an IPI to the host core without any VM exit. When the back-end sends notification to front-end in BM, it also sends IPI with gas vector to the gas core. So the IPI is then delivered to BM based on the interrupt pass-through technique without any VM exit. Other optimizations to remove VM exits include CPU isolation, no-hose fur, hand-during CPU ID in BM directly with dynamic binary writing, delaying some host IPI to the next VM exit and do pass-through for the HURT and M-weight instruction. Finally, we give our performance results. Here is the IPI latency result. The left one is for Intel. We can see that the IPI latency in BM is nearly equal to that in native host, but the VM's IPI latency based on upstream KVM is much worse. MD result is on the right. The IPI latency in BM is nearly equal to native host. It is even much better than the VM when AVIC is enabled. Here is the timer latency result based on our micro-benchmark. The benchmark measures the latency from setting an immediate inspired deadline to the arrival of the timer interrupt. The left one is for Intel. The timer latency in BM is equal to that in native host. The result in VM, however, is much worse than the BM. The right figure is for AMD as we see the result on AMD is similar. Here is our micro-benchmark result for cache line prefetch, which we think can represent the page table work overhead. The latency in BM, which is 9.35 is nearly equal to 9.32 in native host. The latency of the VM with 4K besides APT is worse. The result is 14.3. To our surprise, the latency of the VM with 1 GB page size APT is also worse than the BM. Its result is 11.1, even if the APT TLBs are always hit in this situation. We also test KVM Divert on real-world application in our by downstairs center. In the following, we refer to the application as XS for confidentiality. Firstly, we compare KVM Divert to the upstream KVM. In the table, we show XS end-to-end latency improvement using the VM as a baseline. As we see, the VM can achieve 20% to 30% performance increase compared to the VM. With only the interrupt, IPI, timer, and optimizations, the improvement is 8%. With only the memory div, the improvement is 14%. With only the DMA div, the improvement is 2%. So the overall improvement is 20% to 30%. And secondly, we compare KVM Divert to the native host. On the left table, we show the result that we use KVM Divert to partition the server with only one VM and run an access in it. The XS end-to-end latency is nearly equal to native host. And on the right side, we show the result that we use KVM Divert to partition the server with 4 VMs and run an access in each partition. The XS end-to-end latency is better than the native host in this situation, as we can see nearly 9% improvement. And this improvement is mainly due to the decrease of color compassion in separate guest kernels in this situation. And at last, we discussed the current status and our future work of the KVM Divert. For the current status, we support both Intel and AMD platforms, and we support both QEMU and CloudHelperizer as VMM. For the future work, we plan to push our host kernel and guest kernel patches to the upstream, and we also plan to support live migration for the VM and support what help along and memory hard plug for the VM. The above is the technical detail of our work, KVM Divert. To conclude, we have introduced our PASU and de-visualization techniques, which successfully remove all visualization overhead in the runtime. We also show that using our partition can improve performance even compared to native host. Thank you very much for your listening, and any impressions is very appreciated.