 Hello, I'm Kent Ishiguro. Today, I will present the problem of excessive PLV in VM agnostic KBM and mitigations. I'd like to introduce myself briefly. I'm a third-year PhD student at K University in Japan. My research interests are performance and security of hyperviruses. Over-subscribing budget CPUs is often used in cloud environments, as it enables better hardware utilization by multiplexing budget CPUs on fiscal CPUs. However, over-subscription can incur excessive spinning and it degrades guest-beam performance. For example, rockholder preemption is one of the famous problems which incur excessive spinning. Let's suppose an ordinary case of spin-rock synchronization. Two fiscal CPUs need to acquire a spin-lock. If CPU 0 has already acquired a spin-lock, CPU 1 fails to acquire the lock and it will spin until CPU 0 releases the lock. Since fiscal CPUs are always active, the busy rating is always short. The operating system assumes fiscal CPUs are always active and the spin-rock synchronization works efficiently. However, in a virtual environment, the assumption is not always kept. If virtual CPU 0 is preempted by the hypervisor after it acquires a lock, virtual CPU 1 spins forever. This problem is called the lockholder preemption problem. Excessive vCPU spinning can degrade the performance of guest-beam. Virtual CPUs waste their execution time by executing post-loop instructions for a long time. Ideally, the hypervisor would know which vCPU should be scheduled right now to avoid excessive vCPU spinning like lockholder preemption. However, it is hurled due to the semantic gap between KVM and DINAC scheduler or KVM and guest VMs. Because of the semantic gap between KVM and the scheduler, boosting a target vCPU can be impeded by the DINAC scheduler to keep fairness between vCPUs. The semantic gap between KVM and guest VMs makes it hard to build a comprehensive vCPU candidate set for boosting. KVM leverages the hardware feature to mitigate excessive spinning on Intel x86. The hardware feature is called post-loop exiting, PLE for short. PLE can detect excessive spinning and transfers the control to the hypervisor for scheduling. The current KVM strategy to suppress PLE events is basically rescheduling a period vCPU to another preempted vCPU. This is performed with DINAC scheduler by leveraging yield-to-function provided by DINAC CFS scheduler. To yield and boost vCPUs, KVM makes a request to the DINAC scheduler. Before making a request to the DINAC scheduler, KVM selects a vCPU to boost in landrobin from candidate vCPUs. This candidate is important to resolve the cause of PLE as soon as possible, although KVM attempts to reschedule vCPUs where PLE event happens. In the worst case, a lot of PLE events occur in the short period of time. This is a KVM trace with running two 8-vCPU VMs simultaneously. VM exit occurs repeatedly due to post-insurrection, which means PLE events occur continuously more than 100 times, even though the VM has only 8 vCPUs and they are boosted in landrobin. These continuous PLE events occur in a short period of time. All PLE events in the series occur at the same code location. In conclusion, in the worst case, PLE events occur continuously in the short period of time at the same code location. Continuous PLE events occur a large number of PLE events. This kind of phenomenon is not rare, and most PLE events are from continuous PLE events. So rescheduling vCPUs does not work as well as we expected. The figure in the bottom shows the CDF of PLE occurrences on several benchmarks. In these experiments, two 8-vCPU VMs are running simultaneously on KVM. The benchmark we noted is running in one VM, and CPU-bound benchmark, Swaptions, is running in the other VM. This figure shows more than 100 continuous PLE events is more than 20% of the total number of PLE events in these 4 benchmarks. The reasons for these PLE events are mainly two functions. Native Q'd spin lock swappers and SMP call function many, which means both spin locks and TLB shootdown are the cause of PLE events. I have identified several problems which incur a large number of continuous PLE events. Loss opportunities and over-boost that are due to the lack of comprehensive approach to identify the root cause of PLE. I introduce the strict boost mitigation against these problems. The other problem is schedule mismatch, which is due to the semantic gap between KVM and Linux schedule. Linux scheduler ignores boost requests from hypervisors. This results in PLEed ACPU is rescheduled repeatedly before scheduling a boosted ACPU. Then continuous PLE events occur. I introduce the mitigation, the boost, against this problem. For less oversize, describe these problems and mitigations. The current KVM candidate VCPU selection for boost is developed as follows. The current version implements directed yield, which leverages yield to function to yield the PLEed VCPU and boost or target VCPU. Every candidate selection has been enhanced with optimization against low-code preemption. This prioritizing less entry PLEed VCPUs, boosting only preempted VCPUs, and boosting only VCPUs in kernel mode. Introducing these optimizations results in missing boost VCPUs, that are the cause of PLE event due to TRB shoot-down synchronization. Boosting hearted VCPUs where they receive an IPI alleviates the latency due to TRB synchronization. The experimental results and candidate VCPU selection strategy give us three insights. The first one is spinlock skill incurs continuous PLE events. The second one is VCPUs in user mode also need to act when TRB shoot-down happens. And the third, no need to boost IPI receivers when low-code preemption happens. The first one is due to the schedule mismatch problem, which is described later. Second one is called loss opportunity in this work. The third one is called over-boost. To mitigate these problems, I introduce two new candidate selection rules. One is a boosting IPI receiver VCPUs in user mode. The other one is not boosting hearted VCPUs if the used VCPU has not sent an IPI to it. Next, I'll talk about the schedule mismatch problem, KVM co-operative VCPUs with Linux schedule. However, Linux schedule does not distinguish between VCPUs and the other threads. Also, KVM makes requests to Linux CFS schedule for boosting VCPU. Linux CFS schedule always keeps fairness between VCPUs, which results in Linux schedule ignoring the request from KVM. Eventually, let's see the case study of schedule mismatch problem. Suppose that KVM tries to yield VCP0 and boost VCPU1. VCPU0 is waiting for VCPU1 to release a lock. In the Linux CFS LAN queue, VCPU0 is very high priority and VCPU1 is low priority. The scheduler picks the highest priority task, VCPU0. Then if the CPU time of VCPU1 minus CPU time of VCPU0 is more than threshold, the scheduler decides not to yield VCPU0 and not to boost VCPU1 because the scheduler considers scheduling VCPU1 is too unfair. VCPU0 will trigger a PLE event again and again because VCPU1 still has a lock. VCPU0 consumes its CPU time for executing a prose loop. Eventually, VCPU1 is boosted after VCPU0 waste its execution time with PLE repeatedly. I introduced the Debooster mitigation against this scheduler mismatch problem. Debooster makes the scheduler not hesitate to boost another VCPU instead of VCPU which exits due to PLE by lowering PLE the VCPU priority. As a result, VCPU1 is boosted without wasting VCPU0's CPU time. I have implemented these two mitigations in Linux KVM 5.6.0 by adding only 41 lines of code modification. The mitigations have been evaluated on an 8-core server with two VMs with 8 VCPUs each. Mitigations can reduce the number of PLE events with four benchmarks. In terms of performance improvement, application execution time is reduced by up to 40% and this throughput is included by up to 75%. Also, I evaluated the fairness between VMs. This result shows no core runners performance degradation because mitigations do not raise the priority of the boosted VCPU. In conclusion, over-subscribing VCPUs incurs excessive spinning. Pose-loop exiting is a hardware feature against it. Unfortunately, PLE does not fix it due to the semantic gap between KVM and guest VMs or KVM and Linux scheduler. Introduced mitigations improve application throughput by up to 75%. This problem is investigated by the KVM community. Please see links below and please see our paper for more detailed experiments and analysis. Thank you for your attention.