 Hello, I'm Wang Pengli from Tyson cloud. I'm a long-term active KVM contributor in community. Today I will introduce some KVM latency on the scalability optimization. This is today's agenda. Firstly, I will introduce faster path for both IPI delivery on the TSC deadline timer. Then I will introduce how to boost VCPOs that are delivery interrupts. ICR on the TSC deadline MSR write calls the main MSR write VMXs in cloud environment. Over the vision, multicast IPIs are not as common as unicast IPIs, like reschedule vector on the call function single vector. We introduce fast path handler to handle certain performance critical MSRs in a very early stage of KVM VMXs handler. After writing SNR VMXs, various guest states save on the host state load, various condition checking, host interrupts on the preemption enabled, and also expensive RCU operations. We even can be interrupt or preempted after host interrupts on the preempted enabled. This mechanism is specifically used for accelerating writes to X2APIC ICR that attempt to send a virtual IPI with physical destination mode, fix the delivery mode on the single target. The region's mechanism significantly reduced the latency of such virtual IPI is by sending the virtual IPI to the target VCPO in a very early stage of KVM Xs handler. Before host interrupts are enabled and before expensive operations such as requiring KVM SRCU lock, we can observe 30% latency reduced for IPI benchmark, and 22.3% latency reduced for KVM unit test. AMD SVM introduced hardware acceleration to boost virtual IPI performance. The source IPI doesn't need to VMX it when sending unit-cast IPI or multi-cast IPI in most conditions. The virtual IPI can be sent to target virtual CPU directly. We ran hackbench and IPI benchmark on one AMD ROM server, two sockets, 96 cores, 192 threads. The VM is 180 vcps with SAP exposed. We can observe 3% hackbench performance improve it benefits single target IPI. However, we observe 55.2% IPI benchmark performance drop for multi-cast IPIs. It is weird that hardware acceleration worse than software emulation. Now, I will introduce virtual TSI deadline timer for the past. Both ARM timer and time affair incur VMXs. KVM does various housekeeping tasks before emulation. We implement a fast path for emulation of writes to the TSI deadline MSR. Besides short cutting of various housekeeping tasks in the vCPU loop, the fast path can also deliver the timer interact directly without going through KVM request pending timer because it runs in vCPU contacts. We also implement a fast path for the permission timer VMX it. The VMX it can be handled clearly so it can be performed with interrupts of and going back directly to the host. We can observe satellite test latency reduced by 16.5%. Next, I will introduce the mechanism to boost pre-empted vCPUs to mitigate long-holder preemption issue vCPU which spin is detected by hardware. While in host, vCPU use time to long-holder candidate vCPU which is selected by heuristic algorithm. Now we want to boost not just long-holders but also vCPUs that are delivering interrupts. Most SMP call function many calls are synchronous. So the API target vCPUs are also good your candidates. We boost vCPUs during wake up and interrupt delivery time. Next is boost Q-hat vCPU. Due to the FLFO order spin log algorithm, whenever a heavy weather preempts the next waiter that has not yet acquired a lock even if the log is released, no other thread is allowed to acquire it until the next waiter is allowed to run. Overcommitment increased the likelihood that the Q-hat vCPU may have been preempted and not actively spinning. Rescued Q-hat vCPU timely to acquire the lock can get better performance than just depending on lock ceiling in overstep the right scenario. The log holder vCPU use to the Q-hat vCPU when on lock to boost Q-hat vCPU which is in voluntary preemption or the one which is voluntary how due to fail to acquire the lock after a short spin in the guest. The last one is a new hypercall to use to API target by sending a call function IPI mining to vCPU yield if any of the IPI target vCPU was preempted. We just select the preempted target vCPU which we found since the stage of target vCPU can change underneath and to avoid risk conditions. Let's see the performance number. We test this on Liang's Kasky lake two sockets 48 cores 96 threads each vm is 96 vCPU one vm is running a busy benchmark other vm running cpu bound workloads we can observe 3.4 percent improvement for 1 vm to 24 percent improvement for 2 vm 48.3 percent improvement for 3 vm smp call function many cores can be in call function interrupts on the tlb shodown path we disable paravirtual tlb shodown feature in this testing since call function interrupt is not easy to be triggered by user space workload