 Hello, everyone. Welcome to my session. This is Gavin Shen, kind of working for Red Hat as a software engineer. Today I'm going to talk about the Synchronized PageFault and STS motorization support for ARM64. Here's our overview. I'm going to talk about the motivation where we need the Synchronized PageFault and the current status of the feature. And then the general requirements of the feature. After that, I'm going to talk about the Synchronized PageFault and STS virtualization implementation on ARM64. Then I represented the performance and came to the conclusions. Because the feature is currently developed, I need some support from the community to review my code. About the motivation, generally speaking, the Synchronized PageFault improved guest terrorism significantly by rescheduling other processes for execution in the guest, where the host resolved the state to page fault. Besides in the scenario of live migration, the guest performance benefited from the feature either. The feature may be used by other purposes. For example, it is used by the virtual LFST to relay views from the host to the guest. The feature was introduced to x86 initially around 2010. And the virtual LFST used the Synchronized PageFault to relay views from the host to the guest on x86. And the feature is supported for S390, which is one of the IBM's architecture. But unfortunately it's not available on ARM64 yet. So we need to do something to support the feature for ARM64. About the general requirements, so there are two passes, data pass and control pass. The data pass is driven by two notifications, which are page now present and page ready. The page now present the notifications sent from the host to the guest before the state to page fault is to be resolved. When the guest receives the page now present notification, the guest starts to reschedule other processes other than the 14 processes for execution. In the meanwhile, the state to page fault is being resolved on the host set. At a later point when the state to page fault is resolved successfully on the host set, another notification which is page ready is to send the host to the guest. When the guest receives the second notification, it reschedules the previous 14 processes for execution. There is still another data broker in the data pass, which is shared between the host and the guest. The shared data broker is updated to distinguish the notification. It is also used to identify the specific asynchronous page fault by a unique token. In the control pass, the control data broker is used to help to finish the configuration and the migration. This is a general requirement. So what's the particular requirement we need to support the feature on 64? I would like to compare the situations we have on x86 and on 64. First of all, on x86, the page vector 14 is used to deliver the page not present notification. But the same mechanism is not available on 64 because we have a very limited space in the yes underscore your one system register. This is just as used to tear the root cause of the page fault. So the SDA, which is a software dedicated exception interface, is leveraged to deliver the page not present notification on 64. Apart from that, the interoperate is still used to deliver the page ready notification, which is quite similar to what we are doing on x86. The shared data block is updated on 64 to distinguish the notifications, which is exactly the same to x86. On 64, the control broker is accessed by the SMCC, which is the secure monitor call calling convention to configure the single aspect fault. But on x86, the email message is used for the same purpose. Besides the control broker is also accessed by the actual command from the user space in order to support the migration on 64. However, on x86, we don't need this functionality because the email message can be migrated naturally. So let's take a look how the single aspect fault is working on 64. At the beginning when the QMI starts, the QMI use the octal command to configure a single aspect fault. There are two types of information needed to be configured. The SDA event number and the interrupt number. After that, the guest boots up. The guest start to use the SMCC interface to enable the single aspect fault in step two. At a later point, the sum of the running process in the guest trigger the state two page fault, the guest trap to the host. This time, the host send the first notification, which is the page not present in step four to the guest. When the guest receives the page not present notification, the guest is going to acknowledge the notification and start to reschedule other process other than the 14th process for execution in step five. At a later point, when the state two page fault is resolved successfully on the host in step six, another notification page already in step seven is sent to the guest. When the guest receives the second notification, the guest is going to acknowledge the notification by rescheduling the previous 14th process for execution. So until now, the single aspect fault is complete. It's a lot for the control data block is updated when the single aspect fault is configured. But the shared data block is updated when the page not present and pick ready, I delivered and acknowledge. And we need to support the migration after a single aspect fault. The migration is quite simple to use the octal command to retrieve the state of a single aspect from the source VM and restored on the destination VM. Which means we just need to migrate the control data block. For the shared data block, we don't need to migrate. Because all the pending a single aspect fault are canceled before the migration starts. So as we mentioned before, the SDI is leveraged to deliver the page not present notification in order to support a single aspect fault. So let's explain with SDI. The SDI is the abbreviation of software dedicated exception interface. Defined by then W054A specification, the specification can be downloaded from the link. Generator provides a mechanism for reducing and servicing system events from hyperweather. The interface is offered by the hyperweather to the guest OS. The service is delivered with SDI event. The SDI event is identified by a unique event number. The SDI event is delivered to the guest immediately, regardless of the guest state. It's not a maskable by the IQ and score disabled or similar functions like that. So in this regard, it's quite similar to x86 NMI. Apart from that, the SDI event are classified two types, which are shared and private events. The shared event is owned by multiple EPEs and delivered to one of them. The private event is only visible and owned by one PE. Again, the private SDI event and interoperate used by a single aspect fault to deliver the PE to another person and the PE to read the notification. And the SDI is working based on the SMCC. So I would like to explain what SMCC is. So SMCC is an abbreviation of a secure monitor call calling convention defined by then W0 to AD. The specification can be downloaded from the link either. It basically defends a common calling mechanism to be used with secure monitor call, SMC, or hypercall HVC instructions. HVC instructions used to generate exception, which is handled by a hypervisor running in exception level two. The arguments and the return values are passed in the general purpose register x0 to x17. The service is identified by the arguments carried in the general purpose x0. The value carried in x0 is divided into several fields. One of the fields used to identify the function is called function ID to identify what's the function that gets the ones they have ready to provide. So for the SDI is a fault in the category of SMC standard service calls, whose function ID is a fault. So let's take a look at how the SDI is working. In the diagram of left side is the system level behavior. When the guest boots up or when the driver is going to be loaded in the guest, the guest retrieves the version from the host. To check the version, it's valid or not. After that, the private and short reset is issued to the host. So that previous pending events can be reset or cleared. After that, the guest issue the PE and mask monitor to the host to receive the SDI even from the host on the calling VCP or PE. Eventually the guest need to be shut down and the guest issue the PE mask commander to the host. The PE mask command stops the guest from receiving any further SDI even from the host. So this is the system level behavior. For one particular event, I mean the SDI event, as we can see from the diagram of the right side, the guest need to issue get info to retrieve the specific information about the SDI event. And then to reduce and enable the event. After that, the guest is able to receive the SDI event from the host. When the guest need the event anymore, it issue the disable and register commander to the host. So for one particular SDI event, how is it delivered? First of all, when the host receive the event, in the step one, the host need to save the calling site and increase the general purpose register X0 to X17 and PC and P state. And then the event is delivered to the guest. The event handle, which was provided when the guest, when the SDI event was registered is invoked. When the event handle is going to complete, one of the two commands either complete or complete and resume is issued to the host. To tell the host, the SDI event has been handled successfully. But there's one difference between the two commands. The complete and resume schedule the pending interrupts immediately, but the completed does not do that. So that's how the SDI is working on the ARM64. And the single recipe is supported on ARM64. So about the performance, the first scenario is the performance in the heavy slapping element. I have a test program right by myself. The test program is right to the all available memory. In the meanwhile, there may be one calculation straight, maybe running or not. The VM configuration always has one V CPU and the memory is one giga. And the curing process has been put into the C Group 2, where the memory limitation is enforced to have 512 megabits. The first scenario as we show in the top set of the benchmarks, the calculation thread is not running, but 5 more percent time is needed to finish the job because of the overhead introduced by a synchronous batch of fault. As we mentioned before, the asynchronous pay fault is needed to deliver two signals. The delivery of those two signals incurred some overhead. In the bottom side of the benchmarks, the calculation thread is running. You can see about 40% of the time is needed to finish the job. So we can say the asynchronous page fault is improved the guest performance in terms of parallelism. Another performance we need to observe the maturity is about the live migration. So still we have the same configuration as before, the V CPU is one and the memory is then with the one giga. But this time we don't enforce any memory limitations from the C Group to the curing process. And on the top side of the benchmarks, the calculation thread is not running, but about 3% to 60% the next time needed to finish the job. Because the improved activities by asynchronous page helps to decrease the page dirty rate and post-copy requests. In the next slide, I will provide more data about the page data dirty rate and post-copy request. The bottom part of the benchmarks is a scenario where the calculation thread is running. The time is used to finish the job is almost the same, but the more calculation capacities is offered when the asynchronous page fault is enabled. In terms of the calculation capacity improvement, about 41% and 68% is improved. So here we have more data about the performance in the scenario of live migration. So on the left two columns where the calculation thread is not running. So you can see when the synchronous page fault is enabled. The time is needed to finish the job is a job from 9.1 second to 8.8 seconds. The right two columns where the calculation thread is start is running. And the time is needed to finish the job is almost the same. But when the synchronous page fault is enabled, more calculation capacity is provided. So from the number, the calculation capacity is increased from 1,684 million to 1,781 million. So the synchronous page fault offers more calculation capacity during the live migration. So eventually we come to the conclusions. A synchronous page fault is a significant beneficial to the guest's parallelism or interactivity. According to the benchmark we had, 40% improvement of the parallelism in the heavy swapping scenario. This is also beneficial to the guest's performance in the period of post-copy live migration. About 41% to 68% more calculation capacity is offered by the synchronous page fault. So the features are currently under development. All the code has been posted to the mail list. You can find it from below links. And all the code has been also uploaded to the GitHub. You can also check out it from the GitHub. I really hope someone from the community can take a look on the code to start the real code so that the feature can be merged in time. Thank you very much. This is today's session. If you have any more questions, feel free to contact me through email. My email can be found from the slide one. Thank you.