 Hello, welcome to our session, which is called Long-Live Asynchronous Pagefold. A few words about us. We are both from Red Hat. My name is Vitaly, and I'm a KBM contributor. I mostly work on PV features in KBM, on Hyper-V emulation, Windows guest support, and Nesting. Vivek? Hello, everyone. I am Vivek. I'm a kernel developer. I primarily work at Red Hat in storage and file system team. Right now, working on what IFS have contributed to OverlayFS, KXF, KDOM, block C group controllers, etc. Back to you, Vitaly. Yeah, so what event are we going to talk about today? Imagine the following situation. You have a guest running on KBM, and there is a memory access happening at the guest. And from the guest perspective, this memory is available. But on the host at the same time, it is not present. For example, because the page was swapped out. So what's going to happen in this situation? Well, under normal circumstances, KBM will have to block guest's VCPU, and then handle the pagehold condition. For example, bring the page back from swap. And when it's done, it can resume guest VCPU. So it doesn't see that anything happened at this time. But resolving the condition may actually take time because it involves a storage. And at the same time, your physical CPU may actually be idle. Because, for example, you have nothing else to run on this physical CPU. If you, for example, use VCPU pinning for your guests. So what can be done? KBM uses two ideas. First idea requires the guest to collaborate. And we let the guest know that the page it just tried to access is missing on the host. But we are not blocking it. So it can actually switch to some other process and do some other work while the page is brought back from swap. The other idea doesn't require the guest to collaborate. And the idea is called a synthetic APF halt. And it's the same as emulating a healthy instruction, which guests actually didn't execute, but we pretended the guest executed it. The difference between this and blocking the VCPU is that when we do the synthetic APF halt, we keep interrupt enabled. So we hope that there's going to be a rescheduling interrupt in the guest. And the guest will automatically switch to some other process. And at the same time, the memory which wasn't available before may actually become available. So when it switches back to the task, it may run normally. So today we are mostly going to talk about the first idea which is called asynchronous page fault. Initially it was implemented in 2010. The main concept is that there are two events. The first event is called page not present and the other is called page ready. To deliver these events to the guest, page fault exception is being used. With every page which was not available, there is a token associated. And this token gets passed through CR2 register. Also, there is a shared memory structure. So called asynchronous page fault reason, which indicates that the page fault which has been injected is an asynchronous page fault and not a normal page fault. And it also tells us which event has been delivered to us. Is it a page not present event or a page ready. So how does it all work? Imagine there is a user paid task and it's trying to access some memory which is not available on the host. So KBM injects this asynchronous page fault page not present event, which has been injected as a page fault exception. We get into this page fault handler in the guest kernel and the guest kernel analyzes this shared memory structure. It sees that the event is actually an asynchronous page fault event and not a normal page fault. It freezes the task which tried the access along with the token and switches to some other task which hopefully won't call any faults. So what happens when this page becomes available? KBM again injects page fault event to the guest and we get into the same page fault handler in the guest kernel. It reads the shared memory and now it sees that the event which has been delivered is a page ready event. It searches for the token in its internal structures, gets the task which got previously blocked. It unblocks it so this task can now continue. So a couple of extensions were developed to this. First is called a stand always mode and the idea behind it is that we can not only deliver asynchronous page fault events to user space memory access but also to kernel memory access when the kernel itself is preemptible and can handle this. Another exception extension is needed when you want to run nested guests in case this asynchronous page fault event is happening for an access while your CPU is actually in guest mode, your guest CPU is in guest mode. We cannot just inject a page fault event because it will get into your nested guest which you may not be prepared to handle this event. So we inject it as a pfpmxit and level one hypervisor, your guest should be prepared to handle it. So it does exactly the same as it would do for pfxception. So it all kind of worked for 10 years since 2010 before people actually started looking at it closer and the main assumption which was well documented when this was implemented is that the guest should actually read or the shared memory structure answer to register before any other page fault, even any other like real page fault can happen and they should be guaranteed but can it actually be guaranteed? There was a faulty scenario presented by Andy Lutomarski last year and the scenario is the following. Imagine where injecting an asynchronous page fault event and before we managed to read the CR2 and the epf reason from the shared memory structure and NMI happens. It is known that in Linux kernel NMI handlers can cause accesses to user space memory as these in its turn can cause real page fault events. So what's going to happen when this happens? Well, the CR2 register will definitely get clobbered because real page fault will override it and when we get back to the synchronous page fault handler we won't be able to handle this asynchronous page fault event. So what's going to happen? Well, nothing good. So we started working on it and it's currently work in progress but some things were already been done in the last year. So first it was decided that send always mode for asynchronous page fault is not really robust enough and also no existing well-known Linux distribution builds fully preemptible kernel. So the send always mode was deeper key. Second, we decided that the main problem is that the same page fault exception which has been used for normal page fault is used for asynchronous page fault. So we decided that it's not really a good idea and as a first step we switched from using this page fault exception to using normal interrupts to deliver page ready events because these events are by design asynchronous and KVM fully switched to this mode in Linux 5.8 and KVM guest fully switched to this mode in Linux 5.9. So now the old way to deliver page ready events is not supported anymore neither by KVM nor by Linux guest. We also have plans to do something about page not present events which is then for example to virtualization exception or tin check exception but there are kind of disadvantages both of them have some disadvantages like virtualization exception is only available on Intel. There were also ideas to use an interrupt to deliver page not ready exception but in this case we must be sure that it gets the higher priority and nothing else will get handled before it. So about this interrupt using interrupt for delivering asynchronous page fault page ready events. How does it work now? So again the task was previously frozen because it tried to access some memory which is not available on the host and this memory becomes available. So instead of injecting page fault exception KVM now injects an interrupt a normal epic interrupt to the guest. We get an interrupt handler in guest kernel eventually and guest kernel does the same it searches through internal structures searching for the task which was previously frozen with this token it finds it and unblocks it so this task can now run and also we've introduced an acknowledgement mechanism by which guest kernel tells KVM that it completed processing this asynchronous page fault event so the next one can get delivered either like immediately or in the meantime. So that was it and now Rebecca is going to talk more about new usages for asynchronous page fault. Okay so KVM page fault error handling so like one question is like yeah of course we all talked about page faults happen we can deliver them synchronous manner or asynchronous manner and the question is what happens we cannot resolve a page fault what's the behavior currently and how that behavior can be improved before we dive into it like I just want to address one exactly can that happen where I ran into the issues personally I ran into the issues when I was running what IFS with DAX enabled and that's where I started running a bunch of the issues before I go further like I just want to quickly mention what IFS is fairly new and in case you have not seen it yet it's a pass through file system something along the lines of what our 9p is that it's fuse based it allows to take one directory on the host and just export into the guest in this in this diagram the shared directory is let's say slash four which is being shared what IFST here is the demon running on the host which is the file server and uses the we host user protocol to communicate with the guest and recently we added DAX support to what IFS this is still in the use next tree and we are hoping that it will be merged in this merge cycle which is opening probably soon what does this do is like typically if you're accessing a file let's say food or tax you create another copy of the contents of the food or tax whenever somebody is reading the file we take it from the host create a copy in the guest space and that's how the guest process access the contents but with the DAX we wanted guest process to be able to directly map host space cache and bypass the guest space cache completely and this probably has two advantages that it reduces our memory footprint on top of that it probably is a faster access and it just allows a better sharing also down the line at least because there is a single copy between multiple clients so in this case guest sends a request to map certain portion of the file to what IFST and then QME maps that offset and that page in the file into the QME address space which can we show that the physical memory inside the guest and using DAX we can directly map into the address space the guest process. Now what happens if a guest process mapped a file let's say food or tax and then somebody else let's say on the host truncates that file and after that guest goes and does any load or a store operation to put that particular page so the current behavior depends so this is an error like guest try to access a page which is not present anymore or truncated and it cannot be faulted back in and the question is what do we do if the fault is synchronous space fault let's say a synchronous space faults are not enabled in that case currently we exit back to user space with an error say E fault but if the synchronous space faults are enabled we kind of loop infinitely and this is what happens we take a when guest accesses that page we take a VM exit and let's say we do a get user pages remote and it returns an error code let's say E fault but currently KVM doesn't pass that error code and upon the return it injects a page ready event into the guest guest things looks like pages ready let me try the instruction it retries again we take a VM exit and the same loop continues and it injects the page ready event and this loop just continues infinitely so basically if a synchronous space faults are enabled if the host cannot resolve the space fault because file got truncated we just get into a infinite loop so that's a problem I'm facing and I think this problem we can have any of the resources which can be file backed and guest can do directly load and store that memory location I think like for example and we didn't file back the VM device will face the similar issue so we need to do something about it so I think there are at least two problems to fix the one problem is we need to make the behavior uniform between synchronous faults and asynchronous space faults I posted a patch for that so that even for a synchronous space faults we exit to user space with E fault if the page fault cannot be resolved this patch has not been merged yet there are still some concerns on the patch I need to address the concerns raised by other developers the other bigger problem is can we do better instead of exiting to user space when KVM cannot resolve the fault can we let guest handle it we just need to inject error back into the guest saying this cannot be resolved and then guest can take it further for example if it was guest process which tries to access the space and send a sick bus to the process instead of killing the whole guest and if it was the guest kernel which was trying to access that base we can do some exception table fix up magic to come out of that loop and return to user space with appropriate error so that's the idea that would be nice if we can achieve that but how to do that that path is not clear last few months back we had discussion upstream and various people have different ideas so one of the proposals is that at least I did this proof of concept patch when the error happens instead of and we inject the page ready event into the guest we also send error back to the guest so guest knows whether page is really ready or not or it's an error then guest can do the error handling but this did not work for multiple reasons one thing is as Vitaly mentioned that now we have blocked a synchronous space fault if the guest is running at privileged level 0 so if kernel accesses a guest kernel accesses at memory location we will not even initiate the synchronous space fault protocol and that means that this method is not going to work for that particular use case and the other problem is that this particular patch that is too tightly coupled with the synchronous space fault mechanism and the problem exists even outside APF so design something more generic which works both for synchronous space faults as well as asynchronous space faults so this patch set will never merge and it's just a proof of concept thing another idea was I think Andy Lutomiski said like maybe it is a good idea to use machine check exceptions they do something similar already in the case of poison memory or something so why not use that path and inject an MC I think there are couple of issues or at least concerns there as well typically MCs are raised synchronously only in the case of load but stores often are silent failures and I think later when hardware detects synchronously that something has gone wrong with the memory we raise and MC so typically store path is a synchronous path but load path is synchronous path and we want this to be synchronous both in the case of load as well as store so maybe to solve this maybe because we are controlling everything maybe hypervisor can inject MC both and load and store like synchronously and even if the real hardware doesn't do it but I think rest of the MC code various copy to user and copy from user and all the code following helpers have been written with this assumption that one can get an exception on the load but not necessarily on the store so I think that particular code path and all the helpers will require a change as well and other concern was that MC is already pretty complicated and there was a resistance from few people that we don't want to be doing any special casing for this case in the MC handler and nobody has posted a patch so this still remains an option hopefully if somebody posts a good patch hopefully it will be more acceptable the next one as Vitaliy I think referred to it briefly is virtualization exception so there was another proposal platform has this exception can we make use of this to report the errors back and not just errors once we have a good race free way to report it we can even send the page not present events as well using this but the concern here is it is only available on Intel what about the other architectures AMD and ARM64 do we emulate there or like how do we find equivalent replacement there so this is an idea at this point of time we haven't posted the patches yet and the last one there was another idea can we handle the problem at what IFS level itself like don't even get into this situation let's say implement some sort of file leases and when one client decides to truncate the file take right lease on the file and that should ensure that all other clients if anybody has the file mapped they all unmap it before truncation happens then go ahead with the truncation and make sure that we don't run into this error situation maybe it will work first of all we don't have the concept of file leasing and fuse so somebody will have to introduce it and devilizing the details and hopefully we can make use of it but still there are some issues so one of the most prominent one even if we do it it will continue to be racy because for example if somebody truncates the file directly on the host it doesn't go through what IFS client that means the truncation is not going through what IFS and that means this particular user will not take the leases and not participate in the protocol and that will be racy I think past we can do it what IFS demon is watching for the events and gets notification and sends the notification back to the guest saying truncation just happened and unmap anything in this file you have but that is still racy because in time another user can just access the file while you are being notified so that's one concern and not only that I think it probably can be a stopgap solution but what IFS is only one use case which can run into this issue there are other use cases like NVIDIA was another example if I truncate the file which is backing the NVIDIA will run into the same thing so to me it would be nice to solve in the KVM instead and that will address the wide variety of use cases so which proposal probably is most likely I don't know like my hunch is probably at this point of time more inclined toward using the machine check exception and if it works reasonably well probably that might work I don't know at this point of time but that seems to be my preference so that's it from me everybody thank you for listening and if you have any questions I'm glad to take those questions thank you