 Hello, my name is Claudio Imbrenda. I am one of the commentators for KVM on S390 and KVM unit test on S390. And I'm here to talk about the challenges of asynchronous teardown. So first, I will explain what it is and why it's important and how we want to solve it. And then I will quickly go over how we solved it both for the reboot case, which is more S390 specific and for the shutdown case and some final remarks at the end. So what is it that we're trying to solve? What is the problem? Problem basically is that when a big process terminates, it takes a lot of time for the memory to be available again. It's especially true for protective utilization on S390 because of the way protective utilization is done on S390. So rebooting is also problematic. For example, an AMD protected guest is encrypted in memory. So when the guest is not needed anymore, the memory is just left there because who cares, it's encrypted. On S390, the memory is not encrypted. There's access control. So nobody can access the guest memory. But when the virtual machine dies, then there's a long process of having the secure firmware erasing and changing the security property of each page. And this adds a lot of overhead to the cleaning up process. What do we want is to have QMU terminates immediately and leave your tear down maybe to an asynchronous process. Now here I have a little benchmark down on this laptop, which is not a super powerful server, but just to show that this is a problem even for non-secure guests. I did the test with different sizes and we can see that it takes approximately 13, it goes at least approximately 13 gigabytes per second for clean up time. So if we extrapolate for a big guest, 16 terabyte guest would need 20 minutes to shut down, which might be annoying if you have a big machine and you just press on libware to say, okay, shut down and then it takes 20 minutes before the machine shuts down and then you can start it again. So that's why we would like to have something where you press shut down and it shut down immediately and then you can start it immediately and something in background will take care of cleaning up. So there are some potential issues and one is actually an issue that we create exactly with this asynchronous tear down. It's a problem of resource allocation because if the previous VM is still being torn down in the background, it's still taking up memory and if the same VM is started again and it's using up memory quicker than the one that is being cleaned up can clean up, then we end up using maybe too much memory, which brings to the second item, that's how do we interact with the one killer. If we do interact at all, if you want to, proper accounting and a complexity of implementation. Let's have a look at the difference between the reboot and the shutdown case because on S390 rebooting a secure guest means that the secure guest needs to be turned down completely and then a new guest is started in the same memory. So from a KVM point of view, it's the same KVM guest but from a firmer point of view, it's a different secure guest that runs in the same memory. So in the shutdown case, when KVM gets in control, the memory is already gone. For the reboot case, the memory is already there so something maybe can be done for it. Yeah, so this is at the moment at least an S390 specific problem so I will not waste too much time on it but it might be relevant for other architectures in the future, I don't know if other architectures decide to do things in a similar way. So two possible solutions, I implemented both of them and they both work but we'll see the pro and cons for each of them. The first is very simple, when a secure guest is destroyed, you just start a candle thread in the background to clean the memory and then there is some work needed to allow a second guest to run on top of the first one that is still being cleaned up because maybe the second guest will use some memory in some places that has not been cleaned up yet so there's some exception handling, some situations that normally would not happen that now happen and need to be handled but that's not the end of the world. So this is very nice because there are no user space changes needed because it just works and there are no common code changes which is also nice, it's an S390 problem so it has an S390 solution but it brings to improper accounting of CPU time because the cleanup is done in the kernel and that's not nice that the cleanup for the VM is done in the kernel, this should be accounted to the C group or if not if not to QM itself at least to the C group that QM belongs to, this is quite important for accounting purposes. So next okay we try user space so we create a new interface, QM you destroy the guest with this new interface, the guest is not actually destroyed, it's just set apart, the new guest can be started on top of it already and then QM will create a new thread and do the tear down in background so basically what we did in the kernel thread before now we do in this user space, I mean it's still done in kernel context of course but in the context of QMU so the accounting works correctly because we are working with QMU context so we do have proper accounting of time and we do not need to common code changes but it requires user space changes now so yeah obviously what the solution is obvious, we prefer correctness so the idea is to have the reboot with the user space changes so some of the patches have been already merged, some are still pending, some mailing lists and should be merged hopefully soon, QMU changes are needed, small changes luckily there's like maybe 20-30 lines but I will send them once the kernel changes are in and libvert does not need any changes so this is for the reboot case, highly history 90 specific in this case, let's see what we can do for the shutdown which is a similar problem but more generic so again we have a kernel thread solution, a user thread solution and two more solutions, the third MMPot and clone, let's see what this means and yeah so shutdown, kernel thread, this is quite convoluted so the idea here is that when the process has been turned down in the arc hook that does the un-map of each PTE there is a flag that basically indicates if it's a lazy TLB operation or not which means is this an un-map like the final un-map when the process is dying or it's just a normal un-map so we can detect if that's the final un-map instead of just doing a normal un-map we do the un-map but we also do a get page on it so basically we pin it in place and we put this page in a list of pages that need to be cleaned up later and so at the end the memory mapping has been cleaned up except that the pages are still there they're not they haven't been freed and then when later on KVM is turned down we start a new kernel thread and in this kernel thread we can one by one clean up all the pages that are in the list so this is entirely arc specific so there's no and there are no user space changes needed this is entirely in the kernel and it does not touch any common code on the other hand there's a little bit of a lot of disadvantages first of all kernel thread improper CPU time accounting C groups again there's a large impact on memory management because if we have a huge guest then we have a huge amount of pinned memory which is just there and this makes interaction for example with the omkiller interesting because that's not even mapped to a process it's just been pinged and the implementation itself is complex because we have to go into the arc specific code and do some pretty hecky stuff to put all these pages aside and then yeah and again it's arc specific this is also a disadvantage because we would like to have a lazy or a synchronous tear down that maybe works on every architecture because this is is useful for other architectures as well so let's try to solve some of these problems and let's go with the user thread so this is kind of like the reboot case it's the same as the kernel thread but instead we have we force user space to create new process or a thread actually we need a process here the thread we just create issue an iocdl which will slip and then everything goes as for the kernel thread case we pin the pages when and then when kvm is stored down it will instead of starting our thread it will just wake up the the process that was sleeping in that iocdl and that iocdl will then proceed to take the hits and do the cleanup so it will it will it will be done in the context of that cleaner process so we have proper accounting and this cleanup process will be unkillable until it's it's done which is also what we want so yes we have proper accounting of cpu time so we're happy and it's arc specific so maybe we are happy it still has the same impact on memory management which is maybe not something we want it it does need now user space changes so it's not free lunch and there is still the same usual complex interaction with the omkiller and the same complex implementation because this is basically the same implementation as the previous one except that we are not using a kernel thread but you're using a user space thread and again arc specific why not make it arc independent so next attempt was using a mmput async i discovered that in the kernel there is a function called mmput async which is the same as mmput but is done asynchronously so i thought this is what i need so the idea is that in the core of the kernel where we do the mm exit or exit mm whatever it's called when when the mm is unreferenced so with mmput instead of mmput we do mmput async maybe we put an if with some conditions and so that's only the processes that are being marked for asynchronous tear down or down down asynchronously so yeah and then there's already infrastructure for it it just just works it requires minimal changes in the kernel it's actually simple implementation as i said instead of mmput it's like if something then mmput otherwise mmput async there are some arc specific hooks needed just to to mark which mm's we want to tear down asynchronously later but it's little so it's a simple implementation it's quite architecture independent and there are no user space changes user space changes needed so it would just work but we have two issues again now we are doing it in a kernel worker thread which means that we are not accounting CPU time properly again so CPU groups again and some people have been quite loud about doing it correctly moreover this means changing the core of the kernel where we do the mmput when the process is turned down and some of the people who are in charge of memory management were not exactly happy about that to put it mildly so um no they said you can do it in user space don't do it here so that's what i did clone so before shutdown ideally start up a second cleaner process is cloned using the clone vm flag which will start a new process sharing this the outer space of the parent but without without being a thread so it's a separate process it's not a thread when the parent terminates no memory cleanup is performed because the memory is still in use by the child and so the cleaner process we just wait until the parent has completely terminated and then just makes it and all the tear down will be on the child now so we have proper accounting of cpu time because it's done in the count i mean it's the child process that is taking all the heat for the for the cleanup it's actually a quite simple implementation it's around 200 lines of code including comments it is completely architecture independent and it is completely user space there are no kernel changes needed for this these advantages that we need user space changes well yeah and the kernel the cleanup process is killable actually so if you send a sick kill to the cleaner process the cleaner process will die and if you kill it before the parent dies then there's no more synchronous tear down that's the the only real disadvantage i would say about this this solution so of course that's the the one that's all the efforts have been concentrating on now there's still some discussion ongoing but it seems like the clone solution is the most it's the best one so there are no changes needed for a kernel and libvert interestingly so libvert will not have any issues or complain it just works with this thing and the patches are out have been discussed i don't know when they will be merged hopefully soon i don't know so this is the shutdown solution i will just now go in some more detail about what happens in this cleanup trend actually process first of all it's an opt-in feature so you need an addition a new common line option for qmu so that so that for backwards compatibility this is not done by default so this new process is created with the clone and clone vm and the cleaner process will call prctl with a p dead stick so that when the parent process dies which is this case qmu the child will receive sig hub and there's just signal handle for sig hub important the cleaner process also needs to close all file descriptors because otherwise libvert is not happy so if available close range is used if not available then we just open proc self fd and close all the file descriptors one by one close range is better because i've been told that sometimes container people do some strange things and proc is not always mounted all the time but i guess if the close range is not available then that's the next best thing this is because close range was actually introduced in kernel 5 something and i know there are some distros still supported that still have like four something like rel so yeah so what happens is then the kernel the cleaner process just wait for the signal and only when the parent process is completely terminated the cleaner process can exit and complete termination means that the parent the pit of qmu speed and the parent pit of the cleaner process are not the same anymore before cloning qmu will just write the pin the pit somewhere in memory which is readable of course by the cleaner by the cleaner process because they share this memory and when the when qmu speed is not anymore the parent pit of the um the cleaner process then it means that the parent pit is gone and we can exit so this is generalized solution for a synchronous teardown for any process it works for qmu but it can work for any other thing could work for databases for anything so maybe do you think this should be put in a library for everybody's use this is my question for you and after asking you this question i ask if you have any questions and you do yes so the if you kill the whole c group you might yes i'm doing it so yes the the the remark was if you kill this whole c group then you might end up killing the the cleaner process before the um yes that's true but if you are killing this whole c group and then waiting for the whole c group to be gone then this will not help anyway because the cleaner the the whole point is that the cleanup process will run in the c group of qmu so if you kill up if you if you if you want to kill the the whole c group and then you wait until the c group is gone then yeah yes i would think like if i'm a normal vm or any other user space project i don't really care about okay so the question was why do we care about cleanup accounting for and c group when the the determination uh the point is when the process is terminating so the process is basically dead and it's just being cleaned up this cleanup happens still in the context of that process and it's accounted to the c group of the process and you might want to have it accounted properly yes that's the current behavior yes exactly that's the current behavior and that's how it should still be yes the question was why does the kernel take so long to free the memory and the answer is that it's a lot of memory and you need to go page by page and you know unmap do put page now the page is free maybe you need to zero it out then put it in the free pool and whatever it's some amount of processing for each page that you need to do and there's not it needs to be done and if you have a small process like a few gigabytes it's okay but when you start to go into the terabytes of memory that's a lot just just that and if you go for S290 for example for protected guests uh that's uh even worse because as I said there's lots more cleanup needed for per page that needs that needs to be done by the firmware and the hardware and that uh that adds quite some overhead so it's even slower so that was my reason for doing this obviously but it's still useful also because I mean who wants to wait 20 minutes for a VM to terminate right well for the generic case okay so was I doing the pathological case with 4k pages um to be honest I don't know I just started a process on a normal Linux modern linux distro and allocated 4 8 20 gigabyte of RAM used all of it and freed and looked how much time it took on this laptop uh so yeah okay okay for example for example yeah yeah but for example for protected for for protective authorization S290 you can only have 4k pages at the moment for example so yeah that was that was my main use case to be honest although again if even if it's not 20 minutes maybe it's 10 minutes still why why x86 page tables and if you use 4k pages it's it's hilarious how bad you can make the kernel go if you bump up to even 2 meg I mean it's a factor of 512 this is x86 you go up to one gig and I forget what the math is there it's a lot though and it's just it's astronomically different where it goes from yeah it's a complete bottleneck on everything too it's a complete non-issue okay I'm thank you