 All right. Welcome everybody. My name is Andrea. I work in the economical kernel team, mostly focusing on cloud kernels and kernel space in general. In this talk, I want to show you something I've been working on recently called opportunistic memory reclaim. I'm going to show you how we can use this feature to improve Ibernation performance. First of all, a quick introduction about Ibernation. So what is Ibernation? Ibernation is a feature that allows you to power down your system and restore it to its previous state when the system is powered back on. This is actually a quite old feature provided by the Linux kernel. In fact, it was originally designed as a feature for laptops, and it used to be a really hot topic back in around 2004, when we didn't have the modern suspend ramp technologies that we have nowadays. At that time, the idea was to implement the mechanism in the kernel that would allow to dump the entire content of your memory to a persistent storage, so that you could power up your system, save your battery, and when the system was powered back on, the memory could be restored, reading the data from the persistent storage. In this way, you could resume all your running applications, all your active sessions, and so on. Now, like I was saying, this is quite an old feature, but is it still relevant to talk about Ibernation nowadays? Well, interestingly enough, there are still some modern use cases about Ibernation, and one of them is cloud computing. In fact, in late 2018, Amazon announced the support for Ibernating AC2 instances in their cloud. The question is why? Why a cloud provider would want to invest development time and efforts to provide such feature? And it's interesting to notice that there are some useful use cases for this particular feature. One of them is to give the ability to pose your workload nicely when your budget is over. For example, in the Amazon infrastructure, in the Amazon cloud, you can create instances that are cold spot instances, and you can see them as low priority instances that run on temporarily unused resources. They are cheap, but if a high priority instance comes in and reclaims the resources they paid for, we need to find a way to migrate the low priority instances somewhere else. However, if you can't find enough resources to migrate, the only alternative is to stop these instances. And in this case, Ibernation can provide a pretty nice way to stop these instances way better than simply shutting them down. Another use case is to give the ability to deploy your work instances. So basically, you can deploy an instance, install all the required packages, start all the services that you want, and at that time, instead of updating the instances immediately to production, you can Ibernate the instance and keep it in a frozen state. And when it's really needed, we can just resume from Ibernation and immediately add to production. And this works because usually resuming from Ibernation can be faster than a cold boot. Now, in all these examples, Ibernation can be a winning solution, but performance is important. In fact, Ibernation is a winning solution only if Ibernation is fast, in the previous case, and it resumes fast in the second case. Especially in the first case, I would like to highlight the fact that it's actually a regular scheduling problem where we have low-priority tasks that are interrupted by higher-priority tasks. That's the same problem. So, in this scenario, Ibernation is really the way to implement a context switch between a low-priority instance and a high-priority instance. And if this context switch is not fast enough, but clearly this is not a winning solution. So, how can we make Ibernation faster? Well, first of all, let's try to understand how Ibernation works. And I've tried to summarize everything in this diagram. So, on the left, we can see the main memory, and all the light blue boxes represent the chunks of allocated memory. When an Ibernation event happens, the kernel needs to pack all these light blue boxes together and create what is called an Ibernation image. Then the Ibernation image is written to the swap device. And at this point, the system can be simply powered off. On Resume, we're boothing an instance of another kernel that is going to check if there's a valid Ibernation signature in the swap device. If a valid Ibernation signature is found, the kernel will copy all the Ibernation image from the swap device back to memory. And all the blocks will be presumed at the right locations. At that point, the kernel will perform a special jump operation that will jump into the previous instance of the kernel, and the system will be resumed. Now, I want to highlight a couple of things in this diagram. One thing is that in order to generate the Ibernation image, the kernel needs to allocate memory. And if there's not enough memory, we can either abort Ibernation and resume the normal execution, or we can try to free up some allocated memory. Obviously, it's really easy to free memory that already has a copy in the corresponding backend store. For instance, clean pagecast pages. This is memory that can be proclaimed immediately without figuring any additional I.O. Or we can decide, for example, to swap off some allocated anonymous memory or to flush some dirty page cache pages. But in the last two cases, in order to free up some memory, we need to do I.O. So, yes, we can definitely drop memory, but this will cause more I.O. Another thing that I want to highlight is the fact that on the right side, we don't have all the light blue boxes that we have on the left side. And this is to highlight the fact that not all the memory needs to be saved into a Ibernation image. So on resume, probably some caches will not be present or some memory would be present because it's been swapped out. Now that we understand more how Ibernation works, we can see what we can do to speed up performance. And first of all, let's try to identify the main bottleneck of Ibernation. Usually, the main bottleneck is represented by the I.O. That is required to provide the Ibernation image to the swap device. And on resume, the bottleneck would be to, again, the I.O. required to load the Ibernation image from the swap device back to memory. So if we are able to reduce this I.O., we can probably achieve a faster Ibernation resume. Now, how can we reduce the I.O.? Well, one way is using compression. So the Ibernation image can be reduced by compressing it. In this way, we will probably do less I.O., so Ibernation will be faster. Another way, like I was mentioning, is to drop some memory. Again, in order to reduce the size of the Ibernation image. And we can drop some lean page cache pages, for example. We can just drop some caches that doesn't require additional I.O. And this would reduce the Ibernation image and would speed up Ibernation time. But if we need to drop some anonymous memory, for example, or to flash some dirty page cache pages, in this case, we are still generating I.O. So this is not actually improving Ibernation performance, because just the Ibernation image will be smaller, but in order to have a smaller Ibernation image, we still need to do I.O. So here comes the idea of opportunistic memory reclaim. The generic idea is to provide to the user space an interface that allows to trigger a memory reclaim in the kernel. So we can use this interface to trigger memory reclaim in advance and prepare the system to be more responsive when needed. In the particular Ibernation scenario, we can use opportunistic memory reclaim to drop or reclaim memory in advance. For example, using some idle cycles in the system, like if the system is idle or mostly idle, we can opportunistically trigger a memory reclaim. So we can prepare the system to be Ibernated. And if an Ibernation event happens, most of the memory has been already swapped out or flushed or reclaimed. So Ibernation will probably be faster. And that is the idea, the particular usage of opportunistic memory reclaim that I have experimented. And how does it work? So this is how I implemented it. There's a small Python script that is periodically checking the idle percentage of the CPUs in the system. If the idle percentage is greater than a certain threshold for a certain amount of time, it will trigger the opportunistic memory reclaim via this memory swap reclaim interface. That is obviously a specific kernel patch. And yeah, kernel will start to reclaim memory during using idle cycles in the system. So that's the mechanism detect when the system is mostly idle for a certain amount of time, trigger artificial memory pressure so that the system will start to reclaim memory. And at that point, if Ibernation happens, most of the memory is already saved to swap and we can hibernate faster. The interface that I've been using is the C group memory controller. Actually, the first patch that I posted to the kernel mailing list, in the first patch, I was using a file under Csfs in Cspower because, yeah, that was a very Ibernation-specific interface. And that's because I typically used the opportunistic memory reclaim for Ibernation. But then I realized this feature can be more generic and there can be benefits for other scenarios. So this is why in the next versions, I moved to an interface implemented in the C group memory controller. And so the main reason to use C group memory controller is to be able to apply a fine-grained memory reclaim policy. For example, we can create multiple C groups. This is just an example where I create two C groups. One is called foreground, the other is called background. And the idea is to move the latency sensitive applications into the foreground C group and the latency insensitive applications into the background C group. And then when I want to trigger memory reclaim, I can reclaim memory only from the background C groups. So in this way, I won't affect performance of the foreground C group. This can be useful, for example, in a context, in a mobile device, for example. Let's say you're playing a video game in a smartphone. You may want to move the task of the video game in the foreground C group and keep all the other tasks in the background C groups like tasks that is periodically checking for my emails or the task that is periodically checking if I have new messages on Facebook or something like that. And if you want to use opportunistic memory reclaim in this context, you may start the task that is reclaiming memory only from the background C groups. So your video game won't be affected. The performance of your video game won't be affected and you won't notice any additional latency. This is the test case that I've been using to show the benefits of the opportunistic memory reclaim. So I created a VM with eight gigabytes of RAM and eight gigabytes of swap file. And I explicitly set at this speed maximum IO bandwidth of 100 megabytes per second because I wanted to show the benefits of this solution if we don't have a super fast storage. In fact, this is actually simulating pretty well what we usually have in a cloud environment. In fact, multiple VMs are running in a usually in a shared environment on the same hypervisor and they share also the IO bandwidth or in some cases there are explicit IO limits. So this is to simulate this scenario and the test that I've been using is this one. I allocate 85% of memory. I wait for 60 seconds almost in idle and then I trigger an hibernation and I resume measuring the time. So this test case is probably simulating pretty well what usually happens on one of those spot instances that I mentioned earlier because spot instances usually are deployed they start a bunch of services that could be like a large JVM application or a web server or any other services that are allocating a bunch of memory. Then they serve a bunch of requests but after that they usually sit in an idle condition and we can use this idle state to reclaim some memory. So when we need to hibernate these instances hibernation can be faster and so is it really working? Let's see. Here are some results. In this column we can see the results using a 5.9 mainline kernel and I repeated the test. So this is an average time over 10 runs. This is the 5.9 mainline kernel and this is the 5.9 mainline kernel with the opportunistic memory reclaim patch and the small user space script that is running the one that is monitoring for the CPU idle percentage and that is triggering memory reclaim. And I did the test with the using image size default or image size zero and as we can see the results are really promising like in the image size default case the mainline kernel took almost 50 seconds to hibernate and with the opportunistic memory reclaiming place it took only 3.4 seconds. So that hibernation in this case is more than 10 times faster and also the resume case is faster is like more than two times faster and that's because in the opportunistic memory reclaim case the hibernation image is way smaller and that's because the memory has been already reclaimed or swapped off or flushed in advance. I repeated the tests also with image size zero basically image size is a CISFS tunable in CIS power that where you can specify how aggressive the kernel should be at reclaiming memory when hibernation needs to be done and using a smaller value means that the we're asking the kernel to make the to try to make the hibernation image smaller so zero means we make the hibernation image as small as possible and in this case we can see that we are paying the price of trying to reduce the size of the hibernation image during hibernation but then ultimately the kernel can really make a smaller hibernation image and we can see the advantage on resume because the resume time with image size zero is reduced respect to the image size default in the mainline kernel and hibernation time however is bigger because with the image size default we needed 50 seconds to hibernate with image size zero now we need 71 seconds however in the opportunistic memory reclaim cases well performance are pretty much the same in both cases and that's because we we have already optimized the size of the hibernation image using the spare idle cycles of the system before the hibernation event happens so i've also prepared a live demo to show you better how opportunistic memory reclaim works all right so let's switch to a consult session here right here we can see a virtual machine that is running uh it's running a boom to groovy distribution and kernel that i'm using is a 59 kernel with the opportunistic memory reclaim patch applied now in the first example i'm not going to use the opportunistic memory reclaim because we want to measure like the baseline of the hibernation performance so first thing that i'm going to do is to activate swap using a swap file and then i'm going to start a memory stress test allocator we can see that initially i'm not using very much memory only 150 megabytes are allocated and this swap is not used at all of course there are eight gigabytes available and no swap is used now i'm going to start a memory allocator that is allocating 85 percent of the available memory and we can see up here that the memory is being allocated when we reach 85 percent of the memory in my usually my test case i i was still waiting 60 seconds in idle to simulate the workload of a usual of a typical spot instance but in this time to speed up the demo a little bit we can we can just hibernate because nothing is going to happen in the system at this point no one is triggering the opportunistic memory reclaim so if i hibernate now system needs to swap off almost 6.5 gigabytes of memory and after that the hibernation will complete so now i'm triggering hibernation and as we can see this task is stopped meaning that the system is currently unusable and we need to wait that all this memory uh is flushed to the swap device then the system can be stopped while we're waiting i am going to describe what i'm going to do next the next step will be to activate the OMR cpu d that is the user space component that is monitoring for the idle percentage um in the cpus and that is triggering the opportunistic memory reclaim all right so we can see here on the right side that hibernation almost completed and so when we see the prompt here it means that the system is fully hibernating hibernated and i can start the system again and hopefully my sessions will be resumed we can see that now that's the resume time that is taking a little bit of time because again it's loading the hibernation image from the swap device to memory and soon we should be able to see the spinner going again meaning that the system has been fully resumed and we should also see the memory appear being updated let's see all right the spinner is spinning again the memory is updated so the system has been fully resumed now i'm gonna repeat the same test so just to make sure i'm starting with the same from the same conditions uh what i'm gonna do i'm gonna actually um reboot the system all right well back into the system and appear i'm gonna start the my watch that is checking for the available memory down here i'm going to prepare this time OMR cpu d that is the user space component that is monitoring the idle state of the cpus and that is triggering the memory reclaiming if the system is mostly idle and then i'm going to prepare my memory stress test and down here i'm gonna prepare my interactive session to trigger hibernation now first thing let's start the OMR cpu d um and then i'm gonna start the memory allocator as we can see well initially the system was fully idle at 100 percent now we can see that one core has an idle percentage of zero percent that is the memory stress test that is allocating all memory all the memory up to 85 percent and as we can see up here the memory is being allocated the swap is currently not used because the memory reclaiming is not triggered now that the memory allocator is done and the system is mostly idle for a certain amount of time and the memory reclaiming has been triggered and the so we are using the idle state of the system to swap to pre-swap off some memory so now if a hibernation event happens i don't have to swap off all the memory because most of the memory has been already swapped off and we can see that basically yes this task is running but there's just a spinner that it said that it's just doing print and it's not very cpu intensive it's actually using a small amount of cpu um something else is coming into the system and it either it's either refolding pages or it's starting to use cpus the memory the triggered memory reclaim would just stop so now we are approaching to reclaiming all the memory basically so if an hibernation event happens now yeah remember before we had to wait a lot of time to hibernate and now we have already prepared the system in advance using the idle cycles of the system itself so if i hibernate now we can probably count three and hibernation will be down let's see so one two three down you can see immediately the prompt right here we have already hibernated and i can resume and also the resume should be faster than before as we can see we have already loaded the hibernation image right here and soon we should see the spinner spinning again exactly so another test that i would like to show you is to something to highlight the advantage of using a memory cgroup interface so what i'm gonna do is do yeah let's preset the swap to do that i usually run a swap for a swap off followed by another swap on this way i'm basically cleaning up the swap to initialize it's the swap and this time uh so like i was saying this time i want to show you the advantage of using a memory cgroup interface so uh still running my memory stress test this time i'm also running a latency sensitive application it's this time delta is a simple application it's again a simple python script that is showing a little spinner and each time that it's doing a printer it's printing also the delta time between the between one print and another and since the since the system is not doing anything special at the moment we can see that the latency is always just perfect is always 1000 milliseconds now uh i'm going to start the um r cpu demo that is using the file in the cgroup in the memory cgroup group that pass so that means that memory reclaim is still happening system-wide so when memory reclaim is triggered i'm going to reclaim memory from everyone in the system the latency sensitive task included and as we can see every time that we trigger memory reclaim we see a little spike the latency like three milliseconds here let's see if another or another memory reclaim happens so another three milliseconds and so on now i can also start the memory stress test allocator again when this guy is running because it's also cpu intensive uh the memory reclaim won't be triggered but as soon as we eat the target percentage that is 85 percent we will start to kick off some memory reclaiming and we will probably see other spikes in the in the latency sensitive task probably not during the first memory reclaim because there's a lot of memory allocated by the memory intensive application well actually we have seen a four millisecond spike this could be also done by the IO that is required to swap off the the memory so now let's wait as before for the whole memory to be swapped off so that we can see um if we are also eating other spikes down here and while we're waiting i'm gonna tell you what i want to do next uh the next test would be to create two separate cgroups and i'm gonna move the latency um sorry i'm gonna move the memory allocator stress test into a cgroup and the latency sensitive application into another cgroup and i'm gonna call this one a background cgroup and this one a foreground cgroup oh let's see now we have it a pretty big latency spike like 17 milliseconds oh here's like yeah 271 millisecond so we are eating significant latency spikes down here doing memory reclaim and that's because we are doing a system-wide memory reclaim now let's try again gonna repeat the test but this time like i was saying we are going to create the two cgroups one called fg and the other is called pg and i'm going to move this guy here in the background cgroup because this one is the session is the shell session where the memory stress test will run the memory allocator will run and this guy down here that is gonna be the foreground task or the latency sensitive task that will be moved in in the foreground cgroup perfect just to make sure i'm running everything in the correct cgroups right this one is running in vg and this one is running in fg fine now i'm gonna restart the two different benchmarks memory stress test the latency sensitive test and i'm gonna start OMR cpud again at this time i'm going to reclaim memory only from the background cgroup so what i should see is that memory will still be reclaimed but this time i shouldn't see theoretically i shouldn't see any extra latency here so i won't affect performance at all from the latency sensitive task and because i'm all right memory reclaim has been triggered because this time i will only reclaim memory from the background cgroup so i won't affect performance of this guy that's running here all right memory is proclaimed so far so good i don't see any extra latency down here and if you remember in the previous case we we already we had already seen some some spikes down here let's wait till the end since or when all the memory will be reclaimed now we are i think we are approaching to almost all the memory being reclaimed but but as we can see this task doesn't notice any impact on performance so if you recall the example that i mentioned almost at the beginning of the presentation um yeah the fact that we have like a latency sensitive application that could be like a video game played on a mobile phone we can see that performance are not affected at all yeah all right we have proclaimed all the memory we are figuring more memory reclaiming but yeah performance are not affected all right so let's let's go back to the slides conclusion so opportunistic memory reclaim can definitely help to speed up iber nation resume time as we have seen but iber nation is not the only scenario where this feature can be helpful being able to trigger memory reclaim in advance from user space can provide benefits to the scenario like that we have seen in the last example like if we want to improve system responsive during large allocation bars for example if we want to prepare the system to be able to handle large allocations or if we want to prioritize responsiveness of certain latency sensitive applications versus latency insensitive applications or even if we want to reduce the overall memory footprint in a system like there are cases where memory can be really expensive and being able to trigger memory reclaiming from user space can help to reduce the memory footprint when if it's needed now future work so the overall idea is still work in progress in particular we still need to figure out the ideal ABI that the kernel should provide to user space to make this feature as generic as possible and so that it could benefit the many different contexts and not only iber nation of course i know that google is working on a similar solution they are experimenting similar proactive memory reclaiming technology still based on memory c groups and there's also one important thing to mention and is the fact that even with the mainline kernel as it is right now it would be possible to trigger an opportunistic memory reclaiming um specifically there's a file in the memory c group fs in c group v2 that is called memory dot high and basically what you can do you can you can set a limit into this file that will represent the maximum threshold of allocated memory for a specific c group so theoretically setting a very small value into this file would force the kernel to reclaim memory however the downside of these approaches that we need to react faster and increase the limit soon enough or otherwise we may risk to charge the entire c group or even worse the entire system if we don't respond fast at re-increasing the limit once the memory has been reclaimed so the advantage in my opinion the advantage of the single shot memory reclaiming is that there's really a no way to badly affect or better to affect performance in a in a too bad like if the system is starting to refold and request memory that been memory that has been reclaimed basically it's just a one shot reclaim so system can refold and reload memory from swap or or in in the page cache from the corresponding files and we can decide from user space if we want to retry again to reclaim memory or if we just need to give up and based on certain system statistics like the idle time that I was using or we can you either even use other statistics so I also add the few references here if you want to learn more about this topic I guess the slides will be available somewhere if not you can reach out to me and ask for any information or questions I think that's it so thank you for listening if you have any question feel free to ask