 page cache page or if it's a swap or if it's an anonymous page read it out to the swap device So The way that's done is in the process page table It goes in and basically and it unmaps the page table entry that Maps a particular page in the process is linear address space to a page frame on the system and then once that happens the kernel can do a lot of interesting things with with the page after that and We'll show how transparent memory compression does that but first I want to talk about swapping and The horrible reputation that it's gotten and rightfully so That most people size their machines and the virtual machines with RAM sizing so that Exceed what their work what their workload needs In order to avoid any swap and so you really size your machines on a peak memory load basis Because you can't you can't deal with swap you can't deal with the latency and introduces the non determinism that it introduces So Understandably a lot of people you know don't don't run swap. I typically don't run swap And so let's talk about a little about why Swapping is so bad. So swapping out pages is really not so bad because Most of the work done to unmap pages in Linux is done by kSwapD And so it's not even done on the workload thread and then when the write out to the page To the swap device is done That is done asynchronously the block device layer queues it up and writes it out in a Semi-efficient manner. So swapping out is not really the problem. It's it's swapping in is the problem Because when a process hits a page that the kernel has swapped out that Process is blocked until that page can be pulled in from the device And there's just nothing the kernel can do about that And so it depends on the speed of your swap device How long your process is blocked and on typical rotational media that's going to be on the order of 10 milliseconds For the seek and then the read So that is a lot of time for most CPU intensive workloads So You can really think of it as if you've got a CPU CPU bound load and all of a sudden it runs out of memory And you start swapping that load becomes an IO bound load Because now it's it would be a swapping now it's waiting on on its process pages to become available and that is no good and This is what that looks like so this is running a spec JBB performance benchmark and You can probably tell about how many gigabytes of RAM the system has right? it's 10 Because at 10 when the heap size exceeds the memory size You lose about 95% of your performance and in this particular case This is a this is on a two-core system with the SMT for in power terminology That's like hyper threading four-way hyper threading. What do you think about that as? so And during this during this time the CPUs are mostly idle. They're stuck in IO way while they well Page faults are being serviced, right? And so that's that's no good And this is the swap activity During that same time. So you see swap this skyrockets and you reach an equilibrium point where your your disk is servicing requests as fast as it can and Now you're in this equilibrium at max disc throughput and the whole system throttles and You know, it's no good and One thing I want to add here is that with this is on rotational media with SSDs This would look a little better the throughput would be higher and latency, you know, you have to seek latency associated with rotational media But unfortunately SSDs the SSD technology right now doesn't really handle write heavy Loads very well. You'll burn out your SSD if you use it as a swap device use it heavily as a swap device so that Can solve some of these problems, but unless you want to burn up your SSD really quickly. I don't recommend using it as a heavy swap So this is the case that I was referring to earlier I've actually talked to sysadmin who would rather their process be killed by the out-of-memory manager then have it have the Severe latency and non-deterministic performance that swap IO introduces So What we really need is a way to smooth out this cliff that happens when you when you over-commit memory right everything's running along just fine, and then all of a sudden man, it just drops off a cliff and so z-swap is a way to do that and so When I made this presentation, I wasn't sure how much everyone would know about In-depth on the memory manager and and how it works So I'll run through this quickly just to get us all in the same page when I start talking about how z-swap works so all memory in the system is Managed in a but in a unit called page frame on x86 that you that's usually for 4k on power systems is 4k or 64k different architectures have different page sizes But this is the basic unit in which the hardware understands memory And so the the TLB's all the memory hardware works with memory on this on this granularity and When the kernel starts it creates something called a memory map and it creates a page structure that tracks What each page frame in the system being used for? Oh and one more thing on this every allocated page frame in the system Exists on one of two lists in the memory manager one is the active list one is the inactive list and so The idea is to keep pages that are as the names might imply Active the ones that programs frequently reference in memory all the time so that you never have to Encourage how to do that while in the inactive list are the pages that are considered for for reclaim when the system needs more memory So when the system is low on page frames It will search from the end of the inactive list and the heuristic is the least recently used Heuristic it assumes that if the page hasn't been accessed for a very long time It's not access. It's not likely to be accessed again in the near future so it searches from the end of the end an active list to look for pages to reclaim and Pages can be of all height so the three that you'll most likely encounter are clean page cache pages which Happen on a on a read like if you do a copy command for example It's going to read a page into the page cache and then write it out somewhere on disk Well, even after that command is complete the page still exists in the page cache because you might use it again but if You don't use it again, and it gets to the inactive list and the kernel needs memory it goes Hey, this page is here. It's clean. So I don't have to write it to disk This is this is a cheap reclaim situation, right? I can just remove this from the page cache and Use the page for something else and it and it takes no disk activity to do that The second time you'll run into is a dirty page cache It's the same as I said before except it's been written to and so now it is Dirty in memory. It has to be written out to disk before you can use that page for something else And the third most common type of page you'll encounter isn't an anonymous user page And so in user space if you do a malloc or any number of things it will allocate things on the Process heat which is the anonymous memory Reading and these don't have any file backing these pages don't have any file backing and so you the swap device acts as their file backing and When you want to swap a page out You need first thing to write it to disk because the memory has to be maintained in a persistent state If you move it out of memory you have to get it back somehow in the process once it back or obviously bad things happen So Talking about anonymous page reclaim since that's the part of memory that the cache is going to be operating on So this is what I alluded to this earlier. It uses a process called memory unmapping and That is the process of going through the page tables of all the processes on the system and this is At a very high level the kernel does this in a much more efficient way but it's essentially going through all the page tables of all the processes that have this particular page frame mapped into their process address space and breaking that link And when it breaks that link it Puts information in the page table that lets the page fall handle or no Hey, this page is not present in memory, but you can find you can retrieve this page from from this location And here's a graphical representation of that. I'm a visual learner myself I'm a visual learner myself so the diagrams help me So you got your task struct your MM struct and then the page table that comes off of it And that's I Annotated all these that's the page global directory the page middle directory and the page table entry And then the last stage actually points to a page frame You can see that that links into memory and when you access that linear address space in the process It's going to look in that page frame And then you have the memmat down here with a struct page that's on the inactive list that Corresponds to that page frame and let you know what but that page frame is being used for So memory unmapping breaks that look that last link between the between the page table entry and the page frame And after that happens the kernel doesn't lose track of it because it's got a strict page And it's keeping track of those things, but after this point the hardware will take a fault if the process tries to access that memory location again and But after we break that link the kernel becomes much more It's that it can do a lot more things with that page because now it knows that no user process Can access it and that gives the kernel a certain freedom to do with what it wants to with that page So this is the Layout of a swap entry and this is the information that is put in the page table entry so that on a page fault The kernel can find out where the page can be loaded from and it's very simple There's a type in an offset the type It Specifies which swap device the page can be found in and the offset tells you the page offset within the device where the page has been stored And so what the so well a page fault it will go okay? I know the swap device it'll calculate the offset convert that to disk blocks go and read that disk block Populate the page map remap the kernel page tables to point to that page frame and then resume the process So that's that's page reclaim for anonymous memory crash course So on to the onto z-swap Which now we can talk about now that I've laid that foundation So z-swap is a feature that hooks into the swap code basically into the right and read paths of the swap code and it When a swap out is about to happen and it's going to write page to the swap device There is a hook and there's that there's actually something that I don't vision here that very often this the front swap API It's been in the kernel for probably six or seven releases, but it's a very thin glue layer that allows other kernel Drivers to hook into the swap path at at critical points in the swap out path and then in the page fault resolution path To capture swap devices right before they're written out not swap devices swap swapped out pages and so What it can do is it can intercept that swap page right before it's been read it right before it's about to be written Out to the device and if the right to the front swap back end in this case The z-swap succeeds then it will not proceed to write the page to the swap device It'll say something there has the page It's persistent. It's keeping track of it. I don't care about it anymore. Just only as I can get it back later So that is how we avoid how z-swap avoids a disk right on a swap out Now we don't care about the swap out path so much since we already established that it doesn't impact performance too much K-swap D is primarily the person who does that and It does and writes can be scheduled a lot better than reads can for a later processing So the thing that we really care about is the swap in time Well now that we've got the page in this compressed memory pool When the fault occurs there is a hook in the swap read page path that look tries to look up that Swap offset type pair in the compressed cache and if it finds the page then it can decompress it from memory and Resolve the fault much more quickly than it would have to do a read to the block IO layer and who knows when that will come back and So that really cuts down on your page Your page in your swap in time The code for that is in mainline as of 311 and it's in the mm sub tree at z-swap.c You have to stay with me here. I got a sore throat this morning So I start fading let me know and if I'm not if you can't understand what I'm saying I can slow down So Another thing I want to touch on is the butt and so this right now I've been talking about the compressed pool is this this black box And it's important to know what the black hella black box operates It's not that complex but it does lead it it does Have some catches later, and you'll see why so z-butt is the name of the allocator that we use for the compressed pool and there were two Functional requirements that we had for the pool that were They kind of worked against one another so one is you want Low fragmentation because you're doing all this work to compress memory If you can't store those compressed pages efficiently then that reduces your the effective compression of what you're doing At the same time Memory management folks don't like dead ends in the memory management path They don't like memory that you can't do things with And so Z-swap allows you to be if the page if the page if the compressed pool is full Then you need some way to drop the oldest pages out and go ahead and write them So the thing so the the conflicting functional requirements were high-density storage But be able to reclaim a page from that storage quickly and if you've got a lot of compressed pages in a particular Compressed pool page then you have to write out a lot of them before you can get that page back so the the high-density storage with the quick read claim of compressed pool pages kind of made for a Differing points of view when talking about this design at the memory management conference earlier this year So what Z-bud does is it only stores two compressed pages per page frame It does this force Simplicity reasons and you know for fast reclaim it So when you try to reclaim a page from the compressed pool At most you have to decompress two pages and write them back And that was a nice found that a lot of the the kernel developers that care about me having to service bugs But they really cared about that Unfortunately that also caps the effect of compression at 50% so if you have workload pages that compress really really well That that compression won't be realized with Z-bud now There's another allocator out there that we were using beforehand called the s malloc and it's currently in the kernel staging tree It has higher density But the flip side of that is it takes a lot more work to reclaim those pages when there's lots of compressed pages in them So still working with the community on that one But the code for Z-bud got merged with the swap and it's also in the mem entry at zbud.c This is just a quick diagram of how the simple design of Z-bud kind of It helped with the mainlining process because the complex things going into the memory manager take a very very long time to get in So the the first buddy, which is the first compressed page that comes in gets stored Right as a page aligned at the top of the page at the beginning of the page frame the next buddy that comes in gets stored at the you know, whatever you want to consider right aligned at the end of the page and that is to keep that Keep fragmentation from happening with two objects stored, you know justified at the ends of the page frames that free area in the middle never fragments and it keeps lists of You know the slack space left in un-unbuddy pages But this you know again limits you to capture you at 50% compression But as you'll see later that still yields really nice performance benefits and I owe The IO elimination So enabling Z-swap Yes Right so What I did I didn't show on that slide is that a Buddy can't start at any offset within the page There there's a it divides the page frame up into blocks That are greater than the cache line side and so it the buddies do start on a cache of line so Z-swap so right now C swap has to be built into the kernel It can't be loaded as a module and the reason for that is because it Depends on Symbols that are not exported from the kernel in the swap stuff and in the swap code and exporting those symbols Would create a war that I don't want to fight So we're hoping to get it Into a loadable driver form later because that that is better But right now it's built into the kernel. So it that brings up the question of well For distros it's like well I need to it either needs to be on or off by default if it's built in because it's gonna be there regardless And so the default is that it's off and that is a reasonable choice then it's new code and it has hooks in the memory reclaim stuff and the It has just not been vetted in a wide selection of Platform and workloads and environment so to enable it it takes a kernel boot parameter Z-swap dot enabled equals one and that will enable it with the default compressor which is LZO and LZ and LZO is a build requirement For Z-swap in the kernel So if you select it if you select Z-swap in your current kernel config it will build in LZO as well Which is no problem because almost all kernels build in LZO anyway because that Is the default compressor for the Linux kernel image You can specify another compressor any any compressor in the kernels cryptographic API can be used so The plate is just one of them and I get the example here I guess I'll also mention here that The power seven plus hardware that shameless plug has a hardware Accelerator in it that is upstream and In our case we use 842 which is the driver for the compression module for that hardware compressor There's only one tunable and this was this was by design. We wanted it to be kind of an the hope is for it to be an always on Minimal configuration type of thing And so there's only one tunable which is max pool percent which dictates how much of ram the compressed pool can occupy and This is just a this is just a safety measure until Better heuristics can can come in about what the size of the compressed pool should be but those heuristics don't exist right now And so it's a it's a tunable To keep the compressible pool from completely completely overrunning the system, right? There are a couple places in sysfs and debugfs that are useful So the sys module Z-swap parameters, that's where this max pool percent is it can be set at boot time or changed at runtime And also statistics on Z-swap activity or are held in debugfs under the Z-swap Attribute Front swap is also at this same location sys kernel debug front swap and it Has higher level metrics about what it's sending down to Z-swap and so both of those can give you insight into You know are the pages being stored in the compressed cache? How many of them are being accepted how many of them are being rejected things are like and why they're being rejected some pages Can't be captured by the compressed pool because they can't be stored efficiently Sometimes it can't secure memory for the compressed pool There's a number of ways that can fail and these areas kept keep statistics on How many have failed and why so yeah enough talk let's do this All right, so this is a graph the the blue line you've already seen that was the performance cliff before the red line is the the default with software Compression with LTO and the top line is with the power 7 plus hardware accelerator So as you can see what we've done there up there in the subtitle I've set max pool percent to 40% so the compressed pool can occupy 40% of memory with Z-bud basically what that means is that You can over you can over commit memory by 40% or run your memory load up to a hundred and 40% Before you start seeing though those drastic swap The drastic swap cliff you can see that eventually you do overrun even the compressed pool and then your performance converges on The normal swap, but this does give you a nice a more gradual slope Then then the cliff you see in the blue line Yes, so the compression ratio was 50% and and that's the pages compressed to about a third of a page, but you lose some of that because of Z-bud that that you your effective maximum compression is limited to 50% right now and We that's in the future work. We're wanting to be able to realize the full compression Yes, yes, so the question was this Maybe this could be remediated by Using ZS malloc, right? Is that what you said? Right Z-bud. Yeah, I do So the question was do I have this graph with the ZS malloc allocator? No, I don't Not with me the graph does exist though and basically what it looks like is This this area right here the the area where the benefit is realized extends much farther out Because your effective the pool doesn't feel a fill as quickly because it's storing the pages more efficiently And so this area of benefit, you know reaches out to 16 17 gigabytes And that's an important that's an important thing to point out because the the link of this area here where you Realize the benefit depends on a number of factors one of them is how much you how big you allow the pool to be Another one is how compressible your pages are if your pages are not very compressible then this You know will be this this area will be shorter With ZS malloc for example if you have highly compressible pages I've done it even with zero pages and LG O compresses zero pages very very well You know you won't see the end of the benefit I mean it's you can run it out to 20 gigs and and you'll still see zero swap I out so Again, we're like we're trying to get the ZS malloc back in there that way that this graph looks a lot better with So yes But this is really this graph is really a side effect of This so we started late So as you can see the swap I owe you've seen the blue line before the swap I owe There is little or no swap all the way out to where the compressed pool gets filled This is that and this is really the cause of the performance increase is that you're not having to go to disk To read these things in or write them out It's all just being compressed and decompressed from memory Which is great if you're on a sand or shared storage or slow storage or anything like that This is really good news So I kind of Are already started covering this but there's three regions of the graph there and I want to go over them real quick so in case one where we're in the under committed state and We're not swapping any pages. So you're seeing you're seeing nominal performance 100% And this is what Ram and swap look like right? You've got you've got some free RAM available and swap is completely ice. And so if swap is idle Z swap is also idle In case to that's where the that's the region of effect for Z swap You have RAM. It's fully used it would be swapping, but instead it's compressing So part part part of RAM is being used as a compressed pool But during this time, there's little or no swap IO Unless your pages don't compress well enough or for some reason you can't secure memory for the compressed pool Those pages are getting captured in the cache Case three is where you've over so you the compressed pool has grown to the max pool percent and it can't grow anymore In that case the oldest pages from the compressed pool get decompressed and written to swap This is nice because the swap ID of the swap entry that you use as an index into the compressed pool is still reserved in the underlying Swap device and so basically you're just resuming the right back to disk that was going to happen before the swap intercepted it So there was a lot of policy decisions a lot of academic Exercises in this area you use the static static pre-allocated pool of RAM to use We decided to go with a dynamic one for this because If you have side if you have sized your workload properly then you want it to operate fully uncompressed, right? You only you only want memory to be eaten up by the compressed pool if it's actually needed and Like I said, we're see we're looking for an always-on solution And this would require user intervention to tell tell how big I want my memory pool ahead of time and things like that So we decided to go with a dynamic system the pool sizing Again the heuristics weren't there decided to go with a user tunable there and then the the pool overflow action So front swap allows the back end to actually fail the right You know if you're going to swap right page and it sends it off to the back end the back end can say no I couldn't take it and That is an easy way to handle the pool full situation It's what you send it off to the back end the back end goes I got over room and Just reject it and then it goes to the swap device The problem is that means that these less recently used pages are in the compressed cache While more recently used swapped out pages fall through to the swap device you get an inverse LRU thing Which is it is not Kind of runs counter to everything the memory manager, so we put the mechanism in there It's a little bit more complicated But to write back the oldest pages in the compressed cache on the back end to make to reclaim pages that can be used For new things coming in on the front So the use cases here are if you're an infrastructure as a service user You probably pay for the size of your instance in RAM and CPUs So if you are buying one that's oversized for your workload on You know the off chance that your workload goes over the amount of RAM that you have and If you're running with swap its performance degrades or if you're wanting to swap list the out of memory killer kills it You can use these swap as a kind of a backstop for those workloads And so if it runs in half a gig of RAM normally, but every once in a while runs up You know two or three hundred megs over that then this can allow you to run that workload in a smaller instance and guard against that really Really sharp penalty for swapping Hmm as an ISS provider, you can enable this kind of the other side of the coin you can enable this at the hypervisor level and In the KVM case, I'm not sure about them. Maybe the same thing Instance memory is considered anonymous user space memory by the hypervisor and can be swapped Just the same as any other process and so you can increase guest density on the machine that way And then there are there are systems where you are either at your act You're at your maximum memory capacity or the system doesn't have a breakable memory And you'd like to continue on your workload on it But you can't because the workload now the workloads memory requirements exceed what you have Just a couple of gotchas Because the swap entry is used as an identifier in the compressed pool You can't actually store more pages in the compressed cache then can be written out to the swap device So if your swap device is smaller then Basically, you know It won't hold as many pages as max pool percent of memory will then you'll be limited by The swap device size not by max pool percent I'll quickly refer to these other ones. So Z-RAM is a Driver that is it also works on swap pages But it is the swap device rather than a caching layer on top of it It's basically a compressed RAM disk that you can do you can do swap on dev Z-RAM zero and It's a compressed RAM disk. This is really popular in the embedded community where they can't actually have a swap device They don't have that the flash memory. They don't want to put a swap partition on it and That work is being headed up by Min-Chan Kim on the memory management list if you care about looking into that That works better from for some embedded things But it's that dead-end that the memory management people really don't like in that Once if it is the swap device and it gets written in there Because it mimics a mimics a block device. There's no way to get it out of there. And so there's no way to reclaim that RAM Z-Cache is another Thing and it's for file cache or page cache compression That work has been being done by Bob Lou also on the memory management list It it wasn't staging Z-Cache has gone through so many iterations That And what it does has changed so much over time, but basically what it is now is is page cache compression Page cache compression has a unique a unique challenge over swap Compression in that if you compress a page cache page that falls off the you know that memory reclaim gets You know one of these clean page cache pages that I was talking about earlier That was cheap reclaim before now you're compressing it and storing it And if you're not really confident and that page will be accessed again Then you're actually regressing performance because you're compressing that page storing it and then eventually it'll just need to be Freed and it never got reread, but you went through all the trouble of compressing it So there are the heuristics to determine whether a page cache page will be used again And the confidence level associated with that or it's just not there yet. And so that that is just an area of future work So this is a this is a new area to my knowledge no other operating systems do this and Really what we'd like to go with it is that Compressed memory wouldn't be just something that kind of offshoot it kind of hooks into the memory manager and you know Kind of does some things off to the side But that eventually compressed memory will be a first-class citizen in the memory manager And it will just be another type of memory that the central memory manager manages And that maybe it can even do Preemptive compression, you know could not wait till the swap has engaged, but oh memory pressure is increasing Let me let me have a kernel thread You know compress some pages in the background make room And analyze memory load over time to see okay Well, I see how this process is behaving. Maybe if I can press some pages I can keep it in RAM without it getting into stress So this is the summary the swap didn't mainline Go forth and play with it the off by default and listed as experimental But you can enable it just add it to the boot line allows you to do more aggressive memory sizing more toward the average than the peak and Can't connect as a safety net a safety net against completely trashing your performance your performance on your workload in case you in case you do Hit swap, so are there any questions? Yeah So about choosing which encryption sorry which compression algorithm if you use one that's lower and gives you better compression Could you use the fast one right to compare with your card? Do you basically from what I understand you end up having more fake RAM by compressing better Right, but you're losing more CPU time doing so. Yes, and are you trading CPU time versus? How long you take before you hit this? You on your slide you only compared to algorithms that seem to be compressing about the same Yeah, so the drop-off was exactly the same place right right Have you done anything with something else that I made I compress it's much better, but slower Yes, to see how you win or not So I have run the test on deflate before and again if you do if you use deflate You're gonna look get the same graph with Z-butt and that's because Z-butt is capping the effective compression because it's not storing pages very efficiently because If you want to reclaim it needs to do that quickly Now if you use ZS malloc, which has very Dense packing properties, then yes, you would see the advantage of moving to a higher compression algorithm And like I said that that area that region of effect for the compressed cash would be much wider And how much lower is it? I mean on your graph is it like twice as slow three times ten times So it depends on your CPU right something recent from today right right, but Typically what we've seen is that The CPUs go go nearly idle when whenever you're doing the swap stuff The I'd have to say that I have not done extensive stuff with with other algorithms I think it sounds like if you're hitting swap on disk by that time you just up the algorithm to something that compresses better Yes, that's one way to do it and that that wasn't captured in my future work, but there has there there is an effort to make the Out the allocator pluggable and changeable at runtime. And so you could like oh, you know if You could compress a few pages with another algorithm say all these things were compressed really well You could switch to it you could be adaptive And the related question is you said how much RAM you use for the compressed cash swap I should say for now it's something you just select and How much do you want to give it you give it 50% because of course there's the overhead of going through that right if you give it only 10% You're not winning that much because now you don't have enough RAM to do it Right. Have you found any sweet spot? Are you still tuning that? Yes, so that that's part of the heuristics, right? It's like it'd be nice if we could figure that out dynamically So the Restated the question So how do you find your find the sweet spot for the size of the compressor even if there's not an answer for everyone? What have you found anything on your site for now? It depends on how much you anticipate your workload overrunning, right? so I mean if It runs at a gig most of the time But every once in a while peaks up to one and a half gigs then you want to do max pull percent 50 Because that's a hundred and fifty percent memory commit at 1.5 gigs, right? So basically for both answers you need to find out how much memory you're missing and Then from there to in it to make up for it right for now, right? Okay. Thanks All right got time for one more question then we have to wrap it up I'm fortunate time for questions, but yes, we can definitely I'll be outside afterwards and anyone who has a question I'll be happy to Okay, okay, sorry What are your plans for adding support for page migration? Okay, so There is actually a patch set on the list right now to do that it requires a change in Z-Bud, right and I've actually reviewed I Think that version two or three now and we're getting that worked out And so there there is support for that coming in Z-Bud The thing is that has to be done at the compressed pool manager level Because and basically have to create a layer of indirection that the the Z-Bud objects can be moved around on From right now it maps strictly it maps directly to the physical page frame But if the page frame can be moved from underneath Z-Bud then you have to have an indirection layer to keep track of that Taken Yes, erratic street. That's actually a structure being used. Yeah All right, let's give it an applause for some Jennings for presenting today And if you have any more questions, we'll be happy to answer right afterwards. Thank you