 My name is Mittali Wol and I work here with Better Minutes and giving the last years primarily Android for many years basically. So today we're going to talk about allocators for compressed data for compressed pages within Linux kernel and what these things are about and what they're needed for. So before we actually pass over to the allocators, I'm going to take a step back and start with the main users of those allocators that be swapping or paging techniques, swapping or paging limitations that are there in Linux kernel. And the optimizations to these techniques. So first of all, swapping I guess I will not spend more time on this. You'll probably know that this is a special technique to use secondary storage to save on RAM by pushing the pages that are not really used or have not been used for a significant while off the RAM to the secondary storage. And we basically trade memory, size of trade memory, free area of performance because reading and writing pages can be quite slow with a slow storage device but still if we need more RAM for our applications to run, that can be the only way out. With that said, if we want to do some performance optimizations, and this is what we want to do pretty much always, and this is what I want to do because I work in the performance area for the rest, well I'm not working with the power. So what do we do to optimize swapping with respect to the performance? We're going to use RAM, or we're going to try to use RAM to cache the pages that are being swapped out to minimize the IO operations, especially when the backup device is slow. But if we do it straightforward, in a straightforward way, then probably we're going to lose the main benefit of swapping as such. So let that be a saving ramp for the real users, for the real memory consuming applications that we couldn't run otherwise. So if we have this cache and it's to be, and then there's no real saving, then probably it just wasn't worth it. I mean the whole pass just didn't make any sense. So then the idea comes to compress the pages that are being swapped out. They're not going to be used by system because system thinks that they're actually off the RAM. So we can compress them and keep them compressed either until we have too many of them so that we can actually push them to the storage, or if the system decides that it needs those pages, well we can ask for those pages and we can compress them on demand. What's important here to mention is that the compressed chunks will be less than page size. Their size will be less than page size and conventional allocators, they do not work with that. They only allocate pages. So we cannot really make use of the fact that we have data blocks, that we have data objects less than page size if we're going to allocate pages for each such object. So and then comes such thing as a compressed memory allocator, which well allocates memory. But the key thing is that such allocator is designed to be able to store small data objects less than a page. And then we get the real benefit from the pages being addressed. Oh yeah, this is the question. Many people tell me that I do not really communicate well to the audience, and at some point I accepted it, so I'm talking to an imaginary audience. This is the question that an imaginary audience is going to ask. So these compressed memory allocator are just a tool to store small data chunks. Well, at the bird fly level, the answer is yes. Now, let's come a little closer to the initial subject with the compressed data, data allocators, and how they're used. And we're going to talk a little bit about swapping and compression in the way they are implemented in the kernel. And the first swapping compression implementation is called ZADSWAP. It's the first historically and in terms of frequency of use in lettuce in different situations. And that is a right-back swap cache, which allows for compression. So what it does is exactly what we've been talking about a few slides back. It compresses pages that are swapped out and move those pages into a pool. And when the pool is full enough, it pushes those pages, it uncompresses the pages and pushes them to a storage device. And then when they are needed, they're loaded directly from the storage device. So in a sense, it's transparent to swapping mechanism of Linux kernel. And then we come even closer to the initial subject. And we are going to take the allocator for ZADSWAP, which initially was ZADSWAP. And this allocator is also because it stores either one or two pages, either one or two objects for its own page. And those objects are called bodies. So there are two bodies for one body in a page. One in the beginning, right after the head, the one behind the head. And if we're going to take the ZADSWAP version, a little bit more detail, then it divides the page into same size chunks. And then it places a header in the first chunk. And all the rest are available for the bodies. And then it creates unbodied lists. So for each possible number of free chunks, there is an unbodied list. With that number of chunks, with all the objects, with all the pages, they have exactly that number of free chunks. So if we want to allocate an object that requires entry chunks, we take the first one of the relevant list. If this list is empty, we take the next one, delta has one, and so on. So if everything is up to you, we're going to make a new page. So. Oh, ZADSWAP came as an alternative to ZADSWAP. And it addresses the biggest problem of ZADSWAP, and that is the actual compression ratio, the amount of pages stored divided by the amount of pages used, can be quite low, especially if we stumble upon the situation when there are many pages of size around 2K. I mean, not the initial months, but the compressed ones, the compressed objects which are sized around 2K, 2K plus delta. They are bad for ZADSWAP because you cannot store two such objects into one ZADSWAP page. So there's a lot of space left free in such ZADSWAP pages, so the actual compression ratio can be quite close to one. And ZADSWAP allows for a ridiculous allocation of objects within physically unintrigued pages, which gives very high compression ratio, especially in the beginning. The deficiency of that approach is that, well, there will be holes as times go by, so there will be released objects that will create multiple holes within ZADSWAP pages. So ZADSWAP is exposed to the internal fragmentation, which is very complicated to be to do. And while it's making it especially more complicated, you know, the fact that objects may span across several physical pages. So if we can come up with an alternative, that is better compression-racialized than ZADSWAP, and do not have to distribute objects across several pages so that every object ends at the page boundary, then we may be actually better off. And this is something that we'll be discussing in the future and further on. But as of now, as a side note, we need to mention that actually got two compressed allocators, which did pretty much the same thing in a completely different way. That actually called for unification because ZADSWAP could be, and it should have been easy to configure ZADSWAP to use either ZADSWAP or ZADSWAP, depending on the actual system demands. So depending on what is actually better, you should be able to choose between ZADSWAP or ZADSWAP, and that is a lot more complicated when they have incompatible APIs. So while that called for unification, and the new API has been introduced, call Zool, and call Zasmalek, and Zubard implemented that API, and that's what was changed to use that API and not directly call the ZBuds API functions. So it's only a matter of changing the name of the backend of the compressed allocator now for ZADSWAP to actually proceed with the compressed allocator that fits your test. Okay, now we get to questions from the imaginary audience. What happened next? That's a good question because then came ZEDRAM, which is to a certain point a lot more relevant to what we do here because this is embedded conference, and ZEDRAM is actually something that was targeting embedded devices in the first place because it's a self-contained block device, self-contained storage that doesn't need begging storage on a disk or flash. So you can use it directly for swapping because it also compresses pages. And this is an exact match to embedded devices with small RAM because you free up RAM by putting unused pages into the ZEDRAM device. So they occupy less space because they are compressed. And at the same time, you don't deal with embedded devices, flash storage, which is usually slow. So you don't slow down the operation of an embedded device and you don't cause extra wear out of the flash, which usually cannot change on an embedded device because it's not embedded. And it's all good, it's all great. But the only small problem that ZEDRAM didn't exactly play by the rules, and it didn't really use the exact pool in the beginning, and as opposed to that, it caused an asthmatic function correctly. So if for ZEDRAM, almost from the beginning, it's been like, you choose what fits you best, but for ZEDRAM, you can only use ZEDRAM. And you can recall the problems that ZEDRAM has together with all the good things, of course, and stuff. So what we did first, and there was a presentation in September 2016 about the results of using ZEDRAM over ZEDRAM and then comparing ZEDRAM and ZEDRAM as backends for ZEDRAM. So that was presented in 2016 in San Diego. And, well, I would just probably refer to that presentation because a lot of the slides are up there on the net. But the bottom line was that it was worth a try. And in some times, in most cases, especially for small devices, for not that big number of threads dealing with ZEDRAM, dealing with small things, ZEDRAM actually upgraded better, smoother and faster. But with that said, it still wasn't a real match for an embedded device because the compressor ratio was quite low compared to the one achieved with ZEDRAM's mallet. And with the compressor ratio of about 1.5x, it just isn't worth it. All the stuff called the messages isn't worth it. There are not that much actual savings. So it was a nice thing to try, but it didn't work out that well. And here we have Graf showing pretty much the same thing. ZEDRAM's mallet is actually better scalable, but in terms of the IO operation, it takes off slowly. And ZEDRAM takes off quite fast. So if you have a really constrained system, well, it could be actually good to use ZEDRAM for something that works like ZEDRAM, but is slightly better compression also. And now the imaginary audience goes on to the next question. What if we modify ZEDRAM to call three objects per page instead of two? And that is an excellent question because this is what we did. So ZEDRAM 3-fold, the new kid on the block. This is the 80s, 90s school. But anyway, ZEDRAM 3-fold created a spin-off from ZEDRAM, with the only exception that it didn't have its own API from the very beginning. It only implemented ZEDRAM's full API. Well, other than that, three objects per page instead of two. And it could handle page-size allegations, as opposed to ZEDRAM which stores header in the first chunk. So if the object to be allocated is of a page size, then it would just start to abuse. We don't do that for ZEDRAM 3-fold. We create a headless page instead. And yeah, it's a new kid. The work has been started after UNC 2016. And the first version came into the mainline at 4.8. And yeah, provided that ZEDRAM works in ZEDRAM interface and does not directly communicate with ZEDRAM as it is now, ZEDRAM 3-fold turns out to be an okay match for both ZEDRAM and ZEDRAM. Because it also provides no latency operation and compression ratio, which is nice for an embedded device. And it supports eviction because it was inherited from ZEDRAM while ZEDRAM doesn't. So it's a better match for ZEDRAM. And once again, it has a high compression ratio than ZEDRAM. So for ZEDRAM, that's actually also a match. But then we go to the updated applicability metrics. Yeah, I mean, if we just take the existing mainline curve, we cannot use either ZEDRAM or ZEDRAM 3-fold once again because it's working instead of power directly and does not use the relevant generic API. Which is not very nice. But we're not gonna concentrate on the bad things and concentrate on the good things and making them even better. So we pass over to the fun part comparisons. And the first thing to look at is how we do compress objects on the stress load given that they work with the first version on ZEDRAM 3-fold which came in the version of kernel for my end. So, what do we see here? Well, you can see that ZEDRAM is way, way below. We can see that ZEDRAM starts off pretty well. Well, about 3.5. Sometimes it's closer to 4. But then it doesn't seem to stabilize, right? I mean, it goes on in waves while ZEDRAM 3-fold seems to be quite stable circulating around 2.5, 2.6, 2.7. And then we go to the random read write achieved by multiple file random Rw tests. And here we can see that, well, ZEDRAM is definitely superior when it comes to many threads. But ZEDRAM 3-fold are somewhat better when the amount of threads is low. But they don't seem to scale well. But that being said, they behave in a very similar way which is not much of a surprise given that ZEDRAM 3-fold inherited a lot of code from ZEDRAM. So it inherited the behavior too. The thing is, however, that we were aiming to have performance which wasn't inferior to ZEDRAM's map. So we thought this situation was not acceptable and we started profiling ZEDRAM 3-fold making stress on the two following patterns. The first one is using Perf while running FIO, a very nice testing tool giving stress load and creating different kinds of IO patterns for block devices. And this was used to identify bottlenecks on stress load primarily. And also we were looking into how ZEDRAM 3-fold operation affects user experience. And that was achieved by looking into Perf results while Android Palin K was triggered which normally causes a lot of page swapping. And what was the result of the profiling? Well, you know, not much of a surprise. The main bottleneck was the big speed lock which is protecting all the unbudded lists. So all the unbudded lists are protected while running the same speed lock. And this, of course, does not scale well. Well, in fact, it doesn't scale at all. The next thing to optimize, another thing to look into was identified as the ZEDRAM 3-fold free function because there was an internal page layout optimization after all in every object-free operation. And here's what. So if you can see, I guess you can, the first graph, the first picture out there which is on the top shows a full, ZEDRAM 3-fold page because the first object is there, which is like reddish, the second is there and the third is also there. So then, if we free the first object, then we end up in a situation where we have two free spaces within a page. So we have an internal fragmentation which, however, is easy to fix. We move the middle object to the beginning and then we have continuous free space that we can use in a better way than two spaces out there on the middle one. But it's a relatively cost-vaporation to do it in the free function because it may slow down things lying on the hot path. So the idea became there to implement asynchronous handling of such situations. But first came the idea of using per-page logs. So we should, we cannot avoid having the big logs still but we can limit its usage to only protect the list operations while the operations which are internal to the page can be protected by in-page spin line. So we ended up in a lot of splotches and then we have a graph which looks like this where the lights are lying is for the U of data Z3 fold implementation. So unfortunately, it didn't help that much. Then came the asynchronous page compression and it came into the mainline later or basically very recently and the idea there is, as I've said we just take the perfection off the Z3 fold free and schedule it and then we're taking it off the hot path that's one thing. And the other thing is that it may save time on the perfection if several objects are free at the same or almost at the same time because we do it just once instead of doing it every time after an object is released. Oh yeah, we do. We obviously need some extra improvement. Right. Okay. Then came the idea of locus lists to implement embedded lists using locus lists because we're fighting with the spin log. Can we do it like this? And the thing is that processing Z3 fold allocation could be a lot simpler with LS and a lot faster but the problem is that deleting an entry from an embedded list would be a lot more complicated because there's no analog for lists for list double function in L lists and if we want to simulate it one way or another we're going to end up using some usual exclusion mechanisms anyway. So for embedded lists it just did work. That doesn't mean that at some point we'll not return to the idea of using a list for something else on the list within the limitation but for that embedded list unfortunately it wasn't a match. And then came the idea of per CPU embedded lists. And here's why. As with the ZubBud basically we only need one entry in the list LM. So for every list within the group of embedded lists we only need one entry to take because we always take the first one. So as long as we can maintain this on every CPU we can limit this search to the local CPU and save time and also kill the need to have the spin lock because we only deal with the local linker. So then the same thing creation temporarily is just okay we'll have to spin lock. So yeah we implemented this. We had an idea to check if it had an adverse effect on the compression ratio because selection will still get worse. And this came in into 414h1 so you can try it out and see how it works for you. But for us it actually worked quite okay. So the darkest line is the user 3-4. We lost a little bit on the operation speed when the number of friends is not that big due to no resources spent on maintaining the first few lists obviously. But the scalability increased a lot. So Z3 Fold Evolution is in front of our RW we increased the scalability at no such big cost and then we're going to compare it with the two other rivals the older kids and while we can see that apart from a small number of threats the performance of Z3 Fold is actually better than the one of the two others. So we're there at least partially and then again we need to check the compression ratio if it got worse or not. And yeah the darker line is the new one the lighter line is the old one we don't see that much difference we're happy we can jump to conclusions and yeah the conclusion is first of all I'm going to be bald and say that Z3 Fold delivers it has better performance at its mouth it has comparable well worse but not so much worse compression ratio it has better real-time capability and it's a good fit for both Z4 and Z5. We still need to do some more of course for instance with the introduction of Persephone Unboarded Bists the header of the Z3 Fold which we put into a chunk in the first chunk of the page it actually became bigger than one chunk and now it requires two chunks normally so maybe it makes sense to look into the ways to optimize the header so that it fits into one chunk again the other thing to do is to try making Z3 Fold pages mobile to reduce the fragmentation within the kernel which is external to Z3 Fold so it's not the internal fragmentation that is fought against by the Z3 Fold it's the external fragmentation that kernel compression should be mitigating but it's gone because the pages are not mobile so we need to think how to make them mobile in the best way but anyway I believe for a thing that exists in the mainline for a year it's quite a bit of a progress and with that said I want to think people want me Z3 Fold one way or the other so the first credit goes to concept change we've commented about which I used heavily then dense treatment provided tons of commands which were always very valuable and basically he was the one who helped Z3 Fold take off and run and yeah I want to thank one of the students that I have known as a mentor he ran a lot of performance measurements and was always helpful in pointing to things that we needed to optimize I would also like to thank my wife who provided a comfortable environment for me to work and maybe even my doc because several optimization ideas came when I was walking with him and of course I want to thank you all for listening and for your attention and I think we still have some time for questions so I really want to care some questions and answer questions from here yes please it decompresses and pushes so that is transparent to the system to the sorting system the sorting system does not expect to have a compressed page on a disk if there is a typo I will fix it thank you any other questions please I know people report problems running the ZESWalk every now and then because there are many servers many configurations for the ZEPRAM one I mean the one that is in the future MRC is not really used in any commercial product the one that was in 412 in the one that was before 8 I think there is several mobile devices which are selling quite well in the market sorry well the system does not crash I mean Android does not crash due to Z3 fold so it is good enough the ZEPRAM over Z4 over Z3 fold is very well tested the ZSWalk over Z4 Z4 may still have some problems we are worried about we do not we do not care we do not have slots we do not have slots that is actually a very good question thank you we do not have slots as such we bound the first page the first object which is right after the header then we bound the last the third object to the end and the middle object should start and the reverse object ends the next chunk should be occupied by the third one but this is not street bear requirement it is how the complexion works but there are no requirements as to there are no requirements on the size of any object that we are allocating memory from so it can be one third it can be one chunk it can be many chunks it can be it can be a 3 key plus yes it can never be worse than ZIPA in a critical situation when all the objects are slightly bigger than 2k we will have the same problem as ZIPA but in real life we have never ever come across this scenario and then the flexibility is better for that freefall so usually there are objects that compress in a bad way tend to not compress well and there are objects that compress in a very good way so in real life we pretty much always ended up with one big badly compressed object plus two small ones does that answer your question? welcome any other questions? ok thank you very much you are very welcome to ask me if you have any questions please leave a comment be thinking