 So, next topic is sub-obstraction or native ZSwap. So what is the problem here? The problem here is that the current implementation of ZSwap is that ZSwap is just an end memory compressed cache for swap. The way it is implemented is that it is, it's very conveniently placed at the very last step before swapping out or swapping in a page so that the rest of them doesn't have to be aware of it, which is very convenient from an implementation point of view, but it also has a few problems. First of all, you cannot use ZSwap without having an actual swap file on desk. The second thing is, if you do have ZSwap and the backing swap file, you end up actually wasting capacity when you're using ZSwap because every time a page lands in ZSwap, it's slot on the swap desk has to still be reserved. So this is wasted swap capacity that you have to keep around but not actually use. And then if you're swapping out a path and a page and you end up putting it in ZSwap, then you end up executing some code that you don't really need, like for example, you have to go through the swap slots and cluster management and all this code in the swap file to figure out what to do, but then you end up not doing anything with it and then just compressing the page. The fourth part is that reclaim is unaware of ZSwap. Reclame today will decide to swap a page not knowing if it will end up being compressed or going to a swap file and then on the long term, you want to factor in memory tiering and ZSwap versus a far memory tier which is faster, it would be nice if reclaim knows which is which so that it can make informed decisions about where to put the pages. So who cares about all this? Google cares, obviously, because we have been using ZSwap without backing swap files in our fleet for over a decade now. So we can say that it is a valid use case from our side. There is some interest from Chrome OS and Android in this. I don't know if it's going to actually have a valid use case, but there is some interest. Meta is interested as far as I know, especially in terms of losing capacity on swap files when they use ZSwap. There was also mentioned in the main English that Katoops also have a similar thing to Google where they use ZSwap with a swap file, but they intend to never actually use a swap file just there to allow them to use ZSwap. So the proposal changed after a lot of discussion in the main list. This is the simplest idea that we can do in the short term, which is basically introduce a simple and directional layer in the form of this XRA right here. We can have, instead of putting the swap entry directly in the page tables or the Schminpage cache, we can have an index into a read history put in the page tables and Schminpage cache and also use it to index the swap cache. And then this XRA can tell us which swap file we're using, and then we can use what I would call a virtual swap file for ZSwap, which basically would be a swap infrastructure that we would use to keep the implementation as close to the day as possible to represent ZSwap. And then right back would be moved outside of ZSwap, which is already an ongoing effort to be just moving entries from one swap file to another. So there is already, yeah, okay, never mind. So yeah, so this is basically just the proposed short term solution. I'm not very fond of this. I would like to go to the medium term solution, which I think might as well do the nicer thing. So fundamentally, we have the same thing, but instead of having to use a hack and have like a virtual swap file or swap infrastructure for ZSwap, we actually abstract away the swap operations. And then we have read page, write page, duplicate, free, all the things that we do today with swap files, which are already implemented for swap files. But then we also implement the missing ones for ZSwap. So ZSwap already has the equivalent of read page, write page, alloc. We just need to implement things like swap counting in ZSwap. And then this XRA contains an encoded swap entry, which basically using the lower bits and the pointers like Matthew likes to do to either have a swap entry or a pointer to a ZSwap entry. So in this case, basically we go through the XRA and then we know, okay, this page is in a swap file. This page is in ZSwap. And then we act on it accordingly. And then when write back happens, we just update this XRA. The page tables remain the same. Page cache remain the same. Everyone's happy. And this enables other optimization in the future like swap off. You don't need to go walk the page tables to do swap off anymore. You can just change things in this XRA directly. And also you have the option to drop the swap entry once you swap in a page because the page tables no longer have the actual swap entry. They just have an index in this XRA. So this is the median term idea that I'm more in favor of. A more long term idea, which was actually the original proposal, is an actual, can you actually go back? Sorry. Yeah. So there is already a red-black tree inside ZSwap that maps swap entries to ZSwap entries. So this XRA is basically pulling this outside of ZSwap and generalizing it to other swap files. And then the write back logic already lives in ZSwap. So we would also be pulling that out of ZSwap. So this is generalizing some things that ZSwap does today to all swap files and letting ZSwap just handle what it should handle, which is just compressing memory. Next slide, please. So an even longer term idea is to even go one step further with the abstraction and instead of having the XRA have like a swap entry or a ZSwap entry pointer, we can just have a swap descriptor. The swap descriptor can have the same encoded swap entry, which can be a swap slot in a swap file, or a ZSwap entry, or whatever we need. And then we can directly store the swap cache here as just a for you pointer. You don't need to do anything more than that. And we also can put the swap count in there. And what this buys us is that we have a common place to implement swap counting, swap cache, and all this is independent from the underlying implementation of the swapping backend. It could be ZSwap, swap file, something else in the future. It doesn't matter. A core swapping code acts on a swap descriptor that's independent of the actual implementation. And this basically buys us a cleaner abstraction and potential for a lot of code cleanups. We can get rid of things like swap address spaces for swap caches and the current swap counting code, which is fairly complex. But obviously there are like problems that come with this in terms of memory overhead. We have to pay the price of a swap descriptor for every swap.page, which is about 0.6 to 0.8% of the size of memory that's actually in swap. And then for things like cluster read ahead for rotating desks, we would need to have some way to, we have a swap entry and we want to swap in the next 10 swap entries. So we need to, given that, go back to the swap descriptor to be able to upgrade on it. So we would need to have reverse mapping, which is even worse. And then generally this would be a larger surgery in the code. You'd be changing the swap counting implementation, swap cache implementation. So this is something that we can just keep in the back of our minds for the long term, I guess. And Kristi, you have another long term, Mattia, but before we move on, any questions that people have, Matthew? So the cluster read ahead thing that we do for spinning media, that just needs to die. I fully agree. I fully agree. And I think folios are how it dies, because we need to get to the point where we're not writing out individual pages, we're writing out entire folios. So that naturally gives us the contiguous logical is the same thing as contiguous physical. And at that point, we don't need to do this reverse mapping trick. We can just do a logical read ahead. And honestly, that's what we should be doing anyway. And everything gets more efficient. And we can delete a whole bunch of crofty code that was introduced for spinning rust. Yeah. And I would just add that people with rotating or very slow disk avoid using swap under all costs. So we probably are optimizing for somebody who is not really interested in that at all. Yeah. I just want to note that cluster read is not currently only used for rotating disks. It's also used for Schmem in general, even on non-rotating disks. So we may need to bounce some code around to get around that. Isn't that still basically legacy that nobody did the optimization to get rid of that? Because, I mean, TempFS should be using larger allocations. Right? So I'm saying that SchmemFest also gets the same space. Yeah. I fully agree. I'm just saying today, we're not just doing that for rotating disks. But yeah. And yeah. Any other questions before we move on to the other long-term approach? Yes, please. Actually, it's not about the problem you address. But I have some sort about the GSwap related to CXL-DLM. Because, you know, I think the underlying idea of a GSwap in CXL-DLM counters because the GSwap, it basically, it GSwap condition, the CPU is available more, but the memory is limited. But CXL thought it's different. CXL, the CPU is a, CXL1, the CPU is available, the memory is limited. So, we thought the U2 that Google is using GSwap. But what if you adopt a CXL-DLM, then you have a more memories, then probably in the case that the swap has to be used. But what if you can use the CXL-DLM without compression? You're basically saying if we do have a reliable CXL for memory-based solution, then do we need GSwap at all? Yeah. Actually, we are making some subtext for CXL-DLM. And we made the so-called the CXL-Swap. It implements a point swap, but without compression. I see. Yeah. So, how do you think this makes sense for your use case? This is a question that we're actively looking into. I don't think we currently have an answer as far as I know. If anyone in the room has a better answer, please correct me. But it is something that we're looking into. But I don't think we have a definitive answer of, yes, this will work for us. We don't need GSwap anymore. No, this will not work for us. I don't think we have the answer at hand. Just a comment. I think it's a good idea. But on the other hand, I mean, CXL, the one benefit of CXL is the cache line access. If we reduce that to be a swap device, then, I mean, how is that different from other like RDM-based solution, right? So, I think we are not very fully utilizing CXL for its full potential if we only do that as a swap. You had a point? No, no, no. Thank you. Right, so, yeah, that is an open question. But like we said, I think there's more applications to CXL. And also, we don't have an answer today of what we want to do when we have a ready for memory solution. And I would just say that you do not have to use swap, right? So, but if people are planning to use swap as a form of reclaim done, it's probably good to look into some way to make it long-term, better scale for a new world rather than rotating disk that is heavily optimized for. So, I think that something needs to be done in that direction. Right, all right, I fully agree. My main concern is that someone would say, no, we still need cluster read ahead. If no one cares, then, yeah, by all means. So, one more thing I'd like to add is that we're not sure whether CXL memory would be available for client devices. And our client devices, namely Android and Chrome OS, are highly relying on this model, ZRAM. So, if CXL memory would be available for those devices, then that would be great. Yeah, one question I have before I hand over to Chris to the room. So, with something like this, we incur, for ZSwap, there's no extra overhead. We actually can save a bit of memory. So, for, like, ZSwap files on desk, we pay the extra overhead of having the XRA on the way, which is 8 bytes in the best case, I'm guessing. And then, for that, it's even more. It's like 20, 30 bytes per swap entry, which is 6 to 8, 0.6 to 0.8%. So, how much do people think this is, basically? How bad do people think this is, in terms of overhead for ZSwap? Is there anyone with any thought? It seems like a small enough overhead that I'm not willing to squawk about it. The only concern is, if you're doing memory allocation in the swap path, then you need to be very careful. And, but I'm sure you know that. Yeah, yeah. There has been ideas. Like, if we use slab for them, then slab is already, somehow we'll have a catch for it, and then we will also talking about if there's a way to, I don't want to divert the talk, but till slab, this allocation is using the reclaim path, we'd want to prefetch more eagerly, but I don't want to go there. Yes, that's what's swap over NFS or NBD in general, all about, and that's really tricky code. And regarding the overhead, we used to have that swap tracking in memcg in v1, once upon a time, and the overhead... Still do. We don't have that. Which version? Which kernel version? Yeah, I mean, it was an old thing, but even back then the overhead was considered to be quite high for some people, so for example, we as less didn't have that code enabled by default. And, but, I mean, times are changing, so maybe that overhead can be digested much better these days. I don't know. And is that overhead really per page or portfolio? That's a very good question. That would be per folio as far as I can tell. The swap cache should be per folio, the swap count, if we're swapping the whole folio, everything should be per folio as far as I can tell. So, yeah, so when Matthew's done this, the overhead will be lower. So here's one more spoiler for tomorrow's talk for you about anonymous folios. Anyway, I will hand over to Chris now. Hi, the other idea I have is basically a more VFS-like implementation for the swap file, which allows individuals to swap devices. They have their own implementation of how they manage the free slot and how they manage the swap account, et cetera. And then, we can basically assign one swap file for the G-swap or something, and then the other one doesn't, so that it doesn't have to pay that, these basically the XRA in front of, because this one is, if you turn that on, you will impact every swap file there, and then the idea mostly, can we narrow down these to only impact the G-swap and not the other one, and allow the other one to keep their current behavior? And if somebody preferred that, or we just, like... I'm just worried that somebody will say, oh, since we make this change, and then, like, my swap file or my nice per-swap entry usage go up after I swap something, and we know that the usage case, the machine will constantly have some swap, like Google tried to aim 20% of the memory swap, and then Android's roughly the same, like, 20%. So how much memory that corresponding that you actually, because of making this change, and then end up that we can save a little bit that actually probably matter for some people? Or at least I'm not brave enough to assume that, okay, those memory ways doesn't really matter. I want to retain the possibility to get them back. We are some kind of option. And pretty much that's it. Any questions? Isn't that a question that Google can ask itself, like, ask the Android people if 0.6% is too much for them? Like, don't you guys have the data? Yeah, well, we have a formula to calculate, like, how much memory it translated how many engineering times, and yeah, I haven't done the calculation, but I'm sure it will be some, it's not zero engineering, basically, it will be. But I mean, you're saying, like, to decide between VFS and the other thing, you're saying that asking the room about are these overheads too much to pay for that direction so we should go this direction. Is that the question you're asking? Yeah, provide an option so that we don't have to pay that per entry, per swap out page entry for those, those are always resident in the memory. After you swap them out, you have to keep some. Another idea is that maybe you can do a second-level thing. Some of these entry metadata, they get right through the SSD, and then you kind of like file system, you do, I know, you load the I know first and then you load the corresponding block, and the same can be done for swap, but basically you do a two block read in order to actually locate the page. And then you can save some memory that you always keep in the system. So I remember people including me complaining about swap code for a long time, and it's always been like, can we just rewrite? Yeah. Rewrite it in a VFS file like style of ways, or would the last proposal actually take us closer to that than the other one with the, like then I would most certainly prefer that one. Yeah. I mean, you'll have to rewrite everything again, like you'd still be like we're rewriting everything, apparently. Maybe that's a good thing. Yeah, rewriting job security. One thing I'd like to point out is that through the slides, ignoring the first proposal, so this is just about how far people are willing to take it. So this is minimal overhead, but like the least abstraction possible, this is the clean abstraction, but more overhead obviously. And then this is a way to remedy the overhead of previous ones. So it's just about how much overhead and how much complexity the community is willing to pay to this. And also it's important to point out that in this case, if ZSwap is not enabled, this XORA can just disappear. If ZSwap is not configured or not enabled, then the swap index can just be the swap entry. We don't need to pay the overhead if ZSwap is not used on the system. If it's all swap files, we can just do what we do today and use the swap entries. Yeah, but if you come to LSF and show people the pony, David and people are going to say, we want the pony. Like if you show, if we do all this work, we get the best thing, that's what everybody wants. No one's going to tell you, oh, put in some technical debt, we'll let it get away with that kind of thing. If you show us to the end, the best thing we're going to ask, the best. Right, but the more you want, the more you pay in terms of, well, you pay, you got to do the work. Right, but you also have to pay in terms of code complexity, right? Nothing is free. So if you want to decrease the overhead, you have to pay in code complexity and maintenance also. As it looks, if you go to the VFS style, in the end it will be less code complexity. So it will be some transition, but in the end it will be written as it should have been. So that's the pony. Right. But yeah, I guess we should have thought about engineering time before we came here, but definitely something to think about for the next telescope. And how much are you willing to contribute towards it? Can we switch? Okay, I guess if no one else has more questions, we can give everyone five more minutes of break. Anyone? Calling once, calling twice. Okay, five more minutes.