 So, first of all, thanks for having me and special thanks to Michael for extending my session from half an hour to an hour. So I'm going to talk to you guys for at least an hour. So we're going to talk about the multi-generational RU. And my name is Yu Zhao. I'm from Google. I've been working on multi-generational RU for about a couple of years. And so let's start it. So I'm going to touch on some fundamental stuff. Probably not that interesting, but I think it's important. So the fundamental role of RAM, basically, touring machines do not need RAM. And the role of RAM, basically, is to cache information for immediate use. And it's the same role as CPU cache, but it's lower speed and cost. So an example would be a smartphone with that RAM, if it could work, would be still faster than early mainframes. So I still picture from Wikipedia. You can see here, this is a touring machine. It's pure mechanical. So while we care about RAM efficiency. Our observations are the speed gap between RAM and storage, likely to remain large. And also the RAM price unlikely to drop drastically in a foreseeable future. Also the memory used from complex workloads, something going to keep growing. And the prediction is, our prediction, of course, RAM will continue to play a major role in performance. So here I have a distribution that illustrates our current memory utilization dilemma. Basically for a given workload at a different time, it uses a different amount of memory. And so the why access is the possibility of the memory usage, whereas for example today it uses 100 meg, and tomorrow probably going to use 200 megs. And so the access is the utilization configuration, all the usage. So for such a workload, we definitely have to over-prevision memory. Basically we have to give this workload, let's say on average it requires 200 meg. And we have to probably configure 300, even more memory. And the green part is the profit margin. And the yellow part is somehow, no, we just hit a wall and have a long latency. And then when we reach the red part, we're going to have own chaos and live locks. So that's why we care about RAM efficiency, because more efficient than higher profit margin. So two fundamental problems of RAM. So the first problem or challenge is what object to cache. So that's why we need RRU to decide what object to cache. And the second fundamental problem is how to cache more objects. So basically we have external and internal limitations. And that's also why we use ZRAM or Compaction to reduce internal fragmentation and external fragmentation. So here I have two diagrams. The one on the top illustrates how RRU works. Basically we want to keep red objects, reddish objects, basically hard ones in RAM. And the code objects are bluish colors in disk or slower storage. So the picture on the bottom shows fragmentation. So for a single page, if you look at this single page, we say, OK, user space requests a 32 Mac allocation. But on the hardware, I can just give this user space a 32 bytes. So it has to give this user space a program, a 4K page. So the rest, unused part, basically is wasted. That's the internal fragmentation. So if you look at the entire RAM, so different allocations have different orders. And they have different codeness, hotness. And that's external fragmentation. That's also why we try to allocate huge pages. Sometimes, you know, our allocation fails. So MGRU and this goes, so there are three major goals, simplicity, flexibility, and performance. So OK, probably I should go take a step back. So what exactly is MGRU? In very simple words, probably crude, it's basically we partition memory into more buckets. With the active and active RAU, one I have two buckets. So the bluish colors mean code memory. And the reddish colors means caught memory. So here, on the left, we have one red purple color, another the other blue purple color. So that's how generally active and active RAU works. And then on the right side, we have five buckets. So we partition the memory into more segments. So each bucket contains different kind of different code hot pages. That's basically how MGRU works. And OK, next, questions so far? OK, cool. Now we're going to look at MGRU internals. So there are three components. Generations, page table walks, and feedback loops. So this big picture shows how generally MGRU works. So there are two hands here. Basically, this is a clock. There are two hands. So the first hand, which is the red one, we call the agent. So this hand basically goes into the MM list, at least here, to find out, to gather all the access bits from page tables. So that's the page table walks. So the other hand, the black one, I call eviction. This one walks the page list, physical pages, to evict pages. And at the same time, for each page, it also walks our map to check the access bit again. So basically, we check the access bit at least twice. So where are the generations? So here we have grouped physical pages into dotted squares, these three. So these are actually generations. So the first generation, which is the oldest one, born at T0. And the second one, it's the previously youngest, but now it's in the middle, and it's born at T2. And then the youngest generation born at T3. So these are generations. And for each physical page, we have a generation number counter in the page flex, or folio flex. Now it's called folio flex. Sorry, I should use struct folio, not struct page here. And so this number indicates which generation this page belongs to. So on the aging path, if a page has been accessed, we update this generation number to the youngest. OK, basically, we tag this page. OK, this page is a hard page. Don't evict it. On the eviction path, but during aging, we don't really sort this page. We just only tag this page. Now after tag this page, this page is in the wrong bucket. So on the eviction path, then we sort this page. We say, OK, the bucket number is, let's say, 1. But this page, the generation number counter of this page shows this page actually is 2. So we move this page to the bucket number 2. So that's when we sort this page. And then after we sort all pages that have been accessed, then the pages left are code pages. Then we're going to evict those pages, code pages. That's generally how it works. Questions? OK, so here is a question. Why we want to walk page tables instead of just user because when we walk our map, so we walk each page on the IRI list. And those pages are physical pages, right? They are not necessarily mapped by the same process. Usually they're not mapped in the same VMA. So let's say we have two pages from two different processes. So on the IRI list. So to work the first, to work the IRI map of first page, we have to go into the page tables of process A. Then for the next page, it isn't mapped in different process. So we have to switch to another set of page tables. That costs a lot of cash basis. But because these pages are not mapped consecutively within the single watch address space. So that's why we want to walk page tables. But when we walk page tables, we could see a lot of empty page tables, right? Page tables basically there are not freed and nothing in there. How do we solve this problem? So we use, that's the next slide. So we use balloon filters to store dense, hot page tables. If we have hot page table, OK? And OK, if we have page table, OK? If there's only a few pages mapped in this page table, we definitely don't want to walk this page table, right? We're not going to store this in the balloon filter. If we have another page table, and this page table maps a few pages from node A, maps another, new model A, maps another few pages from node, new model B, and then so on and so forth. So we also don't want to add this page table page into balloon filter. Why? Because the reclaim domain is per node, per memory C group, right? So when we talk about reclaim, we are only concerned with a single node at a given time. So if a page table contains a lot of misplaced pages, we definitely don't want to add a balloon filter because that would mean we have to scan a lot of pages that are not eligible for reclaim. And why is this balloon filter, why have we had two? These are balloon filters. So I'm going to explain this later, OK? Why is this a feedback loop? So let's say first time we scan this page table. We don't see that. We see, OK, this page table is not that interesting. So we're not going to insert it into the balloon filter. And then how do we make sure next time, how do we make sure after the process that owns this page table has mapped many pages into this page table, has updated the PDEs in this page table, we're not going to miss this page table? So we have to find a way to notify the page table worker that, OK, the process that owns this page table has updated it. And now previously, it was sparse and uninteresting. But now it becomes dense and hot. So you better walk it. So we make this discovery when we walk our map, right? Because eventually we have to walk our map because we need to unmap this page. This is something that we can't escape. So when we walk our map, we say, OK, so this page table seems very interesting. It has a lot of relevant pages. And probably next time when you walk a page table, you should just cover it. So then this our map walk as this page table, which is pointer, pointer to this page table, the balloon filter. So then when we walk page table again, so we test whether this pointer to this balloon filter is in this balloon filter. OK, yes, now it's in this balloon filter. So the page table walk going to cover this page table. So that's basically, that's why it's the feedback loop. So the second one is the PRD controller. So in addition to aging process, we also have eviction, right? So after we evict the page, we want to know whether a page is going to refold if it refolds. So what information does this tell us? Does this even tell us? So it means, OK, probably we have made a wrong decision. So we need to self-correct. How do we set self-correct? That's where the PRD controller fits in. So there are some details I'm going to skip, because I think we might not have enough time. But if you have questions, please feel free to stop me at any time. So now I'm going to move on to the next step. So one of our goals is simplicity. OK, apparently I miss about simplicity, as I put it simplified here. And so I think for some of you who have, OK, go ahead. Yeah, I had a question. If you can go back one more slide, I think. So in the bottom left, the promotion. So it's just a question of understanding. If you have, say, five generations, right, and you're aging a page that's in the oldest generation, and you see it's referenced, do you promote it one generation up, or do you put it all the way into the youngest? So that depends on the type of this page. If it was accessed through page tables, then it would be promoted to the youngest generation. So we just got to skip everything between this current generation and the youngest generation and go to the youngest generation directly. If it's accessed through a file descriptor, then it will be only moved to the next generation. Because this is based on the cost of evicting this page. Because if page fault has a higher cost, right? Evicting a mapped page has a higher cost, because we have to map this page, Flux TLB, and do those stuff, right? And if it's not mapped and it's clean, then we could just simply drop it. So also the access bit itself is not that accurate. So when we see the access bit set, right, this page could be accessed like 1,000 times, even million times, right? But for a page accessed through a file descriptor, so one access is one access. So because we have a whole function called a marked page access, so every time when we go through F open, F read, then F read is going to go into marked page access. That's definitely one access, right? Not more than one, definitely, right? Yes, yes, right. So I think probably yesterday we talked about the XFS red page part, right? Because file system people, they don't like memory management or page reclaim goes into a red page. Basically, we, OK, so when the memory pressure kicks in, right, we see a dirty page first time, we mark it as page reclaimed. So we say, OK, at this time, probably there are other clean pages we could drop. So that's not a worry about it. So next time when we see a dirty page, marked with page reclaimed, we'll try to, OK, key swap D, we'll try to write it. And then it will cause a callback function from file system called a red page, I think. It's not red pages, I think it's red page. And this red is out of line writing, right? That means, OK, probably file system will think, OK, why do you want me to write a single page? And there's a lot of work to do, right? It's a lot faster to write a bunch of pages. So with the MGRU, actually, we don't really write any pages. So if we see this page is dirty, we just simply move to the next generation. We mark the dirty, and then move to the next generation. So then when the red back is triggered, so that's when the dirty results hit. So we have dirty ratio to control how many pages are dirty and how many pages must be cleaned, right? And managing an RREU helps to maintain a healthy dirty ratio by setting page dirty, by calling set page dirty. So when it walks page table, it sees a dirty bit set, but it just calls set page dirty. So this way, if the red back doesn't happen in time, then we still can maintain a healthy dirty ratio. So basically, red back walks our map and to clear the access bit, right? But instead of clearing the access bit, we don't, sorry, to clear the dirty bit. Instead of clearing the dirty bit, we don't really clear it. When we say it, we just call set page dirty. Yes? Sorry. Yeah, with the generations, so where I'm coming from is when you look at how garbage collection uses generation, right? Generational garbage collection algorithms. What they do is they usually, it's based on the idea that the longer something has been in use, the more likely it's going to continue to be used. So everything starts with those things, everything starts in the oldest generation. And when we find it referenced, we promote it up by one generation. And if it's referenced again, we promote it another way. And so the longer it's been around, the less likely it becomes to be reclaimed, right? But if the map pages start out at the youngest, and if they get referenced anywhere on the lower generations and you didn't promote them all the way back to the youngest, what do the generations buy you? Like, say you have a difference between one configuration uses four generations, the other one uses five. If everything's promoted to always the youngest, what does that end going up buy? Yeah, very good question. And I answered just after I collected one more thing. So in terms of garbage collection, so here's also actually reuse a similar idea which is when we run the agent, so everything actually slows down because we have to walk page tables, a lot of work to do. But comparing with suffering a little for a long time, we only suffered, okay, we suffer a lot more, but for a very short period of time. So that's basically we bench a lot of things and then solve it very short period of time and then, okay, everything is good to go and you can't reclaim everything. So back to your question. This is really very good. And this is something I'm gonna actually explain why we want generations. So the first goal here, unfortunately, again, I miss about simplicity. So it's simplicity. So why we want to promote a page, a mapped access page, page accessed through a page table directly to the youngest generation? Then what's the meaning of have multiple generations? So I'm not sure if you have noticed. So when I posted the patch set, I called it the monitoring RIO framework. Why did I call it framework? Actually, the automatic goal is to allow everybody to attend it. This is based on the belief, my personal belief that once that's how it does not work anymore and that does not work anymore. So we have different, we have servers, data centers, VMs, laptop, computers, and smartphones, right? Even for a single platform, let's say a smartphone, we have memory configurations range from 16 gig to probably two gig, right? So personally, I don't think one side of it's all gonna work. So how about I finish answering Johanna's question? So, and we have generations. And with BPF, user can customize which generation a page goes, right? So what you suggested definitely makes a lot of sense for a lot of workloads, and I do intend to make the work, but also if we put this existing kernel, then that means we're gonna have additional work. So what I'm gonna do is, I'm gonna give you a BPF interface, right? It contains a PID, process ID, watch address. If you know the VMA layout, you know which VMA this page is from. And so whether the type of this page, whether it's a non-page or file pages, and the fault of this type, whether it's a refold or it's a fresh fault, and then some additional information, probably not that trivial. And so based on this information, you can choose which generation to goes. So with this, actually, we can emulate all the different IRI algorithms, I know of, that's the ultimate goal. So this idea actually is not my, I didn't came up with this idea. So I learned this from networking congestion control. So they added something for TCP, for TCP link, now when this gets congested, so it decides the, what to call, retry the back off, back off time, when to retry. So that's contrast control. So they have like seven or eight different contrast control algorithms, heuristics in kernel space, implant rating in C. So now they are moving those heuristics into user space, into BPF, sorry. So we can do this in two ways. First one is struct, BPF struct ops, which is like more complex. The user way actually is just a key prop. So I have this function, you can implant BPF program to override this, and whatever RIO algorithm you come up with, that's your private IP, and you don't have to share with me, right? And you can maintain your competitive edge. So that's the ultimate goal. Okay, so, okay, just if I understand right. So the patches as they are now, they're introducing the generations. The unmapped file pages are using fine-grained generation control. Right. So the policy for map pages is to not use them as fine-grained, but they're BPF hooks for future extension. Okay, I think I might have missed that when I was reading the patches. No, no, actually I didn't mention it on the mailing list because I think there are a lot of other things I didn't mention on the mailing list. Because that would be really useful to have, because I was like, okay, if it's promoting right away to the youngest, why all the generations for now, that would be really useful to have in the change locks or in like documents. Okay, because I were trying to limit our discussion within a manageable scope. That's why I didn't talk about this. But now, I probably, I think that was a mistake on my part. I should have explained this better. More questions? Okay, I'm gonna move on to the next step. So I think I did touch on this a little bit during our last meeting. And this is about the THP internal fragmentation detection. So also I think yesterday Yang asked another question, or probably suggested something, predictable scanning to detect hardware failure, poison pages. Sorry, I don't remember your exact question. This is a similar idea. So how much time do I have? Okay, I have exact half an hour. So when we use huge pages, right? And we have two problems, usually we have two problems. One is external fragmentation preventing us from allocating huge pages. The second one is internal fragmentation. Basically, we have TGP, never TGP, I'm the wise, TGP always, right? Let's say if your user space doesn't really support it, TGP, I'm the wise, doesn't use the I'm the wise, then only options left are never obvious. So a lot of user space application, a lot of workloads, for example, memcached D, Redis, they suggest we turn out, users turn off TGP. Why? This is because, A, TGP are not easy to, TGPs are not that easy to allocate because of internal fragmentation. The second problem, the second problem, which actually is like worse, is the internal fragmentation. After we'll allocate this huge page, right? This workload could only use 4K, 2 Mac huge page. This workload probably gonna only use 4K of these within this 2 Mac region. So we basically waste 511 4K pages. And that's not the worst part. The worst part is, if this 4K page is hot, so the access bit, we only have one access bit, which is in the PMD entry and put into this huge page. So the access bit in this PMD entry will always be set. So we're considered as an entire huge page hot, and it will be always at the head of the IRIU list. So we're never gonna reclaim it. And basically we wasted 511 4K pages. So this extension is gonna solve this problem. Basically when we start off, we don't map a compound page by PMDs. Instead we map this compound page by PTEs. That means we're gonna have 512 consecutive PTEs pointing to this compound page. This is still a compound page, okay? So this is not 512 4K pages. Then let's see, if this workload uses most of the space within this compound page, then good. We just remove this PTE table and then swap it with the PMD entry. So this becomes a true huge page. If not, let's say they only use one 4K page, but this 4K page is hot, okay? So when the system is an animator pressure, MGRU gonna look at this PTE PTE table, the 512 PTEs, then say, okay, actually there's only one page is hot and the rest, they're not only cold, they have never been touched because by looking at that dirty bit, we see the dirty bit is zero. So then we can just split this page, right? And then free the 512 unused 4K pages, only keep the 4K page as in memory. Does it make any sense? Yeah, just want to ask just for clarification. Is this something that you envision to happen or this is something that we are talking about code that already exists and you would like to get it merged with the rest of the thing? So it won't happen within two years, so that's for sure. And it's very likely to happen within four years, so basically between two and four years. Okay, I think that it makes some sense to discuss those things, but I guess one of the things that we should definitely spend some time on is what you've got currently and you aim to get merged. I'm not really sure where that is, general consensus within the group, how we should proceed on that. So it would be really great if you have some more extensions so that we can see how much that can or how much potential there is in there that's definitely valuable. But if you can spend or reserve something like 15 minutes to discuss those, what we've got right now, things as well. So okay, so let me go back one page before. So this extension gonna happen, if let's say I'm Jair Lanz in 519, so this is gonna happen within three months, the first one. So the second one, it's gonna happen sometime next year, early next year, but when I said two years, that means everything will be down in two years, it's very unlikely. Right, but now we're gonna start posting more patches. So that's gonna happen really soon. So but whether the patch is gonna be merged, though that's now within two to four years timeframe. So this is the last extension we have here. Yes, this is the last extension. So the last one is about VM Baloney. I think David might be interested in this part. So basically for multi-VM workload, so if I have a manual VMs, and we wanted to overcome it to the host, and we want to know which VM we want to take memory from, and which VM we want to give memory to. So this multi-generational RU gives a histogram, basically is what I showed here before. So the one on the right side. And based on this histogram, so we can decide, okay, so in this example, VM1 has more code memory. Again, the bluish colors mean code, the reddish colors mean hot. And VM2, it has more hot memory, so we don't want to take memory away from VM2. So that's another extension. Basically, this is similar to free page reporting and free page handing. And instead of reporting free pages, we're gonna report a working set. So this is working set reporting. And with this information, when we have multi-VMs, we can decide how to balance memory, how to share memory between them, among them. So yeah, I mean in general, like whenever we talk about working set size estimation, we are making guesstimates. And the real issue is if you take a look at the VM at a certain point in time or just at any process, you might think you have an idea what's happening, but like that could change any point in time. So it could be that that VM is idling for like 500 days and then you decide, oh yeah, like I'm gonna shrink that VM and just at that point in time, somebody starts up a huge application and essentially you trigger out of memory conditions. So that is actually like why I, I wouldn't say detest, but detest is the right word. I detest auto ballooning because like you think you know what you're doing in a hypervisor, but you actually have no clue what your guest is doing. And the nice thing, for example, about free page handing is that your virtual machine still has all of the memory available. And like if you run into that like scenario where some like the VM suddenly goes crazy on memory or the OS, it can still make use of that memory. Whereby if you like inflate the balloon, it takes usually some time for you to detect that you can deflate the balloon or if you want to do that out of random context, you're usually in trouble. So it's a nice idea for working set size estimation and it might, for example, help maybe to set like C groups in a certain way with swappiness or something like that. But like shrinking VMs in that way, I'm usually a little bit more pessimistic that it works reliably, but I think it will be a good tool. So I get what you're saying for working set size estimation, it will be a very good tool I think. Okay, thanks. So this is a agnostic tool, whatever technique you're gonna use to shrink VMs. So I just use ballooning as an example. So it could be a word PMM or word amount, anything that works for you. So the working set reporting are from guest. It's like the guest has to communicate this information through the word balloon device. So that needs a channel to communicate this information back to the host. That's the only requirement. Of course, it doesn't have to be a word balloon. Wouldn't it also work somehow that you would do it in a hypervisor and take a look at the physical pages and whole day age? That also works, but in the hypervisor, right? We don't really know the page types, right? We don't know, okay, because I'm sure you're gonna tell you the codes pages actually are from page cache. So this information is lost if we try to guess the working set after that VM. Yeah, right. You're right. My question would be if the guest operating system, for example, when you shrink the VM is able to rebalance them. Although you don't have all the information at hand, it might be good enough for certain working set size estimations where you simply care about what's the maximum size of my VM. It all gets more problematic with memory tiering. That's what I learned. Today, I learned everything gets more complicated with memory tiering. It's just an option, right? And if, to be honest with you, right now I don't know how well or how bad it's gonna work, so, but that's the direction we want to investigate. So that's why this right now is like extension three. It has the lowest priority and the first one of course has the highest priority. More questions? Yeah, so I do not have questions regarding those extensions or the implementation because I'm not sure there is much to discuss. There has been discussion on the mailing list. We as a group have shown some, let's say concerns. So I would like to focus on where we go from here. So you've got your code that it seems that long term, there is quite some nice potential for extending that implementation that is currently not really possible with our existing LRU implementation or at least that's my perception. On the other hand, our LRU implementation has been tweaked and proven by many years. So I do not think it's possible to simply replace one by another. Right, totally agree. Yeah, so right now you've got that opt-in kind of approach and I'm really wondering what also other people in the room feels. So if we go and get that merged because that's the only way how to find out how that implementation really works on the large scale deployments, right? So how should we go about that? Because one approach might be to simply merge it and have it opt-in for people just to enable it and try to play with that. Another option might be to simply merge it and enable it by default and let it go through fire. And I mean, expose that to users and tell them that okay, if that doesn't really work well then that should be really looked into and debugged and fixed probably or maybe just concluded that this kind of algorithm simply doesn't work well with that workload. And you are back to our original implementation. And I can see that both approaches have some advantages and disadvantages. And also maintaining two different implementations of LRU will be definitely a maintenance cost. So those are all things that we should really take or keep in mind and... Totally agree, yes. So is there something that we can agree on here? Yeah, I can see Mel has a hand raised. So do you want to say something before we... Oh God. We can hear you. Oh, okay, thanks for that. Mel, you have disappeared. Yes. You're back. Hello. Hey. I think that if this was merged it should be enabled by default as much places as possible. And a major limitation on that at the moment is Max SMP. It requires Max SMP to be disabled. Now that's enabled on at least open SUSE SLEE according to get kernel sources enabled on Fedora and RHEL. And maybe somebody in the room can confirm Debian. I think later Debian's Max SMP. So even though it was merged, even though it was default, yes. Most major distribution or at least some major distributions will not be able to even run it. And it will be important that limitation be moved. I'm guessing the Max SMP limitation is based on the number of bits consumed for node's shift. Right. And somehow I think I added that depending not on Max SMP a long time ago and after I got a report of build error and somehow I couldn't reproduce it anymore. So I'm gonna remove it probably. I have already removed it in my report. So I'm gonna post the next version. So that part will be gone. So I think hopefully, not gonna cause any build errors. And then if not, then I guess problem solved. If there's still a problem, then I do have plan to address this issue. Basically we have additional space. Okay, basically we can remove some page flags by, oh, who's talking? Yeah, I think I shared that plan with you. And I think we can remove page swap backed, right? Basically a non-page, page of non is always swap backed. And... Oh, they're not, cause you can em-advise freedom and then we clear swap back. Yeah, sorry, sorry. Then they go on the file, LRU. Basically shmem is always swap backed. A noun can be, the swap back can be removed, can be cleared if we want to, something called lazy free or em-advise free, yes, right? So that part. But we still can, let me think, I think I have a solution for that part, but I just couldn't remember it. Another possibility will be simply to drop maximum node shift if, if MGLRU is enabled, because the current value of Max SMP allows for 1,024 new nodes. And maybe the room can answer, does anyone know realistically of a machine that has 1,024 new nodes? There was no microphone, it says a pumpkin and I don't know what a pumpkin computer is. You could design it. I mean, the problem is like now every single performance class gets a proximity domain or gets a NUMA ID, but yeah, 1,024 sounds degenerate. Okay, now remember how to remove page swap backed. Basically for a non-page marked as lazy free, we can use an additional bit in page mapping. Right, this is a null and it's marked as lazy free. We use another bit, because we have spare bits in page mapping. So we can move there, so we can remove page swap backed. And I think we ought to use three right now. We are using two bits because for 32 bits you can only use like the last two bits and to get another bit only works on 64 bits as far as I know due to the alignment of the map. Are we talking about page mapping? I don't think that's true, David, because we K-Maloc all the things that we point to with page mapping, right? And the minimum K-Maloc alignment is 8 bytes, even on 32 bits, right? I think a better bet is just going to be to wait for separately allocated folios. Page bits are true. We've seen this. I love your optimism. There are some page flags we can get rid of. I've been looking and thinking hard. One is page private. The file systems can simply say is the private field null. I need a bit of cooperation from some file system maintainers, but I think people can understand this is for the good of all of us in page flags, that would be a good thing. Yeah, we've got a little off into the weeds here. The context of the discussions in GLRU rather than saving page bits in general as a problem and we could get stuck into that for the next like hour. I just think that the MaxSMP limitation right now is a strong limiter because of major distributions cannot enable this. And I know there's other big distributions out there, but anyone has MaxSMP set at the moment can't use MGLRU, and that's going to be a bit of a maintenance headache. And if that means that the nodes shift needs to drop to make space, then it will be it. But it should be at least pre-requisites because at least my own preliminary evaluation of this shows the performance of it is not bad, but it is better on UMA machines than it is on UMA machines. So there's likely to be some fallout from this. And if we go through a couple of years of development and nodes actually running the thing, significant effort will be put into maintaining two separate implementations only to find out that when someone tries to actually use it, it has to be thrown out again. I agree with all that, Mel. That's a large part of my thinking. I mean, if we can look forward in the future, if this thing is going to be a success, it's going to follow a well-known path. We'll put it in. We can enable it at conflict time or at run time. And nobody will use it. And then various engineers and various organizations will turn it on. Let's give this a whiz with our usual workloads. Hey, this says promised. And they'll start allocating resources to it. It'll get better and better and better and eventually people will start switching to it. And I'll come back to that later. But what makes that process so much harder as a commoner is to get those people up to speed or what makes it so much harder is the combination of the code itself and its development history. I don't know who wrote it. I don't know who to contact. We don't have access to all that institutional knowledge if it's been years of development. You're sure you're frying chap, but it's only you. Combine that with the fact the code base itself is inscrutable. I tried to read it and all I saw was a whole bunch of C statements. That does not tell the whole story. So I implore you to put at least two person weeks into making. And when you document this code, don't look at it from your point of view. Look at it from the point of view of Mel or Hannah or even somebody old and stupid like me try and tell me a story to help get me up to speed over this two-year transitional period. Now, I share this concern with Mel that will reach the point where it's wonderful for most workloads, but there's some things the legacy LE is still better. We've still got slab and slab. We're supposed to delete one of those 10 years ago, but still one has advantages over the other so we've never been able to hit the skills switch. So I think we should have an objective deleting the old one, which means making MGLE a superior in all respects. I understand it hasn't seen a lot of use on big iron. I don't think the prod kernel people are really using it in any angle yet. So if Google could do that, that would make things more confident. But also I'm being a bit hypocritical here. We don't have this problem with file systems. We encourage people to choose whatever file system serves their purpose as the best, and we're proud of it. But for some reason we have the opposite problem in CoreMM. Why don't we have 15 LEUs, choose which other ones are best for you. Weaness has declared that we're never going to have a pluggable CPU scheduler. You could say the same thing of file systems, but we don't. I think unless people sat fire to me, I could see as merging this in the next RC cycle with the big warning button on it will give people time to play with it. But it's that issue of how we help engineers ramp up. We're taking a group of people who have been working on existing stuff for decades essentially turning into new hires. And throwing them a new code base, which doesn't have code comments without the benefit of the guy in the next cubicle who's been working on it for a long time. This is the problem. Right. I agree with most of your points. So, okay, let me start from my point of view. So my first problem, which is a chicken egg problem, so this is regarding how we... I do want to find more people to use it. At least I started internally at Google. So I went to... Okay, you're not running through it. I've got to talk to you. So I went to the Android team. Okay. I went to the Android team. I asked the Android team, could you try this? And collect some data and then we can show the appdream that this is something promising and we can keep the ball rolling. Right. And I'll let Surin answer how many times he asked me, when will this be in the appdream? So they won't try it until it's in the appdream. And from the appdream point of view, why would I merge something that nobody actually has used? If it's not proven, not should be in the appdream, right? This is one thing. So I also talked to the product kernel team. I think Michelle was there. So he also could tell you how many times I asked Google data center team to take this and try this and then give me some feedback. So I think their initial response was, okay, this is never going to be in the appdream. So we're not going to try it. We just rolled it off. So that's the problem I have too. It's the appdream first. Everything wants to be in the appdream, then we'll take it. Right. You have a very good point. I got an answer to that. So then I thought, okay, what am I going to do? I kind of just like to die. So what I did is, I talked to my management. I said, okay, let's allocate some fund to hire a third-party company, a third-party independent lab, to verify this. And they can post data to show its benefits. So that's where the data performance numbers actually came from. So they're not from Google, unfortunately, right? But they are independent. So that's the... Okay, go ahead. Yeah, so I think that while I do agree that hearing about the history behind that is really an important part, but I guess we will not get that full story. And I can live with that. So I guess the primary question is that if we get that merged and it seems that nobody is objecting to having that enabled by default, so essentially declared the next release as a reclaim fire. And you might see quite a lot of bug reports. And you can not really expect that the existing reclaim community called both of them as a community. So that would be a lot of... potentially a lot of work on your end. And are you prepared for that? Are you prepared that if we see a lot of bug reports and they are not getting addressed because you are overloaded or there's not enough manpower to handle that so that the whole thing gets reverted and you get another chance probably. So that's probably a very important question from me. I'm not prepared for that. That's definitely no. So that's why I added a kill switch. That's why there is one time switch to turn now and off. No, I'm not saying that you really have to fix all of them in a week time. I'm saying that if there are bug reports they really they will make you busy for a long time. Are you able to or are you willing to devote that time on that because if the answer is no then I do not see how we can get that much. There are tricks we can play. I can enable it by default and Linux next for one day and then disable it the next day and then not read my email for three days. I don't think you are going to get a reasonable matrix LRU problems in one day. So Malk, could you repeat that because you were probably lost in the noise? You're not going to get reasonable metrics on LRU efficiency in one day of enabling unless it's an absolute object failure but I ran us through five separate machines with a variety of workloads that it didn't crash so I don't think it's going to be an outright functional failure. The key for any LRU is kind of preserve hot day short loss which is much more subtle and takes a longer amount of time to detect if it is an issue. But for me any merging of this that is not enabled by default by most major distributions due to MaxLZ is a hard stop because it's going to take years to rattle out and if the primary author is not available to deal with any issues to come up that's a bit of an issue. So I think the question all comes down to whether Google is going to continue the investment in this area. So the answer definitely yes because I haven't finished my story and now after Andrew has taken the patch set then the Android team said yes we're going to use it and the product current team the data center team at Google said yes we are all in. So I think I'm going to go back to the healthcare and the budget plan for the next many years. So back to your question let me finish this so back to Michael's question would somebody at Google be prepared to maintain this for a long term? Yes. But would that be the model of the community mode? So my role is to start it and then show the benefits to everybody and I try if you are not convinced I can try again, try again, try again until you are convinced. So you think this is something worth trying so you're going to try it and then going to move to the next company or next person to do this. So my next step actually is Mamakashadid doesn't have a backing company so then the next company Postgres which is backed by EDB MariaDB and MyCycle there are two different companies and those companies and also DataBreak the owns Apache Spark to work with them so in their menus I'm going to ask them to add something like as shown promising results when running their applications I just suggest if you have a newer kernel you should turn this on. So that's my next step. So also I have been working with many companies in private so actually I have reached almost of all of you guys I think David Rentes have sent you guys emails like a couple of years ago to invite you guys to join the conference with us so I'm going to continue to push on that front so after we have built enough momentum so then probably I could just sit back and take a break the rest of you guys could see because I think Andrew mentioned oh okay you just put all of us out of job, no so I know you have been here for a long time right so generally I have been working in this industry for a long time as well so I'm not thinking like next three months next three years now I'm thinking about the next five years or the next decade You didn't put it out of a job Of course I'm still You gave them all new jobs thank you very much I want to disagree with you Mel I think putting it in default off that has considerable value it certainly sounds like it's going to make it much easier for Google to put resources behind it it makes it much easier for other organizations to play with it and decide if they wish to ramp additional resources into it if they seek promise There's basically no good choices there If it's not enabled by default particularly early in its lifetime when it has been merged the vast majority of people are not going to end up running it because at the end of the day evaluating LRU is difficult because it's primary job is to identify the hard-working set A three year cut over would be a success I mean better if it's better if it's when that went in it was a joke six months after better if it's went in it would get an eno mem corner and collapse in the heap but people worked on it got better and better engineers ramped up with it and now it's big time People would notice that the integrity issues far faster than those page agent version issues are something similar are identifying changes in phase behavior typically for any of the LRU type bugs that we have dealt with in the past the vast majority of them have been desktop and interactivity related where the system is getting into a swap storm like case swap D being pegged as 100% CPU or the desktop is jittering our video playback stops working when I am doing a kernel compile in the background or something similar to this and for that it would have to be enabled for interactive users early on and pager claim on the enterprise side it's not that big of an issue but until it is absolutely an issue with the run time that you get a corner case but for normal desktop users major problems with LRU bugs tend to show up very quickly so I would push heavily on the enabled by default in particular removing the max s&p limitation so that Sli, Open2Z, Fedora, REL and at least Debian enabled or at least play with it first or at least play with it insofar as to complain about it and they get overwhelmed by bugs and disable it by default which is useful in itself but if you want to see a case where 5 years of development is going to MGLRU and standard turns out no one's enabled it because because they can't because of max s&p or some weird corner names don't get me wrong I'm not opposed to this being merged I just think that it should be enabled by default as quickly as possible because THP was enabled by default at the very beginning and it set everything on fire and it took about 3 years to actually sort it out and if it wasn't enabled by default I don't think it would ever have started it out so I'm not sure if I understand correctly so what's the benefits of enabling this by default as soon as possible instead of working steady pace because it becomes a Heisenberg issue you don't observe you don't observe a problem until you measure it if it's disabled by default it's not measured so okay I'll give you example this is how I usually think so at Google when we draw out a new feature what we do is like we have a million of users or even tens of million of users right so what we do is after we add something with really new binary but this feature start initially turn off and then we turn down on 1% of machines 5% then 10% if everything goes well after we're going to reach 100% within a year or so and if something goes wrong then we just roll back and back to the beginning and fix it and try this process again so that's the way I am used to in terms of thinking process so I think personally I would prefer MJRU follows this process so by working with those third party companies user space application providers we just try this with all major workloads we know which workloads actually consume high amount of memory then we start with the list one by one then we tackle the problem that's how I at least at the moment see the best way to move forward I completely agree I completely agree this is the way to bring it up in a fleet like when you're a company that has data centers and warehouses distributed across massive geographical locations and apply for Google and apply for companies like META both it doesn't help any of the major distributions which are responsible for shipping a general OS for when and how to do switchover if there isn't a large amount of backing data but just because it works for a fleet of machines and warehouses does not necessarily mean it works in a general case if you reduce number of generations to how much is the difference between MJRU and the legacy or existing JRU LRU sorry so by design we can't just reduce the number of generations to two it has to use four because it distributes pages across all four generations so we can extend it beyond four generations so by the way we can't reduce it anymore so right now it's hard coded at four so go ahead just on the enablement so what was interesting I think on the mailing list is that there were actually quite a few people who said if this is going to be available in the kernel I'm going to turn it on I think one of them was the Arch Linux kernel maintainer so I think I disagree a little bit on the if we don't enable it right now nobody will use it I think there are different distributions some of them are not as eager to try new stuff than others I think there will be some like I think if we merged it default no and we have it like this for two or three releases I think people will use it and we'll experiment with it and you know it's not a guarantee but you might be able to space it out a little bit so you don't have flack day and then everything's on fire you might get some early reports I think I don't think that's entirely unlikely I just want to agree with your harness and to disagree with Mel about THP being enabled by default being the right thing to do documentation lives forever it turns out and you can still find documentation published by a prominent database vendor who I shall not name but I've been fighting very hard internally to get them to stop doing that that says you must turn off THP because it absolutely creates a database performance yes it did ten years ago now it doesn't we fixed it delete this documentation and finding the documentation owner and getting it changed is hard it turns out and yes it doesn't die off the web and it doesn't die off forums and you find sysadmin is helping each other by telling them these things and it's like yeah that was true when you learned it have you benchmarked it recently do you have any idea it's true it's not true so yeah merge it by default off by default until we've got the bugs out of it and then enable it even if it was the cave on that particular one it would still have to be even if it's disabled by default it can be online that should still be available so I think this all comes down to benefit risk you know analysis or risk benefit analysis so personally I don't really mind turning on by default but then it would become others problem because you know if we turn on by default and we break something right and okay I break something okay so the other person who users they suffer so personally I don't want that to happen so I do hope one day we can switch to MjRU by default but I think that might not it could happen probably sometime next year but I think we might not want to turn on immediately for everybody that sounds kind of risky to me yep yes right so I think if we merge it by default it's a very tangible time frame for enabling it like I wouldn't say like in a couple years we'll switch it to default on I think it should be a couple kernel versions at max because you'll get a lot of early adopters and those tend to be you have your use cases that you care more about than others those will be people that run laptops and desktops at home and you're going to be inundated with bugs that you might not necessarily be excited about right it's like I turn swap in as to zero wise it's swapping that kind of stuff I think you can take the heat off that a little bit and deal with those up front and space it out a little bit but yeah if there's not a concrete plan to enable it in a reasonable time frame I think that it's not going to work all right I'm calling time guys because we got to wrap it up great thank you