 Hello, I am TJ Mercer. I work on Android. And I want to turn on MCGs for Android. And I quickly ran into a problem when I tried to do that. So this talk is about one of those problems and brainstorming ways we can address it with Yozri and Chris here, who work for prodkernel and have the same problem elsewhere. So yes. So what's the problem? Zombie-MMCGs. Scenario here, you have a MMCG with some processes in it. It allocates some anonymous memory. It gets charged for that anonymous memory. Memory.currents correct, the world is happy. Same C group allocates some shared memory, gets charged. Well, memory that can be shared can be a lot of different things. Could be Shemem. Still gets charged for that memory. Everything's still fine. And then that memory actually is shared with another C group. The problem here is that only one of the C groups gets charged for the memory. It's the original owner of the memory. There's only one owner of a page. So MCGA gets charged for it. MCGB gets to use the memory, but it doesn't get charged for it. It's kind of weird. Shared memory, one owner. And then all the processes in MCGA die. The MMCG gets offline. The MMCG does not go away. The MMCG stays around. It stays around because the pages, the shared memory that's still in use by B is still charged to MMCGA. That's the main problem. Okay. So yeah, the issue with that is that they can accumulate. They can accumulate a lot. Thousands of MMCGs. I ran into this an hour after I turned it on the first time on Android. And there are similar issues in Broadkernel. Maybe these guys work. And so yeah, that consumes kernel memory. That's not great. We're just wasting kernel memory trying to track memory to C groups that don't exist anymore. And yeah, it makes any kernel operation that iterates across MMCGs less efficient and reclames a big one of those, right? So yeah, we iterate through all of those every reclaim event. So yeah, here's some things that we thought we could do to address the problem. Yeah, first is manually reclaim, proactively claim the C group that's about to be offline. Well, that doesn't really work if the memory is not reclaimable. So that C group will stay around until that memory does become reclaimable. If you try and reclaim from the C group before it's offline and it results in the pages being swapped out, that makes it worse actually, because that C group will stay around until the swap gets swapped back in and who knows when that's gonna happen. And yeah, the other issue is you can only attempt to do that once before you offline the C group and it can fail. So it might not capture everything the first time you do it. Okay, so that's not a great solution. Another thing we can try to do is reparent the charge. So the C group is a hierarchy, right? So why don't we just move the charge up to the parent and completely get rid of the originally owning C group? Well, that's also kind of weird too, right? Like the parent doesn't have any more claim to the memory than the child does. So in terms of like correctly accounting, who gets charged for what memory that's still not ideal. And then yeah, then it affects the LRUVAC of the parent too, right? So you're mixing pages, zombie pages with non-zombie pages and then you have to scan more pages in the parent. Again, reclaim is no, regresses there. And that can happen multiple times. That reparenting can happen multiple times so that you can happen all the way up to the root C group and it just gets worse and worse every time that happens. So not a great situation. Yeah, the other issue with this is the same workload can result in different memory charges. So you run the same workload twice in two different C groups because of the non-deterministic way the memory gets charged. The actual page counter is not the same, each run. So I think the fundamental issue is that struck page only can reference one memcg owner when in fact the page can be referenced by sort of arbitrarily many owners. Yeah, so the memcg data field of struck page or struck folio has that info. And yeah, so the fundamental issue is anytime a page is shared between C groups, any reason, the charge on the C group can outlive the owning C group. So that's what keeps the zombies alive. Okay, so yeah, been brainstorming some ways to address this with Chris and Yosie. One idea is instead of reparenting the charge, just moving the charge up the hierarchy, move it to some other C group that has a valid claim to the memory. In terms of accounting correctly, that's a little bit better but which other C group? That's kind of hard to find. The first one that accesses it after its current C group becomes a zombie. Comes later. Actually, yeah, I'll just, yeah, okay. And then a longer term idea where instead of associating a page with just one memcg, a way to associate a page with multiple memcgs, that's sort of the idea under the longer term solution actually. Yeah, okay, so I guess you can take over for recharging. Yeah, hey everyone. I will try to speak through these so that we can have room for discussion at the end. So as TG said, one idea is to recharge the memory after a memcg has gone offline. So basically we try to go through all the pages that are charged to a memcg after it's offline and we try to recharge those to the rightful owners of the memory. So just to enumerate the types of memory we can have charged to the memcg. We can have kernel pages, these, oh. Why you focus so much on the offlining the memcg? Isn't this problem kind of originating when the VMA was unmapped? Like the shared VMA became unmapped. As long as the memcg is still alive, the pages are charged to it. Well, I know, but I'm saying, like if you're gonna do recharging, why don't you do the recharging when the VMA that was shared that was the one that originally parented it to the memcg becomes unmapped? That's like the logical point when that memcg no longer has a claim on those pages. It doesn't have to be a VMA or there doesn't have to be a mapping. It could be just page cache pages. Well, but if they're not mapped into your process, then you still, it kind of, Shade files is right. Exactly, right? Yeah, and also there are resources that do not belong to anybody like TMPFS that is sitting around even without any process having that mapped or even open. TMPFS, Temfes. Shmin, page cache. Yeah, so, yeah, so enumerating the types of memory we may run into, we may have kernel pages. These are fine. They're currently being reparented. We don't have to worry about these for now. We may have LRU pages. I like to divide them into maved and unmapped pages. For unmapped pages, they may be page cache pages. Like you said, file pages, Shmin, all that. And they may be anonymous pages in the swap cache. I will deliberately ignore this case for now because these are not very common and they shouldn't, they don't actually share it. They're not artifact of the same problem. They only happen if we just swapped out or just swapped in the page and they should go away when the process dies so I will ignore these for now. So assuming this covers everything, we go into what can we do about those pages? I like to define this as a toolkit, things we can do. So the simplest and most aggressive thing you can do is just affect the pages, right? For page cache, this is really nice. You just write back the pages, you uncharge them and then next time someone tries to access them, you refold them, you reallocate, you recharge. So it's very simple, just reclaim, but it is intrusive basically because if it's hot memory, then you incur a fault for next time someone access it. Also for swap back pages, if you try to swap out a page from an offline MCG, you basically reparent it. The swap gets charged to the parent and when it's refalted, the charge goes to the parent. So that's effectively delayed reparenting. And if the pages are pinned, you can affect them to begin with. The second thing you can do is to recharge to someone mapping the page. So assuming the page is mapped, then you can walk the R map, find a process mapping it, find their MCG and then try to charge it there. There's an obvious problem of course. The benefit is that this is a rightful owner because they're mapping the page, they're using the page, so you can get them to the page. The obvious problem is that at the random point in time, a MCG gets charged for memory that may have used three hours ago, right? And then in the worst case scenario, you can even cause an unkill there. So this is disruptive. You do not, you do not have to pick the page. So there's no page fault latency on the next access, but you may hurt the MCG. And also if the page is mapped by multiple MCGs, you need to make a choice, which were to charge it. So that's a little bit complicated. The third thing you can do, which is close to what Matthew suggested, is what I like to call two-step recharging. Basically, if the page is not mapped, you don't have a way of finding out who's currently using the memory, but you can flag the page. And the next time someone accesses the memory, you can give the pages to them. So the reason I call it two-step is that to have the zombies disappear at once, one thing we can do is just uncharge from the zombie itself, leave it charged at the parent, and then flag the page. And next time it is accessed, we move the charge. Recharging to the parent should be straightforward. The parent is already charged through the hierarchy. So you just basically drop the reference to the zombie MCG so that it dies as soon as possible. So this is complicated because there is work that needs to be done on the access path. Like the other two, you can do asynchronously. But this one, there's actually work you need to do when someone is reading or writing or mapping memory. So a proposed workflow would be as follows. Basically, when a MCG is offline, you can queue an asynchronous worker, which basically iterates the LRUs and does the following, if a page is unmapped. And it's file-backed. You can argue, like the current tree-clim heuristics argue that for file pages that are not mapped, they're not very important. So we can make an argument to invigorate those pages. Doesn't have to be this way, this is just a proposed workflow. If the page is swap-backed, we can do what I just called deferred recharging, where we basically just mark the pages so that the next time they're accessed, we can recharge into the process accessing them. This is applicable for Shminpages. And then if the page is mapped, we can either walk the R-map at once and do the disruptive recharge that may result in an unkill, or we may do a similar thing where we just flag the page and next time it's mapped or accessed, we do the recharge then so that it's more deterministic when the recharge happens. Of course, the problem is that if the page is mapped and you wanna recharge to the next access, you have to do something like numifuls where you protect the mapping and the next time you access it, there's a page fault and then you recharge then. So, and then there's the other problem is that if a seagrub dies and it had some memory already swapped, these swap pages will keep references to the offline MCG indefinitely until they're swapped back in and they will be recharted the full time. So this is a tangential problem, but it's also related, it also causes zombie MCGs, but this can be addressed completely separately by just having an offline key fit that will just asynchronously loop through those and recharge them separately, but I'd like to focus more on the recharging part of it. So this is the idea of the short-term solution, I guess. Any questions before we move on? We used to recharge in the past. It was a problem. We had to drop that. We built quite a lot of assumptions that MCG doesn't change over life. And this will be hard. So I think that one part of the problem is that sharing resources across some, cross resource boundaries like sea groups is not a great idea in the first place. So are there ways to avoid that in the first place? And also another problem is that, yeah, offlining of sea groups is probably just too easy to be doing without very good reasons. Like we have seen cases where services are restarted. They are using essentially the same environment, yet their sea group is just removed just to be created at the very same place. Yeah, so a part of the working set is in the data, MCG, well, new one has been created. So maybe we just want to make it harder to remove those. Would a lot of things break if we just refuse to remove that MCG if there is a memory that cannot be just easily dropped? That would be one idea, just to avoid a problem rather than to trying to fix something that is really hard to fix. Because, yeah, there are resources that are not reclaimable without data corruption, like 10FS or probably many more. You can have file descriptor sends over, and that is so memory bounds to a file descriptor that is still open. And, yeah. Yeah, so. Can I comment? I don't know if you guys can hear me. Yes, we can. Okay, so previously, we used to have recharging, but we have, it was actually like uncharged and charged. I think with the re-parenting, which is different, that can still be done, which I think the whole infrastructure we have by Romana did. That's one. Second, the shared, like not sharing between C-group, I think that is kind of in today's economy, like kind of not really possible. Everyone is trying to save memory by potentially sharing more memory, reducing the cost. So we do need, yeah. So rather than like avoiding the problem, I would say we can actually aim to have like what we can achieve like the solution easily in the short term. That's beneficial. Just to comment one more, like Yusri is here with the mapped memory, we can actually, like I think you already mentioned, like the recharging, like something like Autonomous Fold, or you can un-map all those pages and whoever the next first fold gets that is there. So I think here we kind of have to decide what, like a short term, this is I think achievable, is good enough, or do we want to go further, that's all. Yeah, exactly like Chikil said, it was actually wrote up to have like, is this control or config that will just prevent people from removing MIMCGs, they have memory charts, but sometimes it's not something that you can control. For example, if you have shared libraries and there are page cache pages that are hot and being actively used, or worse, if you have like a 10FS file and you don't have Swap, and you happen to be the first person to write something into that 10FS, there's no way for you to get rid of it today. You're stuck with it virtually forever until someone decides to delete it. Like if you have a log file, for example, in 10FS, and you write just one byte to it, and then no one deletes that log file, then you're stuck with this charge virtually forever, right? The offline MIMCG will stay there, there's no way to remove it unless you truncate the 10FS file. So I do agree it would be nice if we could have some separation and we could prevent sharing as much as we can, but like multiple people have noticed on the mailing list over the, I don't know, last year or something, a lot of people run into this. It's not a problem that we can really tell user space stop doing X and it will be solved unless everyone knows their own copy of their libraries and everyone does their own logging and Laxik Hill said this is not very efficient, so. Yeah, I do understand that reasoning. The main question is whether we want to see that something is not really working well or we want to hide that. So right now we just hide that because those MIMCGs are offline and it's not really easy to find they actually exist and that they might, there might be a, because essentially that's a resource leak, right? Because somebody just forgetting a file, log file, whatever that is, is a resource leak. That's a memory consumed on behalf of somebody. So either we just face it and just consider that to be everybody's fault like the root MIMCG owner or you can drop the owner of that thing and hope that somebody will touch it again and get caught and get charged for that but then you are hiding that there is a resource consumption in the first place. So this problem is there for ages and I don't think that we have found something that wouldn't be breaking something else and I don't think that there is an easy way out of that and that's why I'm saying that maybe we should just refuse to removing that C group and at least you can have an admin looking, okay, this is something that I cannot really remove. So that's probably a sticky memory somewhere and maybe I need to do a, you know, that imperative kind of action that I just removed that file because I know what I'm doing. I don't know. I simply do not see a solution out of that. It's just how honest we can be about that problem and how visible we will make it because. But just visibility is not like, I'll give an example now for our like workload environments if people fail to remove the MIMCGs because some pages are charged to them, they're just going to schedule new jobs and create new MIMCGs. And then at some point they will hit the limit for the number of MIMCGs in the machine. They won't be able to schedule any more jobs and the fact that we didn't allow them to delete the MIMCGs is not going to make them, they're going to come to us and say, okay, I'm not, I'm doing, this is not getting deleted. I don't know which file in the system or which, what is exactly happening because I don't know exactly what it acts as in lifetime and we're going to ask them to, I don't know, crash the machine so that we can find out what the pages are charged to them and tell them what to do next time. It's hard to, yeah right, but those leaking resources are staying behind so even if you do not have any hard stop eventually then you just, ooh, I'm the whole machine and panic anyway because you've got those unreclaimable resources eating up that memory so there is simply no way out of that without some intervention. I mean just a little bit of brainstorming but it seems like you need a parent for a shared thing and maybe one exists or maybe you need to create a new one. So that's a vague statement but if you've got several processes and they're all sharing memory and some of them might die, you can either create an artificial parent or you can use an existing parent, which you don't like or you can nominate a parent from among the ones that are sharing it but that's the essence of your problem is that right now it's a sibling society and that's no good. You've got to have something that owns the memory. Oh yeah, that is a long-term solution. Yeah which is exactly what I'm going to give the mic to Chris because he's going to talk about exactly that. Yeah, it's a very good segue to our next slide which is like what if we do the brave thing and try to model the shared memory usage and then what kind of solution we can come up with and what kind of problem we are going to run into. I have this idea only for a short period of time and then it's still something in development some of this not fully hash out but the basic idea is that we want to assign a owner for the shared resource and then it will remove the asymmetric portion of it. The owner, basically don't share memory the LIU and we'll point to that share memory SMCG owner and then the shared resource they will have the same lifetime as this owner so it won't have this process go away but this shared memory is still lingering around and we also potentially can track the shared resource. They are shared usage by a borrowed counter and but that's fairly complicated. I will get into that in the last part of the slide and but we will take care of some easy case first and then the advantage of using this model is that there will be no charge movement like the charges always charge to the shared resource the share MCG and if we model it correctly we have account for all the shared resource potentially can be used by these other MCGs and then there will be no zombies basically there will not be any zombie left and then this is a simpler case what if we don't check the detail usage of the how much partial use we basically only say that a shared resource it's either belong to MCG or doesn't belong to MCG it's a member it's basically it's a set of a member or not the big change is that you change the memory instead of as a one counter you change that into a list of counters and then for the shared resource you have the option to say hey I remember this shared resource should be considered for these MCG's memory pressure and then you will see that this big black line pointing to it is the case B and the similar memory of line case the MCG of line case the A basically they first start touch that but they don't get charge for it this actually have a very common case in Google they do the user space packet routing and the MCG that best the packet route to other MCG to process them A is the one that doing the switching and B is the one that actually doing the processing A is very quickly done and then B will probably hold the packet for a longer time processing it and this is the actual drive result we want to get basically B get memory pressure when they're holding the packet for too long but doesn't like releasing it and then the user will have the option to assign which MCG will include this share memory or not and keep in mind that their parent's parent and will need to do the duplications basically if this share object is already in the list they will need to only include that once but because the share resource is a relatively few different type so it should be a very short list to maintain the deduplications and then I will get to the last really scary part and what if we want to track the actual usage portion of it and I haven't figured out like a really good scalable way to do it there Stephanie you can do it like basically we use the borrow counter kind of concept there's still one guy owns it and then other people any reference any map consider as a borrow usage and then you can kind of consider that portion of the usage as part of the memory pressure using the same architecture but it require a per MCG a per share MCG and MCG turbo counter to keep track of it and then the set gets more complicated when they merge in the parent if the share resource have different sets and then merge into the parent so it's a very complicated problem but for Google's usage case and the first one like the membership is good enough to take care of it Okay thank you.