 the last time I think I gave a talk at previous USF Amendment Puerto Rico, and at that time I was thinking mostly about memory footprint side of the problem, but now I'm kind of trying to cover more. So, it's not a secret that on any modern system we create and destroy a ton of C-groups and in particular system, do you like to do it very often? Do you run something, for example, it crashes 10 minutes later and the system did delete the old C-group, creates a new one and it can be repeated and repeated. Obviously, there are some costs to create and destroy kernel objects. There is a cost of creation, there is a cost of destruction. Cost of creation is kind of low, so I don't think we ever had any issues there, but cost of destruction can be really brutal, but what we are doing, we're just postponing it to a better time, so we use reference counting model and we're just saying that at some point in the bright future all the references will be gone and we'll be able to release the C-group. The obvious problem is that it usually takes a long time and in many cases it's never happening, so instead of paying for destroying a C-group, we are paying for having a lot of dying C-groups in the system. It's not coming for free, so there are some problems and issues. So most of us, the CPU overhead, whenever we are reclaiming memory, we are going over the C-group tree and dying C-groups in this sense they are not different to live C-groups, so if we have thousands of dying C-groups, the reclaim is becoming more CPU. Costly. The second is memory footprint. Memory C-groups are large objects on average on modern servers depending on the number of CPUs and numeros. It's somewhere in hundreds of kilobytes per memory C-group, so if you have thousands of them, you're wasting several hundred megabytes, if not gigabytes of memory. But that is first two problems where something will focus a lot in recent years, but now I realize there are new problems, so one problem is user experience. Let's say you have, I don't know, system.slice, very large, I don't know, 10 gigabytes large, but if you go over all existing living C-groups, they may be, they could be like, I don't know, 100 megabytes and a user may ask, where's my memory? Where are the 10 gigabytes of memory? And the user has basically no visibility into it. We have some statistics at parent level, but it's not enough. And actually the worst problem here is that, I mean, these living C-groups, they may look like they're taking 100 megabytes, but it's also not true because they might reuse for free all the page cache and slab memory, which was created by dying C-groups. For these living C-groups, things are also not working well. So I put it as a memory sharing issue, so it's not a secret that we never were good at handling memory which is shared between different C-groups. Like we just whoever creates a page is getting charged for it and everybody is using it for free. But actually with dying C-groups, we have the same problem. It's just sharing is happening, not with, between like living C-groups, but usually it's happening between different generations of the same workload. And as I said, it's, this sharing is breaking many basic features of memory C-groups, right? So significant part of used memory, like actually like hot working set is belonging to the previous memory C-group, which was deleted. Like you can't really trust the size of your newest generation. You can't like your limits are not working. Your protection is not working well and your statistics is also like not accurate at all. So there was a lot of work done in recent years related to dying C-group problems. So this is a short list. I could miss something, but I think the first big part was slap-reparenting, which I presented at previous LSFMM. That helped to solve, I mean, mostly dying C-groups were pinned by remaining slap-objects, so that helped a lot. Then we went through complete work of slap-accounting and at that time we introduced Object C-group API, like the name is Carti of Johannes, which is kind of a fired thing. It's basically, I initially thought about it as a kind of auto pointer in C++ terms. So it's just a pointer with a reference counter, which can be used instead of memory C-group. And the idea is that it's small object, so it's way better to accumulate object C-groups than memory C-groups. And it also can be atomically switched to point at the parent C-group, so we are getting reparenting for free, kind of for free. And this later, this API was reused by Mochung to convert accounting of non-slap objects, and I used it for CPU accounting. So it's, there are at least several use cases now. Recently I worked on C-group right back, clean up, so the problem was whenever, I know it's getting dirty, it's getting associated with writeback structure, which was holding references to original memory and block C-groups. And if nobody else were touching this I-note for writing, it was like, yeah, the reference was there forever. That was done. Another change to Johannes switched memory C-group statistics to RSTAD, it's not API like Nehanism, created by Tatian for, initially it was for CPU controller, and that solves the problem of statistic accuracy because previously all MemCG statistics was all of number of children in C-group tree and including dying C-group children. So you can, if you have 2000 dying C-groups in a tree, like all statistics numbers spread around. And recently there was a big patch set from Mochung which optimized list of all the raw stuff location. So what's missing and what's going on? I think right now the biggest question is like what to do with the page cache because page cache is usually what is shared between, what is left behind when the C-group is deleted and workload is stopped and often shared between generations of the same workload. The reason the patch set by Mochung, he posted I don't know, a few months ago and he reused Object C-group API to do this. And I actually, I'm looking for kind of opinions here because I'm not sure it's the best idea but I wonder what other people are thinking here. So yeah, let me rest my thoughts. And so basically like yeah, we can use Object C-group API to reparent page cache, it will solve the problem but it adds some complexity because of this interaction and I mean the code is not looking nice but actually it's a kind of a question, do we really want to reparent all the page cache or we should do something more smart? For example, like should we use all raw vectors as this intermediate object? Do we really need each each church page to hold the reference to C-group? Maybe not. So I'm kind of leaning towards this idea. So another question, for example, if a page is getting activated and we know that it belongs to already deleted C-group, we know for sure that who is new user, maybe we should recharge it to different C-group. It's kind of a question. We usually don't like to recharge anything but at the same time, the current case is not great either. Yeah, there was a patch from Weiman to release Persepille memory as soon as we don't use it. It's not a new idea, it sent a similar patch a few years ago but again it's adding more complexity because now when you access memsuggest statistics, you need to kind of do more magic there. And it feels like it's not a solution, it's bandage for making the memory for print less severe. I think we discussed different ideas about yeah, what should we do with like page cache left behind? Like maybe we should mark all these pages with some flag and then whoever new is using these pages should pay for it. It's kind of the question here, how to do it without like adding a lot of overhead on hot pass. And maybe the surprisingly the most promising solution is coming from user space and it's the idea to kind of stop deleting and creating new C-groups every time, yeah. Instead what we can do we can do basically, we can add another layer in the C-group tree but it won't be a full layer, it will be just like a pit controller layer. So whenever like something is restarted, we can just create a new sub C-group but the memory controller will be enabled only like a higher level. It might happen soon in system D but actually it's also questionable because now we just delegate to the user space as a answering the question whether like the new stuff and the old stuff are the same stuff and whether it makes sense to kind of, especially if you have memory protection, like let's say you had a very big C-group protected with memory and then like you restart something different with the same name and then you protect all the still page cache. Yeah, that's pretty much all I have. So yeah, what I'm really curious what you guys think about page cache reparenting and like what we wanna do here. I think even for system D changes the way it operates for like if there's a cron job that has a service or something associated with it, right? Instead of creating a new C-group for every run it can now use the same C-group because it's a known job and it doesn't need to create a new C-group for every instance of that job. But I think there will still be cases where you do one-off executions, right? Yeah, exactly. It's still not avoidable so I think it's great that they're fixing it and I mean every time you don't have to delete and recreate C-groups it's nice but I think it still makes sense to fix the runaway pile up. Yeah, absolutely. It won't fix all the cases. It can make the problem smaller for most users. I'm curious with the LRUVEC in direction, right? You would only have to move the entire LRUVEC instead of iterating pages. Yeah. Wouldn't that mean that every time you have a page and you need to identify it's C-group membership you would have to go through the LRUVEC. Yes. And that's not always there, right? Yeah, there are like big questions like. Okay. But in general it would be really nice if we can get away from this idea that like every charged object has a reference to C-group and instead we can just, if it's LRUVEC-group it's protected by being LRUVEC and like if we do all the cleanup stuff and like recharging stuff during the destruction of it and then we can literally release the C-group just immediately after or like the end stage of this process, that would be really nice. And what actually right now is another problem is that it's protected by reference counter with like some huge, usually some huge value and it's absolutely impossible to save something is going from there. We are leaking C-groups in the normal case and if something is wrong we may be leaking them a little bit faster but like it's really hard to detect. That's why I kind of liked the re-parenting right because it would be more clear whether there's an active leak or if the common case does the immediate garbage collection. Yeah, but then you have the same problem for objects C-group counter. Yeah, that's true. I don't know, I think I like the object C-group based re-parenting the best. It doesn't get in the way of still thinking about stuff like LRU batching but it seems to me it would fix the problem most directly with infrastructure that's already there. Yeah, I'm kind of mixed feelings. It's not the prettiest. Yeah, it's not the prettiest. And it won't solve the sharing problem so it's only a little bit like garbage collecting problem. Right, right. Cool, thank you.