 Cool, so we have 23 minutes before lunch. I'm hoping that if we go a little bit over, it's OK. But I don't know, is it a hard stop, or? OK, all right. So this talk is about lazy RCU and memory. So yeah, I'm excited to present. So about me, so I'm an RCU co-maintenor. There's about like four or five of us. We work on different areas of RCU. I've worked on several RCU features over the years, related to memory ordering, lazy RCU, the RCU list APIs, and many bug fixes over the last several years. Yeah, so there's an interesting backstory about how I got into this. So if you're interested in hearing about it, let me know. And I can tell you how I got into it. So it's been about like seven to eight years since I started working on this stuff. And I also work at Google on performance and power in the Chrome OS team. I was working on Android before that. And my website has all the information about me if you're interested in checking it out. So the goals of this talk is to make everyone aware of the lazy RCU feature. And maybe it will help your use case, hopefully. And also, I would like to ask for feedback from the MM community on different issues that we have. And yeah, we can make it interactive. But yeah, because we are already short of time, let's keep the unimportant questions in the end. So we'll start with the background of lazy RCU. So the observation that I had about two years ago is when a system is mostly idle, most RCU callbacks don't need to run right away. They can be delayed for seconds or as long as needed. And yeah, what I observed was that some callbacks, they tend to trickle. And there's like every few tens or hundreds of milliseconds, you have callbacks that are constantly trickling for lightweight activity. And this tends to disturb the idle CPUs that would like to be idle, but they're constantly getting disturbed. And it confuses the SOC's power management features as well. The SOC thinks that the system is active. And so it doesn't go into this SOC-wide deep state. That's what we observed. And this behavior is independent of memory. So even if you have tons of memory on the system, it will still happen. So we're kind of doing work when we don't need to. And that was kind of the point about how do we make this a little more lazy? And some examples of workloads are like we noticed Android logging, Chrome OS, video playback. These are lightweight activities that we're doing a lot of RCU. In the video playback case, they have these graphics buffers that use the open and close assist call, and that triggers RCU as well. So I wrote this tool just to kind of see what's going on system from an RCU standpoint. It's based on BPF and refreshes every five seconds and shows callbacks that are queued and executed. And this is sort of how I got an idea about, OK, how bad is this? And are we doing work that we maybe don't need to? So how does this work? So the main idea is we need to avoid going to RCU. So I came up with different revisions of making this work. And every time I came up with a different version, it kept breaking some corner case. So finally, I settled with an approach where I discovered that RCU already has this per CPU list called the bypass list. And so let's talk about what the bypass list was already used for. So say we have a system with two CPUs, and consider that time is flowing as shown in the arrows. And each CPU has its own callback list where when you request RCU for its services, you queue a callback, and that gets saved in that list. So the thing is, in some configurations of RCU, it is possible that the callbacks are queued on one CPU and invoked on another CPU. So to synchronize this, there is a locking that happens. And that can add a lot of lock convention. And this can cause performance overhead due to frequent callback queuing and so forth and invocation. And so there's this list called the bypass list that was added, and that's a per CPU list as well. And the callbacks, basically, if there are a lot of callbacks getting queued into the main list, then it triggers this bypass where the new callbacks will now go into the bypass list. And it will relieve the lock contention that happens on the main list. And obviously, we can't keep it in the bypass list forever. So eventually, the bypass callbacks will have to be flushed back into the main callback list. There are different situations where we have to do this flushing. One is there's a timer, which goes off once. Enough time has passed, and we now want those callbacks to make progress. There's RCU barrier, which makes sure that all RCU callbacks before the barrier were executed before the barrier completes. And then there's also if the bypass list is too long, then we don't want to keep filling up the bypass list. And we might eventually run out of memory. So if there's a check where if it's too long, we flush it as well. So yeah, so my idea was let's just use the same bypass list for the lazy callbacks. And doing this, I observed that we were saving on these low CPU utilization use cases, we were saving like 10% to 20% power even, compared to without this feature. So the whole RCU activity was causing issues. And doing it this way helps us save power. And yeah, the changes from the bypass mechanism, the timer now goes off only after 10 seconds if the callbacks are lazy. And we could reuse all the other code in the bypass mechanism to do this stuff. Also, if any non-lazy RCU callback is queued, we flush all the lazy ones. Because there's no point if we're triggering RCU anyway, we might as well take care of the lazy callbacks as well. So that's another situation where we do this sort of flushing. OK, so far hopefully it's clear how it works and all that. Any questions so far? Comment? All right. So now the memory reclaim part. The current implementation for if there's stuff in the bypass list, we have a shrinker mechanism that is used to see if the system requires memory back. And if that's the case, we flush all the lazy callbacks. And this is what the shrinker implementation looks like. The count function goes through all the CPUs and adds the number of lazy callbacks on each CPU. And then we return the count. And the scan function looks like that. The main thing here is we're told how many objects are required to be freed. And we go over each CPU list. And as soon as we run out of callbacks to move forward, we break out of the loop. So it's pretty simple. And this mechanism is terrible for detecting memory pressure. There's many issues with it. And this is where I want your guys input. So first of all, RCU doesn't really know how much memory is stored in per object. All we're doing is we're dealing with functions, like we have to run a function after a grace period. And also in the traditional sense, an object is with the shrinker mechanisms, and correct me if I'm wrong, but they typically do this caching of objects for performance reasons because they might need the object later. So when the shrinker gets called, they can just, OK, we can't cache them anymore. But with RCU, it's more like garbage collection. So the other thing is we're going through each CPU and flushing all the lazy callbacks. So we may oftentimes scan more, free more objects or move more callbacks forward than requested. And the scan doesn't really free memory immediately. It triggers grace periods. So if a grace period takes very long, then we're kind of lying to the system that we return these objects when the grace period hasn't even completed. So the shrinker is a terrible mechanism. We're mainly using it for detecting memory pressure, and there's no better way that we found off doing that. And the whole shrinker batches mechanism actually can make more RCU work, do more work. Again, correct me if I'm wrong, but the shrinker mechanism, the batches in the shrinker is because the shrinker wants to do a certain amount of work in batches. And the idea there is that instead of doing a lot of work at once, we can split it in batches and do it a little at a time. But actually, for RCU, that might mean more work. Because for every batch, we might trigger a new grace period. So it's not ideal. So what can we do that is better than the shrinker? Is VM pressure a better signal? Can we use that? This is an old topic that I think came up many conferences. One thing that it's not really clear to me is you try to delay callbacks as much as possible. But that means that it might be a lot of memory sitting in callbacks just to be freed, right? And you don't want to system go and struggle based on that. So I don't think that shrinker is something that will help you all that much because we proactively reclaim for reasons that are not even remotely close to be out of memory. So maybe something that would help is just to hook into the page allocator path where we start struggling. So for example, we have several stages of hot and slow pots. And for example, we have that should reclaim, retry thing that checks how much heart we should try really because that might be an atomic allocation that. So maybe that would be a better entry point for you. I'm not sure. And then I think that what you want to do is just to flush everything, right? You don't want to be very particularly picky about whether a subset of CPUs or is there something that would help or? Yeah, I think so. Typically, the RCU work that it has to do if you flush everything is about the same. It's that we might execute more callbacks than we'd like. We have the option of only flushing a part of it, right? Like I was showing in the scan function, we break out of the scan loop after the shrinker gets a count and we only try to satisfy the number that we're given. Yeah, but once you're doing that work, it doesn't matter just to do a subset of that work, just to be woken up later because that memory pressure keeps adding up. Also, does it help to distinguish between case of D and direct reclaim? Because case of D is not local, so maybe you don't want to pre or call callbacks that are distant to that NUMA node. That's not really clear to me. Yeah, we don't do any NUMA considerations at all. Yeah, so callbacks that are queued on that first CPU list are not necessarily freeing up memory for that particular node or that CPU. OK, so then doing anything really per CPU specific might be completely misleading for the memory reclaim process. OK, so maybe just try to flush everything because once there is a need for memory, then choose. OK, I'll look into that. Thank you. Joel? The other option is to just, we were thinking if the system is struggling all the time, we just have some heuristic that turns this off completely. That's another option, so we don't do lazy at all if things are really bad. Some kind of exponential back off heuristic type of thing. Yeah, just keep in mind that there are workloads that are reclaiming all the time. Like you have a page cache, IO stream kind of workload that is just putting memory in and just to reclaim it from the tail because that memory will not be used again. And that can just make the reclaim happening all the time. Also, there are people doing proactive reclaim. OK, I see. Yeah, sorry. OK, I'll look into that. It'll be fine. Joel, one question. So you are flushing the RCU callbacks because you suspect that some of them are like RCU, free RCUs or something like this, which basically keeps the memory. Yeah, there might be pinnings. There might be having memory that we need to give back to the system. Right, so maybe somehow annotates those RCU callbacks which are actually going to free memory. We have no idea who people use. But you have API, right? Free RCU, for example, is an API where usually it frees the memory. OK, free RCU, yeah, I'll be talking about that. But the regular RCU, people use the regular call RCU API to free memory as well. Maybe just provide some interface to tell them that this RCU actually is going to free memory. But I mean, don't all 99% of RCU callbacks end up freeing memory? No, there's some of them to wake up. And people use it for all kinds of use cases. Yeah, they use it in locking as well, where they start to switch from slow path to fast path. There's all kinds of crazy, like you're right, most of the use cases are for freeing memory. And that's why this makes sense. So we had to find out those few use cases that don't free memory and use a new API for those. And everybody else becomes lazy. So that turned out to be much better than converting everyone. So yeah, and then other ideas, like if we can somehow get feedback from the memory subsystem that executing callbacks did something to memory, maybe we can use such a signal in some way, although that's probably very difficult to implement. Oh, the other thing I wanted to ask you guys is the shrinker scan function, since we're like, we'll look into the allocator and all that, but the shrinker itself, since we're not really freeing objects, should we just return zero? Because I'm told some shrinkers, even though they're doing something about memory, they just return zero, because they're not really freeing a certain number of objects. Will that make the shrinker mechanism behave better in some way? No? So it looks like we should just get rid of the shrinker completely and do something better. All right, so the last part is K-Free RCU. So K-Free RCU is similar to RCU lazy, but it's for pointers. So with K-Free RCU, we know for sure that we're freeing memory, because the API gets a pointer. And this API can be used for both slab-allocated objects as well as VMailoc objects. And we also have support now for calling this API. We had to rename it to something else so that people wouldn't shoot themselves in the foot. But there's an option where your object doesn't even need an RCU head anymore, and we'll take care of allocating memory to do tracking and stuff like that. So this is very useful for objects that are really small, where you don't want to waste space for the RCU head. You can just call this API. But it's important to keep in mind that it may sleep. So this whole K-Free RCU mechanism, I worked on the initial patch, I think, three years ago or something. And then Vlad Resky took over. And he did a great job. He built it and improved it and all this stuff. So thank you, Vlad. And how does this work? So basically, we have a simplified diagram. We have a free list and a busy list. And the free list is basically a list of pages. And every time you call K-Free RCU, we take the pointer of the object and we put it in the page. And when we run out of pointers in a page, we go to the next page in the list. And so we have some pre-allocation logic that allocates these pages and uses them as needed. And we have timers that flush the free list to the busy list. So it kind of goes like that. And then we wait for a grace period. So after we wait for a grace period, we call K-Free bulk. So that's the big advantage of this is we can pass the page of pointers to create K-Free bulk to free objects. This is supposed to be more optimal than calling K-Free on each pointer. So that's basically how it works. The advantage of this is we have nice cache locality. Like we don't need to go over a chain of objects because linkless have terrible cache locality. So we can avoid that because we're dealing with a page of pointers. We can use the K-Free bulk APIs, as I said. And the disadvantages of this scheme are K-Free RCU might sleep is not that great. Why? It may allocate when we are freeing. It's kind of weird that we have to allocate when we're trying to free memory. And we are careful that if the allocation fails, we call synchronizeRCU. But we don't sleep in the allocator. We instead call synchronizeRCU. And that's horrible. But it's only supposed to happen in very extreme cases. The shrinker has similar disadvantages that we mentioned. But at least we know that in the K-Free RCU case, we're freeing memory. In the case of the regular RC, we have no idea. So maybe we can do something better here. So I was actually doing research and thinking, like it would be great if this, at least for slab objects, if we could integrate RCU into the slab itself. And I was living in a fantasy world for a few days when I was researching this. And I was very disappointed that it's much harder than I thought. So in an ideal world, if we could make the slab start and stop grace periods until it's really needed, and this is what FreeBSD does, by the way, it has integrated RCU into their version of RCU into their version of slab. And the nice thing about this is maybe when the slab decides that it needs to do something about RCU-free objects, a grace period might have already passed. So it doesn't even need to go to RCU. Like, we have mechanisms now in RCU where there's APIs where you can query RCU about, OK, has a grace period passed between point A and B. But yeah, so I was looking into the slab, and I was pretty disappointed that slab actually uses explicit free lists where they modify the object with a pointer in the beginning that points to the next free object. And how on earth can we, like, if we RCU-free an object and then we modify it and there's a reader that is accessing it, there's no way we can make that work. So any big ideas on? But even if that wasn't the case, we wouldn't be able to just put the objects into the free list because then somebody might allocate them before the grace period passes. So we would have to still somehow manage that. That's true. So the idea was, like, if there was some way that slab, slab, whatever it was doing, had some metadata on every object, then we could store some information there saying that this object is not ready yet. But unfortunately, it looks like hard to do with. I mean, we can't touch the object, right? So the slab actually doesn't have this problem because with slab, you have an array that points to the free objects. Yeah, but if you would put that into the array again, somebody might allocate it right away. Unless, yeah, but the array could have some metadata. Yeah, you could have some metadata, so. I think for the flag, slab, RCU-type-safe caches, we actually do add some extra header for this, if I remember correctly. But these are caches created in advance with this flag. And if we would start to do it in all the K-maloch caches, suddenly they wouldn't be so nicely aligned. And yeah, I'm responsible for making this alignment actually a documented property. And so if we edit this header, it would blow the usage because of the alignment. But yeah. And we don't want to do this for objects that are not RCU-free. So it should be, I mean, we shouldn't affect use cases that don't need this, right? That's the other hard part. Yeah, but there might be another reason why we might consider adding the arrays to slab that I will present later today. So maybe this would be another use case also. Oh, wow, OK, wonderful. But then, if you have an array, the nice thing about this architecture of slab is you can have any number of objects, right? Like, there's no limit. I mean, you have the page as a limit. But with slab, if I understand, because it's an array, you might run out of slots. And then you have to flush it. Yeah, OK, cool. Yeah, let's talk more about that. Yeah, if we could pull this off, we don't need a separate shrinker. The slab will decide it's shrinking. But as Mikkel was saying, maybe we can just add support in the memory subsystem for doing that. Yeah, there's no need. These are all the advantages I already mentioned of K-FreeRCU. But oh, I'm sorry. This is the advantage of the slab integration if we were to do that. Like, we don't need a separate, you know, there's a lot of code in K-FreeRCU. We don't need to have the pages and filling them with pointers and all that. If you could just tell slab, hey, this is a free request. But by the way, this is an RCU free request. Then that would get rid of all this code that we have on the top of the slab stuff. And it might be faster, because maybe the slab can just drop the whole page or something if it was full of stuff that was RCU free, I don't know. There would also be probably no need of an RCU head, because we're telling the slab that this is the object we need RCU free, and the slab will take care of freeing it. Whereas right now with the K-FreeRCU code, we have an RCU head in every object. Disadvantages, it increases the slab complexity. It does not apply to freeing VML objects, because in the K-FreeRCU code, it's actually called KVFreeRCU. There's macros that define the same API more than once. And so this mechanism can also free VML objects. So, oh, yeah, you mean? OK, OK, thank you. Yeah, maybe other disadvantages. I don't know, but cool. So yeah, thanks a lot. Yeah, the lazy RCU feature, I'm hoping that it will be useful for other people. Like, there's people on the cloud side, and data senders, and stuff who are also interested in saving power. So maybe, so this is available in 6.2 kernel onwards. So yeah, maybe it can help others as well. So just a question, do I need to invoke RCU some special way to use the lazy, or what's the API? So yeah, last LPC, we got this very good feedback from Thomas Kleitzner saying, don't introduce new APIs and confuse kernel developers. So we made this completely in the existing call RCU mechanism. So the laziness kicks in automatically, without any. But we had to add a new API for all these people who are not freeing memory, and wanted their callbacks to run really quickly. So that's called call RCU-Hurry. I didn't choose that name. I was thinking it was expedited, but it's something else. Yeah. Could you go back to the memory reclaimed page, please? I got a request from some GPO Fox. And they find the current slab shrinker API is sufficient. What is it want is they want VMs again to export two generation numbers, one the youngest generation number, the other is the oldest one. And you're not on the page. So which page exactly? Memory reclaimed. So basically, they want to, no, no, no, no, memory reclaimed. Memory reclaimed, that's the, I'm sorry, maybe we can. So yeah, basically what they want to do is, I think they have a similar problem. Basically, they can't compare GPU memory objects with the host memory in terms of hotness and coldness. And what they want to do is they want to put the recent GPU memory into the youngest generation. And when the host starts reclaiming, then the host will notify this GPU driver that, oh, OK, I'm reclaiming this generation. And it's an older than this generation should be gone, removed. So this way, they wouldn't over reclaim. Also, they can sort their internal objects using the VM scan generation numbers. So I wonder whether they're going to be helpful to your use case. Yeah, the thing is we don't care about hotness or coldness, because like Mikko was saying, it's better to flush everything, right? Because we're not really caching in the sense that it's not objects. You don't want to flush everything unnecessarily, right? Right, you want to. So if, like I say, the memory pressure is not that bad, you probably want to delay it. Yeah, right, OK. Thank you. Thank you, everyone. All right.