 All right. The user space control of memory management. That's what I'll be talking about today. It's a fairly, it's not a heavy or detailed presentation. It's mostly about soliciting input about a subject that like was remarked in earlier session is probably going to be of some importance in the coming years with, you know, page placement. In the face of memory tiering. So let's get started. So. The involvement of user space for kernel and kernel in kernel decisions, of course, has that's a very old question. Where is the line of control exactly in kernel and user space gone back and forth. A lot. Over the years, I won't mention the word microkernel here. But yes, it has gone back and a lot over the fourth lot over the decades. Recently, we've seen a bit of a move back to user space in some areas like like networking because you know, at least handing things off to probes because of the flexibility of BPF. There's also been a demand for some orchestrator like interfaces to be able to manipulate other processes like process and advice. And there was a recent proposal for pit fd set men policy. So where exactly does that line lie. And, you know, what line exactly is useful. So what I'm going to do here is just outline some prototyping that I have done that based on our discussions at Google where we do have some experience with trying to push things out to user space a bit more. And then get a pinch and input saying, okay, well, if I were to finish this and send it upstream and as an RFC, how would you react? Would you like it? Would you say I'm totally barking up the wrong tree? So that's kind of interesting to me and to my co developers. I should also say I'm not really replaced looking to replace existing mechanisms at this point. The idea more is here to restructure the code so that user space actors will think they can do better than the colonel can attempt to do so. So the idea is to give user space another rope and see what it does with it. So quick list of existing mechanisms that you're all familiar with, not complete, but, you know, basically from less involved to more involved. So the VM parameters that controls various VM thresholds, swappiness, what have you. On a slightly smaller level, there's the C group limits, then related to that there is proactive reclaim. And then we have am advice, which has grown quite a long list of options, which in itself may be an argument that maybe there should be something more generic that you can control from user space. So the advice falls in basically to two categories that's like setting a hit inside the colonel or doing a one shot action such as you know, don't need, and and some of its friends. Then there's men policy and and buying the dealing with numeros and without a doubt the most involved one that actually diverge the code path to user space is a user fault of D. Which I won't be talking about, although I did see comment on a main list that use what if the is definitely important when you're talking about use space though and I agree. It is just not what what I've been looking at at this context, although if you're looking at, you know, and want to discuss an interesting. development that came out of user fault of the usage, I would encourage you to attend James's HTM. You still BFS talk that will follow shortly. I'm going to mainly be focusing on sort of the advice and men policy space here. So the motivation here is of course how to squeeze the most performance out of applications in the face of an environment where you have increasingly complex system architecture so the third pointer memory tiering. And what we have at Google is we often have containerized workloads, they have divergent patterns, there's multi tenancy, and there's memory tiering so far memory and swap, maybe and then also Z swap. We have some previous success with pushing things out the user space in in in farm memory, where we actually had a farm memory demon that based on detailed information from the kernel would make decisions on on memory migration that has been documented in some of the other papers that we have written over the years. So what I'm going to do here sort of describe a general idea that I had that I think may be useful when exploring this space and that I've been prototyping, and then hoping to get some feedback on that to see if people think that this is a good direction to move in. Let me see. So the idea here is that you provide some to find some general hints and control structure inside the kernel. It's to be easily accessible in all context and we'll go into this context later slide. You pass that control information to BPF probes attached to trace points that strategic points in the kernel when you want to make a memory management, you know, page play placement, usually, but not always kind of decision. And then you can in user space can steer the probes via BPF map manipulation. So the BPF maps will actually be mostly read only from as far as the probes are concerned users base actually manipulates them, the probes read them. So it allows to see what user space wants it looks at the control information gets from the kernel comes to a decision and via a bit of right of memory that is passed with and passes its verdict to the kernel which then takes the ball and rules with it. So what is there currently in this space for we've got an advice that was sometimes it does a one shot operation, but other times it sets a VMA flag. There's a policy that has an impulse he struct. It's accessible via VMA or via a special look up for shared memory where things are a little bit more difficult, a little bit more complicated. Again, I'm not necessarily looking to replace these just seeing what would happen if I implemented sort of a new structure that tried to combine these in a way that BPF probes could act on them. So structure won't be allocated attached if new framework is not used so not looking to use more memory or extent existing structures much. Anyway, not much anyway at this point. Within the control structure, it could even be opaque to the kernel. I mean, it's not always clear where that boundary should be if it's only focused on consumption of the BPF probe, then the kernel doesn't even necessarily need to understand what's in it as long as the BPF probe does know. So that would lead to like a very simple interface to set things up like tag a VMA with this opaque value. That does lead to sort of namespacing issues. For example, one of our earlier ideas was well we've got the anonymous name for anonymous memory VMAs now. So why don't we try encoding and hinting information into that. And then we quickly realized that, you know, several other teams within Google were already starting to use this anonymous structure so that we'd be stepping on whatever they had put in there. So that means that we would have to define a company wide namespace for the string that is stored in there. And that would also mean that any BPF probe trying to use the information would have to do string parsing to get the right hinting information out of it. It just, yeah, but that's not great. So that's the kind of thing that you get there. So you are looking at an interface that you can say, okay, well I'm only going to use it for this particular purpose. So context, so some context in the kernel are natural fit for, you know, accessing some sort of control structure. For example, if you're in a context where the VMA is available, you already have one, the pro call usually is fine within those contexts. And you've got the VMA just take a pointer out of it to say your control structure and you have all the information you want. Other context or more problematic. In the reclaim path, you just have a page or folio list. And then you don't know exactly what this page belongs to. Well, you can figure it out, but it just gets more complicated. So there are a couple options you sort of sort of do an RMAP like look up or maybe worse, a page extension. There's another problem in that context. If you're walking through page lists, and you call the BPF probe for each page. That's not optimal. I mean, especially not if you are in context like, you know, direct reclaim where tail latency is a problem. And you're going to add all these calls to BPF probe. So I mean, it's not great in general, especially in those kind of contexts that would just not be good. So there are some performance issues to consider there. So, given that idea. What have we done so far. So, but it's a prototyping made mainly to see okay are there a couple of interesting things that I can do with some basic infrastructure and can I do what men policy and advice are already doing. Of course, if you can't even, you know, do what the existing infrastructure is already doing, then you might as well give up right now. So here's a couple of things that I implemented. I modified the MTL are you access bit scanner that that creates newer younger generations, sort of as a proof of concept. Basically, it makes accesses by certain processes count more than other processes so they, so essentially they sort of have a nice value. So to speak, you know, compared to the schedule a nice value that keeps their pages and artificially younger and less likely to be pushed down and eventually end up in swap. The practical value of that is questionable, but you know, it's a nice proof of concept. Let's straightforward implement since it's a separate scanner overhead is not that much of concern. The scan is done by walking mms and then VMA so the context is fine you've got everything that you need so that was an easy one. The second one is a compressibility hints. As you may know at Google, we use the swap as end storage or not not as front storage as actual end storage for propages that we like to save. So you store them compressed and you save some memory to get rid of the original uncompressed page. So compressibility hints can actually be useful because you may end up wasting some time trying to compress pages that are just not good enough because there's a certain ratio that you need for this to be all useful. So sometimes you end up examining a bunch of pages that don't actually compress well and you end up giving up. Now, if the application had credit hints on those pages saying, for example, what some applications do is they actually have cold pages that they compress themselves so then they could provide a comparability hint saying a don't even bother with these guys. You want to use the swap. Try something else. So that's functionally not that hard to implement. I have not measured the overhead yet with the BPF proactive. I haven't done that for any of these I still need to do that. The initial number allocation would also not that hard to do have to shuffle around the men policy code just a little bit so that I could trickle up the note through the right code path always from the BPF group but yeah, and that also works basically I said, you know, for certain processes in a BPF map. I said, I sort of faked it and say you are only going to allocate from this particular note. And then at the same time, also set the memory policy for them and it worked to be this overall that decision every time. And lastly, very, very simple case and advice, huge page of no or no huge pace so essentially that was just changing the flag checks into BPF call this probably the most boring one. So, I think this is a sort of an interesting direction to take even if there are performance concerns at the very least, I think that's sort of an interesting vehicle to at least test the how feasible it is to do some some things in user space, or you know, maybe sort of test certain placement policies or what have you out and what or what have you in user space by sort of putting maybe a library on top of this that manipulates the maps. So my question is to you, what do you think does this make sense. Have you done something similar like this. So, yeah, tell me, I can't hear anything at the moment so I know there are no questions or have been cut off from audio. Yeah, I have kind of general question. Why is BPF chosen for this is it just for prototyping to prove those ideas or this is like a final solutions that you are looking for. Similarly, the final solution, it is just that it's the most flexible one. Essentially. So it combines flexibility with still a certain amount of performance since I mean, you know, probes do get run directly in the current line there's no transition to use a space there. It's not wedded to that, that particular solution but it seemed like a good, a good starting point. Right, so let me rephrase it. Are you looking for flexibility in the final solution or is this flexibility that you have with BPF is needed for prototyping only. In other words, do you look at those controls as something you would want the user space to have flexibility to modify logic, like two different customers might modify it different ways or are you looking at controls which are generic enough that everybody can use the same way. I see it now. I think I'm having a BPF solution available that makes use of this framework is is a good thing. I'm not sure if there maybe should be other ways to use it, but having BPF hooks I think is a good thing. Yes. Yeah, I guess one pushback you can hit into is the of course that is establishing a certain API's guarantees that are kind of tricky to be maintained forever, especially when they are pretty fine. I can understand that if you do not have a casting stone kind of entry points for BPF that would be easier to sell. But as long as you would be trying to add fixed trace points that might be a roadblock that would be kind of hard to step over without showing that there is absolutely no other way around your problem to be solved by existing means. Yeah, that's interesting. I mean, I would imagine you're running into sort of fixed ABI issues one way or the other. I mean, if using BPF, I mean, you could use a K funks, but then you're exposed to, you know, the changing kernel interfaces. So, I don't know. I mean, do you think that there would there would be a better way to get sort of a stable interface. Yeah, I mean, if you well define the user interface than the bar is obviously quite high already but you establish a certain well defined operation rather than just an entry point and do whatever you like with that. And I guess that the difference might not be all that great or all that big but but I'm a little bit afraid that the well or making a well defined entry points at and into low level memory management functionality would be much harder than starting an entry point from the user space which is kind of more use case defined rather than here we do aging the weather, whatever you like with that. Sure. I see what you mean so you're saying okay well you know, keep the control structure, and you didn't mean for such that you're any in the kernel, but just have kernel code act directly on the information that you attached and don't don't leave it to BPF probe. Is that what you're saying. Yeah, essentially yes. Okay. Yeah. Yeah, I understand that I mean this is definitely who has was a point of discussion that that we had to we for now we've settled on the BPF method then. But you're right. I mean, yeah, it's it's a good argument so we'll see where we land there. I think you'd have had more interest in this idea if it had happened before process and advise landed. You know, they seem to occupy much the same space in terms of I can control what that process does rather than controlling its own destiny. I'm curious where this goes. It doesn't necessarily feel entirely dead on arrival to me but it. I'm curious to see where it goes. I don't necessarily give up on it just yet but it needs to demonstrate something for me to get excited about it. Yeah, I mean, the next steps here is for us to develop this further and to actually show applications that that can use this framework and get tangible results. But that will make people more. I mean, if we cannot get those tangible results, then you know, obviously we are not no longer interested in it. So, but yeah, I see what you mean. I mean, and as for process and advice. That is essentially similar in a way that it means, you know, one process controlling memory management decisions for all made in another process at our space. But this would allow just basically a wide variety of policies in a wide variety of of contexts. But I mean, I absolutely take your point. I mean, you've got to, you know, show a use case and then some results. So, you know, go anywhere. Frank, I might have missed this, but is there somewhere in your prototype where you populate these structures with the characteristics of the different either NUMA nodes or memory regions with regards to you talked about dear and far memory earlier. Is there a concept that user space can look at the structure and go okay this this region is near this region is far this region is whatever when it's making these placement decisions. Currently, it just uses if you want to set the structure up. Excuse me. So, for the prototypes. It was essentially very, very simple. I simply used a assumed a static node system I populated the structure by default with with with the memory tearing information and then use space provided the the preferences it had for for the nodes. So the kernel put in the defaults and and then the BPF maps as read by the BPF has provided the overrides. So, so to speak. And that is the step. So you usually something you know will be initialized by by by default, although currently in the prototype if if you don't even have a BPF probe attached to it, the structure is not even there it doesn't get used at all. So, so then you don't even need to do any any association but yes that there are some defaults that are sit there but there isn't really anything anything special going on. So, you know, I mean, just to I see we're almost at the end anyway. So I just say okay, I mean, what I'm certainly interested in is if what others other views are, you know, in a memory tearing world. But remarks are made as like okay well, it's impossible for the kernel to get this completely right and I mean user space has to offer something more so I look forward to a discussion about you know the ideas that that that other people had in this in this space, certainly. And I'm hoping that, you know, when I explore this path for a little while longer and then hoping that we can come. You'll see me come back to the man list. In a couple of months saying is what I did and these were the exciting exciting results that I got. What do you think now so. I just add that doing those little experiments with what you have might be really an interesting input in the problem definition. So you can see okay, I can achieve doing this. Let's discuss how to do that properly with respect to future maintenance of that of that feature. And so I really like the the BPF kind of approach of probing the problem. Just not really seeing that this would be the final destination for that particular feature, but I mean, maybe there are policies that are so hard to define in general case that in the end, we might just land on the A side of, let's say high level and well described and three points where you can change the behavior. We have discussed that couple of years back with respect to all am killer decision where it's really hard to define what's your best strategy in the end, we just gave up on that. But that might be one example of, you know, something that can be outsourced, because you just have to make some decision it really doesn't matter what kind of decision you make. You just tell us what to do and we just do the thing that it's really hard to do from the user space because you are in such a constrained conditions that it's really hard to read some statistics or something. So yeah, I would just conclude from my point of view that that playing with BPF is really interesting. But final solution would really need a careful thought to not export too much and tide ourselves into hard to develop situation. I take your point. Thanks. Okay, it seems that there are no other questions in the room. So, unless you have anything last few words or thank you. No, thank you.