 Good afternoon, or evening, or night in Europe. So regarding the more of the tiering side of things, what started out reading the TMO paper with how Mida is determining how much memory a certain workload has based on like its working set size, if it has enough memory to work correctly without stalls. Basically the idea of proactive reclaim from the MCG side, I think Google is also using a similar technique and about three months ago, they just proposed the idea of not moving it but also doing it something system-wide. Because it makes sense not only for the motion, just you have a better sort of LRU's, so you can make better decisions when you do have memory pressure. And one of the ideas of Dave was just introducing a per-node reclaim knob where you basically specify the amount of memory to reclaim. And I wanted to basically see what the room thought of that in general, proactive reclaim system-wide. Some of the interfaces proposed was also just extending the debug FS knob that the multi-general RU brings. And also just, yeah, it was that approach. And also the C group 0 just to do system-wide reclaim. In my opinion, the best, not the best, but the most flexible alternative right now for system-wide, I'm not a MCG person, is to pass like a node mask and then just start simple, just like the MCG interface, not differentiate between the type of memory. Just set the node mask, set the amount of memory to reclaim and go from there. I know that there are a lot of dragons in reclaim, so I just wanted to get a feel for this. Okay, I've got a mic. So one thing is, I don't think going through debug FS is the way because we really do not want to make debug FS any kind of API officially or unofficially because unofficially means officially. And with respect to how the interface looks like, node mask versus C-SFS where we have per node structure and trigger a reclaim on a particular node and do that in parallel on multiple nodes versus a single mask provided in a single interface. I don't have a strong opinion on that. It feels slightly better to have per node and then do things in parallel because then you can define what's your reclaim target much better because what does it actually means that I want to reclaim 100 pages from this node mask. Should I prefer some node more than other? So that would be my thinking. So use the global C-SFS node structure we have that. Have a trigger for each node. Define the node mask by running several processes in parallel. That sounds like the easiest interface to me and probably the hardest one to screw up with several details. So I would add for memcg we just added this interface where you can request a number of pages to reclaim on a per memcg basis. We haven't added the node mask yet because there's still open questions on how we are generally treating tiered memory when that comes into play because for memcg we just do round rob and reclaim and it actually might be enough. It's not clear yet because you're demoting from the top tier to the second tier and then you're also reclaiming from the second tier to backing storage. How much you need to control the amount of pressure you're applying per node is not entirely clear. So I wonder if it might make more sense to stick with the same approach we use for memcg where you start out with the number of pages, it applies round rob into all the nodes in the system and then we can later add a node mask. If we determined we need node control we would add it to the secret interface and the global interface at the same time and with the same semantics just so it's easier. That was another point I wanted to ask like are both interfaces going to be up to par future-wise or the flags basically as things get more complex do we want to have the memcg interface up to par with the system-wide interface? So I don't think I see something where they would fundamentally diverge. For us in our system there's not really any global because everything is compartmentalized into C-groups. You have system management software and you have the workload itself and you have supporting software for the workload and there's no such thing as global. So anything that systems might want to apply on the global basis we probably also want to apply on a per C-group basis. So I think I would expect them to match up all the way through. And I guess it's also too late, well not too late but it's kind of like not the idea to just have one single interface for both memcg and system-wide just passing different writables to the file. I see basically that the memcg thing is pretty much upstream now. It's not there yet but for next release it's full of access. I was also kind of hopeful based on the fact that folks were liking for active reclaim on the memcg level it just translates makes a lot of sense to also do it system-wide. So I think the system-wide interface has its value because there are two use cases I think I didn't mention on the list. The first one is VM migration. Basically before we migrate VM we want to free as many pages as possible so we don't have to migrate those three pages. So that requires a system-wide proactive reclaim. So basically there's something called free page reporting. We report on the guest kernel reports free pages to VMM and then VMM says okay those pages are free. I don't need to worry about migrating them. So the first use case. So the second use case which is really rare is suspending to disk. So that's this part of the thing it's done by firmware. Firmware knows that how many pages are being used and how many pages are free. So if we have a lot of free pages then suspending to disk would be a lot of faster. That's all. Thank you. Yeah. I don't think adding a global interface is controversial is it if I'm reading the room right? It's just about how to structure it. Yeah. Probably the first one I think we want to, sorry, what's that company in there is Kaniql, right? Kaniql. Yes, right. They are interested in the first use case. The second use case I guess nobody really cares because nowadays nobody really aspends to disk. There are? What? No. No, no, no. Why not awkward? We're doing suspending to disk. We're still suspending to RAM. It's different. So RAM has to be powered. So if you unplug a battery where the system is suspending to RAM then you lost everything. So that's one advantage of suspending to disk. But power management is getting better and better. And sorry, I probably shouldn't show you guys this. Okay. So this MacBook M1, even if it's suspended to RAM, you can leave it for a month. You can stay open. It's still power now. You still get everything back. So that's probably the off-topic. That's where we want to get to, get the kernel, learn the kernel too. That's our, at least from the chronology side, that's our real competitor. Sorry, sorry. Sorry. Sorry. Sorry. Sorry. Sorry. Sorry. Sorry. Sorry. Sorry. Sorry. No, that's okay. Sorry about the pressure. Sorry. Thanks. So regarding that I think there was also one use case for VM migration. When you have certain devices pass through. What you would do is, you would actually suspend to disk your virtual machines, such that the virtual machines like save all information about pass-through devices and how to initialize them. the VM and once the VM comes back up it re-initializes the pass through devices and what they wanted to do was also to speed up the suspend phase and they were also looking into like reclaim because obviously as fast as you can suspend the VM as fast as you can then migrate the VM itself so there are multiple use cases in the VM. It's nice that the, my motivation is the motion but it's nice that without hearing it's still beneficial. I was just sort of thinking about the per node reclaim and again from the GPU sort of use case I can kind of imagine that perhaps in future that could be useful for us but at the moment that's a bit of an imagined use case we don't have, we haven't really identified any on the systems that we have so I'd be okay with leaving that for a future improvement but I'd certainly I guess like to see that left on the table at some point. Okay yeah fantastic I thought there was going to be a lot more people against it. Great the other thing I wanted to talk about was the testing in all the hearing things that are going on I'm really, hardware availability is obviously one of the main reasons but I'm finding that a lot of ideas are being tossed around without any proper numbers and for example hot page selection that has, I like the patch set but it has only one, one benchmark that concerns me because I'm seeing that it's more of the standard for this stuff so what can we do to improve this scenario such that we can just throw over clothes at a patch set and just have better information with that. I've been I've been kind of hacking at MM tests, kind of expanding it in a way that where I can know more or less when I'm consuming enough, the workload is consuming enough memory to spill over DRAM and into whatever you'll know here. Right now that's doable but it's kind of like very experimental but it does help determine whether or not like hearing is actually working and my idea was just to kind of like add more tests for MM tests that will just trigger that kind of thing without just consuming memory and just end up swapping all the time which is what would traditionally happen so we also kind of and this has been this is actively debated also a way of representing exporting to user space the kernels vision of the different tiers. I don't see that for example just adding a proc sysfs file for that sort of thing is so bad. One because it's going to be there anyway and two because it will help this sort of testing so we can know whether or not we could have more inferences to the memory that we're using so does anybody have anything negative about that like just or better ideas to improve testing with with hearing because I don't think we can make any decisions here or the future without proper benchmarks to see where we are. Some power and iron systems with coherent GPUs that we could potentially test this on as in I could potentially test this this on they're not generally available but if you need specific benchmarks or something like that run on those systems that would be something that I'd be prepared to okay to help with the look at. That's what I have. I was gonna say that you know I've heard the idea of off-lining CPUs on a Numa node like on a two socket system and and like for the testing right like why don't we just put that out there like when we have one of these patch series right and then show the numbers with that and say like this is the first step maybe I don't know if you think that's a reasonable thing to do. I do and and and we can just start just using PMEM like and we can we can get more advanced as as hardware goes but it's we need something to test Tearing better that's that's it all right. Thank you.