 So, so the overall idea here is to kind of look at differently next kind of components and then try to see whether we should start looking at the possibility of, you know, improving these components using hardware counters. What I intend to cover in the next few slides is some of the experiments I have been doing with hardware counters on our PC platforms. I do have a slides later giving some details about what these counters are and how they look. The primary motive is to also kind of share with the rest of the community what I have learned and then we look at, you know, whether we should start making these changes in Linux kernel so that some of these areas can accommodate hardware counters, right? So these are three top-level areas where I ended up spending my time looking at whether they can work with hardware counters, multi-generational LRU, transparent huge page and page promotion. With multi-generation LRU, I did post an RFC which kind of go into high-level details of how I envisioned multi-generational LRU can start using hardware counters. But at a high level, the idea is to kind of like, as you know, MGLRU consists of multiple generations so the idea is can we use hardware counters for classifying pages into the right generation, right? Another primary challenge with hardware counters or access count is, you know, most of these architectures give ability to kind of know how many access have happened to a page. But with respect to multi-generational LRU, what we are really interested is in the relative hotness of a page so that we can place a page in the right generation, right? It's not about absolute hotness, it is about how hot it is compared to other pages in the system. One of the big challenge I ran into while kind of looking at multi-generation LRU is how do I estimate the relative hotness of a page using hardware counters or relative access count of a page using hardware counters, right? There are a few challenges with that. Quite a lot of these things can be improved using better approximation algorithms but then I ended up looking at, so for each LRU WEC under MMSC Deeper Numanode, you know, there are a set of pages you kind of look at the pages in the youngest generation or the oldest generation and try to kind of come up with an estimation of what is the maximum hotness or what is the maximum access count and what is the minimum access count. It's an approximation because we can't go through all the pages in the LRU WEC because that's going to be really, really heavy. Once you get the approximation of the maximum access and the minimum access count, then whenever you look at a page, you look at this max and the min access count and using that classify the page that you are interested in into the right generation, right? So if you have four generation, the max and the min can be classified into four groups and then a page of access count y get placed into one of these four groups or four generations, right? We could also use different mechanisms like it came in clustering, which kind of takes a sample of access count and then kind of creates four groups or four clusters. And then later whenever you get a page, you can classify them into one of these clusters. There are effects of using complex, more compute intensive work in these classifications. I have a slide that goes into what I learned by trying simple classification and complex classification like came in clustering. So once, and another interesting details about MGLRU is it kind of allows you a sorting phase during reclaim. It's called sort folios where right now MGLRU look at the generation stored in page flags to kind of move the pages across generations. These page flag generation values are populated during like while we are looking for the folio reference of a page, we look around nearby PTEs, get the PFNs, hotness and that is used to play, convert it to a generation and we store that in the folio flags. Similarly for architectures like x86 and which actually supports hardware access reference bits, there is a mechanism where we kind of walk through all the MMM in MMCG and then we scan the pages and things like that. There also we kind of look at the page reference bit and then kind of classify the pages into the youngest generation and put the youngest generation details in the page flags. So these generation details stored in page flags are used to move them into different generations during the sorting phase. So what I ended up doing with respect to my changes instead of using page flags generation, use the access counter based classification that I did to place them in the right generation. There are a few advantage of doing that. One is you can completely skip the page table walk because we are not looking at reference bit for kind of like doing the generation walk. If you don't need to do page table look around during our map walks that kind of cut the cost of some of the art table walk and essentially you could also save the, if you really say that I don't want to save generations in page flags, we could save that bit and then avoid saving generations in page flags. There are like approximately six bits that we use in page flags which can be saved by doing this. Having said that, I also want to look at some of the numbers. I just want to make sure that we understand that optimizing reclaim is really, really tricky because it's heavily optimized and different workloads have different behaviors. So one of the things I try to do is I kind of ran a test with MongoDB as you could see when I used hardware counters, the performance dropped. Even though the graph shows kind of like a large drop, it's like a 1.3% as a drop. There is extensive reclaim as can be seen by the next slides. This is the working set default. So there is heavy working set default count. Even though working set default count decreased by around minus 2.8%, the performance didn't impact much. This is a RAM disk swap that I'm using for doing my experiment. When I went to NVMe, the reduction in the working set default, which says that we are able to kind of classify the pages much better looking at the hardware access counts because we are not defaulting pages similar to looking at the reference bit. So when I use NVMe swap disk, the cost of making the wrong decision is higher and hence I see some improvements in performance, but you could always debate that that's the noise range, it's 0.2% MongoDB. So this is the refold. I have a 7% reduction in refolds with NVMe swap disk. I also did another experiments with the memcash disk. So one of the few things I actually ended up doing, as I said, I was using hardware counters in the sorting phase. We could also completely kill the folio reference lookup, looking at the page table reference bit completely. So it's kind of like the complete R map we can avoid. If I try to do that, the performance improves really good, like there's a 6% improvements in memcash D, but then there are some interesting observations that I had. So this is the advanced classification, the red box, the advanced classification, which is the mid-class string. I also have a fling graph which shows why the performance dropped. So here is the interesting thing that happened, right? When I completely removed the folio reference, R map worked completely and completely depend on the hardware reference count. The refolds actually increased dramatically, right? I am yet to find out why. The performance also improved dramatically, but the refolds also increased dramatically. My thinking at this point, I haven't actually collected traces to confirm this. It's probably because of the initial stages where the hotness is not good enough to classify the page, and hence because we are looking only at the hotness. We consider all the pages as reclaimable, and we reclaim them, and hence, there is a large increase in the refold rate. That's my theory at this point, but I really don't know. Because otherwise, the performance improvement is not explained, right? We have a 6% improvement if I completely remove our map work. So that's on the MGLRU side, as my observation is that optimizing multi-generational LRU is kind of really, really tricky because of the cost involved in estimating the relative hotness. This is the flame graph kind of view of multi-generational LRU. As you could see, there is folio reference, the RMAP work, right? There's the big chunk here, and then there is page out, which is the big chunk there, right? Isolate folios is not that big a chunk, but then folio reference is one of the big chunks that we have, right? So if I use the hardware counters, right? As you could see, this bot is where we are reading hardware counters, so it do come in the cost. We are spending some time reading the hardware counters. But then we are also able to reduce the number of pages in the oldest generation, and folio reference kind of reduced, right? We are map work kind of reduced much because we don't have a lot of hot pages in the oldest generation now because sorting is kind of classifying the pages better. But then I am also now doing some extra work holding LRUVAC log. So you could see this LRUVAC log starts appearing, right? Increased contention on LRUVAC. And that's one of the challenge with doing this in MGLRU because if you try to do too much, the locking contention keep going up. So as I said, if I try to do an advanced classification where it's a recursive function that came in clustering, my lock contention actually shoots up. The R map work completely disappeared because I'm not using reference bit at all. So the R map were completely gone. So ideally this should have improved a lot of performance, but then since I'm doing like a lot of work within holding the LRUVAC, the lock contention kind of increased. So MGLRU and using hardware counters, these are some of the learnings I have. We should probably look at whether this is really worth doing or not, right? Quickly going to the other area where I want to kind of look at. This is one area where I actually found a lot of benefit. So one of the big challenge with transparent huge page is the utilization of transparent huge page. We have single bit in the page table reference, page table entry which track the access to the 2MP. But there can be access scenarios where not the entire 2MP is accessed in a unique fashion, right? There can be a lot of hidden cold pages instead of huge page which is hiding as sub pages, right? So the logic here is that when we are doing our map walk, we look at the page table reference, we find that the 2MP page is hot. We look at the utilization matrix using hardware counters and decide whether the folio need to be kind of like active, kept or reclaimed, right? I mean, there are three options, whether it should be reclaimed, whether it should be put back as active or whether it should be kept so that it given one more chance in the same oldest generation. We could actually use hardware counters for making this decision, right? The hardware counters can actually count, the access count at configurable page size, at least on PPC64 architecture. So we could actually count the access to 64K chunks within a 2MP range, right? So we could find the sparseness of the access using algorithms like Gini index, which is kind of generally used to find sparseness of a range of numbers. And that can be used to kind of find out how effectively these pages are classified. So this picture kind of gives it a high level of what I'm talking about. So there are a lot of cold pages and very few hot pages and there's a single reference bit. So whenever that two hot pages is kept for all practical purpose, the current kernel will consider the entire 2MP pages like a utilized hot page. So the test that it's really hard to find a workload that hit this specific condition, at least I haven't been able to replicate it in a MongoDB or a Memcache, the kind of a workload. So we ended up doing like a microbenchmark. So I had like one, few, few, few process doing writing every page of a 2MP kind of like huge page and a few process writing only four pages of like every 2MP page. So the example that the config I have is like, every page it is touching an 8GB kind of thing and like every few pages, like the four pages in a 2MP it's touching less frequently, right? I mean, it's not touching all the one but then the right few pages threat is running really, really fast, right? Because right every page is going to touch every 64K page and in my case every 64K page by the time it comes to touch the next huge page it takes some time and I also don't run it that fast. I have like some wait time in between, right? So this is really good number actually found. Actually if you could see I spent some time trying to explain this chart it's slightly overloaded. So the top X axis is the time taken we're using hardware counters as you could see the entire test is able to finish in 70 seconds whereas the bottom X axis is the time it takes to finish the same test when hardware counters are not used. So it takes around 800 seconds to finish the test, right? So as you could see like the swap out kind of like the reason why there is no swap out is that using hardware counters we are able to identify the hidden cold pages and split those huge pages, right? And once we split those huge pages they were never swapped back in because they're really cold we are not accessing them back in and hence the application doesn't trash or doesn't reclaim at all, right? So there's no additional swapping at all once we are able to identify these cold pages and send the cold pages to swap. So that's essentially what this page is trying to convey and as you could see when we are not using the red line is not huge page like in a not using hardware counters and the green line is using hardware counters. So when we are not using hardware counters as you could see there is 19 GB of usage because if you go previously you could see that the total memory like 26 GB of it's still consuming the 26 GB, 19 GB of huge pages and when I use hardware counters on the hardware page present in the system is 7 GB which kind of maps to the 8 GB of right every page, right? So it essentially says that using hardware counters we are able to detect hidden cold pages really, really well and we are able to complete the micro benchmark in about 70 seconds compared to 800 to 700 seconds without using hardware counters. So I do believe that this is one area we could definitely find use with hardware counters, right? I also want to spend some time on page promotion I think we also have some slides covering the same thing. So one of the big challenge with page promotion is the cost of NumaFalls and I also believe that one of the interesting things that page promotion needs to also do is to kind of relate you a hotness of a page as compared to the DRAM node where I am moving, right? So essentially what we want to do for page promotion is pick the pages from the youngest generation in like a slow memory, right? That's the youngest generation there because that's the hottest page there and compare that hotness with the hotness of the oldest generation where I am moving, right? There's no point in moving a hot page to a DRAM node wherein that movement costs a replacement if a hot page there, right? So we need a relative comparison between hotness of a page from one Numa node to hotness of a page in another Numa node and LREwek doesn't help you achieve that, right? LREwek kind of gives you the current reference bit based one kind of gives you relative hotness of pages within an LREwek, right? So using hardware counters, we can definitely compare the access count of hottest page in a slow memory node and the hotness of like the oldest generation page in a DRAM node and only if the page in the slow memory node is hotter than the other one, we kind of do the promotion, right? So unfortunately, I have not been able to do a test on this because I don't have like a hardware which kind of simulate this, the kind of like the device memory model that we have kind of used DRAM as the backend and I was kind of waiting to get a low latency memory device which kind of helped me evaluate the performance implication of that. But coding this is pretty simple. We could actually offload it to kind of like a, like KSwapD, we could look at a KpromoteD which can kind of do this periodically comparing the hotness between different nodes, right? I will come back to this if there's time. I just want to kind of, because during the mailing list discussion, there are a few other things that got discussed. Can DEMON be one kind of use case where all these hardware counters could be integrated and can we kind of avoid the rest of the kernel getting polluted with all these different dynamics, right? Different architectures is going to have different hardware counter mechanism, right? One of the things I actually kind of, we actually implemented a kind of like a DEMON operations. It's kind of like HCA-PA-DDR and I have some evaluation given later. But one of the big challenge I found with the DEMON is the ability to evaluate how good it is, right? I mean, because there's nothing like, okay, this is the configuration, you run DEMON and then MySQL or like MemcashD or a MongoDB with YCSD. It's not easy to evaluate DEMON because there's a lot of configuration that DEMON needs to have. What I ended up doing though is with the changes that I did. So essentially this is PA-DDR, DEMON module, where instead of using reference, we are using hardware counters, very simple, right? Instead of using reference, we straight away switched to hardware counters and did some evaluation. So I actually used the DDoop test of DEMON and there's a lot of variants once you use hardware counters, probably because of sampling and other details. But then overall it kind of showed for this DDoop test, right? The memory usage is probably the criteria which we want to kind of... The idea here is proactively claim where I can kind of reduce the amount of memory used by DDoop, making sure that the time to execute doesn't change. So we are able to kind of reduce the memory usage by around 12% switching to hardware counters. So even the DEMON may not be able to use it for like a MemcashD or a generic workload, probably there are ways we can use that. But I think there is also a value of looking at hardware counters being exploited by DEMON operations as shown by this example, right? Since proactively claim was also kind of thing that I looked at while looking at DDoop, I also want to kind of throw the idea that is there a value in exposing this to user space like, you know, out of memory DEMON? You know, do we want to have counters there so that these user space application can make intelligent decisions of access counts, right? That's mostly what I had. You know, this is the last one, legal statement. This is my view and not of IBM view. So if there is interest, I could spend some time on the hardware counters. Otherwise, if there are any questions, I can take that. And I think there is one other slide from you if you want to go through, right? You want to speak? Sure, thanks. I just want to take a few moments to expand one aspect from basically what Anish just mentioned about the promotion. So basically, promotion is one of the key challenges for the memory theory, right? So as we can see right now, we have actually multiple ways to help us identify hard page on those kind of far memory nodes. So for example, the page fault and we have a traditional access-based scanning and then there's a new access counters. We also have the performance monitoring modules, the PEPs and the AMD IBS and so on. So I think that could be also a new hardware feature coming down the line, right? So next slide. Can you move to next? Thank you. So the question is that, I mean, how can we basically have a unified framework to help us to do the promotion, right? So right now, I think the current infrastructure is the auto-mua that really just use the page fault, but can we actually have a more general framework to leverage all these kind of hardware abilities? So there's some kind of thought I'm putting over here. So the idea is that we can have abstracting layer that basically can harvest the hard page from those kind of far memory nodes and this engine can make use of different back-ends and it can provide the list of hard pages so that we can have a dedicated demo to do this kind of promotion, same as the case of D, I think which Anish also just mentioned, right? So the benefit of that could be we can do this outside of the synchronous page for the handler. So next slide. Yeah, so we actually done some kind of early implementation with this, but only at the user space because we want to leverage the PAPS sampling which is much easier to do to program from user space. And at the same time also use the access-based scanning to both kind of signals for us to identify the hard pages. So the way we do that, we basically stream all these kind of page access events to user space, so BPF and the buffers, which is not very cheap. Then we basically, the user space can make a decision, identify hard page and then basically tell kernel which page to migrate. So this actually works pretty well. We have some more details on this implementation in our paper in S-plus this year. So I think the hardware, to be clear, the hardware we use here is Intel Optane as a lower tier memory. And then we basically move page back and forth between that one and the DRAM. So the question in the next slide, the question is that, I mean, can we bring this kind of idea to kernel so that we can have a more performance than the another implementation but benefited by other people as well? So I mean, some idea that we could extend the auto-dumers for that. But I think one concern here is that the hardware counters like the access counters, right? So they are not really like organized around VMAs. They are more like a PFN-based. So I think auto-dumers may not be a best fit for that. So I think initially extend multi-gen to use the access counters. So I think that might be a good idea there. In particular, we can leverage the generation, especially the youngest generations under their timestamp to help us to rank the hard page there, right? Then we can have this kind of background kernel stretch for doing this kind of asynchronous promotions and apply, even apply certain policies applied to that specific to a particular job. Yeah. So that's it for some source I have here to regarding this hardware-based promotion. So I wonder whether only feedback on this or not? Yeah, I think overall, I think what we are looking for is, you know, should we start getting these changes into kernel somewhere or other? Or is it too early or like, you know, or should it all be done in like, German or should it be completely user space or, you know? One of the observations, the last time we talked about hardware-assisted was that the user space might already be wanting to use those counters itself and that we didn't want to, people upgrade their kernel all of a sudden they can't run their analysis tools anymore because the kernels consumed all the performance counters. Like, what do you think about that? Yeah, but that's true if it is performance. That's true if it is performance counters, but that's not true if there is dedicated counters like we have in PPC64, right? I mean, if there is interest, I can walk through the counter. I mean, it's a region where there is 64-bit counter for HPFN and the count value get increased based on the hotness and get decreased based on time. Okay, so this is really for a new class of counters that really aren't already, there's no perf legacy for these counters. Exactly, it's a memory region. You can index into that region based on PFN like, so every PFN get a 64-bit, you can go there and look at the access count. The count decay is based on time, so if you are not touching the page for some time, the count value decreases. So it's completely dependent on PMU and other things. And I believe that that's one nice way we should probably encourage other hardware vendors to also start looking at the possibility of having such counters because as I showed in the TSP utilization, there's definitely value in doing this in TSP utilization and I do believe that page promotion also will find really good results with counters like that. Because on way slide, it had PEBS and other things, so I guess what we're saying is like, Correct, correct, yeah. Maybe we do the dedicated stuff and figure out the PEBS things and other, like perf complex stuff as a separate discussion. Yeah, I think I want to clarify that. We use the PEBS because that's the only thing available. Perhaps actually not really ideal for page promotions, but it does help on the current hardware, particularly because as you know that the Intel, at least at that time, we can actually filter the access to only these opt-in devices so that it gives us basically very targeted access that we can monitor. But if there are better counters like this IBM access counter and it's all on hardware and they're not really shared with other use cases, I think that will be even better. So we definitely looking for this kind of new capability for other vendors. Hello. Yeah, so thank you very much for Anil's share of the hardware counter-based daemon operations set, implementation and testing research. I think that such a hardware counter-based new daemon operations set could be a very good idea and actually that's what I wanted to implement on my own when I get a time slot. I just, I always wanted to do that, but I just didn't have a time slot for doing that. So I think that could be very good approach for daemon and also looking forward for the RFC patch set or patch set for that. Sure, I can start sending them across. I can start sending them across. But one big challenge with daemon is the ability to kind of use daemon for generic workload like Memcache D or like MongoDB, right? Like I have this, I mean, the inability to kind of like come up with the default values which is useful for doing some kind of like, you know, proactive reclaim or promotion, right? I think promotion could be, I mean, one of the thing I was looking at is can daemon do promotion, right? I mean, can daemon be the vehicle through which we implement the promotion, right? Yeah, that's also a very important part. And we also have some ideas for that, including some feedback-based auto-tuning and also having some auto-tuning of the monitoring parameters and the schemes parameters and adding some daemon's action for promotion and demotion of the memory. And also there are some ongoing ideas. I think we can continue the discussion after this conference, yeah. But by default, having the hardware counter-based operation set could be anyway very useful for even though we don't have the good default parameters or auto-tuning like features, I think that could anyway benefit some profiling purpose use of daemon, I think. At what point does something that's kind of proven out in the daemon space move over to like the generic MM? Like, does that question make sense? Because I'm kind of daemon ignorant, but is this something, something we'd want everybody to use all the time versus like having to say, oh no, you need the special daemon setup and you need, and your workload, you need these settings. Like, at what point do things become daemon plugins versus like kind of default kernel behavior? Is that a consideration? Yeah, so that's also to be discussed, but yeah, for going to that stage, I think we would need some more discussion about the default config and default behaviors and whether to enable daemon by default or not. But at this point, I think we can start from the... What Dan is also kind of getting to is like, if you kind of use daemon for like relative hotness detection and things like that, it's kind of building an infrastructure which is parallel to our LRU, right? It's kind of like, it's not making use of LRU. Like for example, that the proactive reclaim what Joannas did, it's like the memcg reclaim, right? It's kind of using the memclru for reclaiming. It's just hinting that go and reclaim some 2MV so that it can do a better working set estimate. What's not really true with daemon because of the fact that it does reclaim independent of the position of the page within an LRU because it has its own way of estimating things. So my thing is, that's a confusion I had, right? That's supposed to be if I'm doing page promotion, should it be a K promote D or should it be a daemon plugin, right? I mean, which way we should go? Yeah, I feel like regarding integration, a sticking point here is that now we have the multi-gen which is also used the page hardware counters to basically sort the page into different LRUs. So daemon essentially is doing kind of similar thing in the same space, right? So if you kind of merge them, I think this is essentially the sound problem we need to solve between these two. Okay, I think that's, I think we're out of time for this session. Oh, I've got one more question. Thanks for the floor. Okay, next, yeah, next player is coming up. Thank you.