 Okay. Thank you for joining. So thanks for joining to my talk. So this talk will be the updates and future plans of Daemon. And my name is Sung Chupak. So first of all, the notice, the abuse expressed herein are those of the speaker and they do not reflect the views of his employers, namely Amazon. And the download link for these slides will be available at reply to the original talk proposal mail, which also occurred in the mailing list and which is also linked in the LSFMM scheduled Google doc. So my name is Sung Chupak and you can just call me SJ or whatever better you to pronounce. And I'm currently working as your current development engineer at AWS and primarily interested in the memory management and parallel programming currently maintaining Daemon. So overview of the talk, I will recap the Daemon development until LSFMM BPF 2022 because this is my first time having Daemon talk at LSFMM BPF. And then I will provide you the updates of Daemon development, which has made since LSFMM BPF 2022 to 2023. And then I will share the future plans for short term, which I am aiming to do in 2023 and then some more long term future plans that not yet scoped. And then finally some discussions. First of all, so the stack of the table related components are looking like this. What's looking like this as of 2022 May 4th, which was the LSFMM BPF 2022. So I will go very quick without the detail. So basically what they will firstly primarily provide to you is access frequency monitoring of all the memory regions you have. It basically just access monitor the access frequency of each of the memory region in your memory and then give you from which address to which address of the memory is how frequently accessed. For this we needed to we need to firstly define the address space which we will do monitoring. And then we need some operations for checking whether the region has accessed or not. And such a things could be very address space specific and also it could use some different primitives in kernel or hardware. And therefore we have separated the layer as using Daemon operation set registration interface. And we call such common operations that Daemon is depending on as the Daemon operation set. In the sense we have made two Daemon operation set for different other space namely physical other space and virtual other space. And then because the monitoring overhead is one of the biggest problem at access monitoring things we have also implemented some more core logic for trade of trading of the monitoring overhead with the accuracy with some best effort mechanism. And by the time by by using the Daemon it was able to give the users some good quality of monitoring research with low overhead and therefore it was able to let users to know some good enough information and the way for doing data access aware memory management optimization or system optimization. However because such a things could because implementing such a things in user space could be in some way replicated and also in some way could be inefficient we have decided to implement a new main feature of Daemon called Daemon. Daemon stands for data access monitoring based operation scheme and it receives some memory action and memory access pattern of the interest. Then the feature finds every memory region having the interested pattern and then apply the action to the found regions. In this context the action could be some kind of page out, promote, demo, hinting to use transplant page or whatever. Yeah and because that also means that we need to get very good optimal access pattern setting and because finding that could be very difficult for some companies or individuals we further implemented some kind of safeguard feature called quarters and prioritization. Using this you just can let Daemon to use only limited amount of time or apply the action to only limited amount of memory and within the limit the Daemon can do prioritization of the memory regions based on the monitored access pattern for example for the page out action we prioritize cold pages and for using THP we prioritize more frequent access regions and in addition to that we also implemented watermark feature which can entirely turn on and off the Daemon feature and even Daemon give with given current status of the system for example monitoring free memory ratio of the system and then turning on proactive memory claimation Daemon scheme if the free memory is quite low and vice versa. And then we have also implemented the Daemon application program interface because Daemon is inside the corner space the main user of the Daemon and because Daemon is implemented as a framework the main users of the Daemon is Daemon API users in corner space the other corner components or modules and therefore we actually currently all the Daemon API users are all Daemon developers have developed anyway. We firstly developed some Daemon API user modules for providing general purpose user API namely Daemon CSFS and Daemon DBGFS which are providing some user space interface as on API using the CSFS and debug FS and also though we have the general purpose user API and therefore user space will be able to implement general purpose Daemon applications on their own we also thought that some case in some case there could be some not only general purpose user attempt for using Daemon but some special purpose case for example doing just for just doing the proactive reclamation modifying and tuning and configuring all the Daemon parameters could be somewhat too much and therefore we started implementing special purpose modules in this time we have implemented one corner module for only Daemon based proactive memory reclamation and provided the special purpose modules and the corner parameters as a more simpler parameters and interface and based on that we have also implemented our user space tool Daemon which stands for data access monitoring operator and also Ali Baba has implemented their own user space tool called DA top and that's all I currently know which are the existing user space tools but hopefully there could be some more if you know please let me know anyway that was what Daemon was one year ago so for recap the Daemon has merged in 515 and supported virtual space and then we implemented the Daemon debug FS user interface API in the 515 and then with 516 we have merged the Daemon operation set for virtual space and Daemons and all the features I just described and then the updates of the Daemon development since the 2022 so I'd like to introduce you what I have done in last one year first three the one the first implement feature that have developed in last one year is online tuning feature before that Daemon didn't support online tuning of each parameter and therefore if you wanted to change any of the parameters then you had to entirely turn Daemon off and then turn on it again and start monitoring again and apply the schemes and therefore we have implemented the online tuning feature by allowing the Daemon CFS and Daemon reclaim users to update parameters online this made the user space driven feedback based auto tuning available because from with this feature the parameters can be tuned online with some findings from findings at runtime then we have further implemented more Daemon's action including called LLU prioritization and LLU de-prioritization which marks the pages in the given memory region as active and inactive with this we have also implemented the special proposed module called LLU sort which is finding hot pages hot memory regions and then mark as active and vice versa with this we were able to reduce the memory precious information time about 20 percent with some experimental workloads of course it could be different in real world and we also implemented CFS based features called Daemon's tried regions and this is of each of which allows users to get all the detailed information about the regions that the Daemon's has found from the monitoring research that is if you ask Daemon's to find specific access pattern regions regions having specific access pattern and do something and if you ask the Daemon's to let us know the tried regions then the Daemon's that exports the all the detailed information about the regions that they have found that having the access pattern of the interest via the CFS so using this users can debug the Daemon's schemes in very detail and also because Daemon supports actions that not only disruptive or making some real change but also just statistics like action which makes no real changes users can do some kind of query like access pattern based query like efficient monitoring research retrieval and then we have implemented a feature called Daemon's filters it allows users to describe what kind of pages should be filtered in or out from the schemes currently we support anonymous page type and single types using this users can further describe whether they want the given Daemon's scheme to be applied or not to only anonymous pages or non anonymous pages or only pages of specific c-groups or all the pages but except the pages of specific c-groups and in any combinations this can be possible so as of today the stacks of the Daemon related components looks like this we have made three more components on the stack within last year and future plans in short term what I want to do hopefully within 2023 so firstly I'd like to implement right only monitoring because this feature that is edge dust namespace we want to make yet another Daemon operation set for specific special order space which will count only right only monitoring this is expected to be useful for some total number of right limited device like flash device and live migration target selection for example we can find some VM guest that doing right not so much and then select that as live migration candidates and also we can think about corner same page merging application to having no much right this has actually suggested by the community not by ourselves and yeah and this can be implemented by adding a new operation set using soft 30 pte as its primitive you might think and actually a patch set that doing in exactly in this way was once posted and then I gave some more feedback to the patch set and then it has postponed due to the a soft 30 pte issue to my understanding at that time I'm unsure if the issue is still going on if someone knows that please let me know and the second project that I want to do in 2023 is actually this is what I'm currently having biggest interest it's feedback based part of the tuning so this has motivated by the meta's work the transparent memory of loading and our self-experiment for auto tuning of demos so the idea is that not the idea but the problem was that when using demo schemes having optimal aggressiveness of the scheme is very important because if we fail at setting the optimal aggressiveness of the scheme the scheme cat overwork or underwork and therefore could make unexpected overhead or make no expected changes enough Lee and actually we already have the aggressiveness control mechanism for demos called quarters that I explained before however finding the optimal aggressiveness and then translating that finding to the exact quarters is difficult and therefore our idea is letting demos to receive some feedback about their current aggressiveness and then based on the feedback adaptively tune the aggressiveness and even that is implemented we will still respect the pre-existing quarters which is namely time quarter and size quarter as a hard limit quarter so all the users of demo scheme interface will not see difference and also in the future we can think about supporting demos self-collectable metrics as being used as a feedback for example the pressure store information value and number of page faults and somewhat event that we can already can get from Colonel inside in that way I believe that this can research in nearly self-tuning demos system in other words maybe one step forward to the corner that just works and next we want to also implement some demo reclaimed control interface that can be used to bear the virtual volume interface with free pages reporting because one of the widely expected demo reclaimed usage is running it with free pages reporting feature inside guest VMs on memory over committed host that is some voluntarily participating based memory overcome it management currently the guest users should voluntarily enable and tune the demo reclaimed themselves and again it is not that easy to have the optimum tuning inside by everyone it's not that easy for individuals or small companies and also it's still a question whether even though it is a voluntary based system how we will persuade our guest users to turn demo reclaimed on on their own and I think that because the free page reporting is already having some way to enable that from host side if guest side also ugly I believe that applying some the rules for free pages reporting and interface for demo reclaimed could make sense in some way and I think that there could be some other general already interface that demo could use rather than implementing the field again maybe I'd like to get some feedback from you if you can and also one other project that I want to do in 2023 is integrating the demo user space tool which is currently out of tree and maintained in github inside the tree I believe this could be beneficial for kernel community and the users because interesting it in the tree may help users understand the API usage of every new features inside the kernel and make the sync and therefore avoid breaking the API and tooling and I'm wondering if there could be some objections or other thinking about this if you have I like to hear about so if you have any comments please just interrupt me and say and also one of the questions I have is the out of tree version currently support all existing corners and I choose the in tree version different yes please we went through this a little bit with the nvdim tools and the cxl tools and those are those are out of tree like and moving them in tree people were not agreeing with that I mean you have to support all versions anyway like the and the kernel shouldn't break old user space so if you've ever released a tool you can't break the abi like if the tool if you break an existing release of your tool then that kernel the kernel is broken yeah so so I don't think you'd have any more abi flexibility being in tree versus out of tree yeah that also makes sense it would have been convenient in some places but yeah I think the benefit the downsides made us stay out of tree and we've been okay there but so the downside you think is size of the tree or the no the yeah the that moving moving in tree causes more problems than it solves more pronounced yeah and like like that that being out of tree forces you to have some discipline about versioning and abis and and you can't you can't accidentally be like oh like you're tying your you're tying your your kernel version to your tooling I know I know perf does this is I know perf is the counter example anybody else have different thoughts we need to say a pretty pretty strong argument for moving in tree and I'm not sure I'm seeing it here and also I wonder how distributors of kernels are going to package something if it's inside the kernel tree it seems to be odd place to get their source code from I assume they've solved that with perf though if you know at the end I have a couple of questions um how much of all of this is being used in production by organizations and with what result how useful is it proving are people saving lots and lots of money using etc so basically to my understanding to my best knowledge currently some of the production systems are using the daemon reclaimation daemon reclaim on their production for saving memory and according to their report they were showing about 20 percent memory saving and also there are some production people who are using the features as some experiment and profiling so it sounds like it's early day is yet um more adoption I don't think it's quite early days I mean yeah each early days in terms of we have so many ways that we can go further however it's already on production of which widely being used and I don't think this is not metroid enough to be used in production at least in some case okay go ahead so you had this point about write notify or like write tracking how did you could know write only tracking something like that a couple of slides earlier um so I was wondering which guarantees do you need like for example if if someone is has like a page pinned for example and writes to it you wouldn't be able to track it using the page tables if you're not careful it's just like like do you need to track each in every right does it have to be 100 percent correct or could there be false false cases where you miss a right at the point we are aiming at using the sampling mechanism of daemon and therefore it will not be able to track all the pages so essentially what you would want to do here is you would want to make a page read only like map it read only and when you get a right fault you would somehow realize that and use that I just want to reuse the soft dot pte mechanism inside daemon okay so you would okay that would essentially be doing that okay thanks just the quick usual reminder that that that doesn't work too well for the device cases where the devices are migrating memory to and from a device say on the PCI bus I'm sorry I didn't get it oh so if you have a device that's using memory and it's it's um migrating memory from system ram over to device memory across the PCI bus then the the whole soft dirty thing is broken for that case so because if you get a page fault the device driver will pick up the cpu fault and then it that's an invalidate you know it'll it'll uh if it doesn't hook into the page fault it'll at least get the mmu notify callback right so then it says oh well invalidate so it invalidates everything on the device as well and then the device which was happily using the page on its own side gets an unnecessary page table removal inked out from under and then it refaults so you've just destroyed performance on the device and it's a very heavy weight operation to restore that is not as fast as the cpu so that whole mechanism is completely oblivious to the to the device case good point I think that should be aware while doing the implementation and I will keep that in my mind and the implementation I do not have as much of question as an ask at least for me a bar to understand what Damon is capable of is really high because it's really complex memory management outside of memory management and I'm not sure I'm the only one but I would really appreciate if there was at least some not a documentation but something that just puts or gets all the loose ends together and put a coherent story for example of the use case where that can be useful and you just do tuning x y z and this is what you get as a result and that might be even in documentation in the kernel or you can help a certain internet that page right to come up with an article or blog post and that would be hugely appreciated at least for me yeah actually we have some demonstration page of Damon which having some demo video and demo animated gear give and some of the tutorials maybe you can use that but I will try to improve that and further advertise for all people you can know I wonder what I wonder why it should be implemented at Damon should be implemented as separate module not built into counter what why it's not separate module but built statically inside corner nevertheless there is no that's also there is also no reason to implement it as a static things except that it is using some corner functions that not exported to module but at the moment it's implemented as static module of the corner yep if there's no more questions let me proceed I have about a couple of minutes so already done yeah so yeah yeah so that's it thank you very much