 So today I will present about the Daemon demos and demo title is a little bit weird, but anyway it's an introduction of my small project called Daemon. And before starting, you can go to the da-monitor.github.io which is a public website for introduction of this project. So from there you will be able to get any introduction and all the resources about that and the news and demonstration video and such et cetera. And also these digitalized are also already uploaded to the OAS Summit event website so you will be able to download right now. So my name is Sungjae Park and you can just call me SJ or whatever barrel for you to pronounce. Currently working on AWS and maintaining a small sub-system called Daemon. So first of all, I was saying about access awareness. Why access awareness mirrors? Because there are some memory devices that having different characteristics such as registers, cache, DLM, flash, disk tape and they have all different characteristics such as capacity, latency, bandwidth, power efficiency and cost. And nevertheless, the faster ones tend to be more expensive and power consuming. And modern systems are using those or in a hierarchical manner for better efficiency and better cost. So that is, for example, putting Aaron cache on top and an L2 cache, then L3 cache, DLM, SSD, HDD tape and so on. And this hierarchy will only be more complicated in my opinion with new device that evolving now for example, we can imagine about ZLM like software defined memory will be put between DLM and SSD and the CXL memory like device will be put between DLM and SSD somewhere before or after ZLM. And also we can think about fabric-based indirect device between somewhere and network-based device somewhere like these pyramids like hierarchies. And also the second thing that I am very interested about is the cost and importance of the memory. That is, nowadays modern workloads are becoming more and more data intensive such as machine learning workloads, cloud workloads, big data workloads need many, many more data and those are being very important nowadays. And unfortunately, the DLM price is not dramatically dropped nowadays. And for example, in recently published paper that published by, that authored by Meta, they report high cost of DLM on their data centers. That is, they are showing about 33% of cost and 37% of power consumption of the data centers are due to the DLMs. And given these trends, the use of different characteristics of device and the increasing cost of DLM, efficient memory management is critical for the cost. So maybe the year of efficiency might not end by 2023, though I really hope it will be finished soon. Anyway, so that said, the efficient memory management sounds very simple, right? Keeping important data close and keep critical data closer to fastest memory device. And what data is important? That's also a very simple question. The data that will be, will often to be accessed frequently in your future would be what should be important. And how to know those? Not fortune teller, not creativity, because those could be irresponsible. I think it would be better to make data-driven decisions. That is, we can monitor current data access pattern and predict future. There are some examples for such access over efficient optimizations. The first one is access over transparent huge pages, collapse and split. The background is that using huge pages can increase performance by reducing the transparent cache mist. However, it also increased memory footprint due to huge page internal fragmentation. And using THP for only hot data can reduce the memory footprint while preventing the performance. There was an academic research work that implemented this and that was accepted to a top tier conference called WSDI. The second example is proactive claymation. The Linux kernel is currently reclaiming memory only when memory pressure is found. And therefore, in other way, it is working in reactive manner. And reactive memory claymation on the memory pressure can incur some latency spikes and keep unnecessarily big memory footprints when the thing is coming. And the idea was that proactively finding and reclaiming such a cold pages can reduce such latency spikes and reduce memory footprint in normal case. There were two works that based on this idea, the first work was done by Google and the meta has also made a work based on this called transparent memory offloading. And those both works has accepted to another top tier conference called asplos. And also they are not just academic works but also currently being used on their fridge. That idea sounds simple, but maybe, but we found that there is a demon, not demon, but demon in the detail. So here comes the demon. Unfortunately, the monitoring data access is not on each problem because with prior demon, normal access monitoring techniques that usually being used, the overhead is usually high and it could arbitrarily grow as the size of the memory to monitor increases. For example, the overhead could be somewhat acceptable if the memory to monitor is only about one gigabyte or eight gigabyte or 10 gigabytes. However, what if the size of the memory becomes terabyte and even exabyte? And also such a high overhead can also affect monitoring research accuracy itself. And I sometimes feel that maybe some of the uncertainty principle can be applied here. And Linux kernel has made a great trade off for low overhead here. That is a mechanism called RLU, but not real RLU. And even in case of the real RLU, it might not be an ideal case in some situations. And current Linux kernel's mechanism is not using fine-grained access tracking but just harness classification. Traditionally just only access at least once or not and whether it is active or inactive. And also there was some good improvement called MGLRU. That said, still the access can happen reactive to memory pressure like events. I believe that this is the biggest problem here. And without periodic and frequent such memory pressure like events, the accuracy of the mechanism could be bad in some case. And I'm not blaming about the design decision of current mechanism. I believe that this was really great, great idea at the time and also it has worked so well so far. However, it could be more challenging given the trends that I have introduced at the beginning. So here comes the demo, what is it, what it provides to us. Conceptually, the demo does periodic access check of all memory area. And therefore using this, it can inform users when which memory area has how frequently accessed. And therefore we can know which memory area has been frequently accessed for last say one minute or two minutes or 10 seconds. And why it is special? Because it equips a simple but effective best effort overhead of just trade-off logic that called region-based sampling and adaptive regions adjustment. I don't want to introduce the detail of the mechanism today because it is only time consuming though I have prepared the backup slides. So if you want, please ask at the Q&A time. And it, so the briefly introducing the mechanisms let users to set the minimum accuracy and maximum overhead they can accept. And then demo makes best effort for lowest overhead and max accuracy under the user provided constraints. And this is one of the reason that I'm not introducing the detail here is that it is obviously important part of the demo but not itself. There could be more future mechanisms for this purpose or it can be replaced or unused if needed in a future though it mandated at the moment. Nevertheless, this is not a magic just a trade-off. So if you set the minimum accuracy and maximum overhead like tunes tuning strongly then it could be heavier unexpectedly. And nevertheless, how well the best effort would work? Maybe only the data can say. So we have evaluated Daemon's overhead and accuracy. First the overhead, we believe that Daemon is lightweight because compared to a prior Daemon era page granular approach it was consumed 100% single CPU time to scan access to 512 gigabyte memory every two minutes. That means that it can scan access of 4.26 gigabyte per second. Nevertheless, this was just the research from the first approach and they have made great improvement on the approach I should mention that. And compared to that on a production setup Daemon was able to scan access to 68 gigabyte memory every five millisecond while consuming less than 1% single CPU time. This means that it can scan access of 1,372 terabytes per second. Nevertheless, please note that Daemon is having upper bound overhead that can be set regardless of the memory size. So this is not a fair comparison at all but I'm just giving the numbers. Please don't use the number, don't think the numbers as seriously but just showing some things. And Daemon is, we believe that Daemon is upright because it shows the same monitoring research with some realistic benchmarks including Pulsec3 and Splash2X as you are showing. So we also don't know whether it is true or false but at least looks something like, right? And also we were able to identify from the production setup which have used for the evaluation of the overhead which utilizes about 70 gigabytes of memory we were able to find seven gigabyte working set and four kilobyte heartiest region from the production workload. And identifying the heart memories from Daemon research with our human eyes and then modifying the program to the MLAB system called to protect the regions under the memory pressure achieved up to about 2.5 times speed up under some artificial memory pressure. So we believe that Daemon is lightweight and also upright. So time for demonstration. So first what I will do is let's, I will show you how Daemon can show you how Daemon can be used by you to record data access monitoring research and then visualize that while consuming how much CPU time, how much overhead it incurs and yeah, such things. So I will first show you the CPU consumption and memory consumption and command of Daemon worker thread. Daemon worker thread is called KDaemonD and also because currently Daemon is not running this will just give us some error message. So let's forward the error message to normal. And for this, what this command will do is just execute this command and then monitor the data access pattern of the resulting process. Then we can show, we cannot show PS. I'm currently connected to a virtual machine via network and the network is not working right now. Yeah, so the first column is the CPU usage of the process and second should be the RSS and then the command. And this is not working at the moment. So let's hope the network to be working again and let's go back to the slides. So I will show you the, I will demonstrate, I will have yet another demonstration session so. Can I give some questions? Yes please. So the point that I'm confused is does this can, based on physical address range or virtual address range? You can select, it can support virtual address space of specific programs and also the entire physical space. So you can use both physical space or virtual space on your selection. And in case of the demonstration, though demonstration was just to the command, it was for virtual space of the process. So it was expected to monitor the data access pattern of the process that started from this command and then show the heat map visualization of the access pattern. But I think we cannot show at the moment. You can go to your home and install day one corners and do that on your own. So you can choose between virtual address range of a specific process or just full physical address range and if we choose physical address range, then do you, does demo use something like reverse mapping infrastructure over links? Yeah, it use our map to figure out which process the pages are mapped and then use, follow the page table and then use PTE internally at the moment. So let's revisit the demonstration after the network becomes much, some better, hopefully. And therefore we can say that Damon allows access over system operations that is because Damon is now working and give us accurate access pattern with low overhead, a number of access over analysis of the access pattern of your system and workloads and optimizations based on that are possible. We can provide some monitored working site sites to users and then let users to configure their systems. For example, install more DLAM or remove some of unnecessary DLAMs. And you can debug and optimize wrong implemented data access logic of your program if you are the programmer of the user space application. For example, like that we can visualize the working site site of the workload by how the working site site is changing by the time and which memory region has been how free country access by time. And then by using some F trace like great tracing features, maybe we can also record the call stack and then show that how stack trace has changed by time. Then we can show there are three memory bursting period and which memory area has been frequently accessed and which function was called at the time. And then we can start to look into the function and make some optimization if needed. And also there can be some system level access aware memory management. And this is something Linus Connor can further help in my opinion. And therefore we have continued the work to Damon's which it stands for Damon based operation schemes. So using Damon we can do Damon based optimizations and such a common such optimization would be getting and analyzing the monitoring research using Damon and then prioritize or de-prioritize some memory zones based on the analysis. For example, we can page out called regions and apply transplanted page to our regions and et cetera. In other words, we are getting the important data close and critical data closure as mentioned at the beginning. So it's all done, right? However, this may contain some repetitive and efficient steps in our opinion because the analysis and the prioritization could be repetitive for some different applications. And also user space management could be inefficient if because the transferring monitoring research from Connor to user space and then transferring the memory management decision from user to Connor space could be inefficient in some case. From here, one of my colleagues has asked me a question. Can't Damon do that instead directly inside the Connor? And therefore we have implemented the Damon's. Damon's is a feature of Damon for uploading the optimizations effort to Damon. With using this feature, the users can simply specify the access pattern of their interest and the memory management action they want to apply to the regions of the pattern. Then Damon finds regions of the pattern and applied action to the regions. Therefore, we don't need code but just request specification. For example, as in this slide shows, we can ask Damon to find some memory regions that not accessed at all for at least two minutes and then page out. We have evaluated the Damon's effectiveness by implementing the main ideas of the two state-of-the-art access over optimization box that I have introduced to you at the beginning of this talk with Damon's. In case of the access over THP collapse split, we have implemented that using Damon's again very simply and then our version has removed 76% of THP memory footprint increase while preserving 51% of THP speedup. And by implementing the practical automation ideas using Damon's, we were able to reduce 93% of the residential set and 23% of the system memory footprint while incurring only 1.2% runtime overhead in the best case. Nevertheless, please note that these research are from a benchmarks, though they are saying that they are realistic benchmarks, benchmarks, so please aware of that. And for more details, we have the link here and you can download the slides. So if you want to, please use the link. And also, I'd like to mention that reasonable Damon's effectiveness here also means that reasonable Damon accuracy. And second time for demonstration, demonstration for Damon's, let's hope it works. It works. Maybe connecting again might work. No. Let's turn off and turn on Wi-Fi again. Ubuntu. But my company supports only Ubuntu 2, our company laptop unfortunately. Anyway, yeah, looks like it is working now. Okay, let's try this again. Okay, never mind. So before the monitoring, let's start from record. And also from here, actually because this is, because this is the demonstration for not only Damon, but also Damon's, I will monitor the memory usage and CPU usage of the test program too. We will show CPU usage, RSS, and comment of a test program that I will use for this demonstration which called M-A-SIM. Thank you. And also I forgot sleeping. Let's forward the error message to null. Let's start from recording. What it does is start in the process and then monitoring the data access patterns and recording the research. And as you can show, Damon is consuming only about 1.3, 1.2% of single CPU time now. And also from here, I should also mention that the record feature of Damon has implemented using TracePoint feature of the new corner. So what this is currently, what the M-A-SIM process is doing is making some artificial data access pattern which is described in this file called steras.cfg. This configuration describes the program to have 10 megabyte memory region and access that those 10 memory regions one by one, the first memory region for five seconds and then second memory region for five seconds and then third region and so on. And therefore it has 10 steps here and it has finished. So let's show how the research was recorded. Actually, it was be able to show on real time using the monitor command but it was not working now. So as you can show, this command shows the monitoring research in a heat map format like this on the terminal. And you can also colorize, change the colors of that and it shows the access pattern that we were expecting, the 10 memory regions and accessing each one by one. And also we can show the report in low way. This visualization shows the monitor access pattern in low pattern, in low way. That is all the information the monitoring research is having that is from where to from where has how frequently accessed for how long. That was the demonstration of the daemon. And now let me show you the demonstration of daemos. Thank you for waiting. So the daemos can be used by daemos action, page out, daemos nr access 0 to 0, dash dash daemos age, five second. And target will be the mhcm.com and so what this will do is, before starting this, as you can show from the second screen, the mhcm was having 10 memory regions, each having 10 megabyte size. And therefore the RSS was also 100 megabytes. And because it is accessing each region one by one, this can be somewhat inefficient. And what I will do now is start the mhcm program again and asking daemos to find memory regions not accessed at all. These two numbers are minimum and maximum number of accessed. And here I also find one mistake. And find the regions having age of at least five seconds. Here age means that how long the memory region has maintained the data access pattern. So this means that finding the memory regions that not accessed at all for five seconds or more. And then as soon as found, I am asking it to page out. Then for this case, I have backup, dash dash daemos access rate. Oh, access is correct. Okay. Please let me have a cheat. Oh, yeah. Access rate two dash. It was working last night, but yeah, I show. Daemo age five. Yeah. Let me try just five final try. Yeah. Finally. Different error. Yeah. So what it is doing is starting the mhcm again and then applying the daemos scheme. You can show it started with 100 megabyte RSS. And it was currently running initial phase, which is accessing all the 100 megabyte memory region. And then the daemos has started to reclaim those because the phase has changed to the phases that accessing only one of the memory zones one by one. And therefore you can now show the RSS size has reduced from 100 megabyte to 20 megabyte. And you can also show the CPU usage of daemon itself has slightly increased. It was about 1.1 to 1.3 before when we were doing only monitoring, but it has been increased to 1.7 or 1.6 mainly because it needs to use some CPU time for reclaiming the cold memories. So that was the demonstration. Thank you very much for being patient and bear in mind with me. I'm sorry for interrupting. I would like to have a high level understanding of a first and second demonstration. The two differences. The first thing is working on the truly on the user space. How long is the video corners help? Is it right? Those are running in corner space, monitoring also running in corner space and daemos also running in corner space. Maybe one difference would be the record in case of the first demonstration. We were doing recording, right? And for the recording, we are using TracePoint and the user space tool is recording that pop record and therefore some combination of the user space and corner space in actual. But yeah, I was just lazy. And one other question is, you showed probably the trace can be done for process context, not only for process context, but also for memory context, isn't it? You showed that basically it seems like it shows the hard coordinates of process context, but data can be interpreted as for memory device context using the CPU address range. Okay, thanks. Yeah, thank you. So yeah, and also we have implemented some more daemos features for being used on production, namely daemos quarters. The schemes target access pattern should fine-tuned to be appropriately aggressive because if that is tuned to too aggressive, for example, in the previous example demonstration, if we were setting the minimum age as not five seconds, but one second or two seconds, it would have reclaimed the hard pages, hard memories that frequently being accessed. And therefore daemon overhead can be high and also can impact more at the user space and the workload. And if it's not enoughly aggressive, then we will show no effective change, and therefore the memory usage would not be reduced in the previous example if we say set the minimum age as not five seconds, but two minutes because the program will be finished within one minute. And finding such an optimal value might be doable for some big companies, but daemon is for not only big companies, but everyone. And therefore we have implemented a feature called daemos quarters which allows users to set the upper limits of the aggressiveness of the schemes in a more intuitive way. It lets users to limit time and bytes for applying the schemes action per specific time. And under the limit, daemos prioritize regions using their access pattern and then apply the action to more prioritized regions first. And also it allows users to set their personal priority rates for each of the access pattern elements, namely access frequency, size of the region and the age of the region. For example, in this example case, it is asking daemos to apply the scheme on the 100 millisecond and 110 millisecond and 100 megabytes per second limit and treat the access frequency and age as same important while ignoring the size of the regions. And also we have implemented yet another feature for fine user control called filters. We have made this because some users may know some characteristics of their workloads more than kernel. For example, they can know a list of latency critical process of their system and also they may know they are accessing anonymous pages more frequently and for performance critical. For that case, daemos filters feature allows users to filter schemes via type of the pages. Currently, we are supporting anonymous pages and belonging to which signals. So, using this, users can apply a scheme to only anonymous pages, non-unononymous pages, pages of specific c-groups, well, all the memory but except pages of specific c-groups and it can be combined in any way. For example, in this example case, we are asking daemos to do not apply the scheme to anonymous pages of a c-group called latency critical. And one question that we got from the users here is that is daemon daemos for user space control? What kernel that just works that is there could be some users having enough capacity to get enough information about their system and workloads and therefore need a way for final control of the kernel. In the case, the daemos filters feature could be used for them and nevertheless there could be some users having no such capacity and therefore couldn't get such detailed information but just need a kernel that just works some way. And for them, the daemos filters could be used. And we are, so to answer to the question, we are trying to push to help both parties and more features for both parties will be developed. And the last component, daemon. Daemon is providing only kernel API for other kernel components and therefore it is not, it cannot be used by users directly. And for the reason, there is a new kernel module named daemon.csfs interface which creates pseudo files on c-sfs and hook IO to the files and then control daemon using daemon API as a response to the IO. That means we are having daemon API. And daemon is a human-friendly user space tool for daemon which developed by daemon also, me and therefore maybe we can call this an official tool. And it provides some human-friendly user interface for input interface and access pattern visualization and it is written in Python and currently available at PyPI. Nevertheless, the daemon is not necessarily the only one daemon user space tool because anyone can write their own daemon user space tools using the ABI. For example, Alibaba has developed their own user space tool called DA top and made it available at GitHub. And inside of the daemon user space tool it has the there is a Python module called underbar daemon.py which implements core logic for daemon ABI control and daemon comments that we I have including the comments that I have demonstrated including the record and start and such things are implemented by using the underbar daemon.py as a library and therefore referring to demo records at Py and demo starts at Py you can show some example usage of that and also you could implement the daemon user space tool for you using it as a library. And I was also considered having yet another demo for daemon however you have already shown that so we can skip. So putting it all together we get daemon, daemon and daemon and these constructs on access of a Linux system which allows DRAM cost cut and memory intensive performance boosting. Please note that this is not an official name but just a temporal one for making Amazon Linux people be confused because they also have same Synonym AR. And availability of the access of a Linux all AR access of a Linux system components are open source and upstreamed so daemon and daemon have merged in from 5.15 and 5.16 respectively and daemon is available at PyPI and I believe that there could be some people who cannot use 5.15 or later kernels for them we have some options for example we have back ported basic features on Amazon Linux 5.4 kernel and all new daemon and daemon features are being back ported and also continue to be back ported on Amazon Linux to 5.10 kernel and Android 5.10 kernel also has back ported version of daemon and daemon and also there's a rumor about some customers are asking some distro to back port daemon on their kernel. Use case and collaborations first of all please note that these are just rumors and clues collected from private public conversations and there's no central and well managed channel and these could have a lot of false positive negatives so far what I can say at the moment is that some people from companies including Alibaba, AMD AWS AWS of course and Digital Ocean, Google Huawei etc since using or explaining it and we have found the Android common kernel has ported and enabled the feature and also some academic and industry folks are researching daemon based memory management and for collaborations a number of AWS not only AWS internal but also AWS external people are collaborating on daemon development and in 2022 last year 39 Amazon external people has contributed 83 patches for daemon and we are trying to continue the communication in several ways including of course we have the daemon dedicated development mailing list which you can send any questions or patches and also we are having virtual bi-weekly virtual meeting for community meetups and also I'm trying to present daemon like this in several conferences trying to convince both kernel developers and user space developers I'm unsure if I have success today or not but I'm just striving to and also we are having some occasional and regular private meetings on demand when someone are asking for because we have bi-weekly community meetup but there could be some topics that they want to have dedicated slot and therefore and just two more slides so daemon community is currently waiting for your voice because daemon demos and demo are still under active development and the journey has just begun it has merged in 515 in 2021 it's quite young yet and so what I'm saying is that the interface may not fit for your use case and there could be some lacking features for your use case nevertheless this doesn't mean that the system is unstable please don't forget it we'll wait for someone to implement it instead of you but please make your voice please report your use case your challenges and benefits that you are currently getting from that and ask questions and request for your use case and you can also prioritize some future works by showing your interest, sharing your expected usage and test and share the research and finally by sending patches conclusion so we have introduced daemon the data access monitoring subsystem and daemon which makes the data access over system optimization with no code and finally the user space tool called daemon and some people are currently using daemon daemon's daemon and getting some fun and I hope you to also help us delivering more fun to the world thank you very much great talk thank you maybe answer from a question already there you mentioned 6l memory and for 6l memory case there is some like interesting problem slightly opposite what data can be placed in the slower memory in 6l memory for example without affecting performance of application can we answer on this question by using this infrastructure can you please repeat the last one can we answer on this question what data can be placed in 6l memory without affecting performance of application can we answer this question by using this infrastructure so what I am currently thinking is so I'm not directly working for 6l use case because I don't have the 6l device nevertheless what I heard from some people who are working on 6l based memory tiering using daemon is that they were just trying to find cold pages and then place them on the slow memory while finding hard memory and then place them on DRAM yeah I see but I believe that anyway could be some problem here because there is dependency between data for example even we define cold page for example another variables can be dependent from this and finally we need to move data or do something so I think it's not so easy answer yeah I think that's where the more fine tuning of these schemes is important finding real cold pages, real cold and finding hard pages, real hard pages, real hard and for that I am thinking that finding such optimal tuning values could be doable for big companies but not that easy for some small companies or individuals like me and therefore we are also trying to find a way for that such as auto tuning of the parameters so yeah that's the basic idea and what we are currently thinking about is getting some feedback from users or from system metrics that demos can get by itself for example in case of the meta transparent memory offloading approach they were using memory pressure store information as the feedback data I believe that similar approach could be used for demos and I believe that could be used for memory tuning and in here one of the other possible metrics that can be used as a feedback source of the feedback could be not only the pressure store information but also amount of free memory on DRAM and amount of free memory in CXL memory based on that we will be able to describe demos to utilize the CXL memory and DRAM based on not only specific access pattern but based on aimed final system status that is get 20% of hottest memory in DRAM and place 60% of coldest memory in CXL and practically claim some percent in such a way then based on that that is really optimally implemented I believe that could be some way for utilizing tiered memory in a good way okay thank you thank you yeah so I think I have used too much time for the demonstration and I will also be whole so if you have any questions please ask me or send me the email or use the web page and there is the documentation and the list is always for you and we have the beer cup tea chat every bi-weekly so please feel free to reach out to me thank you very much