 So this talk is for introducing a humble and simple Python program called demo, which stands for Data Access Monitoring Operator, which can be used for profiling access pattern of your systems and making some profile-guided optimization that can also be made without code, but with just simple configurations. So first notice, as always, the views expressed here are mine. And this talk is based on version 1.9.9 of demo and version 6.6 Rc2 of Linux kernel. And of course, some details can be changed in future versions. So let me introduce myself first. So my name is Sungjae in Korean name, but just call me SJ when speaking in English. And I'm currently working on AWS and currently maintaining the Linux kernel demo subsystem and each user space to demo, which is the topic of today. And also, I'm now a certified pilgrim because I have arrived in Spain last month and then worked at the Garmin of the Santiago for about 500 kilometers. So the overview of this talk, so I want to share about why I think the access awareness matters and then introduce you about the Data Access Monitoring Operator, the demo, and then further deep dive into how the demo is doing data access monitoring and the data access monitoring-based system operations. And then I'd like to show you some call for participate in demo community because demo is community-driven development and then make a conclusion and have a Q&A if we have some time. So first, why access awareness matters? So in the world, there are so many deep memory device of different types. For example, there are registers, cache, DLM, flash, disk, just hard disk and tape and et cetera. And they all have different characteristics, such as capacity, latency, bandwidth, power efficiency, and price. And there are some trends, though. The faster ones tend to be small, expensive, and power consuming. That means that those image heat and even carbon. So our usual reaction to this situation was making hierarchical combination of the device for more efficient utilization of the system. For example, putting error cache on top and then L2 cache and L3 cache, DLM, SSD, HDD, and tape, and so on. And this hierarchy will only be expected to be more and more complicated because more peculiar memory devices are upcoming. For example, the DLM or ZWAP like software defined memory device could be put between the DLM and SSD. And we can also think about CXL memory and fabric-based network storage and fast network-based device. So the picture would be somewhat like this. And I'd like to notice that this picture is retrieved from the Meta's paper called transparent page placement. And we are also showing some more trends. The modern workloads such as cloud, machine learning, or big data continuously becoming more data intensive. And the DLM price has not that dramatically dropped. And Meta also reports that the cost of DLM on the data center is quite significant, they say about 33% of cost and 37% of power consumption is being consumed just for memory. And given the trend, we can think that the hierarchical memory system operation will be critical for the system efficiency. Nevertheless, the efficient access aware memory system operation idea could be quite simple. That is just keeping important data items closed and keeping critical data items closer to the fast memory device. But how we can define the importance of the data items? What data items is the important ones? That would depend on the specific workloads and the environment. But normally, maybe we can just simply say that data items that will be accessed frequently in the near future would be the important one because, anyway, we are currently saying about the speed and capacity and cost of the memory device. That is the efficient memory management means that access aware memory management or system operations. For example, for examples of the access aware memory management optimizations, I'd like to introduce two previous works. The first one is access aware transplantation page collapse and split. The problem is that the transplanted pages reduced tier B miss ratio and therefore improved performance. But it also increases the memory footprint due to the internal fragmentation of the huge pages. And the idea is simply just use huge pages for only hot data and regular 4K pages for cold data. And the research work shows some noticeable research and accepted to a top tier conference called OSDI. The second example is practical claymation. The current default memory claymation mechanism over Linux is reactive to memory pressure events. And therefore, it can incur unnecessarily big memory footprint and sometimes latency spikes because it works quite late. I mean, it starts to work quite late. And the idea here is to just find cold pages and practically claim those even before memory pressure is happening. This idea of the work has been done by Google and Meta twice. And in case of the Google, they were using machine learning idea. And Meta was using some other memory pressure information based adaptive approach. And both of those were accepted to a top tier conference called Asplost. And also, they are really using that on their production fridge. So the idea is simple. And there are some proven examples. But there are still challenges. First of all, how we will know which data item will be frequently accessed in near future. We don't know the future right. Maybe we can do some monitoring-based prediction. And that could be some reasonable way. But there could be some overhead problem because checking the access of every page would be quite challenging. And even though we solved the data access monitoring problem, implementing each optimization from the scratch would be difficult and dangerous and repetitive. Fortunately, we now have one available solution called demo, which is a Python written user space tool. So I will introduce the demo now. So in briefly speaking, the demo is a user space tool for data monitoring and monitoring-based system operations. It supports all corners. So it requires a corner feature. And it supports all corners that have the corner feature. That is, 515 or later upstream corners and some distro corners that backported the feature. The emergent next corner and Android corner and some of the other distro corners are also having the features. And therefore, demo can work on the corners. It is currently available at PyPI and a few Linux distros package systems. You can check the availability at report log. And the packaging for the Linux distros has been done by not me, but great voluntary contributors. So I'd like to say, I appreciate to them here, to Mitchell and Koka Kiwi. I'm not sure if that's the real name, though. I haven't met them. So what's the basic usage of demo? So I will now start introduce the usage of demo and more internal mechanism of it. First of all, demo provides the sub-comment-based usage. So you can just use help option to show available commands and basic usage so it receives the command. And the command can be categorized in four categories. The first category is controlling of the data access monitoring, namely start, tune, and stop. And it supports some commands for a snapshot of monitoring research and data access monitoring status and some commands for record-based monitoring research profiling and convenience use and debugging of demo itself. So here, we will more focus on the demo controls of commands, namely start and tune. So let me start from the live demo. So firstly, let's see how the commands for recording and visualization-based offline profiling can be used. So it can be used for some case of if you have some programs but you don't fully understanding the access pattern of it or if you doubt about whether it will have the access pattern as you expected, you can use demo to profile. So here, we will use a program called MACIM. And this program is a simple program that receives a specification of a specific access pattern and then just making the access pattern as configured. The input is very simple. So it first defines the memory regions and their sizes. So here, we show nearly 10 megabytes of memory regions. So each in total 25 memory regions, each having about 10 megabytes size. And it shows it describes the memory access phase. So it has first phase for 3,000 milliseconds. And here, in this phase, the program will access third, seventh, ninth, 11th, 13th, 17th, 19th, and 24th regions continuously. And then, we have next phase and next phase and next phase in total five phase. And this is very simple configuration, right? However, because there are 25 regions, it is not that easy to expect what would be the resulting access pattern from our human, at least with my tiny human brain. So we can use demo to show how it will make access pattern. So we can do that by demo record command. And it receives the command of the workload that it will monitor. This command will make demo to execute the given input command, and then record the access pattern of it. So you can show that amazing is making the access pattern as described. So it has the five phase, right? So it is running the five phase, and the workload has finished, and the data access pattern has been saved. Then you can show the recorded access pattern in a report, in a visualized way, for example. In Hitmap option, there was a typo, as always. And this shows the access pattern of the program in this way. So the access pattern of the mystery was Linux typography. So we also have the collection of this kind of access pattern visualization of the realistic benchmark workloads. You can visit the site to show them. And also, you can make this kind of visualization for yourself, just for fun, or for the profiling, and debugging, and some profile guide optimizations. And in more detail, maybe you can use this feature for demo-based offline profiling example in this way. For example, you can record the stack trace using puff-like tools, and then record the access pattern using demo, and then visualize them how the call stack is changing by time, and how the access pattern changes by time, when which address range was accessed, and what was the total working size of the time. And then you will be able to know which function is accessing how much memory, and which data object is responsible for the usage of the memory. And then maybe you can just free some unnecessary memory objects earlier, or make some system calls to protect that. One example of the demo-based offline profiling guide optimization would be something like this. We can find hot objects of each workload using the profiling approach that I just mentioned, and then just call MLOC system call to avoid swap out of the pages. We have done this a few years ago, and it has shown up to 2.5 times speed up on the memory pressure. And we shared this on K-Summit, Connorsummit 2019, and also published on academic paper to Midwear 2019 Industrial Track. And second live demo is snapshot of access pattern and daemon status. So the Linux typography access pattern looks funny, right? But because it's a little bit complicated, I will use more simple access pattern for this demonstration. So it has just 10 memory regions of size of about 10 megabytes. And it has first phase for 10 seconds that access all the regions. And then for next phases, it will access first region, and second region, and third region one by one for five seconds each. So we can start demo to just monitor the workloads like we did before. Then it starts amazing and start monitoring of it. Actually, I wanted to make it silent, but I forgot that, so let me kill it first. And ensure demo is stopped. And so actually what I wanted to do was repeat the access pattern for five times because the program will finish after about 60 seconds and make it quiet. Then demo starts the amazing command and then start monitoring of it. And then you can show the online status of demo using demo status command. So it shows the demo is running for monitoring the process of PID 4320. And you can show the data access pattern snapshot using demo show. It shows that it is currently accessing a 4 kilobyte region with 105% access rate and about 9.5 megabyte region with 105% access rate. That is what we expected before. We have 10 memory zones of 10 megabytes each, and we access those one by one. So this is how we can use demo to show snapshot of the regions. And the show option allows you sort regions by specific access pattern characteristics, such as age and access rate. Here the age means that the time that the region has maintained current access pattern. So in this way, we can show which regions are hot and maintain the hot access pattern for a long time and sort them at the bottom. So we can show that there are some 4K pages that was accessing having 100% access rate. I guess that would be something in Stack or Hip. And also we can show the about 10 megabyte memory region that we are expecting to be hot. And also it supports filtering the regions by some of the access pattern. For example, if we have interest in only hot memory zones, we can just filter out memory zones that not accessed actively in this way. That is, we are showing the snapshot of the access pattern for memory zones that having 5% or more access rate. And then we can use demo to make data access for optimization without code, but just a simple comment line. So currently, MACM is working on, and it shows virtual memory size of about 100 megabyte and residential set size of about 100 megabyte. But we know that not all the 100 megabyte is needed for MACM because it accessed only 10 megabyte regions one by one. And we can find the memory zones of MACM that not accessed that frequently and then reclaim those using demo tune comment. So demo tune comment is for updating the parameter of demo, including the demo-based operation schemes. So we can page out memory zones having access rate of 0% for at least three seconds. Then it keeps monitoring, and the monitoring was already ongoing, and then it will find memory zones that not accessed for more than three seconds and then reclaim those. So the residential set size is decreasing to about 10 megabytes like this. It increased to 20 megabytes because the phase, when phase changes, there will be 10 megabyte memory zones that frequently accessed, though. So that was the demonstration. And thankfully, this time, I believe that I didn't make too much accident. So where we can use this kind of demo-based access over system optimization examples, we have applied this kind of optimizations in some environments. Firstly, we have applied the correct reclamation on large serverless production environment, and it resulted in reducing up to 90% memory overhead with up to 2% CPU usage and negligible performance overhead, and the result was published to a proceeding on academic conference called HPDC. And also we have implemented the THP collapse data access over a THP split idea using demo-based operation schemes. And then we found that it can reduce up to 76% of transparent page overhead while keeping 51% of performance gain of THP. And detailed research are available at the website. Nevertheless, I'd like to note that these are just a case for a specific environment and workload. So the research in your workload and your system in your environment might be different, but we are just showing the best case here. And how the demo is implemented. Actually, I was lying. So it doesn't do data-assess monitoring and data-assess monitoring-based system operations on its own, but it just delegates the low-level works to links corner. Most specifically, the data-assess monitor subsystem or links corner called Daemon. And the Daemon was the required corner feature of Daemon. And the Daemon itself is implementing only the communication with Daemon and user interface for any user and the visualization of the monitoring research and some more ad hoc features for convenience. And this can be called as the reference Daemon user spectrum implementation because it is currently maintained together with Daemon. And therefore, it is able to support all Daemon available in its corners, including the Daemon development corner trees. And it has merged in mainline since 5.15. And that's why I was saying that Daemon can support upstream corners of 5.15 or later. This gives us a picture of the Daemon subsystem stack. So we have Daemon in user space, and we have Daemon inside the corner. We don't know how much components are in between and under the Daemon though. We will either like to further deep dive to show them. Firstly, how Daemon and Daemon communicate. So Daemon provides user interface for controlling of Daemon, namely the start and tune commands. Those commands receive Daemon parameters from users and pass those to Daemon. And the command to share same command line options. For example, Daemon start command line options. And Daemon tune command line option is very same. And the command line options can be categorized into two categories. The first category is partial Daemon parameters set of command line options. This can specify only subset of Daemon parameters, not all the parameters. That's what we have shown with the live demo. And it provides different parameter values for unspecified parameters. This is very easy to be used and would be best for beginners. That is, it is convenient. Nevertheless, it is restrictive. It has some limitations of utilization of Daemon. And the second category is the full Daemon parameters set of command line options. It receives full Daemon parameters by a JSON format input. This could be useful for experts who are more familiar with Daemon parameters. And how the demo passed for the user request to Daemon. It's using Daemon API. The Daemon core part is a framework. And therefore, it provides only API to only other kernel modules and subsystems. And we have a kernel module named Daemon-CFS interface. It implements a general-proposed Daemon API on CFS. And the interface is simple enough to be used with just shell commands using echo and direct and cat. But nevertheless, there are so many files there for flexibility. And therefore, we don't encourage people to manually use the interface on their own, rather than that, we encourage people to use user space tools. And Daemon is using the API to pass the Daemon parameters to Daemon. And there are some other Daemon kernel modules that implement a more special purpose and simpler API for specific use case, for example, for project reclamation and sorting pages on the LRU list. And you can also make your own Daemon user space tool because though Daemon is a reference Daemon user space tool implementation, it's just one of the implementation of the user space tool. That is, we, of course, strive to be useful in many general case and also provide full features of Daemon. But Daemon is not necessarily only one that fits all needs and neither be optimal because it is written as a live Python Humber program. And anyone can write their own Daemon user space tools using the Daemon API, like Daemon does. And Alibaba has done that they implemented their own Daemon user space tool called DATAP. And also, you can consider using Daemon's core modules as a library. That is because currently, Daemon is implemented in, is separating code in core module and the subcommand module. And the subcommands are just using the core module as a library. And therefore, if you don't care about the Python's performance, but you just need some more additional features, then maybe you can just use the Daemon source code as a library. And of course, the programming interface would be not that stable. Like links, corners, APIs, so maybe the best option would be making your own Daemon subcommands and obtain those if possible. And therefore, we now show how Daemon and Daemon are connected. And let's further see what Daemon parameters Daemon are sending to Daemon for the monitoring. Before that, for getting the full parameters input, the writing JSON input from the scratch is boring because there are so many parameters. And there are helper commands for making the full JSON input in more handy way. That is, Daemon fmtjson receives the partial Daemon parameters option, and then it generates the JSON input for whole Daemon parameters. So we will show one by one of the JSON input. So first of all, the kDaemon disk and Daemon context. So the JSON input has the kDaemon disk list, the list of key kDaemon disks. This is the Daemon worker thread. And then each of the kDaemon disk has a list of cont, a list that has having key context. And the Daemon context is the data structure for the monitoring request and running status. That is, it contains both request and the research. And each kDaemon disk should have at least one Daemon context because it makes no sense to run worker thread without request. And currently only single context for kDaemon is supported at the moment. We will support multiple contexts in future, but this is the current state. Using this, nevertheless, you can run multiple kDaemon disks if you need to scale out for more CPU resource or for various monitoring requests. And then we have operation set. This is a set of implementation of primitive-level operations that need to be used by Daemon for monitoring. That provides the ability for Daemon to check which address or specific address space is accessed or not. Currently, we have two major operations set. PADDR is for monitoring physical space using page table access pitch. And VADDR is for monitoring virtual address space using page table access pitch. And you can select the available operation set as you need based on your given use case. And the operation set can also be easily extended because those are separated from the Daemon core layer using operation set registration interface layer. And of course, as I just told, we can imagine about some future extension of the operation set. For example, the operation set for monitoring physical space using AMD, IBS, or some other hardware features, or some other software features. For example, maybe we can think about using the LRU list and the pages position on the LRU list as one of the data access source. So we now show the bottom of this tag. And then it receives the monitoring target. For monitoring access pattern, we would of course need to know which address space and which address range we want to monitor the access to those. And therefore, we receive the virtual other space or physical space ranges and PID of the process for this purpose. And then this is a little bit might be a broad part. Let me try. So it also receives three time intervals called sampling intervals, aggregation intervals, and operation set update interval. So every sampling interval, they want to check if there was any access to sub-memory regions of the target monitoring regions. And then if it found there were any access to the sub-region, then it increases a counter called NR access of each region, of the region. And then for every aggregation interval, which usually is said to be much bigger than the sampling interval, they want to check if the NR access is counter has significantly changed from the value of that when on the last aggregation interval. And then if it has changed significantly, then it research yet on the count of the region called age. And if it was not significantly changed, it just increases the counter called age. In this way, we can know how frequently the memory region has access and how long the access pattern, how long the access rate has continued. This is called access frequency monitoring mechanism of Daemon. So using this, we can get this kind of hit map information. And from here, I intentionally didn't answer to some of the questions that can be raised here. So what is the sub-regions? The sub-regions are defined as the address range of pages that have similar access rate. And therefore, by the definition, we would need to check the access of every page of the monitoring sub-regions for just any one of the page of the region. And therefore, Daemon just randomly select one sampling page and then check the access to only the page. And therefore, the monitoring overhead becomes dependent to the number of the sub-regions, not the number of pages or the size of the monitoring space, monitoring target space. And this is called the region-based access sampling. But this step is more remaining questions, right? The monitoring frequency would be very bad if the regions are not well identified as defined, how we will identify the regions as defined. For this, we randomly split each region for every aggregation interval. So if we continue this and result in each sub-regions being same to size of the page size, then we will get the perfect accuracy rate. Nevertheless, it will have too high monitoring overhead. That is, unnecessary split operation is just increasing the monitoring overhead. To handle this, we merge adjacent regions that have similar access rates for every aggregation interval. In this way, we continue identifying more aligned with definition memory regions while reverting unnecessary splits and therefore minimizing the overhead of the monitoring to that of the access pattern. So we will have only access pattern-dependent monitoring overhead. This is called objective regions adjustment mechanism of Daemon. And there is another monitoring interval called ops-update interval. This is just for ensuring the operation set can have some time to make some update. For example, for VADDR ops, it updates target monitoring address ranges to cover all the updated mapped regions because the running program can map and unmap the memory virtual address space. So we've got two more components on the stack. So using the adaptive regions adjustment, now we're sure that the monitoring overhead will only depend on to the real access pattern. But if the real access pattern is mean, that is, if the system is having too many different access pattern, regions having very different access pattern, then the number of regions will be arbitrarily increased and therefore the overhead could be too high. For the region to handle the case, we provide two more parameters called mean and max number of regions. Using this, Daemon avoids doing the merge and split operations. If doing the operation can result in making the number of the monitoring regions be lower than or higher than the limits. In this way, we can set the minimum accuracy of the monitoring research and maximum overhead of the monitoring. And this is a part of the adaptive regions adjustment. And from here, I should admit that this could be an effective limit, but we also recently again found that this is not that easy to tune knobs. So that is, what number of mean and max number of regions means how much CPU usage is or how bad accuracy. That's still open question and it makes Daemon not that easy to tune. So maybe some future implement is needed here. And then we have Daemon, so in this way, we have covered all the Daemon parameters for access monitoring. And we still need to go more for parameters for access monitoring based memory management, which called Daemon based operations schemes in short Daemons. Each yet other Daemon core function, core feature for no code access over memory management implementation, it find memory, it allows users to specify which access pattern of memory region they have interest for example, code memory regions or hot memory regions. And then what system operation action they want to make to the found regions of the access pattern. And then Daemon finds the specific access pattern of regions and then apply the system operation action on each own. And therefore the users don't need to code kernel program on their own, but just ask Daemon to find and apply specific actions. So it has, it receives two core parameters. The first one is action. That is the system operations actions that we want to apply. Currently page out action and the actions for paging out the page of the regions as soon as you found and letting THP Daemon to use huge page for the memory region and not use the huge page for the region and prioritize the pages on the error list or vice versa. And a special action code step which does nothing but just count statistics are supported. And the second important parameter is of course access pattern. The access pattern is constructed with three ranges each for size of the region and access rate and age of the region. That is you can specify how much, how big size of regions that having specific range of access rate for how long time to be applied with some specific system operation. For example, you can ask find regions that not accessed for two minutes and then page out or find regions that accessed for 50% of time for more than five seconds and apply huge pages and such more things. That's the core of the Daemon's. And we have some more features called quarters. Some more features for production usage of Daemon's that is the access pattern based specification can be useful and effective, but it's also hard to estimate how aggressively Daemon's will work. That is if we fail at estimating the amount of regions of the specific access pattern, then Daemon's might try to apply the action to two huge memory regions and therefore consume too much CPU usage. For the reason we can add a quarter for Daemon's schemes that is we can limit the amount of CPU usage or amount of the memory that Daemon's will apply the action to. And also we have a feature called worldmarks which activate and deactivate the Daemon's scheme based on system metrics. For example, we can activate pre-recreation if the free memory rate of the system is lower than 30% but don't do the pre-recreation more if the free memory rate is more than 50%. And finally, filters. The filters is for more fine-tuned aiming of Daemon's target. Using this feature, we can filter in or filter out pages of a specific type. For example, anonymous pages or file-backed pages and pages that belong into specific C groups and pages in specific address range or pages belonging to a specific monitoring target to be filtered in or out. In this way, we can, for example, practically claim only file-backed pages in specific Numan node or specific C groups, for example. In this way, we have finished all and, but the talk is not finished yet. Yeah, we have just one more minute. But this is the main thing that I really want you to say to you. So Daemon is a community-driven development project and we have so many community members here. I'm saying that anyone interested in Daemon as a member so some people would say, no, I'm not a member but I just think that they are members. And so we have some people from companies including many companies that have shown some interest and notified me that they are trying to use Daemon on their production. And Amazon Linux has ported initial version of Daemon in their corner and also continuously porting latest version of the Daemon to Amazon Linux 5.10 corner. And Android also ported an enabled Daemon on them from 5.10 and also Hokus has recently published some research on Daemon-based predictive reclamation and there are some more interest from academic and industries. And we are doing this in collaboration in real. We are making collaboration for Daemon development with a number of AWS internal and external people called Daemon community. So in last year, we got 83 patches for Daemon from 39 Amazon external people. And we are trying to communicate in several ways including using the Daemon dedicated mailing list, of course, and virtual bi-weekly committee meetup is ongoing and I'm trying to present Daemon in as many as conference since 2019. So this year, I'm planning to make four talks in this year. This is the third one. I hope the final one that I submitted to be accepted, please. And yeah, and also if you need, if you are not that comfortable with the open discussions on mailing list, you can anytime reach out to me and ask me some private meetings, of course, if you need. We are doing this with many of the people always, so please don't hesitate. So the Daemon community is waiting for your voice because the Daemon subsystem is still rapidly evolving and therefore it might not perfectly fit for your specific use case, but please don't forget it or wait for someone to know your requirements and implement it, but please make your voice you can make your voice by reporting your use case and challenges and asking questions, request development of a new feature for you and show your interest to some shared future works and share your test and finally, of course, send your patches, whatever, please. So yeah, finally the conclusion. So the conclusion of this talk is that more data access for system operations are needed in my opinion and Daemon can be one of the solutions and it makes the feature using Linus Kono subsystem called Daemon and already some people are getting fun with those and I hope you to also participate and get some fun for you with it. So yeah, that's it. I guess we are already three minutes over. Nevertheless, there would be seven more minutes until next speaker. So is there any question? Yeah, that would be a good manner for next speaker. Thank you very much. So thank you very much for joining this session and listening to my talk and yeah, so I guess that's it. Thank you very much.