 I've been working as a kernel developer for many years and on Linux kernel for last few years. I've worked in many different areas of the kernel and lately I've been working in a memory management area. So today I'm going to talk about how kernel manages free pages and allocates to a proxy that claims from the various processes. So free pages are going to be available to other processes that need them. So as I go through this presentation, I'm going to introduce a number of concepts. If there are questions, please feel free to ask them. I'll be happy to clarify those. So I'm going to talk about the problem that we have been encountering from time to time, especially with our customers, the Oracle customers. And I'll describe what the problem is. Then we'll talk through how it can be resolved. And then I'll show you how does it work based upon the data I have gathered. So with that, let's get into what the problem looks like. On a Linux system, the system has a certain amount of physical memory and kernel manages this memory in form of pages. And each system has a base page size. So kernel uses a base page size to allocate pages and then it can aggregate them in contiguous pages as well to create higher order pages. So order zero pages, a base page size, order one pages, a group of two. So that's a standard page management system. So as a kernel is managing all of this memory, it will allocate pages to the processes that need them in the process releases the page goes back into the free pool. Now, if you look at a system, and it is running a number of processes on it. The user space processes may not necessarily use all of the memory that's on the system. So what kernel does is any memory that is not in use by kernel or the user space process, it tries to make use of it to maximize the process on the overall system. So it uses all that extra memory as buffer cache. So we'll talk about that. So let's say we are looking at a system that has got a large amount of memory. We are running a workload on it and the workload does not by itself allocate a large amount of memory, but it does tend to do a whole lot of IO. So it's reading from that is it's writing to the disk continuously. And what, based upon what we have seen is you are running the workload like this, and then all of a sudden you start seeing a message from the kind of thing is running out of certain order pages, or the three pages or the four pages. Or you try to start a new program, and it just takes a very long time, because in the background, the program is trying to get the memory it needs to start running. So system has lots of memory so intuitively, we shouldn't be running out of three pages. So let's say you encounter this situation. So the next thing you do is you take a look at, okay, so what's happening with the memory on the system. So first thing to do is, okay, take a look at the free memory. So you run a free dash edge which shows you what the memory is on the system currently. And this is the data on this slide is from a system that's my test system where I ran a workload and gathered all of this data. So this is from a real system. And even though the system has fairly large amount of memory, in my case, 722 gigabytes, what I can see is use memory is only 2.7 K. Yet my free memory is only three and a half K. That's fairly small amount of memory. And to explore further, I take a look at a property info because that shows me where all the free pages are in terms of how they're being managed by the buddy system. So property info shows me the number of pages of each order. So when I do that, again, I can see that there are very few high order pages, the lower order pages. So, but I can see that most of the memory seems to be not free. So, on a system that has 722 gigabytes of memory, I'm using only 2.7 gigabytes. Yet I'm seeing only three and a half gigabytes of memory free, which explains why launching a new program is taking a long time because there's no free memory available to run it. So, what's going on on the system? What's happening really is that, as I said earlier, any memory that is not in use by the user space or any of the kernel subsystems, it tries to use that memory to improve the performance on the system. So it uses that memory as a buffer cache. So what the cache is for is any frequently used data on the system. It can be cashed in memory, especially if it's data that's coming from the hard disk. It's expensive. So any data you can cash in the memory and then give it to the user space. The next time it asks for the data, you get a significant performance boost. So that's what Colonel does. It's just using up all of the memory to cash, not just for this data, but many of the kernel structures as well that are being accessed frequently. So what that means is a lot of this memory that's been used in a buffer cache, it can possibly be reclaimed because the data that's in the buffer cache can go to the disk if it came from the disk. If the data was not modified at all, you can simply throw out the copy of the disk data in the memory because the original data is to decide from the disk. If the data has been modified, the page is dirty. You can write it back to the desk. So we have a place where we can store the data away and then reclaim that page. So that's what Colonel does as the memory demand increases on the system. And the kind of has memory. It is using for buffer cash. It will keep releasing those pages, putting them back in a free pool tonight available for a process. So if we look at that previous scenario, let me go back there. You can see that in the output for free a dash H buffer cash is using 716 gigabytes of memory. So that's where all of the memory is gone. We need to be able to reclaim pages from there and then make that memory available for the user space and other processes to use. So, how does this work. What Colonel does is it does a scan of the buffer cache to see which pages pages can be returned. And this is can is done by a Colonel said all case of tea. It wakes up whenever there is a memory pressure it wakes up based upon number of conditions when it wakes up, it doesn't stand to see which pages it can reclaim. And after reclaiming any pages it possibly can. It will then wake up another Colonel third, which is K compact D what K compact D does is it looks at all the free pages and looks for which pages are going to go through each other. And if pages are contiguous, it call us them into higher order pages. So if there are two physically contiguous pages. And each page is a base page, it can call us them into an order one page, if four of them are contiguous, it can become order two page and so on. So, K compact you will try to create the highest order pages and as many of them as possible. So, the question is, okay, so what causes case of B to wake up and start reclaiming pages. And what causes case of B to do that is the watermark. On the Colonel, we have got watermarks whenever we reach the memory, free memory reaches one of these watermarks. Okay, so let's take a look at that. So, this picture just shows the number of free pages over a period of time. So if we start out, say, over here at time zero the system is running pages are being allocated by processes pages are being allocated for IO. We are just moving along. We just keep stopping and we reached a certain point that is called the high watermark. So Colonel maintenance free watermarks high low and minimum. So when the number of pages reaches the high watermark we know we are starting to run low, the system continues to consume pages until we finally reach the low watermark. That's when case of B wakes up because now we are starting to run rather low on pages case of they will start reclaiming pages. And while it is reclaiming pages user space and possibly other IO is continuing to allocate pages. And you may be in a situation where the rate of reclamation of pages is being outpaced by the rate of allocation. What happens is we continue to the number of free pages continues to drop. And as the number of free pages continues to drop will finally reach a point where we have hit that minimum watermark. And this is where things start to go from bad to worse. When we hit the min watermark, all allocation is stopped on the corner, the user space as for a page it doesn't matter. It's going to have to wait. When it tries to allocate a page, what is going to happen is Colonel is going to go synchronously look for pages it can reclaim. And until it can reclaim enough page to satisfy that allocation request. The user space process goes into a stall. And there are certain allocations that will still be allowed to go through even though we have hit the min watermark and those are the gift atomic allocations that typically for Colonel critical data structures. So from user space point of view, for the most part, once we hit that minimum watermark, we start seeing allocation stalls. And if the request was for a higher order page which happens typically when you are doing IO, if you're doing IO from the desk, the IO system is potentially going to ask for order 234 pages. When that happens, and we don't have a, any pages of that order, and we cannot satisfy that allocation request by breaking up a higher order page. Now we go into synchronous compaction, and now we have a compaction stall. The user space process is called compaction happens in real time. Until we have compacted enough pages to create the right order page system is going to be. Finally, we start to get above the water. We start to get more free pages and more higher order pages and then potentially the number of free pages starts to write. Case of D is continuing to do its work in the background is continuing to reclaim pages. So now we are starting to catch up. And the number of free pages starts to go up. We finally hit the low watermark, and then we hit the high watermark. Once we hit the high watermark for the number of free pages right now. That's when cause case of the goes back to sleep. So now, just something we have enough memory, we can continue operating until we are back to the low watermark. So, all this makes sense. These three watermarks are critical to when a page reclamation happens when allocation and compaction installs happen and how the system to use it. Okay, so now these these watermarks determine how the current is going to operate in terms of reclamation. It's important to set the right value for them. Now, how are these values set. We start out in the kernel by computing the minimum watermark and minimum watermark is really set up so that we don't have a situation where kernel cannot allocate pages for its own critical data structures because then we are really in trouble at that point. So kernel compute a minimum watermark which it costs me and you can see it from user space because it shows up under boxes VM. So, kernel computer minimum watermark, and then from there, it will compute the low and high by adding an offset. The offsets are computed in the routine setup per zone, the blue marks kernel takes that offset that it has computed. And then it takes asset offset first to the low watermark to computer low watermark and then as that offset again to the low watermark to compute the high watermark. And what that offset is it's the larger of you take either four times of the minimum watermark. Or you look at the watermark scale factor. Number of pages. And whichever is the larger of the two is the offset and then we just add that to. So what that means is watermark scale factor is another tunable that is visible from user space is under process VM. So what that means is user does have some say in where those watermarks will get set by changing the watermark scale factor, you can then affect where the low and high watermarks end up and you can affect where the minimum watermark will end up by simply writing to process VM in 3k bytes. Okay, so this is the minimum watermark, the minimum watermark kernel computes a value, but then it caps it. If the current evaluate computer is higher than this, it will just cap it at that value and that value was 64 megabytes for a very long time. So the value was set. Sometime in 2000 to 2000 2004 or five. This is enough to get it. So if you look in the get history that's how far back you can get and it was set to 64 meg, which probably was okay at that time. But now we have systems with much, much larger amount of memory than we had back then. So minimum watermark is starting to be fairly low for the system. So finally, in 2020, there was a patch to raise that cap to 256 meg, which is a whole lot better than before, but it's still a fixed cap irrespective of the size of the memory on the system. So, since we can play with the minimum watermark from the user space and in turn affect when reclamation starts and how long does the reclamation run because reclamation starts and we hit low watermark and it will run until we hit high watermark again. We can change the minimum free to bite, which is a min watermark and change those values. And for the Oracle database workloads. That's what we have had to do. Because the systems that we work with it. These are typically fairly large systems. And they're running. They have lots of memory. They have, they're running fairly large databases. We are affected by that very low value of a min free a k bytes fairly often. So the recommendation from the tech support was when you configure your system, you change the value of min free k bytes depending based upon the size of memory on the system. And what has happened is we have come up with a formula for his what the min free k bytes should be based upon the size of the memory and that number has been revised over the years because as we get to newer criminals, as we get to newer workloads, our memory requirement is changing not how does the user says but even on the kernel side, and we find that. Okay, watermark is again very low so the recommendation is revised and it has happened over a number of years now where we are today is the recommendation that I have in the table here that on a system with 16 gigabytes of memory we suggest to min free k bytes to 82 meg. Okay, not a big deal, but it's not at all out of ordinary for some of these databases systems to have two terabytes of memory. And when you look at a system with two terabytes of memory, our formula is 10.24 gigabytes and which is starting to get fairly significant. And there's no guarantee that we won't have to revise it yet again because workload change when the kernel, a new kernel comes up, we find a different behavior and it's okay change the watermark again. We cannot keep doing this, because it's not a solution that just solves the problem forever we have to recompute it constantly. Something better a better solution that is sustainable. So, first problem was of course the cap on min free k bytes was way too low, but at least that problem is gone now in 2020. Now we have 256 megs at least we start with somewhat of a more of a rational number. But now we still are looking at workloads and trying to tune the watermark. Let's say we go to manually tuning watermark so that we can respond to what the workload is doing. So, most of the systems they don't have a very consistent activity over the entire course of the day, or even entire course of the week, you will have periods of very high activity, where we are allocating lots of pages, and then the system comes down again. And now we are not allocating as many pages and things are relatively stable. So, when we are in high activity period and workload is allocating lots and lots of pages. We need to be able to scan up for cash very aggressively and make free pages available so that when the workload comes in for a free memory we can hand it a memory immediately and not have to force it to go into an allocation or allocation because allocation install or a compaction install is jitter on the system. It's unpredictable. These are the things customers do not like. So you have queries that run in a certain number of milliseconds but then suddenly it enters install and now we are talking hundreds of milliseconds to complete a query. It's not predictable so we can try to mitigate that by trying to make free memory available so that allocation succeed right away. So, we raise watermark. Well, it's great. While the system is under pressure, we are reclaiming more aggressively, but when the system quite sound. We don't want to leave the watermark high because when we raise the watermark we make fewer pages available to the buffer cash, and we know buffer cash is a good thing for the system in terms of performance. You can see better performance from the system so let's try to make as much memory available as possible to the buffer cash and then the system is calm we do want to hand over most of the memory to buffer cash again. So, what I'm going to is that the watermarks cannot really be static the watermarks need to change based upon what's happening on the system, what the demand is from the workload. So, we have a problem not only with what marks are not right. We also know that you can just set them and forget about it, you have to keep tuning them based upon what's happening on the system. Khaled I have a question on the previous slide. Do you remember which current release that this change went into the cap. The cap was, I'm thinking it was 5.9. I'm not 100% sure I'll have to double check, but I think it was 5.9. Thank you. Okay, so what we are talking about now is what a mask cannot be static, they need to be not just adaptive they also need to be proactive. Because this one system is in a state where it's having problems finding free pages, or it has run out of free pages. If you react at that point. It's already late, which is what Colonel does today, when it hits the low water mark that's in case of the code into reclaiming pages. And it's if we don't want to get to a point where we are already under memory pressure, when we start taking action so we want to be able to change the water mark or at least get case of the to start reclaiming pages early on. Well before we get to the point where we don't have free memory available anymore. So, reactive behavior for Colonel is not going to work too well, which is why we can, we often end up in a situation where we hit the low water mark case of these running, you can even see it on your system. So we are talking about the case of the will periodically hit 90% or 100% CPU use it as it is scanning the buffer cache, but it's just trying very hard to to reclaim free pages but system is already under severe pressure and the number of pages continues to drop down. How, how can we avoid the situation, can we foresee in any way possible that system is approaching a point where it's going to start running out of free pages so how about we take action now before we get there. So that's the problem I set out to solve. And we can follow that problem by modeling the system behavior. What if we could take a look at what the system overall has been doing for the last certain period of time, and then project it forward in time assuming system will continue to behave the same way we can project it forward and say, we have this many free pages now, but based upon how system has been allocating pages, we can see that with the same behavior, this many seconds in future, we are going to run out of free pages. And that's not very difficult to do because mathematical models exist today to do just that. So, how would we start looking at the system, we take a look at how many free free pages exist on the system and periodically just take a look at it, how many free pages, and based upon how many pages are available at any given time, we can start to see a trend, we can see the trend that system is allocating pages at a fairly high rate or a moderate rate or a very low rate. Our system is actually gaining free pages because the user space is releasing the pages it had been using. So we can see a trend line and that trend line can be projected forward in time to see where the system might be in the near future. So we can take this trend line and turn it into a mathematical formula. We can express it as the formula x equals a y plus b, where a is the slope of the line and then using this formula. We can project the system forward in time at any point in time. So, in the formula x equals a y plus b think of x as this is the current number of free pages. This is the current line component. A is the slope and b is a constant. So, these are the values we need to compute and come up with a formula and then start computing this formula on an ongoing basis. So we know when page exhaustion is coming up in future. Well, not only that, we can also watch how the system is behaving in terms of page reclamation. We're looking at how many pages were in buffer cache, how many pages have been freed up. We can see what is the current rate of scanning by case or B, we can do the same thing for K compact. We can see the current rate of compaction by just monitoring the number of pages it is creating of higher order. So once we know what is allocation pattern. What is the compaction rate for the system with all of this information, we can project forward in time and say, we know in the next 20 seconds or 500 seconds, the system is going to start running out of order five pages. So we better take action now or system is going to start running off out of order seven pages but it's going to happen three hours in future. Okay, no need to take action now. We look at the data and it says system will run out of three pages in the next 200 seconds and based upon the current rate of reclamation, we are not going to recover. So the system. Slowly we are going to keep declining on the number of three pages. So we need to take action because based upon allocation rate and the reclamation rate, we see a problem coming up. So, if we can predict forward and come up with this. A foresight into what the weather system will be in future, we can adjust watermark. Now, get this so I'll be running a little earlier, get it to run a little longer. So we started claiming pages so that by the time. We would have hit free page exhaust exhaustion, we have already created enough pages that we avoid that whole situation. So, let's take a look at how we can do that. So the farm, the method I'm using for this is the method of least as well, which is fairly straightforward. All it says is, you take all of your data, and you simply plot it. You start plotting number of free pages versus time. So, I just look at how many free pages exist on the system now, then few seconds later, few seconds later, and I start just plotting all of those dots on the graph. Once I have these dots on the graph, what method of least square says is find a line that goes somewhere by all of these points, while covering minimum number of squares. The line will then very closely predict the trend on the system. So that's what we do. We, once we plotted our points, we simply fit a line through it, and this line gives us a system behavior. We can, and this line can then be represented with a formula x equals a y plus p, which allows us to compute a future behavior fairly easily. Okay, so now we have an approach, how we can model the system behavior, how we can project behavior forward in time. So, let's see how we put this whole system together. So, what we need is, we need some data. This comes from the number of free pages as reported in the PROC VMSTAT or PROC BUDDY INFO. We gather the current number of free pages. And then, once we have enough data points, we can start fitting a trend line. But one thing to keep in mind is system behavior is not static, system behavior is going to change from time to time. So we need to recompute a trend line. Great, we can project that forward. But soon thereafter, the system behavior is going to change. So we need to recompute this line. And say this is how system has been behaving for the last so many seconds. So what I have done is I created a sliding window. I gather data, certain number of data points. I fill this up, create a trend line. Then at the next sampling period, I'll move the sliding window over. Plug in the new data I got, create a new trend line and simply keep doing that. So this way the trend line is tracking the system behavior. And it's closer to what the system is doing at the moment. So once I have my trend line, then I can start computing the memory exhaustion points for not just each zone, but the overall system as well. Okay, so now we know how to model a system behavior, how to compute when a page exhaustion is going to happen to the next step is what do we do in response to that. So I'm going to take a look at what is the critical condition I'm looking at that might happen in future. If it is simply free page exhaustion, I'm going to try to force an earlier reclamation and a longer reclamation by adjusting watermark. But what if I'm not running out of free pages rather I'm running out of higher order pages so I've got lots of databases that are free. But I don't have enough higher order pages to deal with that, what I have to do is force compaction on the system. And then, since I'm creating this continuously updated model of the system. There's a possibility that I'll encounter a period of time where the system is in high activity mode allocating and change lots of free lots of pages, but then it can enter a slower period. I need to be able to see that as well then the number of free pages starts to go up I need to be able to back off on the watermark so that now pages can be freed up to be used by buffer cash. Khaled we have a question on the your linear function in the chat box. Would you like to address that now. Okay, so the question is the linear function always a good approximation. I would expect that in interesting scenarios it could be anything much more complex and often unpredictable in it is probably good only in situations where the load is changing smoothly enough, and nothing really critical is happening. And that is true. There are many ways to model the system behavior and linear equation is a simpler one. Obviously you can fit a. There are many, many mathematical models that exist that it will allow you to fit a more complex curve to model the system behavior. My approach is to start with a simpler system and go to complexity only if it is needed. I keep my sliding window is small enough. So I'm responding to more of the instantaneous behavior of the system, a linear model is fairly good at predicting the system behavior. If you make your sliding window very large, then at that point you will see that if you fit a linear line to all these points, you'll see many more squares be occupied in the space between the line and the dots on the graph. At that point the linear model starts to get more and more inaccurate. So what I've done is I'm working with this smaller sliding window. And of course, if actual real life data shows me that a different model, a different mathematical model can give me better results. I definitely would be open to implementing something like that. For now, starting with a linear model work made sense. It's easier to implement keeps the algorithm simple and it works as I'll show later from the data. Okay. Any other questions. Okay, let's keep going then. Okay, so when I started looking at this. Initial idea was to implement this algorithm in the terminal. So it will become part of case of the case of the every time it picks up, it could run this a modeling algorithm. And if it sees that okay we are potentially going to run out of three pages case of the could run longer and just make one more pass through the buffer cache and try to reclaim more pages. I could use that to get it to reclaim more pages. And then, from case of the I could also kick a K compact if we have reclaimed enough pages or it looks like that we are going to run out of higher order pages case of the could wake up K compact T and try to create as many higher order pages as possible. So with that idea, we launched this project as an elephant external mentorship project in summer of 2019. And I created a patch working with mentees that were assigned to that project and we implemented the whole algorithm, tested it on a kernel and send the patch out to lkml for feedback. So there was a good discussion about it. And the feedback is one of the important ones that came back besides the comments on the algorithm and users was that this really is a policy we are implementing. Wouldn't it be implemented better outside the kernel in an external demon. It kind of makes sense, because if you think about it, changing anything in the kernel takes more work. It requires a customer to update the kernel to pick up new functionality, whereas if you implement it outside the kernel, you can just update a user space package. So it can work very well from that point of view, but it of course has its own size. If this external demon is affecting the behavior of the kernel in a critical area like memory management, you are not going to get the best behavior out of the kernel unless you install this external demon as well. So there are pros and cons to both approaches. For now, based upon the feedback, we decided okay we'll go ahead and implement it outside the kernel and keep refining it, see what kind of performance we get, and maybe at some point it will make sense to put it back into the kernel. So that's where we did. And Colin, do have a question before you switch into the demon part. There is a question in the question and answer box that might be better answered now I think. Let's take a quick look at it. Okay, so the question is, can you quickly explain what higher order pages as opposed to normal order pages from its admin perspective. Okay, so the way kernel manages memory on the system it breaks all the memory down into a manageable size called a page. Each architecture has a base page size so for instance on Intel architecture is 4k pages. So system kernel will take the entire system memory, break it down into 4k pages and keep track of each page. But when we are looking at some of the operations in the kernel for instance say you want to do IO where you are going to DMA a whole chunk of data from a test into the memory, or you are moving whole chunk of data say off of any band using RDS into memory. So you are going to need potentially more than 4k of free space. And all of that space has to be contiguous. And also user space will often ask for memory and if we want to keep it contiguous because then we can minimize the number of a pasteable entries that we have to maintain because these are physically contiguous pages. So what buddy system implemented in the kernel does is it manages pages in, in terms of order so order zero page is the single base page, and then the order one page is to base pages that are physically contiguous. The order is really order that you would apply to base to then order to become for contiguous physical pages, and so on so we go all the way up to our order 10, and the idea is to keep contiguous memory managed as contiguous memory so that when someone asked for a large amount of memory that's bigger than 40 pages we know exactly where to go and grab a set of pages from there. So essentially that makes sense. What's the difference between base order which is order zero page and higher order pages. And of course what that means is if we have a whole bunch of say order four pages, but we have zero order three pages and someone wants an order three page. We can just take an order four page and break it down into order three pages. So that's why a kernel will try to maximum. To compact pages into the highest possible order because those are versatile pages they can be broken down into smaller order whereas if you have ordered three pages, but no order four pages and someone asked for order four pages, you can't allocate a multiple order three pages to get an order forces page because those pages will not be contiguous. So, when we move from implementing this algorithm in a kernel to implementing it in user space we created this MIM optimizer demon. In the MIM optimizer demon we implemented this mathematical algorithm to compute the trend line and we are using a window size of eight. We have a small enough window that the data is manageable yet it has been shown to be pretty good at predicting the instantaneous system behavior. So we created this project and we launched it as an open source GitHub project out there on GitHub. Feel free to download feel free to make for changes I welcome any changes any contributions. GitHub find it over there. And since this is a demon. We added some configurability to it and just use standard locations for a config file, you can take a look at the conflict file and we can tweak the behavior of the demon by changing values in the conflict file. One of the other things we have done is the demon itself uses the facility to log what it is doing. And there are multiple levels of logging. So you can set the logging level based upon what you want to see is it uses log logging facility you can control where it gets logged and all of that. So if you are running this on your system and really want to see everything it is doing, go up to the max level of velocity level of five. And at that point, you can see all the computations it's making. It will compute reclamation rate compaction rate it will, it will, whenever it makes a determination that we are about to run out of free pages of a certain order actually print out what is the logic it use is what it saw based upon that it thinks we are going to do. We are going to run out of this then it will also talk about what is the action is taking. So, the two things we need to implement something like this is, where are we going to get our data from this is the data we are plotting and creating a trend line from and then what are the control knobs we can tweak in response to what we are seeing on the system. So, the data sources we are using, there are three, we are using right now. And that is probably and start probably and start gives us information on how the page reclamation is working, how many, how the cash pages are being used currently property info gives us information on how many pages of various orders are currently on the system, and then prog zone info gives us a personal watermark that we need to know what the current watermark is before we try to tweak it. So once we have our information, and we can start plotting a trend line then comes in the control now when this falls to memory pressure. The M optimizer uses two knobs to change system behavior one is the process VM watermark scale factor, and this is the watermark scale factor I had mentioned before. What a market scale factor affects the size of the gap between me watermark and low watermark and then low watermark and high watermark. So by changing the watermark scale factor we can raise the low watermark and high watermark both at the same time. And then there's another file in this effort, which is this device system node, and it's a per node file. So if you have a number system, you will have multiple nodes and compact writing a one to this file forces the compaction on that number node immediately. Khaled, there is another question. Sorry, is this a good time for. Oh, of course. It's in the question and answer box. Sure. So we have a use case. We need to have dirty background ratio very high to avoid the IO today, can the one optimizer work for this workload preventing own. Okay. Good question. Dirty. I'll talk about the other things we are thinking of doing with mem optimizer and actually tuning dirty background ratio is on that list of things to look at. So that's something I'm already looking at. And I'll talk about some of the other things we are doing with mem optimizer because it turns out we can do a lot with this. So, good question. And yes, definitely something we are looking at. So, what mem optimizer does, it starts out, takes a look at body info, it gets a number of free pages of each order for each node. And it takes that data, puts it in that sliding window and computes a trend line. So of course, when it starts up, it has to wait until it fills up its sliding window so it takes eight periods before we have a trend line. Once it has a trend line, it will keep computing the trend line periodically. It will compute a trend line for each order page. And it does that for each zone as well. So, we know what's happening on the system, not the overall but also what's happening at the zone level. And then we also know what the water market on each one of the zone. So, it creates a trend line, and then by using the data from Proc VM, Proc VM start reports, how many pages were reclaimed. And since we are sampling that system at a periodically, we can compute what the current rate of reclamation by looking at what we saw last, what we see now, and then we have the rate of reclamation based upon that data. Same thing with the compaction rate, we look at top body info and we see how the higher number, higher order number of pages is changing in the time that we have been suffering the system. When it is time for mem optimizer to force the reclamation, it will scale up the watermark by using watermark scale factor or force a compaction by writing to that compact file. So, now we have our model, we have the actions, it's going to take in response to what the model tells us. Now the question is how do we know all of this is working so I needed to come up with a workload that will allow me to see this in action. So the workload of looking at some of the customer workload and distilling down the behavior that's resulting in these allocation and compaction installs. What I found was that a workload that does a lot of IO and in the mix, there's also IO that creates lots of files and then delete them as well. And then I put this workload that is doing a lot of IO plus creating lots of files and deleting them. I start to see that system is running into allocation or compaction installs. So the workload I define was I have a set of SSDs on my test system. I do nine parallel DDs to the SSD. And at the same time, I do a kernel compile the main dash j 60. This is a 96 processor system it has got 768 gigabytes of memory. So, with this, I can now monitor the system to see how many installs unseen. So the goal behind them optimizer is to monitor the system and reduce the number of stalls by making free pages available in advance of the system requiring them. The test load is parallel DD is combined with a make dash j 60 of the kernel over and over again. And then the metric is the number of stalls we are seeing. If I see a change in the number of stalls, then I run this workload with mem optimizer and without mem optimizer then I know I'm seeing some good results, possibly. Do we have another question here. No, no, that's the same question. Okay, so what did the data show us. So I ran this workload ran for roughly 140 minutes, because I found a system became kind of stable over that period of time, and I ran this test with the four kernels. So, at our call we have the Oracle kernel, which is the unbreakable enterprise kernel and there are multiple releases that are currently being maintained and supported. So, I started with a UK for and then five and six and then the current up to 5.14 kernel. Now, without mem optimizer running just the rocker well as it is, I ran this workload and simply looked at what was the number of the stalls reported by thought we instead at the end of the run. And as you can see, the numbers here are fairly high. And the number stays high, even with four dot 14, but the number does drop significantly with five four and then five dot 14 kernel because we had a number of changes that went into reclamation algorithm in between 5.0 and 5.2 kernel. Those changes did make significant difference to how effective reclamation in the kernel is, and you're seeing the effect of that. Once I had the data on what the system behaves like, without mem optimizer doing your thing, I then added mem optimizer to the system and ran exact same test again, and looked at the number of allocation installs and the number of not just allocation but allocation and the number of stalls went down significantly with mem optimizer running. So as you can see on 412 kernel, we went from 5,500 to 625 or so. And then on 4.14 same thing, significant reduction from 3,200 to about 42. Even on 5.4 kernel, there was this two significant reduction, we went from 212 to one. And on 5.14 it goes from 190 to one and zero in my mind about the same because this is within the noise margin of noise. Khaled, I do have a question. How do you define this stall? How long does it last? Do you average it out in terms of what's the responsiveness? Do you take responsiveness into account? So I'm not computing, I'm not looking at the size of each stall. I'm only looking at the number of installs that happen. So mem subsystem, anytime a process enters compaction or allocation installs, it implements a counter. And the counter is reported by clock VM side. So this is just the number of times we ran into a stall. Now that's an interesting data to see how long each stall was and that requires more instrumentation, which I haven't done because I'm afraid by instrumenting to compute the size of each stall, I might change the behavior of the system. So for now I'm focusing primarily on how often we are seeing the stalls. Thank you. Okay, so now we have seen what the mem optimizers can do in terms of improving system performance. We have been improving the algorithm, adding more functionality. The thing that came out of this was that mem optimizer is sitting outside the kernel and it is actually looking at the system behavior and modeling it. Which means it can potentially do more things besides just changing watermark. So, to start off with, I have added more capabilities to it. We have got a large number of system tunables. And he says admin knows that the extremely large number of tunables on the system at this time and it can be hard to figure out how to set up some of those tunables, especially when some of those tunables can change from system to system based upon the configuration of the system. So, most of the admins have gotten into the habit of defining the system.con file to change all of these tunables whenever the system starts up, so they make a one time change by using system.con. Well, mem optimizer can do the same thing too, because when it starts up it starts looking at the system configuration, and very quickly it starts to build a system behavior model. So, why not have it look at all of this data and based upon that it starts coming up with the intelligent defaults for lots of these tunables. So, I have a modularized the code so mem optimizer cannot do that that when it starts up it can compute and set that tunable. But not just that it's, since it is looking at the system continuously it, it can also see changes happening on the system. So, for instance, to start up with on I added a trigger that keep an eye on how many huge pages are allocated on the system. And anytime the number of huge pages changes, we know, there might be changes that need changing so how about we use that as a trigger and then I can add actions to these triggers so that whenever a number of pages changes, mem optimizer goes out and changes this. So, there are more things it can do. So, we have been looking at all of this. And we are also looking at how can we improve the algorithm. Today, if I look at the actions that mem optimizer is taking, even it needs to create a higher order pages, it has to pick up action. It's a little hard on the system because it's writing one to that compact file, it calls the system to go out and do a compaction right away. And if we could reduce that load on the system that would be a good thing. So, one of the features that was added to the kernel recently adds this file process VM compaction to active that it can tweak how to actively the kernel will look for higher order pages. It's somewhat limited because it looks for a specific order pages, but there's something that I can potentially use and make the impact of compaction. It's a little bit better to be handed than it is right now. Now so there are many more not just under process VM that we could possibly be looking at. So, I've listed just a few here what am I boost factor softiness min three k bytes dirty background ratio is another one. Then VFS cash pressure that's another one I'm looking at to see what should go on the value of these tunables. The behavior should drive the values. And what is the connection between how by looking at the system behavior, how should that govern the value of this unit well should it go up should it go down should it be 10% of this other value should it be five times of this other value. So those are the things I'm working on and I'm looking for tunables that have the maximum effect on the workload people are running. There is another question in the question or comment in the question and answer box and then also we have a raised hand, think it might be. Kenneth, you probably have it's, there is another, if you can unmute and ask a question. Probably we don't have a question here maybe just a comment. Thank you. Okay. Another thing I'm looking at currently I'm modeling the entire system behavior, and somewhere in the back of my mind I am thinking that maybe looking at see groups might be beneficial. So that's just a thought. I'm not sure exactly what I want to do with that but that's on my mind. So there are other sources of data that might be useful so demon is once a system that when you do the kernel recently, and it gathers data PSI is another one that gather special information. So maybe I could use that information to determine how the system is doing how the system is behaving. So this is just possible future ideas. I'm open to other ideas as well. You have ideas you want to contribute code, go to GitHub, grown the repo. Feel free to send me a patch. Okay, so any other questions at this time, what I'm going to do next is the issue a live system that's running mem optimizer that's my test system, and give an idea of how the system behaves under this workload. So, all of the discussion we have had so far, it would make it a little more sense and you can see actual data. So, let's move on to looking at the system. Okay, so can everyone see this window and read the text, do I need to make it bigger. Yes, yeah. It is. Okay, better now. Let me iconify these so they are not in the way. Control shift plus loss, that should just make it bigger as well. Okay, doesn't work with this extra. I can make it even bigger. That's more readable. Okay, so let's do that. Yes, so this system is currently running a workload. And the workload on this is just what I had described. So, I've got a bunch of dds running on it. It's going to be a little slow because it's under tremendous memory pressure. I'm not running mem optimizer on it right now, because I wanted to show what's happening on the system. So, I've got all these dds running they are writing to an SSD. I've got three SSDs. And there are three dds happening to ease the city. And then there's also coming compile happening in the background. On this system, if we look at what is the state of free memory, we have got 722 big total and currently 75 gigs free 643 gig is in use on for buffer cache. And here's the data I'm using to see what kind of installs I'm seeing. When I started this workload on the system I started right after rebooting the system so the number of total installs for zero. At this point we have got 1287 plus 425 plus 423 this many stalls on the system. And you can see this behavior in top. Most of the memory is tied up in buffer cache. And we just had 75 gig of free memory is already down to 53k. So this memory continues to dendel until we hit the low watermark. And then at that point, reclamation happens, memory starts to go up again. We would have hit allocation and compaction install in the meantime. These are the data sources I'm looking at. As well as talk body info, which shows me the various order pages right now the system is not too bad a shape. We are out of pages up here, but on node one at least down here we have some pages available on node zero we have some pages available here. But this number again as we keep doing IO and allocating pages, this number will continue to go down. And just in the time, you're talking free memories down to 39 gig. And the number of stalls. We haven't hit install yet 1287 425 423. But as this number continues to go down, we are going to start hitting stalls very soon. So, on this system, we are running with the default watermark scale factor, which is set to 10 that in turn affects the watermarks that we have. So if you're going to prog zone info. After the per node is set, then we start seeing information on what the watermarks are. So this is the DMA zone. Let's get to normal zone. If we look at the normal zone. Here's our min watermark for this zone is a low watermark and here's a high watermark. This in number of pages. This is the value we are going to try to tweak. So, when I launch a mem optimizer, so mem optimizer registered itself as a system be survey. So you can start and stop it using system D. We start that this system is just because it's a tremendous pressure, which you can see from this low amount of free memory. So what was that variable, you are going to tweak color. This is the process VM watermark scale factor. Okay, we can change that. So this value. It's, it goes from 10 up to 1000. And so it looks like the system is under so much pressure. I can't even start this service. Typically, I would start this service in the system boots up. Then everything is calm and quiet. One more time just in case it starts up. Okay, this time it is start. And it will start logging. So it takes a little bit of time, about two minutes for it to gather enough data to start modeling the system. So, let me bring up another window. This is the compilation happening in the back from a compilation. So we are down to about three gigabytes of free memory and mem optimizer is now starting to look at how the system memory is doing. So, once it, it has enough data points to compute a trend line that that's when it starts to take action. So the first action it should take is once it sees that the number of free pages is continuously going down is to raise the watermark. And it tries to guess it tries to compute when we are going to run out of free memory by looking at the current reclamation rate and current allocation rate. And then it will raise watermark in proportion to that how far out that event is when we are going to start running out of memory. So it will raise the watermark. A little bit. And if things don't improve, it might raise it further. And then same thing happens on the other side as well when the system it starts to slow down. And then it will drop the watermark by like 10%. It will drop the watermark scale factor by 10%. So it reduces the watermark. If the system continues to stay either in a stable state, or the number of free pages continues to go up, then it will continue to drop the watermark until we drop all the way down to 10. And this range where MIMO optimizer will vary the value of watermark scale factor you can tweak it by setting its aggressiveness it supports the three levels of aggressiveness in the most aggressive one it will let the watermark scale factor go all the way up to 900. That's the highest I said it to look less aggressive with sample system less periodically the periodicity is something like 15 seconds when it's most aggressive, which is number three, 30 seconds and 60 seconds. And then same thing applies to the highest level it will set the watermark scale factor to 900, 700 and 500. In fact, we have a question in a Q&A. Have MIMO optimizer been used in production yet or do you have any known of any production use today. So, we are just starting to deploy it on customer system so it's not on a customer system yet. We have been testing it internally we have got test systems that run workloads very similar to what some of our major customers are running. It's one of the same kind of behavior, and on those systems we have seen improvements in the number of installs we have been seeing. So as you can see now that MIMO optimizer has built its memory model, it's starting to take action so it's running out of order three pages here so on node zero so it did a compaction there. And here is what its logic was how many pages are currently available what is the consumption rate and so on. So, based upon all of that it triggers compaction on node zero. So, that's simple it will keep doing this every time it updates its system model like it just did right now. It will compute which node is running out of three pages, high order or base pages and then either do reclamation or compaction. Okay, so that makes sense. Then I have got some data that I gathered from my test system that can also show how MIMO optimizer is affecting the behavior of the system and how it is getting the results is it's a little easier to see the impact it's having by looking at the actual data. Let me close these out. There is one more question in the question and answer box. I have a rough estimate on how big the impact on system resources of MIMO optimizer is in terms of CPU maybe. So MIMO optimizer demon itself is very lightweight. The impact is in the noise region, where that true impact comes in is when it forces a compaction. The kernel is scanning buffer cache. It is scanning the free pages to see which ones can be compacted and that can be depending upon the state of the system. It can consume resources because to compact pages what kernel has to do is it may have to migrate pages. So, it may take pages from one end of the free pool and move it elsewhere so that the page could be made continuous with another free page and that's be compacted. Compaction can be a little more heavy handed process and that depends pretty much on what the state of the system is. But even on my test system where you can see it's so busy that the response time is slow. Even on that system, I don't see significant CPU resource usage by K compact your case of the once in a while I'll see case of the pop up consuming about 80% CPU. But it typically doesn't last more than a second or so. And most of the time case of the is not even in the top 50 processes. Because they keep running talk to monitor who is consuming all the resource so so far, observation has been that it's fairly likely. Okay, so let me start with. Okay, this chart. So, I have written a script. Sorry, let me close this distracting. I've written a script that just runs continuously on the system, and gather data, it gets a spot means that body for a whole bunch of a product and this file. And then I can take all of that data, and I have a script that sort through that data and create graphs that allow me to see visually what's happening on the system. So this is just the information from property info plotted. Over a period of time. So these are just number of free pages currently available of each order each order has a different color. And this is how things are proceeding on the system. So now this is when the workload I described was running on the system. So workload is just moving along and as you can see the number of free pages is very low. And then at some point we hit the low watermark and case of the kicks in the case of the kicks in goes into reclamation, it wakes up K compactly, which does compaction, and all of a sudden, we get a whole bunch of free pages. These are all the pages that were reclaimed from buffer cash. So now we have a good number of water 01 and so on pages, but the workload is still running. We are consuming those pages, and we are back down to very low number of free pages, until we hit the watermark again. And then, once we hit the low watermark case of the runs again so you can see how the system is behaving in response to the number of free pages going down. So how does this behavior change when we run mem optimizer so I have another chart from the data I gathered running same workload, but with mem optimizer running separate it out and put them side by side. So this is what happens when we are running mem optimizer in the background. As you can see, mem optimizer in the background is constantly trying to keep the system ahead of the workload. So we hit low number of free pages but it's not as severe as we had over here. And the system, the number of free pages available on the system is a little smoother. Now, of course, the number of free pages is going to keep dropping. But what mem optimizer is doing is it's looking at the system behavior and anticipating how many what the number of pages required might be. So instead of trying to create the highest possible number of free pages, trying to create enough free pages so that the system will not run into installs. And you can see how mem optimizer does this. So for instance, if we look at this data here, so it was right around time stamp number 21 where the number of free pages went up suddenly. Let me bring up another piece of information here. I also captured the log that I was showing earlier from mem optimizer. So we can see what mem optimizer did that resulted in a change in behavior on the system. So we're here. So first and take a look at, let's say this time stamp. At this time stamp, a mem optimizer decided that reclamation is recommended because we have high memory consumption rate. So that is a time stamp 1314 we can go down here and find the time stamp when my other script captured the data. So 1313 is where it captured data then again at 1315. So this between these two lines, this one and this one is where mem optimizer took action and the action it took while it decided reclamation is recommended. And based upon the current consumption and the reclamation rate, it figured that the time to go below high watermark is 190 milliseconds. And based upon the current rate, the time to catch up is going to be 216 milliseconds. So we better start working now. So it raised the watermark factor all the way from 10 to 490. And here's the accuracy of that action right here. We have some free pages, but we are starting to run fairly low on some of these pages. Watermark scale factor was changed and that kid a case of the kid K compact B and all of a sudden you can see the effect of that the number of free pages went up and we created more higher order pages. And as we go along, you can see those pages will get consumed again and then somewhere around a timestamp 30 mem optimizer did something again and you can correlate it with an action logged by mem optimizer again so 1355 is right about here. We raised watermark and then we raised watermark again here. So, this just shows that the actions that mem optimizer is taking they do result in changes to the system behavior and because we increase the number of free pages available hence we end up with fewer allocation and compaction styles. And I've got data from some of the other files as well and they all show similar behavior so for instance captured the fragmentation, this chart shows what the fragmentation looked like when mem optimizer is not running. One is of course, really bad fragmentation you want these numbers as low as possible. So if you look at the same chart for node zero from the same run. The chart is moves out quite a bit in terms of fragmentation, there are not as many peaks and overall the fragmentation is lower than what it was without the mem optimizer. Okay, so everything makes sense. I have one question. How you looked into there is at least get get up shows me there is one instance of a deadlock situation with the main free bytes being k bytes being too small. Have you ever run into a deadlock because of this in your workloads and would mem optimizer help avoid deadlocks. So I can't say I have seen a deadlock but I have seen allocation failures. So we have a workload where you're able to reproduce and order the allocation failure very often. And in that situation, mem optimizer does help. It probably will work help with that deadlock as well but I'll need to create a workload that can reproduce that. Thank you. Okay, so if there are no questions, I'll hand it back to Megan and sure to wrap this up. Wonderful. Well, thank you so much Khalid and chew up for your time today and thank you everyone for joining us. As a reminder, this recording will be on the Lennox Foundation YouTube page later today and a copy of the presentation slides will be available on the website. I hope you're able to join us for future mentorship sessions. Thank you so much, and have a wonderful day.