 Okay, so for years I've given these memory management talks about complex things that we are doing currently in the Linnos kernel and it seems that the attendance there was getting less and less because this gets more and more complicated. So for the last year I've been giving this talk at conferences around the world just to get to the basics of memory management. I always thought it was self-evident what's going on there, but it seems that a lot of people need to know the basics of this one. So I try to give you an introduction to memory management in half an hour, which is a pretty complex matter. So we are going to touch to just some of the basics of memory management here. So how do I get to the next slide? So I want to talk a bit about memory processes, real virtual memory, the distinction and how paging works, how actually you can figure out what kind of memory is in your system, how you monitor how a process is using the memory, how you can configure the system to seem to have more memory than there really is. Let's over commit and give you an overview of where to find more and more so you can play around with these things. And then if we have some time, which I don't think we will have, we can go a little bit into processor cache use and how to optimize the performance of your code. The first thing to know is the kernel doesn't work with linear memory. The memory is separated into pages. You can think about it as a file cabinet and you have all these different pages. The kernel refers to every page through the frame number and then an offset into the page. And so if you have four gigabytes of memory, that results in one million of these 4K, usually 4K pages. And the memory management unit in the hardware can do some tricks with these 4K things that we are getting to soon. So these 4K units are called pages and these frames have a page frame number. So from zero to n, depending on your memory size, and this is called the physical page frame number. So if you want to refer to physical memory, you can specify the page frame number and an offset into the page. So maybe you are in frame 12 or an offset 10, there is your data. And so from these physical pages, the MMU provides a way to create virtual memory that is process specific. And we are going to get into that in a minute. The common size for the page frames is 4K on Intel and has been that way since the 70s or 80s. So right now we are having sizes of terabytes of memory and systems. And this results to an extreme high amount of these page frames and it is continuing to be a problem to manage these. Other architectures like IBM and ARM have much larger page sizes and don't suffer from all these problems. So if you look at the diagram up there, you have the physical memory there and you refer to physical memory with a page and the offset. And so in between there is a page table. The page table allows you to create a logical address, a virtual address for each process. So you are having one to n physical frame numbers, now you are also having one to n frame numbers for each process. And the page table allows you to reassign that. So the virtual frame number 2 may be the physical page number 6 in the hardware. And that way since these page tables are specific to each process, every different process can have the same address but refer to different sections of physical memory. That allows the operating system to isolate the processes and make sure they don't interfere with one another and that also provides security in this system. So when the process starts up, these page tables are empty. And as soon as the process runs and begins to refer to a memory, the processor looks at the page table and says where can I, which physical frame number belongs to this address. And then the page table is empty, there's nothing there. We have a major page fold. This means now the operating system needs to provide the data that should be there. This can be, for example, a load from disk. We have mapped an executable or a data file from disk to this memory location but we haven't populated it because we do on-demand paging. So this means if you access a new page of your binary, the system will find the page on disk, read it into memory and then fill up the page table so that the data is available for the process. This is a very laborious process. It's going up behind the scenes continuously. And this provides kind of the dynamic way of providing memory that you need. You think it's all memory but it's really not there. You start with nothing and all only what you reference is in there. So if you have code segments that are never being touched, they will not be in memory. They will only be on disk. And so you have a virtual size of a process that may be much larger than the physical size because you've never touched certain sections. The system has never provided physical memory for you for these memory sections. And so despite you having maybe four gigabytes of space for your binary, you only used a couple of megabytes. So there can be a last distinction between the amount of memory you are actually using and the memory that you think you're using. And that affects the ability of the operating system to run various binaries concurrently on your system. So if you never touch them, then the system can run multiple, for example, of the four gigabytes binaries, although you'd only have four gigabytes of memory. So each process has this kind of memory map. In the memory map, you have the binary where the instructions are that you execute. There is the stack. When you do a function call, it saves the prior information and it just books down the stack. There's the heap where you do dynamic memory allocation. And there may be also shared libraries and stuff like that. So the operating system manages that stuff for you and provides the address map. This address map is specific to each of your processes. So then it also means that pages can be shared between processes. Let's say two different processes use the same shared library or the same binary. GLC is an example. It's used by almost every process in the system. The system only needs one physical page frame with the content of the binary. And it can then be linked to each of the different addresses of the processes that are using them. This is another way to save memory and to allow a system optimization. And so usually, if there is a shared mapping, for example, of a binary, all these mappings are read-only. And this means that the processor can enforce that. If you try to do a write operation to a read-only page, the operating system will get control and can check what's going on there. Does this process have the right actually to write to this page? If the process does not, it can create an error message and can about the process. On the other hand, if the process has the right to do so, but another process must still read the old version. The kernel can allocate a new page, copy the page, and allow modifications to the new page. That is called copy on write, which is a typical thing that happens frequently. Because, for example, a process forks means it creates another copy of itself. The copy is not really made. It's just a new page table is being set up for the new process. And all the mappings are mapped read-only. And the new process now can run. And as soon as it tries to write to something, the kernel will provide additional memory to it. This is again following the concept of on-demand memory. And which is, which minimizes the amount of memory actually your process is using and allows you to run much more on a system in terms of memory use and what you actually have. Then there are special pages. For example, some pages you may want to swap out. So let's say the system is getting overloaded. And now the system has to remove mapping because there's not enough physical memory. The system will look at, okay, which physical pages have been used the least, or haven't been used for a long time. The kernel will then invalidate these mappings and make them free and use them for different purposes. On the assumption that you won't get back to this page soon. If you get back to this page, it will give you the same physical content in a different page. It can do that for files from disks that are read-only without a problem. You can just invalidate it and give it to another process, because you can reread it from disk. But let's say you have a heap page that you have, you have dynamic allocated memory and you have changed it. And now the kernel cannot invalidate that without losing data. In that case, the kernel creates a copy of the page on disk in the swap area, writes the content of the physical page to disk, and then removes the mapping and gives the memory to another process. That is called swapping. The kernel keeps the memory of where to put that thing in the page table. So if the process touches the dynamic, the allocated memory again, the kernel will stop the process. Okay, I need to read this from my swap space, get the page back, and then you can run. This means that if you go into a swap or if you don't have enough memory, the performance of the system will gradually degrade as you overuse memory. The more you overuse memory, the more this activity will come into play and the slower your system as a whole will work. So it is always a judgment call on how much of that you will allow, and we'll get to that later. So then we have the zero page. Lots of memory in the system is just simply zeros. So we have created one page that we all know is always zero and we write protected. And whenever we need zeros, we map the page into the various processes. This means if you zero four gigabyte of memory, you use one four kilobyte page. It doesn't require too much memory. Only when you actually modify the page and write something that is non-zero into it will the system actually allocate memory. A lot of people get very confused by this. They're okay, I'm creating a hundred processes and they all allocate five gigabytes of memory. How can this be? I only have four gigabytes because, yes, you only use one page and it's all zeros. When you start writing to this five gigabytes of memory, then it will allocate and then the system will come to a slowly to a grand stop and crawl along and no longer work. So the classical thing is read data behavior. So typically, let's say you wanna access data from a file that you haven't mapped into memory, you didn't map. So if you do the map, the memory range has no entries. When you first touch and try to read data from the memory location, the system gets a page fault and it's not there, it's a major page fault. And now the system starts to read these things from disks and fills out the page table. And once the data is there, it returns to your program and the program continues. That is a very slow process. So what we're usually doing is we don't read one single page. We read a lot of pages because on the assumption that you will read more and we need more anyways. If we don't go back to the disk, maybe move the head and which is very expensive and cause a lot of slowness. So we have a technique called a reader head. When you do a single read, it reads the next 512 kilobytes of memory just in case you wanna do the next thing. This is pretty significant these days because these IO devices are very fast. And usually, this overhead is worth it. And as time progressed, this increases. It used to be much, much lower in the 80s and 90s. But we have legacy behavior here that we have to deal with. Then write data. If you modify a memory location of a file, then the system knows, okay, this is now a file has been modified, but it's not on disk yet. So now the system has to manage the consistency of the data on disk. The system will track all the pages that you have written to. And these are called the dirty pages. And the dirty pages have to be cleaned. And this works by a process that scans for pages in a neighboring range. And if it finds that there's a series of pages that are dirty, it creates a right request to disk. The pages are then written to disk in the background. And at that point, the pages are clean again. So that kind of means that when you simply switch off the system by pressing the power button, you will not have all the data unless the data has been written back to disk. That's why you have a command called the sync command. The sync command makes sure that all dirty pages are written to disk, and then you have all consistent content on the disk. The pages can be in memory, and they can also be used by, gosh, by other. And so let's say you've written to a page. The kernel puts it into a thing called a page cache. So now if you have other programs that use the same page, they will use a copy from memory, not the copy from disk. That is an optimization. And so for a time being, you may have the illusion that the page has been changed on disk, although it's only in memory. So there's also a cause for confusion because people don't see that there must be a state when the page has been cleaned out to disk. Only if you do a certain flush command or a sync command can you be guaranteed that the content is actually on disk and that a power failure will not cause you the loss of data. This is a lot of stuff that has been discussed repeatedly in database design and database security and making them crash-prone requires some kind of syncing behavior. And we already discussed the two different types of pages. There are pages that reflect pages from disk, where we can just validate the mapping and assign the page because we have a copy on disk, and then there are anonymous pages. There are pages that don't have a file. That's why they're called anonymous, like the heap allocations for dynamic data allocations. For those, we must allocate memory on a special file called the swap device to be able to write them out and preserve them in case we need to use the memory from some other purpose. Then how do you get information about what's going on in the system? There's a special file in the proc file system, proc mem info, which shows you the current state of affairs, what's going on with memory. There are various other commands as well, like numers, ctl, free, top, and de-message. They all rely on the same data that is either in proc mem info or in this device's system. Under that, you find a lot of directories and other information that gives you details on system memory and system behavior. And so if you look at the various things that are listed here, you see some of the stuff that we just talked about. The memory total is how much physical RAM you actually have in your system. Then how much of a memory is actually available right now. Then the cached means how many of these pages in memory reflect data from the disk. And this means that compression can actually be invalidated without too much cost. Then swap cached, how many of these pages in memory actually have a spot on the swap device so that could be evicted? And the zero kilobytes means I've never used swap. The system did not even assign a page block on the disk for all my anonymous memory because we never got that far. And then we have the active and the inactive. The system keeps these 4K pages out of the active list. It was only recently used on the inactive list. It hasn't been used for a while. When the system gets short on memory, it will remove pages from the inactive list from memory and give that memory to another process that needs memory for a different purpose. And so the pages are moving back and forth between the active list and the inactive list. If a page is on the inactive list and you make a memory access, it's moved on to the active list. Then we have the active list and inactive list for fileback pages and for anonymous pages. We have, may have some memory that is unavictable or M-locked. There's a way to tell the system these pages never have to go out of memory. It must be, must stay in memory. If you said that, then we can't do nothing about it. If you do too much of that, then the system cannot actually do efficient paging anymore and the system actually may fail. Then here's the swap we actually have about two gigabytes here of total swap. And all of that is free. You've never used it. Then there are the dirty pages. There are 48 kilobytes of pages right now in memory that I have modified and that are not on the disk yet. If this number is non-zero, then you shouldn't switch off the system because something will be lost. So if you wanna make sure that you shut down system in a good way, you can look at that number and make, type sync or something and see it's zero. Okay, you could switch it off in an emergency. If you do a shutdown, a regular shutdown from the operating system, it will write back all the dirty pages first. Then write back means the pages that have been identified as dirty and where a command has been given to a device, please write it back. But the device has not completed that action yet. That happens usually if you have a large copy operation, then the dirty pages come together to write back pages and will stay there for a while until the device has been successful in writing all the data back to disk. Then continue over there. We have the anonymous pages, the map pages. Then there's various administrative stuff for the kernel, the slab pages. Stack, the kernel stack and the stuff used for page tables. Look there's about, what is it, 11 megabytes of page tables. This is just the pages used to map the virtual addresses to the physical addresses. And this can be significant depending on how large your physical address ranges are. If you have a couple of terabytes size process, then it will use a significant amount of page tables. And on third-jube systems with the special address extension, this can actually cause out-of-memory issues. Yep. Okay, then we have some other stuff here like Numa CTL. If you have a business system with two different Numa nodes, then you can inquire here, what is the topology here, how much pages are free. Top shows you a list of the processes that are most active and consuming most resources in your system. And you can see what's happening there and maybe why your system is slow. And the D-Message shows you the current state of affairs, what's happening on terms of the operating system messages. And at the beginning of the D-Message, you can see also how the system detects the memory of the system and how it creates physical memory from these blocks of memory that it finds on Buddha. Okay, then there's a way to inspect how a process uses memory. We looked at, I told you about CHOP, which is a common-use tool, but you can also go directly into the PROC file system to figure out what's going on with a certain process. So if you know the PIT, here's the, for example, the SSD-demon process, you can just go into this thing and look at the status file, for example. There are a lot of other files in there as well, which gives you more details on what's actually going on there, but I think this is the most important, at least for memory management. And so you can see VM peak. This is the largest-sized virtual memory that you've used in this process so far. And then the current size of the virtual memory. You may have actually mapped a huge amount of memory and then unmapped it again, so the process was shrunk. Here it wasn't shrunk so far. How many of the VM pages have been locked in memory? How many of them have been pinned? How many actually pages in memory are being used? So we have a process here that's 65 megabyte in size virtual, but it only uses six megabytes and actual physical memory. Because of all these measures that we talked about, with the zero pages, the duplication and all the other things, this saves a quite a huge amount of memory. How many of these pages are anonymous? They are dynamically allocated objects. How many of these pages are actually on disk and reflect mostly binaries? And then we have the stack sizes, how much does the executable size, this libraries, and how much is used for the PTEs, which is another measure of how many page-to-page pages have been used. And then there's a VM swap, which is pretty important. If you see a process getting slow, check the VM swap. If there's actually activity there going on there, the process is slowed down because of insufficient memory and the operating system may have taken pages away from this process. And there's other commands like PS and TOP that actually just inspect these files and present the data in a nice form to you. And there's gazillions of tools that have been written that copy part of these operations and use the same file to present this information in various ways that may be convenient for one on the other use case. Then there's a very important tool called U-Limit that allows you to specify limits on the amount of memory that a process can use. You can specify, for example, here, how much virtual memory can you use? Here it says virtual memory is unlimited, but you can limit that and say, okay, the process cannot really map more than this kind of amount. You can set various other, the max memory size. You can limit the amount of locked memory. There's all sorts of things. You can change scheduling characteristics. The amount of time it can use, and things like that. So this allows you pretty detailed control about resource limits of a process because you don't want a process to hog all of memory. You don't want a process to hog all of your CPU time because you want the system to maybe do other things as well. Then regarding to the over-commit configuration, I told you that there's the ability of the operating system to reduce the actual, to actually use much less virtual memory than physical memory. Here you can configure what the system should do if we have more virtual than physical memory. And so there's these over-commit memory setting. If you set it to zero, the system will allow you to over-commit. You can allocate as much virtual memory as you want. And there's one, the other simple guess, how much is a reasonable amount of overall allocation that should be allowed? And if that reasonable amount is getting exhausted, it will stop your program out of memory. But it does a guess. This is the default configuration of most systems. So it won't allow you to slow down the system too much. If you set it to two, it will over-commit without limit. So you can allocate as much memory as you want. The system will come to a crawl because it will start swapping and everything else but it will allow you to go as far as you want. And number two is there's a special ratio that is specified. You want to specify how much virtual memory the system should allow to be used. So the over-commit ratio is it uses a swap file and you can multiply the physical memory by a certain ratio to come up with this. This allows you to have a detailed specification as to how far the system should go with the over-commit. Typical configuration is zero, so the operating system just guesses. If you get out of memories without any logical reason, set it to one and see if the over-commit is a problem. If it's just slowing down the system. My experience is that setting one is actually the best because these days we rarely use, run into these problems, we rarely overuse memory. And then if you want to have more knobs to play with, here are some. They are found in Proxys VM and you can find the documentation in this, in the documentation admin guide or online somewhere. And you can play around with all these things. And some of the important ones I have highlighted here, for example, these dirty things. Staff that controls when is your dirty pages written back to disk. There's a dirty right back sentisex. This specifies how long does the system maximally wait before it hits disk for a certain thing. Let's say you don't want any data to stay in memory for more than two seconds before it's written back. You can say that there. And it makes a lot of sense because if your system is idle and it keeps the stuff in memory and then it crashes, that's bad. But if it stays at least for two seconds idle and you have two seconds there, you know everything's at least written back to disk. Then there's dirty background ratio. And this means at what percentage of memory does the system begin to run a background process to scan all your processes to slowly write stuff back on its own? So if you do copy operation or stuff like that, and the stuff will stay in memory. And usually if it's not much, then the system will finish this copy operation and use it very fast. And then when you do something else, it will slowly write the stuff onto disk. That's usually about five to 10% of system memory. Then there's a dirty ratio. This is the percentage at which the system will start to stop the program to first focus on write back first because we have so much of a dirty pages that it is unsafe to operate the system. I'd say you have 50% or 40% and 60% of memory is dirty now. This means the system cannot free these things, it must do something with them. The flexibility of the system is restricted. At that point, then the system will no longer allow you to allocate memory or to write stuff without writing first to disk. So you do the right operation. Your program will stop. The system will start only doing write back until the dirty ratio is, until the less than dirty ratio memory is dirty. Okay, you can find more details on that one in the admin guide here. And there's some big keys here. You can find more data on this. And then some man pages. This is pretty important to get familiar with that if you want to tune around with the system and figure out more about the system. Without access and being able to find the proper documentation, it's difficult to make much progress here. And I can only point you to things here, but it's half an hour for the VM, for a little memory, right? So if I can get you to read these things, I have had some success. And what time is it? Yeah, well, I think we can just stop. Yes, you can reach me here. If you want to talk and have questions later. I'm usually available pretty regularly. And I post most of my stuff and deal with open source patches and all sorts of things without email address. So we have five minutes left. So any questions? Any thing you want more detail on? M-Map provides a virtual mapping. It's basically reserves virtual memory, doesn't do any reader memory, and it just tells the rating system, these information can be found at this position on the disk. Once you actually access it, then physical memory is provided for these things that you have M-Mapped. So the M-Map might succeed, but when you try to write to the M-Map range, the system may run out of memory. Yes, the page tables will provide read and write protection. So the operating system can set the page in such a way that any write attempt to the page will cause a fault and then the operating system will analyze the cause of the fault and see if it's okay or not to do anything with that. If they are shared, then we have multiple page tables referring to the same physical page. Each of the different page tables has different access permissions for each of the processes. So it might be one process that can actually write to the page, the other one cannot. Yes. No, when you access the page, you will get a page fault and the operating system will know that a violation of its access restrictions has occurred and it will then do proper measures to remedy the situation. So that is used extensively, also for user space logs, for example. We can just track when a log was changed in terms of, from user space and we can then react and restart the processes. We don't have to constantly pull anything in user space. We can just see. If nobody changes the log, then the process continues sleeping. But if somebody writes to the log and changes it, we can say, okay, everybody, go again. So this is used in many ways by the OS for multiple purposes. All processes use the same page tables from all over. All physical processors use the same page tables from the processes. And so they're all synchronized by the OS and it provides a consistent semantics of memory handling and provide security. And therefore, the operating system assures you that it can track these things and can make sure that the right thing happened and that you're gonna be notified if you want to. There are various mechanisms in the operating system. You can subscribe to changes in memory and actually you can get a notification in user space if something happens to a certain data structure. Another question up here. Yes. Is it a normal or a device? With fsync and fwrite, you're gonna write, you have the data in normally, usually in anonymous memory, in dynamically allocated structure and you give a command to the operating system to write the stuff back. So the kernel provides, creates some internal structures and writes it back. It's not exposed to you. The advantage of using mmap is you don't need to use the system core. It's all dynamically handled. If the data is in memory, then no fault occurs and it's very fast. There's no delay at all. If the data is not there, only then will the kernel create the overhead of reading the stuff from disk. Whereas with the fwrite or other for fread system core, you always have a contact switch and you always have the OS involved. So a lot of high speed applications actually rely on mmap because you can avoid the OS overhead there. And there are special flags also to mmap where you can say, okay, I'm mapping this thing now, but please populate all of it right now so that you can guarantee that the following operations will not cause any system fault, only any faults. With that, you have a secure way to avoid OS overhead by paying the price upfront. So you're doing the initialization when you have time and then you get into the critical section. Everything uses a DMA. There's no operation that I know or any devices that would not use DMAs these days. There's only maybe the printk driver, the early printk, the serial IO that doesn't use polling, but most of the stuff is all memory maps IO. The only driver that I remember recently that uses polling is the serial debugging console. All other devices use DMA consistently, network devices, storage devices, whatever else. They all use DMA consistently. And the OS manages the DMA operations for you. The kernel issues the command to the driver, please use DMA to write this back. It's a driver can't do it. The driver is free to say, okay, I don't can't do it. No, can't do it. No, can't do it at DMA. I'm gonna pull the device now and give it one bite at a time. That's possible. I think the old software driver was like that. But all modern devices don't operate like that. But we are unthinkable and we slow these days to do anything like polling. You want to find the memory footprint of a process. This is one of the standard ways to do this. This is not good. Okay. Yes, there are on the web various, more accurate ways to do this. I think about two years ago, and changes went into the kernel for special functionality to measure the memory footprint of a process. This gives you, of course, the maximum that the current uses. But yes, it's it. Well, yes, this is an optimization to avoid the page table lookups, right? So the page table lookups are expensive. And if you can cache the content somewhere in the page table in a DRB, then you can avoid this, right? Yeah, we're done. Okay, thanks for coming.