 Goedemorgen, welkom op deze sessie om een tutorial over linux memory management en in relatie met dat container. Mijn naam is Gelof Langeveld. Ik ben senior trainer voor AT Computing, die is een bedrijf in Nederland die alle soorten open-source training over linux, zelf en kubinetisch docker en alle soorten andere subjecten. Ik ben ook de createur en maintainer van ATOP, die is een performance monitor, die je in de repositories van de meest linux distributies vindt, ook. Ik gebruik ATOP dus nu en dan, tijdens deze sessie, om te demonstreren wat we op dat moment praten. Nou, we hebben een een en een half uur sessie en een deel van deze sessie, de eerste deel is om memory management in general in linux. We zullen even kijken wat de memory consumers van jouw brandmemory zijn. En dan gaan we over kernel en slapcaches. We zien processen als consumers van memory, tempfs, pagecache. We zien op de demandpaging hoe onze processen er in memoria worden, als je een nieuw proces begint. En we zien wat er gebeurt als memorie te voel krijgt. Dan krijg je pagecanning, we krijgen swapping. En in relation to swapping, we ook hebben een parameter, die is, wel, swappiness, dat zal worden handeld, ook. Als memorie te voel en swap te voel krijgt, dan krijg je memorie stress en dan krijg je de uit-of-memorie-killing-mechanisme en we zien de details over dat ook. Dus de eerste deel is om memory management in genereel niet te vergelijken met containers nog. En we zullen in een simplifieke manier in de eerste slide beginnen en later gaan we meer in de details van memory management. Naar dat, zullen we eens kijken hoe we de memorie voor procesen garantieven en hoe we de memorie gebruiken van procesen limiten. En daar zullen we eens kijken op c-groups versie 2 en dat is ook gebruikt door container implementaties. Dus we zien ook de relatie met containers in de laatste deel van deze talk. Er is een memorie-exerciser, die is genoemd use-mem, die is een klein programma dat ik zelf ook voor de performance-analysis traine deed. En je kunt het met dit repository klonen. En daar vind je de sourcecode, die is in c, use-mem.c. Je vindt dat maakfile, je kunt, als je een c-compiler hebt op je eigen systeem, je kunt de executable zelf genereren om wat experimenten te doen, misschien tijdens de talk. Er is ook een statically linked version van use-mem in dat repository. Dus als je niet een compiler hebt op je systeem, dan gebruik je de statically linked version, die is called use-mem instead of use-mem, die is de normaal naam. Ik denk dat de statically linked version nog een executable moet worden gemaakt, maar na dat je het kunt runnen. Bewaren met dit programma, want als je in een goeie maak van duurte parts van memorie, kun je de systeem die je werkt op. Ik zal je deze use-mem programma laten zien. Als je het met de parameters runnt, dan bewaren je het gebruik. With use-mem, de enige mandatory parameter is dat je een vergeten size van memorie kunt specifieren om te worden gecreëerd. En by default, het zal worden gecreëerd via de famous malloc systeem call, maar je kunt ook een physical size specifieren, optioneerlijk. En de physical size betekent dat de memorie die je hebt geallocated zal ook gegeten worden. Het zal ook gegeten worden. En dat zorgt ook om dat memorie te creëren, fysically. Vertrouwelijk is dat niet genoeg in dat geval. In plaats van using malloc kun je ook memorie map gecreëerd worden. Je kunt ook scherp memorie creëren, met de capital S, systeem vijf scherp memorie of post-scherp memorie. En er zijn veel andere flags die je kunt gebruiken om alle soorten kernel features gelaten aan het alloceren van memorie van processen, maar we gaan niet gebruiken deze speciaal features. Er is één parameter dat ik wil emphaseren, en dat is het minus r parameter. En daar kun je een repetitieinterval specifieren in een aantal seconden. En dan zal deze memorie die je specifieren elke zoveel seconden meer geallocated zijn. Dus je kunt een memorie leek simuleren door dat memorie leek in het proces. Dus dit is het gebruikprogramma dat ik nu gebruik en dan voor demonstraties. Oké, laten we eerst even kijken aan de simpliteit explicatie over memorieverhandeling. Als je een linnex systeem boeit, zal de kernel loaden, en dat zal de stedelijk part van de kernel, en dat is de famose file slash boot slash vm linnex met een certain version nummer achter het, dat linnex referent om de fact dat het bedoeld is en tijdens de boot het bedoeld zal worden bedoeld en het zal worden gestort in memorie. Maar dat is alleen de stedelijk part van de kernel. Wanneer de kernel gaat alloceren data, betekent dat later op een proces, een procesadministratie moet beelden. Als dat proces een file openst, een fileadministratie moet beelden in de kernel. Dus de kernel zal ook meer dynamische data creëren, en dat zal worden gecreëerd via de so-called slab. De slab contains slab caches voor alle soorten sizes van data structure's die de kernel wil alloceren. Dus de kernel groeit ook dynamisch, en het groeit, en op een certain moment kan die structure's weer ontdekken, en het zal weer shrinken. Dus deze slab caches zal niet stabil zijn dat het uitgaat en al het tijd shrinkt. Maar alles wat relateert te de kernel, is memorieresident. Parts van de kernel zal niet bezocht worden. Well, during the boot phase, at the end of the boot phase, Process 1 will be created, the very first process, and this process is the initial process that we know as SystemD nowadays, and that will via unit files create daamend processes, and also the SSH daamend, all kinds of other processes will be created. So SystemD is the ancestor of all the user processes, not of the kernel processes, they will be created by process 2 usually, but the ancestor of all the user processes, also the interactive shell that we will run later on. And of course, the executable file will be used to load such a process in memory, we will see more about that in detail later on. Suppose that your system is up and running, then it might be that your processes are running, but there's still a lump of free memory, which is really unused. Well, that unused memory will be used in a useful way by creating a cache of that memory, and that means that all kind of data which is read and written to the file system, all that data will be kept as much as possible in memory, since disk devices are usually relatively slow and memory is relatively fast. So all the data that we can keep in memory from the file system, we will use that free memory for, as a so-called page cache. Of course, if I start more processes, then that page cache has to shrink again, and if processes will exit or release memory, then the page cache can be expanded again, that will happen dynamically. What you can see in this picture is that there is still a lump of memory really free, and that's a kind of stock of memory that can be used whenever you start a new process. We can immediately give that memory to the process, or when existing processes will expand by melking or creating shared memory, then that free memory can be used as well. So, let's have a look at how that looks like on the system. What I will do is that I will run ATOP. ATOP by default will run with intervals of 10 seconds. Well, for demonstration purposes, it's better to shrink the interval a bit to, let's say, 4 seconds. What we will see here in ATOP is our CPU usage here, all the CPUs, but also the individual CPUs, where we looked at the CPU in lowercase characters. But we will also see memory information, and usually if my system is connected to a network, I will also see network activity. I will see the activity of my disks. As you can see at this moment, if I... Oh, I was a bit too late. But what is important at this moment is of course what about memory, about this line. And that's similar information, some of it, that you can also see with TOP or with other commands like free and so on. So, what we see here is that my laptop has about a bit less than 16 gigabytes, from which still 13 gigabytes is entirely free. But here I can also see the page cache size, which is about half a gigabyte at this moment, because apparently not so much disk accesses have been done so far. Not so many disk accesses. What we can see as well here is the slap. So we also can see what is the dynamic memory used by the kernel at this moment. So, let's have a look at this page cache. It's only half a gig at this moment. And what I can do now is that I can start a program, just grab. And most of my example command lines I put in a make file, which is by the way also in the git repository that you have in the subdirectory demo, if you are interested in that. But what I want to do is this command, grab minus r, recursive. Some pattern doesn't matter in my downloads directory, where I have a lot of files at this moment. Once I start this command, I didn't do it yet. Once I start this command, then a lot of file data is read and that will be stored in the page cache. So I will see that my page cache will expand by that. So I started the command and if I have a look here, then I can see the page cache was a half a gigabyte and it's in the meantime three gigabytes and it's in the meantime four and a half gigabytes, six gigabytes. And you can see in the meantime that also my disks are very busy, or my one disk, on which I have the download subdirectory. So in the end, it seems that my cache has grown to 10 gigabytes, what we see here. And it looks stable now, so that means probably that grab is finished in the meantime. But let's also have a look at the slap. The slap is about 150 megabytes at this moment. And it hasn't grown because of the grab, because grab opens one inode, or maybe two inodes, or there are only a couple of huge files in my download directory. But for the rest, grab is reading data and that's filling my page cache, not the slap. But if you access a lot of inodes, if you open a lot of single files, every inode will be kept in the slap by the kernel. Even if you close the file again, you never know if the file is open in the near future again. So the kernel tries to keep such anode in the slap cache. So if I go back to my other window, what I can do is here make find. And that will run the find command, finding from the root directory all kind of files in my file system with a modification time of zero days. Well, that will access and open a lot of inodes. And what we can see here is that the slap cache, which was 150 megabytes, is now increasing and the kernel is allocating all kind of additional data. But still, we have some memory free here, as we can see. Probably if I access more file data, the page cache can still expand because we don't need that much free memory. Okay, so we have a page cache. Suppose now that more processes are started and processes like process four is expanding, then of course the page cache has to shrink again. And the page cache will shrink at a certain moment to a certain minimal value. And if the page cache is minimal, then processes have to be removed or parts of processes have to be removed from memory to get free memory again. It's all about this free memory here at the bottom. If it really is used by new processors, then it has to be filled again at free memory. En if the page cache cannot deliver more memory again, then we have to take memory from processes that are currently running. And then the swaps comes into pace. So parts of processes will be written to the swap device. And at a certain moment, we can reach the situation that swap is full as well. And then we have a problem because then we might get a certain deadlock. Memory is full. We want to get rid of a part of a process, but swap is full as well. We can get free swap to get parts of processes back into memory, but that's not possible. Memory is full and so on. So then we get the famous own killing, which means out of memory killing. Then the kernel on its own initiative will kill one of the processes. And we will see more details about that later on during this talk. So if we have a look at what parts, what components use physical memory in our ROM, we have seen the kernel, the static part of the kernel. We have seen the slap caches, the slap. We have also seen the size of the slap. But also processes, of course, use memory at a certain moment. Well, what we can see here on this slide is that Atop, in its newest version, also has a kind of pseudo graphical representation of the most important hardware resources, like the processor, the busy time of the processor. We can see here the activity on disk. We can see the activity on the network interfaces, how many packets are going in and out. But we can also see memory here, more or less pseudo graphically represented. This is, by the way, new in Atop 2.9, which is not in all the distributions yet. I think that our E-Pel doesn't have it yet, but most other distributions have it. You can start for this representation Atop with the minus B flag, capital B. But you can also use the minus the capital B key inside Atop itself. So if I press the B while Atop is running, I can see that this is the situation at the moment. We have the processors and the kernel all together. We have the slap. I'll come back to TempFS. We have the page cache and we have still 3 gigabytes as we saw earlier free. And I can see my swap device, which is about 8 gigabytes, which is still entirely free. So what kind of consumers can we have and can we see in memory and on swap? Well, we already discussed about kernel and processes. Another consumer of memory is TempFS. If I have a look here, then I can see that there are various file systems that are based on TempFS. And TempFS means that it doesn't consume any space on disk. Such a file system is entirely kept in memory. And if memory gets too full, it will even be swapped out. And TempFS is a non-persistent file system in that sense. And we can see that one of the file systems is slash def slash shm, which is a TempFS based file system. You can see the maximum size of such a file system is usually half of memory. Ja, remember my memory was about 15.2 gigabytes. And a half of memory is really the size of a TempFS by default. You can limit that, but that's the default, as you can see. And that also means that if I write a lot of data to that file system, that might even introduce swapping in my system. So what I can do here is dat I do a make of a command again. And that will be the command dd. I will put it at the top of my screen. I will do a dd and copy a file slash def slash zero to slash def slash shm slash big. And I will create a file of four gigabytes here. Ja, so when running this command, I can, it's finished already. I can see now that TempFS is really consuming a lot of space in raw memory. And it has caused that the page cache from which it is in fact part has shrunk. Okay, another consumer of memory is shared memory. And shared memory in fact belongs to processes. Shared memory is a way that it can be created by a process and it can be shared by other processes. So a lump of shared memory can be part of several processes in a read-write fashion. One of the processes can modify things and the other processes connected to that same piece of shared memory can use that information or even modify it themselves. So shared memory, we know that in two flavors, system five shared memory, which has other system calls to create it. And we have POSIX shared memory, which is in fact based on TempFS again. So I will have a look at system five shared memory, which is I think most often used. And I can create via my use mem command a lump of shared memory as well. So having a look at my command line again, I can do a use mem minus capital S, that is system five shared memory. And I can create a lump of shared memory of four gigabytes. That's virtually. That will not create it really in memory. But I can also take care that it is written to and by writing to that memory, it will really be created in memory. And I will run this in the background. I will create this in the background. So going back to my other window, I can see that also shared memory has been takes a part of my memory now. There's still no swapping going on, but all the time the paged cache has shrunk. Making space for my new memory allocations. Oké, what I can do next is that I can also create some other pieces of memory. And that's what I can do with a normal malloc. So I can do a use mem. Let's create five gigabytes. And also create it physically. Fill it with data. And I can run such a process in the background and see what happens here in memory. And what you can see now is that memory fills up. Page cache has shrunk to a minimum. Still, there is no swapping going on. By the way, what you see here in Atop is apart from the memory consumption and the swap consumption, you can also see events. And there you can see if pagecanning and swapping out is going on. And these are colored green, which means it's okay. And you can even see if ownkills have taken place. But we are on the edge of filling our memory. No swapping yet. And what I can do is that I run the last command. That I run that last command again. Creating another five gigabytes of memory. And then we will see that we reach a point that swapping is going on. And all kind of data is moved to the swap space. So that can be processes that are moved to swap space or parts of processes. And that can be tempfs and shared memory as well. So if I create another process of five gigabytes, then I'm really getting at a critical point. Because I don't have that much space left in swap. So you can see that all kind of things are coloring red now. And you can also see that, well I missed it unfortunately because of a very small interval of four seconds. But you can see ownkills was also red. And it says one of the processes has been ownkilled. You can see after ownkilling has passed then that block remains a while orange. Even if you have missed that event, which is of course a very terrible event. Then you can still see that ownkilling has taken place recently. You can also see that shared memory is swapped out apart from processes. And in this part of the processes, the black part, there you can also see that tempfs is swapped out. It's not in memory anymore. We have a lot of, well, some free memory again. Page cache is eliminated almost entirely. And we have still some shared memory left in memory. Okay, what I want to do now is that I'm going to kill all the useman processes. One of them has been killed, as you have seen. Killed, that was the ownkilled one. And it has been killed by a signal9, a sick kill by the kernel. But I will also kill the others, which makes my memory rather clean again. Not entirely. Still processes might be swapped out, parts of processes and still tempfs might be swapped out. In fact, tempfs is swapped out. What I can do here is that I do a make of this command. I do a cut of my big file in tempfs again. And that will my tempfs be swapped, take care that my tempfs will be swapped in again. So if I run this command, then we can see that tempfs is slightly filling and you can see that the space on the swap device is slightly decreasing until it stabilizes, but it's still busy. And you can see also my reads on disk are still going on, reads from the swap device. All right, let's have a look at the slides again. So what we have seen now is a simplified view of memory management. We will dive into a bit more details now. As we have seen in the simplified view, it looked like all the processes are loaded as one lump into memory, which is not the case in practice. If we have a look at our RAM memory, it is subdivided in equally sized chunks. And these chunks are called pages. And the page size is in principle defined by the hardware, by the CPU and the MMU. And of course it's important to know what the page size is, because a lot of memory consumption is reported in all kinds of counters as the number of pages. And to know how much gigabytes or megabytes that is, it's good to know the page size. Well, you can figure out the page size of your system with the command getconf page size. And you will see the page size, which is, well, most often 4K, specifically for AMD and Intel processors. So we have 4K pages, let's assume. There are other architectures that have, for instance, 16K pages. But let's assume 4K pages. A simple command like PS already reports certain things in page size counters. So PS minus L shows you a value called SZ. And that's the virtual size of a process. But with the Y flag, you also got the resident size. I'll come back to a virtual size and resident size more in the rest of my talk. But for now, you can see two values here that are related to memory sizes, virtual and resident size. One of them is by default in kilobytes. And the other one is by default in pages. In the same line, it's always a bit strange to me. But the RSS, the resident size is in kilobytes and this one is in pages. So it has to be multiplied by 4 to get kilobytes as well. So beware that you are knowing the page size. So the picture that we saw before with all the running processes with different colors, we can see that same picture again. If we look here, then we can see that all the processes that are running are using single pages and they don't have to be contiguous. You can see all kind of green pages and orange pages and so on from different processes that are mixed up in memory and a process consists of various single pages. Here you can see again pages that are marked with an S that are slap cache pages belonging to the kernel. You can see the pages marked with a C that are page cache pages belonging to the page cache. You can see the yellow pages which are free with the F inside and all the other ones that don't have a character are process pages. And in that way, we can also consider the swap space as a collection of pages on disk that have been swapped out so far. A swap space is not a file system. A swap space format that with the mkstrop command and that's just formatted as a collection of pages that are swapped out. You can also see the executable file can also be considered as a collection of pages. And there we can see various pages that are code pages. Containing the instructions to be executed. And various pages will be data pages containing the static variables of the process. And we will see how such a program is loaded after all. So if we have a look at loading a program and for those of you who are familiar with the system calls, that's in fact the exec system call, not the fork system call, but the exec system call, then a new executable is loaded into the process. And such a process starts at that moment then with an empty set of pages. There are no pages loaded at the moment that you start a new program. The only thing which will be done by the kernel is that it sets up the page tables for the process. The page tables that are used by the MMU, the memory management unit, which is integrated in the CPU. So we are talking about hardware. And the MMU has to know where can I find the physical pages of a certain process. And the MMU will know by having a set of page tables. Well, those page tables are all marked for the MMU as these pages are not present. So no pages are present for the MMU at the moment that you start a new program. Well, then the process is started and it will refer to the address of the first instruction in a code page. But that code page is not present in memory. And then the MMU, the hardware, will give a trap, a page fold trap that says, hey, you are referring to a non-existing page. And then the kernel reacts on that trap by finding a free page in memory and loading that specific page, assume a code page here, loading that specific page in that free page. And that's not free page anymore, of course, afterwards. And then it will set up the page table entry for the MMU. It will say that the page exists now and it will also make a reference to that physical location of that page. And then the process will be restarted and will retry to do the access to that page. And now it will succeed. Well, suppose that in the first instruction that will be executed now, you refer to a static variable in a data page. Then again, that reference to that address will introduce a page fold trap by the MMU and the kernel will take that specific data page from the executable, will find a free page in memory and loads it in that free page and sets up the page table entry again. And this is what we called load on demand or demand paging. When you start a new program, only those pages will be loaded into memory that are really referenced by the process and pages that are not referenced for this run will stay on disk. They will never be loaded into memory. And that is the difference between the virtual size of a process and the resident size of a process. The virtual size of a process is the worst case size. Suppose that you would touch every page in your executable and also you would touch every page in the mallocked areas en in shared memory and so on. Then your process can be this big as the virtual size. But in practice, you won't touch all the pages during every run of your application. So usually the resident size of a process is smaller than the virtual size of the process. So that's in fact what we see with the command ps minus ly. There we can see the resident size and if we look at bash here, in my shell, I can see that multiplied by 4 that my virtual size of the shell is 220 megabytes worst case, but it in fact only uses 6 megabytes of resident memory. And that's important what is the resident use of your processes in memory and not the virtual use. And of course this kind of counters you can also see in top as v size and r size en in a top and so on. Per process, you can see what is the real resident size and virtual size. So if we have a look at a process, a process always consists of code, of course instructions to be executed. And the code pages are shareable. If I start twice the same program or 10 times the same program, the same executable, then the kernel will notice that and it will only load the code pages once and they will be shared by all the processes. So if 10 users at the same time run the vim command, the vim program, the code pages of vim are only once in memory. But per process you will get your own data pages because they are modifiable of course and your own stack pages and you will get your own heap pages which is in fact the size which is created by malloc, doing malloc in your program. So that are mandatory pages that we see always for every process. But a process optionally can also use shared libraries and the idea of a shared library is suppose that you have two executable files, two different executable files, of course they will not share the code, two different files. But still parts of the code might be the same. Every executable uses C-functions and needs the C-library routines. So what you can do is that you get this code that they have in common, that you get it out of the normal executables and create a new executable file which is called the shared library and then you can take care that these executables are referring to the path name in the file system of this shared library where we have shareable code and corresponding data which is needed by that code. So one of the executables now can be VIM and the other one can be NANO. So even if people are using VIM and somebody else is using NANO still they can use the same shared library and still they can have data of a code in common from that shared library. This code will be once in memory of the shared library. The corresponding data will be per process of course because every process makes its own modifications again. Well, furthermore a process can create shared memory. We talked about that earlier for inter-process communication purposes and a process can use memory mapped files. It's a bit out of scope but just in short you can access a file in the conventional way by opening a file with the open call and after that do read and write system calls. But instead you can also open a file with the open call and then do a memory map to put the data of the file in your address space of your process. And memory map returns the start address of that mapped data and then you can simply modify and look at your data in the file by address manipulations. And well, that's the idea of memory map files which are also part of process space optionally. So what we have seen is that we have a virtual size for a process and the virtual process the virtual process size is the worst case size as I mentioned. That can be the resident size if the process touches every page but the resident set size yeah, that's only what is at this moment in memory of the process. Notice that both houses include the shared pages. All the shared pages again. So you cannot simply say I add all the resident sizes of my processes together and then I know what my processes consume because you will have too much for every process all the shared pages are calculated in the resident size. And that's why there's also a third size of a process which is called the proportional set size and that's rather similar to the resident size. However, the shared pages are divided by the number of sharing processes. And well, then you get if you add all the process consumptions together then you get a more realistic view about yeah, what is all the space consumed in memory by the processes. That proportional set size is for instance shown by Atop. And if we go here the problem is however I can switch back with the capital B to text mode again and see all the details. I can also press the M button for memory and then I see all the memory details about my processes. And here you can see the here you can see the virtual size and the resident size. Of course the resident size is always smaller than the virtual size but you can also see the proportional size. Well, the proportional size is not filled at the moment because it's a lot of work and a lot of CPU consumption by Atop to calculate that proportional size. So it will only do that if you press the capital R key in Atop then it will calculate this which you can also give as a flag. But there's another issue this can only be calculated if you run with privileges root privileges so which I didn't do. So I will start Atop again with Sudu to run it in a privileged way. And here I can start also press the R key capital R and then you can see if I press the M again for the memory details then I can see also the proportional set size. So in fact proportional set size is always smaller than the resident size and the resident size is always smaller than the virtual size. All right. Let's have a look what happens if memory gets too full. Yeah, we have seen that in the simplified version if memory gets too full then first of all the page cache has to shrink but also we want to get rid of pages of processes and swap them out. So how does that really work under the hood? Well, if we have a look at physical memory of your system then we can see that physical memory is subdivided in so called nodes. If you have a simple laptop or just a small system then probably you don't use nummer. That means that you just have one lump of memory one node. But if you look at larger systems you will probably have multiple nodes. You will probably use nummer non-uniform memory architecture which means that the total memory size is subdivided in various physical pages various physical chunks. So here we see an example of a simplified example of a nummer system with two nodes and part of my memory is on node 0 and part of my memory is on node 1. Both nodes together that's the total space memory space of this system. You can see in a node to a node we also have various CPUs connected and the other node as well in the system has various CPUs connected. These CPUs can very fast access the memory in the same node maar they even can access memory in the other node but that will be done via an internet connect which is well slow it's a slower way and with a higher latency than accessing the memory in your own node but it is possible to access memory in other nodes. Well So memory is subdivided in nodes and nodes are subdivided in zones again for Linux memory management. And what we see here is that the first zone in memory that's the so-called DMA zone and that's the first 16 megabytes of memory and that is still well rather precious memory if you are still using ISA controllers. ISA controllers can only address only have 24 bits to address and they can only do DMA direct memory access in the first 16 megabytes of physical memory. So that's precious memory and that's a separate zone. Then we have another zone which is the DMA32 zone and that is from the 16 megabytes to 4 gigabytes which is addressable by 32 bits for 32 bit controllers that might do DMA they have to have their buffers there and the rest of your memory is in fact normal zone and that's also what we can see here we have in the first node the DMA zone the DMA32 zone the rest of the first node is normal zone and also the other node is normal zone entirely. So if we have a look at that free memory which is always kept free even if the page cache expands all the time that free memory is configurable and that is defined by the kernel parameter proxysvm min3k bytes and the default value of that is determined by the total memory that you have in your system so that will be defined during the boot phase by the kernel itself but you can overwrite this file if you want to have a stock of more free memory or less free memory or whatever. Well, per zone 3 threshold values are defined and first of all the min threshold value is defined and the min threshold value per zone is a proportional part for this zone of that min3k bytes let's have a look at an example on the next slide watch this file proc zone info of course via such a file we can look into the kernel administration and in that kernel administration we can really see these three threshold values so let's first have a look at part of the next slide if I do a grab of a couple of terms in proc zone info I don't need all the other things there but only these things what I can see here is the DMA zone I can see the DMA32 zone en I can see the normal zone by the way this is a normal laptop with 32 gigabytes of memory and well it only has these three zones and you can see all the zones are in node zero yeah that's mentioned here well if we look at spand for all the three zones that's the size of the zone yeah so let's say 4069 96 sorry times four yeah that's 16 megabytes that's the first zone this number of pages here is if you calculate this times 4k it's 4 gigabytes minus 16 megabytes and this is the rest of the pages for spand for the normal zone up to 32 gigabyte well you can also see per zone how many pages are free pages free are maintained per zone en if you use the free command or top or atop and you see the free memory at that moment all these free sizes per zone are added together and that is given to you as the free memory in the entire system but it's maintained per zone well what you can see here is that first threshold value that I meant how many pages proportionally are to be delivered free for the DMA zone and how many by the DMA32 zone and how many by the normal zone if you add all these pages together from from the min and we can see that here then we come to 16,895 pages times four and that is really the number of min free kbice which has been configured here so that has to be proportionally delivered by the three zones so that's the min value but we can also to see per zone to other values which are low and high well they are also calculated at brew time low is the min value of the zone plus a certain factor times min and that factor is usually around a half and you can see that high is that factor times min times 2 so let's have a look again here min is 8 for the DMA zone about half more is low and about half more well the factor is here dot 47 so it's not really 50 percent more we can see that these other thresholds are also calculated also here we can see the same and also here we can see the same about low is twice min and high is low sorry low is 50 percent more than min and high is about twice min okay going back here these thresholds are important to know when we are going to fill up the free pages suppose that the free pages in the zone at this moment are still well high a lot of free pages in this zone at a certain moment the pages in the zone are going to be used yeah by the page cache or by the process and then we see that the free pages in the zone will drop well if the free pages in the zone reach the low value of the zone then there is a process in the kernel which is called case swap D that will be activated and that will take care that occupied pages in the zone will become free pages again yeah and it will not just release one page yeah when we just drop below this value then we've got a trashing as it is called but it will also immediately release a lot of pages and make them free again up to the moment that we reach the high value again of the zone then it stops releasing pages then we have enough free pages again of course these free pages will be used again in time and we will reach again the number of pages drops the free pages drop below low and then the case swap D will take care that pages will be released again so that we reach the high value again in the zone so that might also be explained that you sometimes see heavy swapping while the free memory on system level is still a lot but it might be that one of the zones has a shortage of free pages and that's why swapping goes on in that zone specifically and not probably in other zones yeah we'll be swapped well dr. kernel tries to preserve the pages in the 6th in the dma in the first two in principle but they will be rather equally sped yeah but pages in the first two zones are more precious right so that's the moment that the system starts freeing up pages again get these pages free well how the system will do that we will see on the next slide but first have go back to this slide we see here all the physical pages in the notes and of course the kernel has a small piece of administration for each physical page that says if the page is free at this moment or is it in use for what is it in use what page can we find from watch process and so on so for every physical page the kernel maintains a small piece of administration well if we have a look here then we can see that these pieces of administration of every physical page is in a certain list en let's first have a look at the left part of this picture what we see in the left part are the so-called anonymous pages and well anonymous pages that are more or less the pages of processes yeah the data pages and the stack pages of processes but also shared memory pages also tempFS pages they are all here in the left part of this picture well suppose that you are going to start a new process then that new process needs free pages they are taken from this collection of free pages and these pages are filled as we have seen by referring to those pages and once which page is filled it will be put in this list and that's the active anonymous pages list en this is a so-called LRU list least recently used list it will be put at the top of that list but suppose that later processes are also using pages then my earlier process the pages of that will slightly travel to the bottom of this page list to the bottom of this active list well if it's at the bottom then it might be transferred to the inactive anonymous page list and that are pages from which we think that they are probably not in use anymore by the process the process is still running but this page might not be in use anymore and might be a future candidate to be swapped out if it's not used well and also this is a LRU list and also there it will slightly go down and travel down in time but suppose that that page is touched by the process then at the bottom of this list before releasing the page we will notice that the page has been used again by the process and then it will be moved to this active list again and will be put on top so that it has some time again to travel to the bottom of this list and go to the inactive list and well if it's referenced again it will take that cycle but if it hasn't been used such a page for a while by the process then it can be released as a free page so what we see at the right part are the pages that more or less belong to the page cache the blue pages that are these pages belonging to the page cache and also for the page cache we have an active file pages list an inactive file pages list well suppose that one of the processes opens a file and reads some data from the file of course in the page cache you only want to keep the popular data that are data blocks that are accessed well more frequently and seems to be popular so at the moment that I touch or read a piece of data from a file for the first time a free page is taken that page is filled with the data from the file system but we don't know if it's a popular page if it will be used in the future near future again so first of all this page will be moved on its first reference to the inactive page list and that's also an LRU list where it goes down in time if it hasn't been used and read by another process in that time then it will be moved to the free pages list again and it doesn't seem to be a popular page but while traveling down when it is read again by another process or the same process then it seems to be popular and then it will be moved at its second reference to the active file pages list so in this way we get the most popular we keep the most popular data pages in the page cache all the time well notice that this is a pool mechanism when the number of free pages drops under the low value as we have seen before then we need more free pages again and then we will pool at the inactive anonymous page list and at the inactive file page list to fill up the free pages again in this zone but that will also take care that pages are moved from the active list to the inactive list en at this point as well so this is used to fill up my free pages again in a zone we have these active and inactive lists per zone but the question is of course if i'm running out of free pages in a zone how many pages will be swapped of processes versus how many pages will be released in the page cache so that's what we see in the next slide in the next slide we will see that what is the balancing of anonymous pages and page cache pages well that's more or less controlled by the swapiness parameter proxysvm swapiness which is a value between 0 and 200 the lower this value the more aggressive we will take pages from the page cache and the less swapping we will do from processes and the higher this value of course then we will swap more process pages and leave the page cache more alone so if we look at the the code in the kernel that decides what how how do we get our free pages again if no swap space is available at all in my system yeah that's not even a swap device configured of course then we can only string the page cache we cannot swap out anything if swapiness is really zero then we will also string the page cache only if we are needing free pages if the page cache has a minimal size then we will only string the process pages then we can't string any we can't string the the page cache anymore and we will swap out processes process pages but well these are in fact the first three points are in fact exceptions if we look at the normal situation then we will take some pages from the page cache and some other pages we will take from the processes yeah and we will do swapping and that is determined by the swappiness parameter then we get an anonymous priority which is the swappiness value which is on most systems by default 60 and the file priority is 200 minus swappiness which is by default then in that case 140 yeah so that will be the relation 60 against 140 between how many pages do we get from the processes versus how many pages do we get here from the page cache okay but then finally we can reach the point that memory is full and swap is full and then yeah we have the situation that we cannot do anything else than kill a process and that will be done by the kernel and that's what we call own killing and the kernel searches for the process with the so-called highest ohm score and that's at this moment it has been modified in time this mechanism but at this moment in the modern kernels you can see that the process with the highest memory consumption physically will be killed and this ohm score is calculated by getting the usage of the process according to real memory usage and swap space versus the total memory and swap space the per meal value add thousand to that and add an ohm score adjustment to that to artificially lower or increase the ohm score of a process and that's done via this formula that will define for every process the ohm score and the process with the highest ohm score at this moment will be killed well we can also see the ohm score so we can already predict on beforehand which process will be killed when we are running out of memory and you can see that under the proc directory in the PID of a certain process ohm score but you can also modify that ohm score adjustment which is in the formula and that's a value which by default mostly is zero but you can give it the negative value to make the process more protect the process against ohm killing or you can increase this value to make this process more a candidate for ohm killing so let's have a look when I do this command I do a drap on well just a beginning of line as a search pattern in all the ohms scores underneath slash proc of all the processes of course I could have done a cat but cat doesn't show the file name and I want to know the file name yeah that's what grab shows me so if I have a look here then I can see all the ohms scores of all the processes that are running currently and you can see that a lot of them have the value 666 and why is that well suppose that the per mil usage of a process is even less than one per mil then this is zero plus thousand plus an ohms score of zero then times two-third this will end up in 666 and so that's a value that you see quite often well if I do this command I can sort my output of that grab command and I can sort it on the value which is at the end so then I can also predict which process will be first candidate to be killed by ohm killing and you can see it's a process 3066 this one en if I have a look here 066 yeah then I can see that it's oh it's my viewer I didn't expect that but this is my right so no ohm killing anymore yeah so artificially you can make your process with the ohm score adjustment you can give it a higher ohm score lower ohm score this is also what you can specify for instance in system d service files if you have a very important process to be started you can already give it an ohm score a negative ohm score adjustment all right so far for the general part of memory management what I want to have a look at is how does this apply to containers containers will use C groups control groups and C groups is a mechanism to subdivide the capacity of a certain resource and that can be CPU capacity that can be memory capacity in this case to subdivide the capacity and make slices of that total capacity and then assign certain processes to those slices yeah and what you can do with C groups is that you can put a limitation for all the processes in a certain C group on the memory usage yeah normally processes can use all the memory there is and by that introduce swapping also and get other processes out of memory but you can also say no processes assigned to this C group can only use maximum 500 megabytes but you can also take care that a process is not entirely swapped out yeah you can also have a guarantee on memory consumption and say well this is the lower value the lowest limit and this is what i want to assign anyhow to a process to a C group so C groups i'm usually managed by system D and they are implemented via a pseudo file system yeah which is similar to slash proc not a real file system and you can see that the root directory of this pseudo file system is cis fs C group and initially all the processes are in that root C group in this directory they are all assigned to that C group and initially that C group has all the capacity but we can make subdivisions let's have a look at an example how we can make subdivisions how we can make our own C groups if i go to the C group sub directory with the cd command then i can see with ls that there are a lot of files which are not really files but pseudo files that i can manipulate and look at and then you can see one of the files is cgroup.prox and there if i do a cut on that file i can see all the pids of the processes that are currently assigned to this C group well what i can do in this root directory i can make my own sub directory by just doing an mkdir and if i go to that new sub directory it's by magic already filled with all kind of files again also there i can find the file cgroup.prox which holds the pids of all the processes assigned to this C group but i can also see all kind of other files that i can use to manipulate the memory usage of all the processes assigned to this C group if i have a look at in my new C group which processes are assigned to it then i can see that the file is empty and i can also see with memory.current what is the memory consumption physically at this moment of all the processes in this subgroup and well it's zero still well what i can do now is that i assign a pid that i echo a pid to that cgroup.prox file and you know dollar dollar is the pid of my running shell so i assign my shell pid to this file en make my shell a member of this C group my shell was a member of the root group but there it will be taken out if you assign it to another C group underneath and all the descendants of my shell will inherit the connection to this C group so from my shell i start big app now and that will also be assigned the pid of that automatically to this C group it's a child of this shell so if i have a look at memory.current now what's the current usage then i can see the memory consumption of big app and maybe more but we'll see that so that memory.current that you'll find per C group that contains the memory consumption of all processes in the C group but also the memory consumption of these processes in the kernel not only the user space itself so also what these processes have allocated in the page cache and in the slap caches of the kernel except the memory that a process already claimed before the process was assigned to the C group that is not in memory.current so if i go back to my example we see the same example here again at the moment that i do an echo of the pid of my shell then the allocation of my shell that has been done before is not in memory.current only the new allocations of the shell from now on will be in memory.current and if i start my new process that will also be in memory.current entirely you can see that also with ps you can see always to which C group is my process assigned and you can see that all these processes are now assigned to the same C group ok, with C groups you can also give processes a guaranteed number of pages to use physically that will not be swapped by default processes can be swapped entirely but i can give a process a certain memory guarantee for important processes and that's what i can do as follows if i go to my top directory en i make a new sub directory again vips and then i can find a value there memory.low which is the minimum value that i want to guarantee for the processes in the C group and i can do an echo of for instance 500 megabytes to that memory.low and by that i have assigned that memory it will not be immediately allocated but it means that if i later on start processes in the C group they will by demand paging get their pages and they might at a certain moment go over 500 megabytes but they might also be swapped but never under 500 megabytes again if that is the guarantee so now i can start my VIP server which is a very important server and it gets this PID 13 000 so on so and i can do an echo of that PID to the C group.prox and by that i assigned it to this C group however all the memory that has been allocated by my VIP server before doing this echo is not calculated so therefore this is a better way that we see at the right side of the slide first assign my shell to the C group and then via my shell i can start my VIP server or even do an exec my VIP server to get rid of my shell so by that i can give a guaranteed memory amount to my processors but i can also set a limitation at the other hand yes suppose that you have a process with a memory leakage and it cannot be solved for the moment then you can still say okay let's put a maximum memory size on that process or process group and then run that process and at the moment that it reaches that value then it will only swap its own pages out and it will not push other processes out of memory which will by default be the case and here again you can see that i create a new subdirectory leakers and i go to that subdirectory you can see that the maximum memory that can be used by all the processors is max everything there is but i can put a limitation here echo 100 megabytes to memory max and that will take care that we that the process cannot use more than 100 megabytes of physical memory now i can assign my shell again to the prox directory the ip id of my shell and i can see my current memory usage which is only of the shell of course at this moment you can even see this is the memory usage memory dot current how much swap space is used so far no swap space by this cgroup suppose now that i start my leaking application dan i can again look at memory dot current and memory dot swap current well that increases memory current memory swap current if i look at a while later yeah then i can see that my memory has reached 100 megabytes and it cannot go over that and i can see that is now filling the swap even swap space can be limited on a cgroup by doing an echo of a certain value to memory dot swap max which means if the process reaches the memory maximum and it will be swapped out and it reaches the memory swap max then it will be home killed even if there is plenty memory in the system and plenty swap right which brings us to containers as we know containers are isolated ecosystems to run your applications in and every process running in a container is in fact a native process for the host it runs on yeah that's typically different from virtual machines every process running in a system in a container is for the host running underneath it's a native process so that means every process in a container yeah will be treated as any other process on the host yeah by using the page cache and the slap caches and so on however every container has its own mini file system yeah which is different from the host file system so that mini file system comes from the image yeah that has been created by the image and that can be modified by the application as well and in that mini file system we will find the executables and the shared libraries and so on for this particular container well you know at the moment that you start a container based on an image then there will be an additional layer created on top of the image layers that will hold all the modifications done in the container by the application so we have a static mini file system from the image and on top of that we have a layer that holds all the modifications well that means if we have a look at code sharing suppose that you start based on a certain image more containers in your host then still we will have code sharing if you use the same image if the executable file is in an image and you run multiple containers with that image we still have code sharing but suppose that inside the container you dynamically built your executable then the executable will be in this top layer and if you will run it later on in your container then it will not be shared with other containers even if these other containers are based on the same image so code sharing is in that sense it's different from normal code sharing if all the processes run with the host file system furthermore if we have a look at docker and potman specifically with the command docker run and potman run you can give various parameters and they also relate to the memory management topics that I covered at the moment that you start a container with docker run each container will be a new cgroup underneath the directory sysfscgroup machine slice or sysfscgroup user slice and particularly if you look at potman and you run a container as a normal user then you will get your cgroup underneath your own user slice with your own user id usually if you use docker or run potman as root then you will get your new cgroup underneath machine slice however in all cases you can specify with docker run or potman run the parameter minus minus memory reservation is for instance 150 megabytes well what will be done is that in this new cgroup for your container memory.low will be set en dat wil geef je een certain garantie voor de applicatie in de container of de applicaties in de container een garantie van de mematie te worden gebruikt dat zal niet gevoel en het zal niet worden gevoeld er is nog een parameter dat je kunt gebruiken die is minus minus memorie en dat is een limitatie ja een max en als ik 300 m erin put dan dat zal in de cgroups memorie.max dat we hebben gezien eerder verdermore je kunt ook de parameter minus minus memorie dash swap wat je kunt zetten voor instance te 500 m en dit is de memorie en de swap ja together dat kan worden gebruikt met de applicatie in de container dus wat echt zal worden gedaan is dat memorie.swap.max zal worden zet tot de niet tot dit value memorie.swap maar tot dit value minus dit value ja dus in dit geval memorie is 300 en memorie swap is 500 het betekent dat ik een limitatie van 300 megabytes in memorie en 200 megabytes op swap als je niet memorie dash swap gebruikt als een parameter maar je gebruikt memorie dan de default is van memorie swap is twice the parameter dat je voor memorie specifieert ik zal je een klein demo geven over dat wat ik kan doen is run de command potman run minus minus memorie is 50 megabytes en ik gebruik een image hier op perpload in die heb ik de use mem command ja dus ik run de use mem command in de container en de overruling command is use mem ja allocate 20 megabytes en al ook creatief physical en repeat dat every two seconds ja so it's increasing all the time en de maximum is 50 megabytes of memory maar dat in also implies 50 megabytes in swap maximum ja en if both are filled my process will be unkilled so if I run this command then it will do a number of allocations of 20 megabytes until the fifth one yeah then memory is reaching its limit and swap is reaching its limit so then it will be umkilled well furthermore you can also specify with docker run and potman run you can specify umscore adjustment and well that's obvious we talked about that you can make your process more sensitive for being killed by umkilling or less sensitive by that there are other parameters that are not supported anymore in cgroups version 2 memory swapiness and umkilled disable that's just for systems with cgroups version 1 that can use that finally I want to have a look at kubernetes at the moment that you create a new pot the pot scheduler is going to search for a suitable worker node and one of the one of the decisions is based on what are the resource definitions of this pot and what you can see in this pots manifest file that I have here on the slide is that I have a pot with two containers and that you can specify on container level resources and there again you can specify limits en you can specify the limits for memory and cpu and you can specify requests which is in fact well the guarantee and there I can also ask for a certain amount of memory and you can see that you can do that differently per container well at the moment that your pot is placed on a worker node then it will also get quality of service which is defined by the pot scheduler and that can be guaranteed and that will be guaranteed when each container in this pot has a specification of the resources the limits and the requests and these specifications are equal as you can see here here the limits and the request are equal and for that other container the limits and the requests are equal as well if that condition is met then your pot will be scheduled as a guaranteed quality of service and we will see what it means later on but your the quality of service of the pot can also be burstable and that is as if you specify for one of the containers at least a memory or cpu limit but which is not necessarily the same yeah the limits are and the request are not the same values or you didn't specify these values for all the containers in your pot then you will get burstable as quality of service if you didn't specify anything about resources at all in your pot then you will get quality of service best effort well depending on that quality of service on the worker node where your pot will be scheduled you will be scheduled in one of the cgroup subdirectories of that worker node and there's a special cgroup for qpots burstable and qpots best effort and there's a special cgroup for guaranteed quality of service well on the worker node if the pot is started then all these resource definitions are passed to the container runtime like container D and then container D on that worker node will create a subdirectory a cgroup for the pot as a whole en underneath subdirectories per container and there it will again specify these limits and requests well the resources.limits.memory that we see here limits memory here the 200 megabytes for instance that will be used again as the memory max in the cgroup are you going to exceed this later on then you will get own killing yeah because if you try to exceed your limit then swapping has to be done but usually worker nodes don't have swap and therefore if the application exceeds the limit you immediately get the own killing of the process in your container if we look at the request value then you would maybe expect that the low value in the cgroup definition will be set by that but that will not be done because that's also a problem suppose that your application that you set a request value of 500 megabytes your application is running and is going to exceed the 500 megabytes at a certain moment it becomes very crowded in memory also by other containers and other processes then you have to be pushed back to your limit to your guarantee and at that moment swapping should be done but if your worker node doesn't have swap device you can get a process back to its guaranteed value again so what we see with resources request.memory is that it might change the own scores value of the process and for guaranteed qualities of service then the own score will be set to a very low value a negative value 997 which makes such a process almost well protected against onkillen but burstable processes they will get an own score adjustment value again that is calculated by a certain formula this again is the per mill value of this application inside the container on the worker node and if we have a real example about that suppose that I have a burstable quality of service for my pot if I have a look at kubectl describe of a node my worker one node I can see this is the total memory of my node about 15 megabytes if I go to that node I'm here on work one then I can have a look at my process running in a container here and that process runs with the PID 8224 in this example then I can see an own score adjustment of 863 which is a high value a high value which make this process very sensitive of being killed by onkillen and you can see according to this formula this is 1000 minus 1000 times 200 because that's in this example what I specified for the request divided by the size of the total memory of the worker node and that comes to this value which is well giving this value a very high own score and making this process rather sensitive for being killed by onkilling in contradiction to guaranteed which has a very low value best effort yeah so if you do not specify anything about request and limits then the own score adjustment will always be 1000 making this process even more sensitive for onkillen by the way that quality of service you can see that if you do a kubectl describe of your pot then you can see what quality of service has been assigned to the pot based on what you specified for resource consumption finally suppose that you run pots en they are running with this best effort then they are rather sensitive for onkilling burstable quality of service is also sensitive of onkilling specifically if you combine it with a lot of natively running processes on the worker node which are not started by kubineties but just native demons that are running on the worker node itself that makes also can can introduce out of memory killing for containerized processes if you make your resources limits dot memory too tight even with a quality of service guaranteed it might be that your processes get onkilled and that's for instance if we look at this example here of this pot in this pot I use my outcome per float image again I run the command use mem and allocate 20 megabytes all the time with a repetition of every 5 seconds and you can see here that I have specified limits and request with the same values so this is quality of service guaranteed but still then at a certain moment yeah doing a number of times that 20 megabytes allocation I will reach my limit of 50 megabytes en if I run this pot and I do a kubectl get then I can see the modification of my status all the time and I can see after creating this container and running it it will be onkilled after well 2 or 3 times 5 seconds and then it will be restarted by kubineties and it will be onkilled again and well end in a crash loop back off yeah so if even if you are using guaranteed quality of service yeah take care that you know what the need the memory need of your application in the container is yeah and you can of course find out by just running your application and then having a look at for instance with atop at the resident set size so we stop and see how much space and how much pages have been paged in for your process all right um it's time but still there might be some questions otherwise yes sorry which file system the cgroup file system ah okay you mean the question is if I go to assist fsc group that I have to be rude to do modifications there no no no normally not no it's okay we can have a look afterwards if you like we can have a look just afterwards yeah okay I want to close down thanks for listening and have a good day