 Okay, I guess we are good to start. Thanks for coming here, the last session of the QPCON. Glad to see you here. So my name is Antti, this is joint work with Ying, who unfortunately couldn't come here today. And we are going to talk about saving by swapping. So this is the problem that we are addressing in this talk. In suspicion memory, we would like to have, we would like to run more workloads with the nodes that we have without starting new nodes to the cluster and how SWAP is here to help. So the idea of SWAP is of course that some seldom accessed, hopefully seldom accessed data is written from memory to somewhere else so that that memory can be used by someone else. And when the owner of the data needs the data again, then operating system fetches the data back from somewhere else, back to the memory, and this is called page fault. And then the owner can continue. So this is a little playful calculation. If you have 10 workloads running in your Kubernetes cluster, each of them requesting one gigabyte of memory, but actively using only 900 megabytes of it, then if you swap out this idle 100 megabyte per workload, then you could fit the 11th workload on that node. Doesn't sound much, but if you put it in the scale and if you are dealing with here implement it in like 100 data centers in every server there, then you could immediately switch off like 20 core power plans in the world wide. So the scale makes it matter. And of course it matters if you don't need to start a new node in your cluster. So good why it's not on bid by default. Cubelet, long time cubelet refuses to start if it notices that there is swap. And now you can tell cubelet that hey, there is swap, I know it, I actually put it there, don't care, just run. The reason for that behavior is that swap brings in performance cost because the page faults are costly and it also brings in unpredictability because you don't know when the system starts to slow down. So when it was designed to run workloads that are sort of latency sensitive, they should respond in time. So swapping makes that difficult. So the point of this talk is actually not to think of swap as like one switch that is either on or off, but instead consider different, take different point of views to swapping so that you could make swap work so that you will get the benefits without suffering from the costs too much. So the questions are what to swap, how much to swap, where to swap, and when to swap, and then isolating these swapping effects. And finally we are going to see that where to put the control logic of this whole thing so that we can implement the swapping that suits us. So what to swap? In Kubernetes currently swapping is based on QoS classes so that guaranteed containers are not swapped, but first of all are. Best efforts probably not. But we could go in more detail than that. And we should go I think if we want to make this work correctly nicely with all workloads. So maybe swapping only certain ports that might be marked that yes these can be swapped or even certain containers. And we can go even that low level that we start to swap only certain processes in certain containers if they take a lot of memory and if we know that those can be swapped without hurting the performance too much. The other question, how much to swap. So this boils down to estimating the workings at size of processes. Workings at size means the amount of memory that the process needs actively so that it runs smoothly. It's a different thing from resident size in memory that you can find in the process status for instance which means the amount of memory that the process currently uses so that might be bigger and it is often bigger than the process actually needs but if you are swapping already a lot in your system then probably resident size is not big enough. So I will demonstrate in this presentation that how you can detect workings at size of your processes and of your containers. Next question, where to swap. So Z-RAM is a block device in RAM and anything that you write to that block device gets compressed. You can actually, and because it's a block device then you can also put swap there. So everything that you swap out gets actually compressed and still stays in the memory. So it's pretty fast to get there back when it is needed on page fault. Z-Swap is another a bit similar thing so it is also compressing memory that is being swapped out and it keeps it in the memory but it's there just for caching purposes. So there can be in the Z-Swap case also like back end device that if there is not enough memory for storing this cached swapped out data then it gets written in a controlled manner to the back end device. And in the Z-Swap actually in the latest Linux kernels there's hardware compression enabled so you don't even need to run the compression on CPU. And this also works with Z-RAM so that's also enabled. Then there are of course physical block devices like non-volatile memory devices, SSDs and one option that is not actually swapping but I would like to mention it here as well so adding memory to the system but you can add memory that is actually cheaper than the memory that you probably have like DDR5 memory is pretty expensive compared to some CXL.mem modules so those are cheaper but they are also slower and they are also bigger so you can instead of swapping go through your memory and see that what is not that actively used and push it to a different memory physically so that will allow you to continue without swapping but having most needed data in fast memory and the rest in a bit slower but it is not that slow after all unless you are doing something very performance critical. Next question, fourth one. So when to swap? You can leave the swapping completely to the operating system so it will start swapping when there is memory pressure. The good thing is that the overall performance is pretty good it runs fast as long as there is memory but when there is memory pressure then everyone hurts and this is probably one of the reasons why Kubernetes has it disabled originally. In CXP2 there are then controllers and controls for saying that this particular container should start to swap out so it has like container level memory pressure on memory.high control and when that is reached then operating system starts swapping but it swaps only that container so this already gives you a better limitation that how weak the heat is to other workloads. Going further one thing that I am demonstrating here is also like a new project like first release MTRD with it just before this conference and the idea there is that you can it's a user space demo first of all and things you can do is also this kind of thing that you track the memory of some other workloads and define your own limits that if some memory has been half an hour there without single access then swap it out even though there wouldn't be any memory pressure. So you can do this kind of swapping on the background and try to push the situation where the memory pressure occurs as far future as possible and another nice thing is that you can with this thing you can control the bandwidth so how many megabytes per second you would be swapping out so this should be like less disruptive for the other workloads finally how to isolate these swapping effects so I already mentioned this limitation of swap out bandwidth so when you do it very early in the run then you have time to do these kind of things and also you can limit swap in bandwidth so if there is some container that is being swapped out a lot and when it really needs the memory you might want to limit swap in bandwidth so that it does not consume all the bandwidth of the back end device which would then again interfere with other containers so this is good for those containers like low priority containers and others can keep on running while this is recovering from swap this IO is configurable also so in C-groups v2 and v1 as well so it's one of the OCI level parameters for the container and speaking of OCI level it's now good to step into the picture that where should we put this control logic of swapping so container run times both cryo and container D implement NRI server nowadays and that can be switched on from the container run time configuration on both container run times it will be on by default on container D 2.0 and that NRI server container run time now enables Kubernetes managed plugins well any process in the system but that includes Kubernetes managed containers to connect to the NRI server and then react on all pod and contain life cycle events and in that is reacting they can make some adjustments to the OCI level parameters so that you can actually write even this memory.hi, memory.swap.max and also block IO parameters so this is one example of NRI plugin NRI MemtrD plugin which is something that then runs this MemtrD user space demon and it will run it inside the same container where it is leaving itself so the NRI plugins when they are launched they register to the container run time and tell the life cycle events that they are interested in in this case starting a new container then because of that it will tell that when it is starting a new container and now the NRI plugin can react to that and in this case NRI MemtrD plugin reacts by launching a new MemtrD process that will watch that container and because we are now giving an example of working set size estimation this MemtrD process what it does it starts dumping memory access data that it can find from those processes that are running in the real workload that we are tracking so in the container X so how does this look in practice in the first line we are adding NRI plugins repository to Helm then we are installing from that repository NRI MemtrD plugin here we have an extra parameter as I mentioned these container run times they don't have NRI enabled by default yet but when you are installing NRI plugin you can patch the run time to accept this option so that it will be switched on and then these NRI plugins can actually connect to the container run time so at repo install plugin and then you can see already that okay there it is running in its own pod so it's a demon set that runs on every node in your container cluster okay here's then an example workload that I'm gonna be tracking in this example never mind about the details but the one thing that I want to emphasize here is this annotation so it says that this class where this workload that's pod and in this case only a single container the class in which that belongs to is a track working set size class so that's the name of a class and then we just launch it and how the MemtrD NRI plugin configuration looks like it looks like that so this class called track working set size and associated to that class there is MemtrD config which is then the actual configuration that is given to the MemtrD process that is launched when a new container belonging into this class gets started so the other configuration maybe I raise a couple of things so here we say that we are interested in memory age that is being, that's the time since the last access to the memory and we are using idle page tracker idle page is a Linux kernel interface which enables user space applications user space processes to see that which memory pages have been accessed since the last reset of idle bit and there are some parameters maybe most importantly here scan interval so we are scanning every 20 seconds all the memory of the processes that are found in this C-group here we could also, as I mentioned we could go deep into the container and track only a single process that is possible also by configuring pit watcher filter so that we would be filtering processes that are found in this C-group and finally there is a routine that is being executed so there is a MemtrD command line command running dump access memory and then giving some intervals from 0 to 1 minute from 1 minute to 30 minutes 30 minutes to 2 hours 2 hours to 1 day and then the rest so how the result looks like it looks like this so here the tracking has been just started all the memory that this process had at that point not that much of memory was like most recently accessed but once it has been running for some time we can see that like 70% of the memory of the process has been accessed somewhere between 1 minute and half an hour so this already gives some idea on this particular workload, this process that how it uses its memory and works as a guideline for us to think on how to swap it then how to swap it, let's see that also we are using the same plugin again using MemtrD process there to say that swap out this process this process ID and those addresses from it so here's the different class now not only track but here's like swap half an hour idle memory so we are adding their allow swapping option, we are adding idle duration means that if some page hasn't been accessed all the time in that duration then it will be swapped out and we are configuring also the mover that is okay well mover gets the address ranges and says that yeah this should be going to swap and this mover then acts like once in 50 milliseconds and uses bandwidth of 25 megabytes per second to do the swap but of course we don't need to use MemtrD to do the swapping we can use the C-groups parameter C-groups console that I already mentioned so there is another NRI plugin NRI memory QOS this is like extremely simple plugin so if you are interested in looking at what NRI plugins could do how they can control C-groups then you should look into this because it's so small so it says that it is interested in create container not start container NRI MemtrD plugin and when a new container is being created it answers by giving memory.high and memory.swap.max values where it get those values again there is like class based annotation you can annotate your pod and a container inside the pod so that you swap that much of the memory resources memory limits that there are in that container so these values are calculated based on that limit and the class okay so a lot of stuff a lot of questions but time to wrap up this so what to swap one more precise will help you mitigating the effects and like isolating that who will actually suffer about swapping and who will be not affected that much how much to swap working set size estimation is a good idea to understand that amount MemtrD offers something for that there is MG LRU in Linux kernel that is also good for that we don't have integration to that but anyway something to look at if you are interested in where to swap compressing in ROM is actually pretty interesting scenario possibly even hardware accelerated when to swap we can wait for memory pressure in the OS level in the container level or we can do swapping like before hand try to avoid the memory pressure and try to keep the node behavior in that way predictable isolating swapping effect that is something you can do by limiting this swap in swap outband with block device controls and your own demon controls and the control logic I would suggest to look at NRI plugins and MemtrD projects for that so just mentioning the dependencies container run times recent versions NRI has been there while in both container D and cryo so it's mature to use good to use there and idle page tracking is not enabled in every kernel by default Ubuntu LTS release has it but some others don't so there are a couple of kernel configuration options that need to be enabled to get idle page tracking working and if there's room for in improvement here then at least this annotation it's nice for hacking but if you would like to really like schedule your ports to nodes that actually support some class then this annotation is definitely not enough there's a good kept QoS class resources that would solve this problem and solve many other problems as well so we hope to get that and if you see it valuable we'd be glad to see also your support there thank you so much for your attention so questions test okay it works first thank you for the talk when we have an application that needs to query memory that is being swapped if we will limit its bandwidth to query this memory it will affect latency to use scales possibly breaking SLAs and stuff yeah that is the thing I guess that while swapping is disabled by default to have this keep this SLAs keep the behavior predictable but but when you try to like push to your node even more workloads then it comes boils down to the selection of what to swap and how much to swap and all these things to be able to keep the SLA which means if I want to note how much I can swap I should be able to also evaluate how much bandwidth I need to swap right am I able to track like drastic memory access changes especially when it comes to bandwidth yeah I think that these kind of things typically in the end they need measuring it's so workload dependent as well what are the good values and of course for the bandwidth what is the device where you are swapping to so if you are compressing into memory well there you don't need much bandwidth limitations possibly but when you go to like slow devices then it really makes a difference what are the real good values then depend on your hardware environment whatever is running there on your node thank you for great performance now questions related to Kubernetes scheduler could you please like share your thoughts about integrating information from memtrd to scheduling process I mean for example we know a lot of how application uses memory yeah we have these stats for example this application it over requests memory and it's okay for it to use swap and we can reschedule our bots using this information to mitigate the risk of high latencies yeah that's very interesting point thanks for bringing it up I haven't actually looked into this scheduler too much so I didn't even know that there is such a thing being able to provide like more metrics to that more information about the usage then it would be very interesting and I think that it would make this more usable it's a bit hackish it's in the early stage of development at the moment so it would be nice to be able to provide users and Kubernetes way to get access to that data thanks okay you can go home now conference is over