 Hello, everyone. Thank you for coming to my session. I am Dixita alias Dixie and I work for Google My co-speaker who unfortunately couldn't make it to the conference because of personal reasons is anti-curvenant and he works for Intel so this talk is about advanced memory management techniques and memory QS and Some of the NRI plugins that you can use for managing the memory What is memory QS? So memory QS feature is currently in alpha. It's actually alpha since 127 and with cgroups We to expose in new memory controllers like memory dot high and memory dot min We thought maybe we could configure these values to provide better memory guarantees and Memory dot min basically we wanted to map that to the requested memory and wanted to use memory dot high For throttling the workloads so that kernel has enough chances to reclaim the memory before the container the containers reach the Maximum limit so all of this sounded very promising theoretically, but when we actually performed the test We figured that our comprehension of memory dot high did not align with the recommended guidelines and Before we delve into What went wrong? What are the key takeaways and our learnings from working on memory QS cross feature? I would like to do a quick refresher on cgroups and what happens when you create a pod using pod spec in Kubernetes So what are cgroups cgroups are basically a mechanism that allows the processes to be organized hierarchically and the resources to be distributed hierarchically as well and in a controlled manner and also in a configurable manner So cgroups have two components one is the core and the other one is the controllers so core Takes the responsibility of organizing the processes hierarchically and the controllers take the responsibility of distributing the resources across those processes in a controlled manner So this is what a general Kubernetes node would look like that has cgroups we to configure It will have these three slices. One is the Q pod slice one is system slice and the other one is user slice System slice will have all the configuration corresponding to the system services like container de Kubelet, etc And Q pod start slice will then have C group files basically the slices for each QS classes that the Kubernetes support So when you create a pod a cubelet takes care of Creating the C group for the pod and it places those files inside the QS to which inside the QS directory to Which the pod belongs say best effort guaranteed or burstable The path of the pod C group is then passed to the container runtime Which takes care of creating the C groups for the containers and also managed manages them So this is what the workflow would look like when you create a pod It's passed on to the cubelet which takes some of the parameters from it some of the resource related parameters and passes on those parameters to the container runtime Or using the protocol CRI gRPC protocol and then the container runtime passes on that information to the underlying container runtime shim and That passes on the information to the underlying container underlying low-level container runtime like run C Which is actually responsible for Creating the containers and also configuring the resource values for them So if this is what your pod spec looks like and you have the request and the limits configured So scheduler takes scheduler uses the CPU request from here and finds a pod that would fit in and is able to Run your workload the container runtime then maps So this information is passed from cubelet to container runtime and container runtime maps the requested CPU to CPU wait C group parameter and if you also have limits specified scheduler completely ignores the CPU limits But the container runtime maps the limits to CPU max Moving on to the memory request and limit The cubenator scheduler uses the request again Just like it uses CPU request to find a pod that would be able to run your workload But here container runtime ignores the requested memory value So we saw that there might be an opportunity to provide better memory guarantees with the C group V2 controllers like memory Dot main if you are able to map the requested memory with that And for the memory limits scheduler ignores the memory limits and container runtime maps the limits to the memory max So moving on to the particular C group V2 memory knobs that we explored in memory QS So memory dot max which is similar to Memory dot limit dot limit invites in C group V1 This can be used to specify the maximum amount of memory beyond which you would want the ohm killer to be triggered So when the user just goes beyond this map this level the ohm killer will be triggered and The new controller memory dot main which we explored as a part of memory QS We wanted to map the requested memory to memory dot main to make sure that Whatever the memory is requested in the pod is always available for the pod to run and also from the kernel docs it said that this is the hard memory protection and If the usage is below memory dot main values The memory won't be reclaimed in any conditions The next controller that we explored as a part of memory QS is memory high So as per the kernel documentation, it said if you set memory dot high for a process And if you're if the memory usage reaches memory dot high the memory will be throttled beyond that point and So we thought maybe this is this is helpful And we will give it will give kernel enough chances to reclaim the memory before it eventually gets ohm killed So this sounded like a happy picture. We have found it. We have found the Controllers that can help us guarantee better memory and We also explored memory dot events So this is just a readable file that keeps track of how many number of times ohm kill was triggered And how many number of times memory dot high was reached or how many times the memory was throttled So like I said, this sounded like a happy picture But it was only when we started experimenting with these values that we found out Our interpretation of these controllers was wrong So what are the possible side effects of mapping of setting memory dot men in our use case? So if there are processes that are actually using lesser memory than the requested memory setting memory dot men would actually reserve extra memory and It won't be usable by other processes. So this can trigger this can lead to more ohm kills as the memory becomes Unreclaimable in each c-group and when the system is under memory pressure this amount of memory Reserved in each c-group cannot be Reclaimed and ohm killer is involved. I want to walk over this hypothetical example to explain that Just assume that this node doesn't reserve doesn't require any memory reservation to run system or Kubernetes workloads So say this node has one GB of memory and there is a container one that requests 500 MB of memory and its usage is below the set value below the minimum Requested value say 300 MB and then there is another container which requests 500 MB again And at a certain point of time its usage bumps up from 500 MB to 600 MB So at this point of time container one is using 300 MB container two is using 600 MB Which is still less than one GB? But ohm killer get gets invoked because the container one Reserves the extra 200 MB of memory which it is not even using but it just reserves it and It it seems like there is just no memory available for container container to to To reach the 600 MB memory usage. So at least for our use case my take away is that It might be a little aggressive to set memory dot min And it could lead to more ohm kills if the user is not responsible in setting the requested memory value Which is more or less a likely case Now what are the possible side effects of setting memory dot high? So like I said memory dot high can be used to throttle the memory once the usage reaches that level and Beyond that once it is throttled The colonel reclaim or triggers the reclaims So what we observed was say for example There is a process that consumes all the memory that was recovered during the reclaim What can happen is it can just stay in a deadlock like the process is consuming the memory It triggers the reclaim it reaches the memory dot high and then again It consumes the recovered memory and then again it just stays in that loop So the process will be stuck indefinitely in this case This is when we reached out to some colonel folks around and got their feedback To understand what memory dot high basically can be used for so they recommended that Memory dot high should be used in a feedback loop where there is an external process That is trying to alleviate the heavy memory reclaim pressure that is triggered when memory dot high levels are reached So it didn't fit the use case as well and we had to unfortunately Stole the efforts to promote the feature to beta I want to show give a demo of how that live lock scenario looks like So say this is a this is a pod spec that I have in this There is a python command line that runs the script It basically tries to allocate almost one megabyte of memory in every run and then sleeps for three seconds after that The request and the memory limits are set the requested memory is 10 megabytes and the limits are 20 megabytes I then go ahead and apply the pod Spec and the pod is created We get the pod uid from it because that's what we can use to get the path of the respective pod cgroup So I get the pod uid from here And I continue to watch that pod to see what happens and in the other tab um I do a cat on basically the cgroup that I have configured the memory dot high value that I have configured in this demo So if you see the memory dot high value that set is 19 almost 19 megabytes So the the request is 10 the limit is 20 and Where we want to throttle is when the usage reaches 19 megabytes And there is this another file which we spoke about memory dot events which keeps track of the number of times The different events were triggered Now I'm just getting the particular container and Watching how the stats look like for this container So right now the user just 10 megabytes, which is fine and it just continues to make progress And I'm also printing the logs in the other tab from the container. It says that It's still it's still progressing. It's allocating memory And it continues But when the memory reaches the throttling limit, which was 19, let's see what happens there So I'm opening the memory dot events file which keeps track of the number of times The memory was throttled. You can see the memory is getting throttle and In this tab after a point of time The workload is stuck nothing happens And it just continues to get throttle And you get no signal from it And it takes forever basically So my key takeaway from this was it's just better to be umkilled than to be throttle forever and wait without any signals Knowing what to do without knowing what to do So another takeaway is that if we have a cap that kind of involves Say for example, this particular cap involved a very deep knowledge expertise around kernel So it it might be better that we involve the subject matter expertise in the cap approval process way early on So that we have insights into whether the design looks fine or not So this is all all the bad things that happened in this particular feature, but We realized that even though memory dot high could not be used to throttle the memory and It could be used to throttle the memory, but it wasn't fitting our case It can still be used if you have an external process and for other use cases So for one of the use cases for which memory dot high can be used as you can still throttle the workloads and have a liveness probe which can act as the external process and When when memory dot high is reached, maybe the probe can take care of pulling out the process from the live lock scenario and Continue and it just it would just restart the process rather than invoking the umkiller The next use case that can be useful is To vertically scale the pods use memory dot high as a signal So when the usage reaches memory dot high throttle the workloads for some time And have some threshold values there and there can be another process that then scales the pod as needed The third use case Which we are also going to dive deep into is maybe memory dot high can also be used Used as a signal to swap out the container memory say the usage reaches memory dot high that can trigger another that can trigger another external process to To swap out some memory and then the workload can make progress So today kubernetes does not have a full-fledged support for swap, but it's still work in progress So for the demo, I have a plugin which is a little advanced. It's called memory qs nri plugin which can be used to swap out the memory So what are nri plugins nri plugin is basically a kubernetes workload that intercepts the communication between Container d and run c and then you can have different sorts of configuration that you want to modify So for our particular use case, we are using nri memory qs plugin What it does is it lets you specify memory dot high and memory dot swap dot max in the pod annotations And once you have these values you say that okay after After the process has used say 90 percent of the memory we want to swap out 20 percent of the memory So all the art is configurable using this plugin So my co-speaker I'm going to play the recording they have worked on this plugin and they have a demo using this plugin to swap out the memory And now that you know that how nri plugins hook into this stack, let's take a look Into the nri memory qos plugin in particular So you can find this plugin from nri plugins project and you can easily install this plugin using hell And this plugin Can be configured with a config map Where you can introduce memory qos classes So you have to define the class name like silver and bronze here And you can also define swap limit ratio for workloads in each class Swap limit ratio is basically The percentage of the data That should be swapped out When the memory consumption of the workload gets close to the limit And in this silver case we say that 20 percent Of the memory should be swapped out when the limit is reached and for the bronze workloads We want to swap out half of the memory when getting closer to the memory limit In addition to these classes, some of the Unified fields in C-groups 2 can be directly modified by workload specifications With this configuration we allow direct data values to be returned to memory.swap.max And memory.hi in C-groups And how we are going to then tell this information in workloads Here is an example of a workload where we define in pot annotations That all containers by default in this pot belong to the silver class And especially for container b The memory QoS class is actually bronze And finally we can specify that container a should have these Exact values in memory.swap.max and memory.hi in C-groups So memory swap max should be zero That is we do not allow any of the memory of this container a to be swapped out And memory.hi should be max That is there is no high memory threshold for this container Okay, let's see then how it looks in practice In this cluster I have now NRI memory QoS plugin running but there are no workloads Next I'm creating a test pot And now that it is created Let's see that it has been successfully started up There it is and let's take a look into the workload specifications So here we can see some annotations All the containers in this pot belong to the silver memory QoS class by default Then there's an exception for container c0 low prior which belongs to the bronze class And there's a container c2 no swap where we say that It has no memory high threshold and it never should be swapped And here are the definitions of this container so that you can see what they are actually doing In each container there is DD running 80 megabytes of memory Unlocated for the data from Dem Zero and memory limiting 100 So the system shows now that 40 megabytes of swap has been used And if we see how is the memory of this DD process is doing By reading VM size and VM swap from proc status Then we can find out that there is a DD process that hasn't been swapped at all Then there is a DD process where roughly five megabytes have been swapped Which means that roughly 80 megabytes of data is kept in RAM Which kind of sounds like the silver class DD Because it can use 80 percent of all the data in RAM And 20 percent should be swapped when it would reach the limit that is 100 megabyte Finally, there's one DD where roughly 35 megabytes of the data has been swapped out Which means that roughly 15 megabytes of the data is kept in RAM So this sounds like the bronze class workload Because as you probably remember in this NRI memory QoS configuration We defined that the swap limit ratio of the bronze class workloads is 50 percent At max 50 megabytes in RAM So far we have taken a look at how C-groups controls That is memory.hi and memory.swap.max can be used from NRI plugins And now a bit more advanced look That how memory can be managed in more detail And for this example I'm using the Memtia D It is a user space demon that is able to watch and track And swap and move memory of other processes running in Linux So in short, Memtia D keeps watching new processes being created inside a C-group And that in the end track the memory usage of those processes Meaning which parts of the memory is active and which parts of the memory is idle This happens with standard Linux kernel trackers like idle pages of dirty and diamond And based on the tracker information and policy configuration Memtia D policy can then make decisions, move data to another NUMA node Or swap out data which is then handled by mover with a band with that you wish And now the question is how to integrate this kind of user space demon into Kubernetes nodes And for that purpose we have written also NRI Memtia D plugin Which registers to the NRI in the container runtime And starts listening to start container messages Which means that the create container has been already called successfully And there is a container in the system Meaning there are the C-groups directories and such already in place And now Oculplet is telling that we should actually launch a process there But when this NRI Memtia D plugin gets that message That we are starting a new container It creates a sub-process that is a Memtia D process And Memtia D process starts watching the C-groups directory Where the processes of the new workload are being created And based on the data that it gets from the tracker and from the configuration It then makes the decisions that this memory of that workload should be swapped out or moved You can find also NRI Memtia D project in the NRI plugins And install it using help similarly to NRI memory QoS plugin The quick look and configuration on both NRI plugin and workloads This is again similar pattern that we used earlier So we can define classes for workloads For instance web idle data class There we can have a Memtia D configuration Which you can find from GitHub Which basically tells that if a container is not using the memory that it has Then Memtia D is free to swap it out without any memory pressure Which then will free more memory to the system Okay so here we are Then we have demonstrated now using C-groups controls And using this kind of custom user space demons to manage memory And now how to improve and how you can participate in this memory management So I wish to raise two Kubernetes enhancement proposals That are important in this QoS perspective NRI plugins perspective First of all in these examples we used pod annotations for QoS Saying that we have QoS classes, bronze and silver and swap idle data Do these classes even exist? And on the other hand schedule like where are the Kubernetes nodes Where you can actually use this and that class And again like QoS quote us how many code and platinum workloads you can schedule on the node All these are already addressed in the QoS class resources KEP Which actually makes the QoS a first class citizen in Kubernetes So instead of annotations with that KEP we would be able to say that there are QoS resources associated with a container For instance a resource called memory QoS would be class bronze The second KEP is about resource information So currently not all the information that is in the pod and container specifications Is passed to the container runtime through the CLI protocol This KEP makes the change that Qplet is not leaving out any important information That could be useful in the container runtime Or NRI plugins that are connecting to the container runtimes And that's it Thank you for your attention I hope this presentation gave you some new fresh idea There is one more slide to this I hope I'm able to move to the next slide So like we spoke about the design issues we have in KEP 2570 which is about memory QoS So there are a couple of ideas that people came up with during the conference itself Like maybe when memory.high is reached you have Qplet take be the external process And it can reset the memory.high at that point and then eventually the process gets own kill So I wanted to see that if there are any other people who have any other ideas around this And you can reach out to Signored and maybe we can redesign this KEP and make it work Yeah, that's all. Thank you