 Hello, everyone. Thank you for coming to our session. Seager V2 is coming soon to a cluster in your youth. Thank you for coming after lunch. My name is David Porter. I'm from Google. I work on the GK node team, and I work in upstream SIG node. This is Bernal Patel. Hey, folks. I'm Bernal Patel. I work for Red Hat. So I work on container runtimes. I'm a maintainer of OCR runtime spec, RunC, Cryo, and I also work on SIG node upstream. So first off, we're going to talk about resource management. What is resource management really in terms of Kubernetes? So here's a 10,000-feet overview of resource management, right? Clusters consist of nodes, and nodes have resources, like CPUs, memory, desks, GPUs. And resource management is about managing these resources. So nodes advertise the availability to the Kubernetes scheduler. So typically, you have some amount of memory on your node, say, 32 GB. You want to reserve some for your system, like the kernel and the processes that are there on the node running natively your system services. Then you want to reserve some for the Kubelet in your container runtime. Say you take away four GB each, you have 24 remaining, and that's what's advertised to the scheduler as allocatable. So here's an example of a pod that has resources set. It has requests and limits. So the scheduler is looking at requests when it is scheduling pods on nodes. So when it finds a node that's able to satisfy the request, it schedules a pod on that node. And limit is what a pod cannot exceed. So what are some of the requirements from resource management? So pods should not be able to hurt each other, right? They should stay within the limits. They should get consistent performance based on what they requested. We should be able to prevent infinite loops, for bombs, memory leaks, node lockups. And we should allocate the right amount of resources for pods. And also, we want to ensure that doing all this management doesn't utilize a lot of resources. Ultimately, we want to allow as many pods to be run on a node as possible, right? You don't want to have a lot of overhead from the system components. So how do we do this? So we do this with something called a C groups in Linux kernel. It's a way to group a set of processes hierarchically. Hierarchically, and then you have a set of controllers that allow you to put limits on those processes. So we have the CPU memory, these are some of the controllers that can be used to put limits on the processes. And C groups are controlled through a pseudo file system, which is called a C group FS. So basically, you write to these files to set limits. And then there are other files that you read to monitor and get statistics back on how much memory it's using, how many processes are running, and so on. So here's the history of C groups. So the first version of C groups was introduced by Google in the Linux kernel in 2006. It was first called as process controllers. It didn't cover everything. Then slowly over time, a lot of different folks came and contributed other controllers. Then in v2, development started in 2016. Some of the goals of v2 was try to simplify things. Things were added organically, right? And so the controllers weren't working well with each other and they were not unified. So v2 is an attempt to simplify things and also provide more features and more stability. So Fedora moved to v2 in 2019. Docker and RunC grew support for it in 2020. And in 2021, most of the distros are now enabling C groups v2 by default. Also, like v1 is considered legacy at this point, right? So kernel fixes in this area are mainly going to v2. Also, SystemD is planning to remove v1 support by end of 2023. So the big message here really is that v2 is real. It's coming. If you aren't already testing it, you should be planning to test out v2 to make sure it works well for you and you can give us feedback to fix any issues that you find. So here's an overview. Like basically this slide shows that all the popular distributions that are used with Kubernetes today, all the container runtimes, the higher level ones like ContainerD, Cryo, Docker, the lower level ones like RunC, C run, how support for C groups v2. And in 125, we finally went GA with C groups v2. So Kubernetes supports it now. So what's new in v2, right? So the first thing is a single unified hierarchy. So I'll go over that in the next slide. Then there are some additional improvements. So I mentioned like how the controllers were added one after the other, right? So with v2, now work has been done so they can work well together. So one example is page cache writebacks, right? The memory and the IO controllers work together to properly account which processes are charged for that. Then on the memory side, like userland memory, TCP socket buffers, kernel memory such as I nodes and dentries are tracked together under the memory controls. Also on the memory side, we have way more knobs compared to before. So we have more control over when the kernel starts throttling your memory allocations instead of just like hitting a limit and getting womb killed. Then there's something new called as pressure stall information. So it allows us to monitor how much resource pressure is there in a particular C group for CPU memory or IO. And finally, like this better delegation support. So this allows us to run rootless container as well. Like for example, Podman uses this for its rootless support. So here's an example of V1 versus V2 hierarchy. So on the left, you see a C groups V1 and you can see like how the CPU and memory controllers are mounted separately and that's this FSC group. And then you have to go and add your process to each of them separately. So it's more flexible, but it's clunky, right? And in practice, most of the time you're going to end up putting them under the same hierarchy. So on the right, you see that with V2, there is a single hierarchy of C groups. And then finally, all the settings for a particular C group are under a single directory. So that's a unified hierarchy. And I will hand it off to David to cover some more details here. Thank you all for explaining some of the new features in C group V2. So I'm going to talk a little bit about how Kubelet and Kubernetes actually makes use of C groups to provide resource management. So the important thing to understand here is there's actually two components that interface with C groups in the context of a node. It's the Kubelet and the container runtime. So the way it works is that Kubelet creates a C group for each pod. When you start a pod, Kubelet actually creates a C group to house all the containers within that pod. Next, the container runtime will actually create a C group for each container. So the Kubelet actually passes the path of the C group for the pod to the container runtime. And the container runtime owns the C group within each pod. The other thing is that Kubelet actually manages not just the pod C groups, but the whole C group hierarchy with QoS classes. So Kubernetes is a concept of QoS class. There's burstable, best effort, and guaranteed pods. And there's in different levels of the C group hierarchy. So depending on the QoS class of the pod, the C group of the pod will be placed under that QoS top level C group. So I want to talk a little bit about how do we actually set these C group values? How do they get from your pod spec into the kernel and set into the C group FS? So it all starts with a pod spec, right? You create your pod, Kubelet observes it, and it has some amount of requests and limits set. The next step is that Kubelet will go ahead and create the pod C group. Then for the containers, it's going to actually talk to the CRI, the container runtime, over the CRI protocol over GRPC. And the CRI protocol actually has some definitions for all of these different values. So we start to convert these values from the pod spec into the CRI kind of definition. The next step is that the CRI will actually be responsible to talk to the underlying container runtime. And sometimes it's the same component. Sometimes it's a different kind of sub-component. And it converts basically the CRI into the real container. So we create an OCI JSON specification. OCI is the container kind of standard. And inside the OCI spec, there's this config.json file, which is the standard. And it has different fields for the resources. So it has explicit fields for the memory and CPU. And it actually also has a new unified field that was added for C group v2 to set C group v2 properties. So once we have the OCI JSON spec, the next step is to pass it to a lower level container runtime that can run OCI container images. So this is usually run C. This is kind of the standard. And depending on something called the C group driver that's being used, and I'll go into a little bit about what that means, usually if you're using C group v2, you should be using the system DC group driver. And what that means is that the OCI container runtime, run C here, will actually talk to system D on the machine to create something called a system D scope unit. So it will be managed by system D. And system D has understanding around all these properties around CPU and memory and other resource controls. Run C will also actually talk to the Linux kernel. So system D and run C will both talk to C group v2 on the Linux kernel to finally set the actual values in the kernel and set each property for the resource requirements. So that's how we get from the pod spec to C group v2. So let's talk a little bit about what properties are actually set and what they do. So the main properties we have right now are CPU and memory. That's the main resources we kind of manage. So the first thing, let's talk about a CPU request. So when you set a CPU request, what are you actually doing? The first thing is that you're telling what the minimum amount of CPU your container needs. And so the schedule will look at the CPU request and before it schedules a pod to a node, it checks that that node has that CPU available. This gives you the guarantee that you'll always get that CPU request. Even if the node is 100% busy, everything's using CPU, you'll always get the CPU request for. The Kublin and the scheduler never over commit on CPU. To actually implement a CPU, we use something called CPU shares, which is a Linux kernel feature. It's called CPU shares in C group v1 and CPU dot weight in C group v2, but the concept is the same. And the idea is it's a unit that we use and each container gets some amount of the CPU shares, which is kind of an arbitrary unit. And then the kernel, when it does actually CPU scheduling, it'll take a look and kind of sum up all the CPU shares in a given C group hierarchy and understand the ratio of some amount of shares in one C group to another. And that ratio is how much CPU one C group will get compared to a different C group. So in this simple example here, we have kind of one CPU and we have three containers you can think. One is 1,024 shares, another two have 512. So the first one will get 50% of the CPU and the other two will get 25%. And if you had more CPUs, we would sum this across all the CPUs on the system. So that's CPU limits, or that's CPU request. Now let's talk about CPU limits. So CPU limits use something called CFS bandwidth control in the Linux kernel to actually be implemented. And unlike CPU requests, the scheduler actually completely ignores CPU limit when it does scheduling. So this is only used to be enforced in C groups. So the way to think about CPU limits is they are the ceiling for CPU. You can never use more CPU than you put in the limit. In fact, if you use more CPU than you put in your limit, you will be throttled by the CFS bandwidth control in the kernel. And so the important thing to know here is you'll be throttled even if there's spare CPU cycles available. The way this works is that there's two concepts called CPU quota and CPU period. And in C group V2, they're called CPU max, but the same properties apply. And the idea is you have a CPU period. And a period is a unit of time. Usually the default everyone uses is 100 milliseconds. And then you have some amount of quota. And the way to think about it is you basically get that amount of quota, that amount of time that you can use for CPU in each wall clock kind of period. So you have 100 millisecond period and you get some amount of quota. And you can use that amount of quota within 100 milliseconds. If you use up all the quota within that first 100 milliseconds or before the 100 milliseconds are over, you get throttled. And then you need to wait until that 100 milliseconds expires. And then in the next 100 milliseconds, you, again, can use that quota. So it's kind of this bursting bank where you constantly get refilled with quota. And at each period, you have the ability to use that quota. And if you use more quota than you're allowed to, you'll get throttled. So that's the idea behind CPU limits. So with CPU limits, there's kind of been something I want to address which is kind of what I call the CPU limits debate. So if you look online, there's kind of a debate going on. You'll see tweets like this that, for example, say, never set CPU limit. And then there's other tweets that say something like, you know, debate's raging. Should you set CPU limits, yes or no? There's articles say you should keep using CPU limits in Kubernetes. And then other articles for Love of God stop using CPU limits. So really confusing. And you might be asking yourself, all right, what's going on here? What should I be doing with CPU limits? So let me try to give you my take on it. So the first thing that we can kind of all agree that I don't think there's any kind of any debate about is always setting a CPU request. So CPU requests are used for scheduling. You need to set a CPU request to provide the minimum amount of CPU you need. And if you don't set it, you'll become a best effort pod QOS and you will basically not be guaranteed any amount of CPU. So always set a CPU request that we can all agree on. Now about CPU limits. So like all things, I can't give you a definitive answer. I think there's trade-offs here. So let me try to explain them. So the cons of CPU limits is kind of the feature of CPU limits at the same point, which is that you can't use any spare CPU cycles. So if there's spare CPU cycles on the node and you set a CPU limit and you hit that limit, you can't use that spare CPU available, right? And so this kind of translates that you're kind of throwing away unused CPU. You have CPU available, but your pods can't use it. And you know, if your pod is really bursty and suddenly gets more traffic or has more things it needs to do, it won't be able to use those spare CPU cycles. And so if you start to measure this and kind of analyze it, and you start to graph it out, you might see that you might introduce some artificial throttle and connect to your application and especially your P99 latency, for example, may increase. So that's kind of the cons, right? But the pros of setting CPU limits is that you're not actually relying on those spare CPU cycles. So if you're constantly relying on spare CPU cycles and you're constantly hitting the throttling limit, that probably means that you set a low CPU request. And the problem with that is that those spare CPU cycles are unpredictable, right? There's no guarantee you get them. If there's other pods that are scheduled later that use up a lot of CPU, you will not be able to use that CPU that's available, right? Because somebody else is using it. So you're kind of relying on this unpredictable CPU, right? And that's kind of the issue there. So if you do set a CPU limit, you'll kind of get more predictable behavior, right? You'll become in the guarantee QoS tier, and you'll always ensure that you're not relying on this unpredictable CPU cycle that may not be there. The other scenario where it's useful is a multi-tenant environment. So if you have multiple teams, for example, scheduling pods on a node and you want to do some sort of charge back and ensure that one team can't use some amount of CPU, it's useful in that scenario as well. So that's kind of CPU. Memory is actually kind of simpler to understand. Memory request, it's only used for scheduling. We don't actually set it at all in C-groups in V1. In V2, that will change, and I'll talk about that in a second. And for memory limit, we kind of have two knobs, memory.max and C-group V2, and memory.max.limit.bytes in V1. And it's very simple. You set that, and it comes from your pod spec, from your container, right? If you go over that limit, you get unkilled. Really simple. And so the recommendation that we have generally set your memory request equal to your limit. The reason for that is, with CPU, you can kind of overcommit on it, right? It's a compressible resource. But with memory, it's not compressible, right? You can't overcommit on memory. So the recommendation is to always set memory request equal limit. That way, you don't impact other pods on the system, and you might be using more resources than you can request. So the other item that I kind of touched on earlier, and I want to kind of explain a little bit more, is about C-group drivers. So this is kind of a little bit of a misunderstood concept, so I kind of want to talk about it a little bit. So as I mentioned earlier, there's two components that interact with C-groups on the system, the kubelet and the container runtime. And the kubelet, right, owns the pod C-group, and the container runtime owns the container C-groups. And when you interface with the C-group subsystem, there's actually two kind of APIs that you can interface with it. One is the C-group FS, where you're just talking directly to the kernel and setting values. And the second option is something called the SystemD driver. And the SystemD driver basically means that instead of talking directly to the C-group file system, you're first talking to SystemD. And SystemD has this concept of slices and scopes, which are kind of abstractions for C-groups. And then SystemD will actually go ahead and set the values in the C-group FS kernel. And so with the C-group V2, one of the requirements of C-group V2 is that we only have kind of one process that manages C-groups at any given level. And since SystemD kind of owns that responsibility, it's kind of the default baked in every distro, we really strongly recommend that you use the SystemD C-group driver on both the kubelet and the container runtime when you're using C-group V2. And this is something that you need to configure in kubelet and the container runtime. They have to match. And we really do recommend using that SystemD C-group driver as I mentioned. So the other item that we want to talk about is monitoring. So C-groups provide us resource management, resource throttling, but also they provide us the ability to export metrics. And so the way this works is that there's a project called the C-advisor. I'm the maintainer of it, actually. And C-advisor is responsible for actually scraping those metrics in C-group FS and getting them to kubelet. And then other systems will actually get that information and export it out to Prometheus. And that's how you can see them in all your, kind of, Grafana and other dashboards, et cetera, right? And so the way it actually works is kubelet depends on C-advisor as a library. It links it in. And C-advisor, we had to update it. We had to do some changes to ensure it works with C-group V2. That was done in V043 version. And it's included in kind of the latest, you know, latest kubelets. There's also some other work I wanted to mention here. We're actually moving a lot of this metric collection away from C-advisor and into the container runtime. That'll ensure that we kind of can make it uniform across different kind of container runtimes and not depend on C-advisor to get the stats. So that work is ongoing. So the other big effort is part of graduating C-group V2 to GA. One of the big things that we worked on is actually testing it and making sure it works well. And so SIG node has a whole bunch of tests that we run against kubelet and against the different ecosystem to make sure it works well. So as part of this, we wanted to ensure that all the features, everything works well with C-group V2. So we actually added new test jobs in open source here to basically test all the variants of different tests on C-group V2. And you can see those highlighted here. So there's conformance, serial node tests, cluster tests, all types of different tests that we ran. And we're running all these jobs continuously. We're running them actually against the latest container D as well. So we're getting coverage of both container D, run C, and the latest kind of kubelet. We also worked on working with the community in general to gather feedback, to understand that the different container runtimes were working well and making sure that C-group V2 was going to be adopted in the broader community as part of this effort. So a couple things to be aware of as you start to migrate with C-group V2 in a Kubernetes. One of the things that you should do is probably just use one of the latest Linux distros that enables C-group V2 by default. And I'll have that slide earlier, had basically every distro these days is kind of defaulting to C-group V2. You also want to have a requirement that you need the kernel to be 5.8+. Most of those kernels are already in even newer versions. You should use an up-to-date CRI runtime, the latest CRI runtimes, container D, cryo. They both support C-group V2. The other big thing I mentioned earlier, make sure you're using the system DC group driver on both the Kublin container runtime. That's some configuration you need to set. And SIG node really doesn't support using the C-group FS driver. That was commonly used on C-group V1. We don't support it, and we don't actually test it, so please, just don't use it. And then also for hosted Kubernetes offerings, you know, if you're using a hosted Kubernetes offering like GKE, AKS, EKS, one of those, it should work with your vendor to understand how they're adopting C-group V2. And you can chat with me about how GKE is doing that. So the other thing that you should be aware of is, you know, this is a big change, right? And so you should test your apps and make sure they work well with C-group V2. From kind of the work that we did, most applications, they don't really have C-group dependencies. It's quite kind of rare, but some applications do. And so the most common case is, like, third-party monitoring and security agents. Those often have to go in and actually scrape the C-group file system and do things like collect metrics and other kind of low-level things. And those might have dependency on C-groups, and because the C-group V2 kind of API has changed due to the unified hierarchy and some of the other things, they need to be upgraded. And so a lot of those vendors already kind of have versions that support C-group V2, but you have to make sure that you're using those versions that are supported. So work with your vendor to understand what versions have C-group V2 support and make sure you're using those versions. The couple other things, some of the popular projects like C-Advisor, if you're running it as a standalone demon set, you should upgrade to ensure it supports C-group V2. The version there is listed. There's another project that's kind of popular called AutoMaxProx by Uber. This one kind of automatically, if you're using Go, it sets the MaxProx variable depending on the C-group setting. So that one was also upgraded to support C-group V2 in that version. The other thing to be aware of is some kind of language runtimes also depend on C-groups. So Java actually uses the JDK, and when the JDK starts up, it actually looks at the C-group file system to understand how much CPU is available, how much memory is available. And so it uses C-groups for that. And so if you're using Java, you should make sure to upgrade to 11.0.16 and JDK 15+. They backported the C-group V2 support. And so using those versions will ensure that Java applications will work well, too. So that's kind of the idea behind C-group V2. Hopefully once you adopt it, you'll kind of get a lot of those new improvements and kind of the lower-level accounting and resource management that we mentioned earlier. And hopefully your applications will work fine. That's kind of the goal, right? You won't see too many big changes. But the really cool thing about C-group V2 is some of the opportunities it'll provide on top. And we have many opportunities to kind of improve resource management in general using C-group V2, and I want to talk about that a little bit. So one of the opportunities that we have is to improve kind of how we manage memory. And so this is actually a feature that's already alphan Kubernetes. And you realize on C-group V2, it's called MemoryQOS, Memory Quality of Service. So going back, with C-group V1, the problem is with the kernel, we really only had one knob for memory. And that was memory limit, right? You hit the memory limit, and then you're ummed, and that's the end of the story. But with C-group V2, we have much more kind of control over memory. We have four knobs, actually. Min, low, high, and max. And these are the soft memory limits. So in the bottom right here, we have a little diagram to explain how it works. But basically, memory.min is kind of the guarantee of the kernel, please never reclaim this amount of memory. This is the minimum amount of memory I need. Memory.low is kind of best effort. If there's significant memory pressure, the kernel will try to reclaim it, but it usually will try not to. Memory.high is kind of the limit, but it's not the hard limit. So as soon as you hit that, your application will be throttled. It'll start to reclaim memory, but you won't be ummed killed. And then memory.max is just like the limit we were talking about earlier, which if you go over that limit, then you get ummed killed. And so the idea here is, actually, we are already setting in your pod spec, right? Everyone's setting a memory request, but we're not actually using it at all for C groups at all, right? We're not actually using this number. So the idea here is, let's map the memory request to memory.low. And this will ensure that you have some minimum memory for your application. The other idea is you're setting a memory limit, and we kind of want to get the guarantee that as you approach your memory limit, you'll start to get throttled and not ummed killed and set something to memory.high. So the way we did that is we take your memory limit and we multiply it by a throttling factor. The default's kind of 0.8, and then we set that to memory.high. And the idea is that as you approach your memory limit, you'll start to get throttled because you'll hit memory.high, and then the kernel will try to reclaim memory, and if you continue to get to increase memory usage, then you'll hit memory.max and you'll get ummed killed. So the result here is hopefully you'll get less frequent umms and kind of better performance as you approach the memory limit. Some of the other work we wanted to mention is PSI, Pressure Metrics. So this stuff is kind of coming down the pipe and we want to integrate Kublo with it. This will allow us to kind of understand what resource shortages we have and improve the eviction. So we can detect things like resource shortages for CPU, memory, and I.O., and this will improve node stability. The other thing we want to talk about is disk throttling. So the Kublo has really good resource control for CPU and memory and ephemeral storage, but disk has really been a resource that we haven't accounted for. Disk I.O. is specifically. So SIGRAV2 has a new I.O. controller that helps manage I.O. And we want ability to limit I.O. of pods so we can ensure that pods also get kind of some amount of I.O. guarantees and can impact the node. The last thing we want to talk about is OomD. So SystemD has this kind of new constant called OomD, which is a user space OomKiller. It uses PSI metrics for this. So the way it works right now is the Kublo sets a Oom score, and then the kernel actually does the OomKilling. But the kernel has really little visibility into the pods that are running. It has no idea about Kubernetes, pod priority, you know, what pods are, anything like that, right? But if we can move this OomKilling into user space, the Kublo can make these decisions, and Kublo is a lot better informed around taking into account things like podQS, podPriority, et cetera. So we want to take the OomKilling and move it out of the kernel and put it inside Kublo and ensure that Kublo can do that OomKilling where it has a lot more information it can deal with. And PSI metrics will help with that. So that's some of the future work that we're planning to do, and please join us, kind of, and signal it if you're interested in any of these areas. So I want to do a quick kind of demo video here around CGroovy2 and some of those kind of concepts I covered. So let me kind of make this full screen here. So the first thing we're going to do here is to create a cluster with CGroovy2. I'm using a cluster on GKE, and GKE has a feature called NodeConfig that you can specify that you want a cluster with CGroovy2 enabled. So here we're specifying we want CGroovy2 enabled. We're going to go ahead and create a cluster. This is a 125 cluster on GKE with that NodeConfig. So we can see we have the cluster created here. The next step is we're going to have a little workload. Well, first we're going to examine the nodes that we have. So I just created a one-node cluster. This is using the latest 125 build of Kubernetes. It's using the container optimized OS, which we use on GKE. It's running container D, and it's on the 515 kernel. So this is kind of the latest cost version here. So the next step is we're going to deploy a workload. So I just have a very simple kind of busybox workload. It doesn't do anything, it just does a sleep. And the important thing is I'm specifying the requests and limits for CPU and memory. And the limits are higher than the request. So this is going to be a versatile pod. So I just take this pod, and I'm going to deploy here on my cluster. So just your standard kubectl apply. Cool. And then I'm going to do get pods and see it's running. So cool, the pod's running. So the next step is I'm going to SSH into the node to kind of examine what's going on and what's going on in the actual node. So I'm just going to SSH into the node here. All right, so we're on the node. So the first thing I want to do is I render this command called stat, and you can pass in the cgroup fs. And you can see here that we get back cgroup2 fs. This is the way to check that the node is actually using cgroup2. So cgroup2 is being used on this node as we specified. The next thing I want to do is kind of show how the kubelet pod cgroup hierarchy is set up. So I'm going to run kind of the tree command on the kubelet, a kubepod slice. And you can see kind of how kubelet's managing the different cgroups. So at the top level, we have the pod slice. And so the way this works is we're using the systemdcgroup driver, so systemd has slices, and then systemd is creating the cgroups underneath here. And so we have at the top level a best effort slice, versatile slice, and guaranteed slice. These are different QoS classes. And within each kind of QoS class, we have a slice for the podcgroup. Each of these are the podcgroups that we're seeing here. So that's the idea here. We're going to go one level deeper just to see what's inside actually the podcgroups. And here, within each podcgroup, we have a dot scope unit. This is the cgroup that was created by the container runtime for the actual container. So each container gets its cgroup created underneath that pod level cgroup. And that's what we see here. So you might have one cgroup basically for each container. And so because we're using a systemdcgroup driver, we can actually ask systemd about this type of stuff as well. So if we ask systemd, hey, give us all the slices that exist. Systemd slice is basically analogous to a cgroup. Then we can see here systemd is telling us, OK, here's all the slices on the system. These are all the podcgroups that Kubla created. So we can see all those. And we can also ask systemd, hey, give us all the scope units. Scope units will be created by the container runtime for each container. And then you can see them here. So if we do list units type scope, you can see here I'm using containerd here. So we have a scope units for every single container. The next thing is we deployed that busybox sleep workload earlier. So I just want to kind of see how the cgroup settings are set up. So first thing, I'm going to kind of grab just run PS and get the PID that the sleep command is running under. So here's the PID. And then I'm going to use PROCFS to actually see, hey, what a cgroup is this processing? So here's the full path. And it ends with dot scope. That's the container cgroup, right? So I'm going to save this into a environment variable just called containercgroup just so I can kind of play with it and it's a long path. So anyways, I have it here. And now we're going to see how the actual resource settings are set. So we're going to take a look at CPU. So the first thing is a CPU weight. This is your CPU requests. And you can see they're set here. So these will be converted from the CPU request you set in your pod spec. And then also there's a CPU max, right? And that's the CFS quota and period that's set by the CPU limit. So all those settings are being set by the container runtime here. And then that's CPU. And so for memory here, I also want to kind of demonstrate how this is set up. And so I also enabled that memory qs feature I mentioned earlier that sets soft memory limits. So we have memory dot max set. That's set to the hard memory limit on the container, right? That's just like before. But we also have the soft memory limit set now, right? We have memory dot high set. This is computed earlier, right? As the throttling factor times the memory max. This is as you approach the memory limit. And then we also have memory dot min set. This is coming from your memory request that you have. So we're actually setting a soft memory limit here as well. So that's kind of the idea here just to kind of give you an idea of how Cgroups are actually working on the real node. Cool. So that's kind of our presentation. I want to give a big thank you to everyone who worked on this. This is a kind of big effort in SIGnode. So a big shout out and thank you to everyone in SIGnode who helped help work on this. I want to thank the container runtime community. Container runtimes are super critical here. And shout out to Giuseppe who worked on this early on. It's really kind of pioneered a lot of the early Cgra V2 work across the container runtime space. The container D maintainers, the cryo maintainers, and Moby Docker helped a lot here to kind of start Cgra V2 support. System D is kind of a critical element of Cgra V2 as well. So thanks to the System D maintainers for adding Cgra V2 support and kind of continuously iterating on it. And of course, it wouldn't be possible without the Linux kernel adding Cgra V2 and all the work that went into Cgra V2 in general. We have a couple of resources here. We gade Cgra V2 in 125. So there's a blog post there that you can read to get more information. There's some Kubernetes docs that you can read about Cgroups and some of the details of Cgra drivers and so forth. And if you're more interested, there's the kernel docs that are a great resource as well as a couple of their KubeCon talks. There was another KubeCon talk earlier this week about Cgra V2 and then there's another talk from 2020 by Giuseppe that also goes into more details about Cgra V2. That's a good resource. So with that, thank you for coming to our presentation and really hope you can start using Cgra V2 and please let us know any feedback. Thank you. With that soft memory limit, is there any way for something like a Java garbage collector to react to that or like get a push notification? Like what is the action that a programming language can take to help mitigate that so that you don't end up hitting that hard limit? So actually in Cgroups system, right? There's a file that you can listen on and get events as you're crossing these thresholds. So maybe like the application can open an FD on that file and look for this notification and react dynamically. But I'm not aware of any language doing that right now, but that's a great thing to explore. Yeah, the language could integrate with that. And right now the main thing that you'll get is that kernel will actually reclaim the memory from the JDK or from the application, right? And then hopefully the JDK is aware of that and will react appropriately. Yeah, so right now it's more a static tuning. It looks at the values and decides what the GC values are. But what you're talking about is like more dynamic. As you're using how do I react when I'm getting this notification from a kernel that you were because you crossed the low or men or nearing the high? You talked about this KIO. What about network IO? Yeah, that's also definitely interesting area. I think it's still pretty early, but I think we also want to isolate network IO. There's been some work in the community around that. And I think that's definitely an area we want to explore as well. I think Kubernetes is really good for support for CPU and memory, right? And some of the other resources we definitely need to improve on. So that's something that we want to explore for sure. On the networking side, the details have changed a bit. So with V2, I think the expectation is you attach an EBPF program and then look for the C groups associated with and then do the throttling. So it also depends upon your SDN plug-in providers and so on on how to do that. And all the network TCP socket memory will be actually accounted for under the main memory counter in C or V2. So the memory is accounted there, but for actually network usage itself, that's something that we still need to add. So I might have missed this conversation in SIG node, but I know a lot of the stakeholders in SIG node, like Google and Red Hat, use system D like heavily. And so you mentioned that system D, you know, everyone was really pushing for system D C group driver. What is this? Does there exist a story for distributions that don't ship system D? I think for those distributions, right, like we see that majority of the distributions are using system D. We don't see as much usage. That's why we concentrated on like keeping things simple. But if folks are really interested in using C group efforts, like we really encourage them to show up to SIG node, like raise their hands and help get that support like fully baked and working. I have a question about the soft memory limits. You mentioned before, if an application exceeds the soft memory limits, then you can't throttle it down. Can you elaborate about what kind of applications you're targeting with that? I mean, if you're not scheduling it, then it's not going to release the memory. So what situation is the best avoided or best used? So what idea can do is kind of, if you're not using some of my memories, flush it to disk, right? And so that also interacts with swap, which is also an ongoing effort. So it tries to swap memory if it can. Some memory can be reclaimed by the kernel, right, because it's shared and it's kind of caching it. So those are the kind of things it tries to do. Yeah, the executable file pages can be swapped, even though swap is not enabled. But there's also an effort to make swap available and then it will work better. And also like for OomD integration will need swap. So OomD has the time to react to these changes and actually make decisions. Otherwise, if things are going too fast without swap, then you'll end up getting OomKill by the kernel OomKiller. So incredibly exciting, over here. I noted that COS M97 should have C Group 2 by default. We noticed in some situations where you're running, say, two pods in a COS environment, memory pressure against one can suffocate another. I'm guessing this being enabled might help in a situation. Yeah, so I mean, if you're setting a... It kind of also depends on the QS class that you're using, right? So if it's a burstable pod, there is no... There might not be necessarily memory limit set, right? Or it might be at the kind of cooblet top-level C Group that's set. So depending on that, that's why memory pressure can impact two different pods, right? But if you do set kind of guaranteed pod or you set a memory limit, then you should kind of get full isolation between the memory usage between two pods, right? And of course memory... QS feature will help kind of with software memory limits as well with that. Yeah, a question here. So I remember that if you have a single-threaded app like Python where it can only take advantage of one CPU anyways with C Groups v1, if I don't have a limit, the nature of the application automatically limits it to one CPU. But if I add a limit of, hey, limit yourself to one CPU, just the act of adding a limit period can lower performance. And I was wondering if like C Group v2 helps that scenario. So that's kind of dependent on the Python interpreter to kind of do that. So if it always... It kind of depends if it's hitting that CPU limit in the first place. So I don't know if Python actually can configure and tell it kind of like JDK and with Go Max Prox where you can set kind of a CPU that you have that it can allocate underneath for you. But basically, you know, if you set a CPU limit of one and actually always will use less than one, then you shouldn't get throttled in the first place, right? But if you're setting a lower CPU limit and it's always using more CPU than that, then you'll get throttled. So C Group v2 will not necessarily help with that unless the interpreter will actually integrate with that and read those values and kind of tune itself appropriately. And that's not really... Yeah, it says C Group v1, C Group v2. It's kind of general C Group setting, yeah. Hi, appreciate for your sharing. So in your side there, you mentioned there is a conversion. There's a better conversion convert memory mass to memory highlight. So, but that one is a global conversion. So do you consider maybe support per port conversion? Means, okay, if I don't set this for my port, in my port spec, I will forbid to use the global conversion. But if I already set it in my port spec, I will prefer this variant, yeah. Why I ask this solution is actually it caused an incident happening in our production when we lower our C Group v2 because some application actually, for the full case, it's okay. But some application, they don't want to be throttled, yeah, because they may be very latency sensitive. And when we lower those applications, latency become very high. And turns out we finger out... Actually, we take at least one hour to finger out this is caused by our C Group v2 because at the beginning, we didn't see this issue when we log out the C Group v2 to at least one or two cluster. Yeah, so that's the reason why I ask this question. So are you mostly focused on memory kind of then? Like if the pod is not setting memory request and limits, that's kind of your question, or more on the CPU side, that's I didn't fully understand. Both, okay. So with the memory, so I'll talk about memory. So for memory, with the memory QS feature, one thing I didn't mention actually, it's not just setting the memory min and memory high settings on the pod level. It actually also sets it at the higher level, at the QS level as well. So it'll also kind of look at node allocatable and set memory settings there. So that'll ensure that even if you're not setting any kind of memory request or limits, if you approach node allocatable, you're also kind of getting that behavior at the top level of C Group. For CPU, I don't think we set it at any kind of top level. We set CPU shares based on node allocatable at the top level. So that will ensure that, you know, if there's some amount of CPU available in the system, like all the pods share those CPU shares, right, at the top level. But we don't set any CPU limit settings at the top level. Does that help answer your question at all, or maybe I didn't fully get it? So I have a question that's a bit related. You mentioned a 0.8 throttling factor. Is that configurable in like the pod spec or at the node level, or is it like hard-coded? How would you go about configuring that? So I don't believe that's currently configurable. The feature's still in alpha, actually. So this is actually where it would be awesome to get your feedback and try it out and see if that works for you. I think that was kind of an estimation that we kind of said works for most folks because we didn't want to ask people, hey, also set us a memory high on the pod spec. It's kind of like additional info. We don't think it's super useful. But if that is useful to be configurable, that would be great feedback. Is it possible to set which cores on the CPU are being used? Are you asking if we can do Christmas cores? Well, we have a use case where we want different NUMA nodes to be used. We don't want workloads to be scheduled on the same NUMA node. And so we're specifying cores 135 to be used for certain pods. Is that possible to be set with C-group V2 via a pod spec? So like C-group V2 also has CPU sets similar to V1. Yeah. And that should work similarly. Can you set it through a pod spec? No, you can't specify what CPUs you want to use in the pod spec. You can only specify the number of CPUs. And then CPU manager or something else will go and work with the container runtime to actually pick the specific CPUs that your pod ends up using. OK. Thank you for this. I'm curious about the UMD killer in user space. Is it possible to... I would love it if the monitoring has a lower priority for being UMKilled, like the monitoring namespace. So my user... My other workloads would have a higher probability of being UMKilled when my monitoring is kept up all the time. Do you know if that's something you're considering? I think that's what we're hoping, right? Like when we use to an UMD-like model, we can actually look at the priority set for your pod or other signals like the QoS class and make smarter decisions than the kernel UMKiller takes today. So that's our goal. And like, we still have to do all that work. So you're happy to like join-signored and, you know, give input on your use cases. All right, folks, we have way over time, so maybe we can take one last question here. We have time. Yeah, I've noticed that the burstability of CPU seems to have an effect on the UMKilled score. So if I had my request and limit for my memory set equal, I would kind of expect the quality of service for the memory aspect in the UMKilled score to reflect that of guaranteed. Do you know if that's being addressed at all in V2? And do you know why that it behaves like that today? Yeah, so you're totally correct. So when you have a burstable pod, we actually look at kind of the ratio between the memory requests and memory limits. And then that's how we compute the UMScore that will be set on the pod in the containers. So, I mean, it looks at the memory, not at the rate. So it looks at the memory request and limits ratio, because the idea is we want to have a different UMScore for burstable pods compared to guaranteed pods. If I don't set the CPU limit equal to the request, then I get an unexpected UMScore. Okay, well, let's try that. That's three possible bugs, so I might need to investigate it. Cool, that's it. Thank you so much. If you have any further questions, please come up. Thank you all.