 Thank you for coming. My name is Michael. This is Bisek. We both work at Red Hat at Plumber's Team most working on SystemD and today we will be talking about C Groups v2 or Unified C Groups hierarchy. So this talk was caused by recent developments. So SystemD switched to, so how are we talking about this but C Groups v2 has been in development for a long time and people have been wanting to switch to it for many years and finally at the end of last year this happened at least in our corner of the Linux world. So SystemD switched the default default to C Groups v2 by default I mean that the compile time default setting is version 2 and it can be overwritten both at compile time and at runtime and at the same time Fedora decided to drive the switch to C Groups v2 and since Fedora 31 we are running on version 2 and version 2 is just nicer so let's talk about it. And we didn't really know if the whole subject of control groups will be known to everyone. So yeah, so who in the room doesn't know what C Groups is at all? Okay, so there was just one hand. So everybody is C Groups expert here so that's very cool. But I think it would be better still to maybe introduce C Groups a little bit. So Linux C Groups, it's a Linux subsystem that has sort of two main purposes. One is process tracking and the second one is resource distribution. And a bit of terminology. So in C Groups or with C Groups we have a couple of things that we often talk about and this is C Group itself as a unit. So it is sort of a set of, it's an internal concept and it associates set of tasks with a set of parameters for one or more controllers. Then what is controller? Controller is an entity that schedules particular resource. So for example we have like CPU controller, memory controller. And then the C Groups are arranged in hierarchy which is a tree of C Groups and every task or every process on a system is exactly in one C Group. So and maybe one more note. An interface to you as a system administrator or a user of a system is a file system. So C Group is represented to you from a kernel side as a file system. So you mount it somewhere. Usually it's mounted in CISFS C Group and that's how SystemD sets it up at boot. So if you do LS CISFS C Group you will see all the C Groups that you have. Or we have a Nifty tool called SystemD CGLS which will print you hierarchy, how it looks like. So the tree structure of a C Group, how they are arranged on a system. So the history. So it all started quite a while ago and there was something called task control groups in the beginning. This is not very nice to pronounce so we now say C Groups. And this was kernel 2624. And then the development of C Groups and C Group controllers was rather rapid. There were many ideas, many controllers implemented. And it pretty quickly became clear that the original design needs to be reworked. And in 2012 this rework version 2 was announced. The initial work was pretty quick. And I remember when in 2013 people were talking about this it seemed like any day we will switch to version 2. But well it didn't really happen. There was always the next thing that was missing and without that we couldn't move forward. And the thing that was the major blocker in the beginning was the fact that in version 1 we have a tree of threads. And in version 2 we have a tree of processes with some details. And people couldn't agree this is a good thing or a bad thing. It simplifies things but some people really want to schedule resources on thread level. And this meant that the CPU controller which interacts with the scheduler couldn't be merged. And the discussion continued and finally C Groups V2 kind of gave in and added support for threads. So the CPU controller was merged and then another thing popped up and so on. And finally late 2019 early 2020 we were almost there. So why, what drives the rework? This is a slide from the maintainer of control group station hell from ten years ago. This is a random walk and this is a nice summary of how version 1 was developed. Different people had different ideas. There was no single person or group watching the overall design. So there was no design. People implemented what was possible to implement at the time. So the interfaces that we got they essentially usually reflect the kernel naming and the kernel internal details specific to any given subsystem. And then there were some design changes. People didn't know how to use this yet. So the design allows for nearly infinite flexibility. In particular you can do resource management of multiple resources in completely orthogonal ways. Which is nice if you have some very specific use case. But also means that understanding this is much more complicated than it has to be. And we talk about the hierarchy of C Groups all the time. But actually in version 1 there is a hierarchy of groups. But the controller does not have to be hierarchical. Various controllers do things in a way that treat the tree of groups essentially as a random set of groups. And then you can have a controller which assigns more resources to the children of a given node than to this node. And the limits that we had were the resource limits were not often particularly useful. And because of the design of separate hierarchies there is no possible cooperation between controllers and an example is when we have a controller for that limits the amount of memory. And if we set a limit on the amount of memory sometimes we need to swap things out. If we are swapping things out then we are using IO. So the memory limit and the IO limit are tied together. But if those two things are in separate hierarchies we cannot, there is no possibility for us to tie the IO usage of cosmic systems. And then we can swap by the memory pressure in one hierarchy to the processes in the second hierarchy. And then if we, for example, are doing the IO we are using the CPU to do, for example, the compression or some other operation during the IO. And this also is a resource that is consumed by this first group. But we, well, separate hierarchies we cannot tie that together. And this was a very divisive issue. But generally for users it's better to operate on processes because for most resources like memory the split into individual threads is not interesting. And what is very important, the lack of hierarchy, hierarchy city and some implementation details meant that the delegation was not secure. It was not possible. We'll talk about delegation. So to show you an example, this is the memory controller in the C Groups version one. And we had some limit. And then people realized that it would be actually good to account for kernel memory. So a second limit was added the first one couldn't really be changed because backwards compatibility. And then, well, okay, what about TCP buffers? They weren't counted here. They weren't counted here. So we have a third one. And the process continues, right? And then we have a bunch of knobs like charge it, immigrate and same behavior and use hierarchy. They probably do important things. And if we look at different controllers, so shares and wait are the same concept into different controllers. They just happen to be named differently and use a slightly different set of numbers. So in version two the idea is to design the naming and this kind of interface stuff up front. So every controller has to conform to this. It generally doesn't make any difference for the kernel, but it's much nicer to use this. So version two. We have a single hierarchy. So maybe let me skip one slide. So this is version one. This is version two. And what is important is that, maybe not obvious, is that here each controller uses the whole tree. First there's the controller, then there's the tree. We cannot have a controller in some tree that goes half way. So for every controller we essentially need a different tree if we want to have different depths. And in version two we have one hierarchy and controllers that go from the root to a certain level. The reason why we want to cut off at a certain level, it's not just to reduce complexity and configuration. It's also that when we are accounting resources, this causes resource users on its own and we want to disable some controllers partway through the tree to make the whole thing a bit faster. So the interface is more consistent. There is, some of the controllers have been thrown out and everything is supposed to be hierarchical. And the resource accounting is not trivial and it overdraws 10 years. We have constructed some better ways to count resources and we, in version two, the idea is to expose high level knobs without those little details that reflect the internal counter structure. And the limits are soft in the sense that they do not cause the process to be killed immediately but they slow it down if the limit is bridged. We will return to this later. So to quickly summarize the state, most controllers have been ported one to one and apart from some naming differences there is not that much change here. The CPU, in version one the accounting of CPU usage and the limits on CPU usage could in principle be two separate characters which does not really make much sense. So almost always they were mounted together and there is just one controller for that in version two. Oh, and the patches for the huge TLB controller have been merged to Linux next. So they are expected to land in two versions I believe. Another thing that was blocking the switch to version two was the lack of the freezer controller. And this is a classical case of a controller that is not hierarchical. It's just using the tree as a set of groups to control to group processes and this has been replaced by C group freeze attribute that can be called on a C group and causes the C group and its children to be frozen. And a bunch of stuff that was not really controlling resources but doing filtering has been replaced by EBPF filter. So devices controller blocks processes from access to certain devices and the networking controllers they would interact with the IP tables to change routing and priority of packets. And this has, as you can see the pattern is generally to replace everything by EBPF which can be attached to the C group and does the same job but in a more flexible way. Yeah. So delegation. Yes. So now we have a basic understanding like what are the major differences between V1 and V2. We know that basically V2 is much nicer and also V2 allows us to do delegation to less privileged code or to non-privileged code in sort of a controlled and secure way. So delegation is a concept in C groups where basically a C group manager usually for example system D will give up control of part of a C group hierarchy to a different process. So we have a tree, C group tree and subtree of the tree will be controlled by a different entity then system D. So that could be for example Libvert or some container manager. And so the way this works is as I mentioned C group are exposed to you as a file system. So if you actually have this on the next slide I think. Yes. So this is the C group tree and basically this is the output from system DC GLS and all nodes in that. So nodes in the tree are basically they correspond to directories in a C group file system and then we have like processes running in those C groups. So for purposes of delegation we can think of a C group as a directory basically. And then we see here that this C group which corresponds to this directory was delegated to process that runs as users bishek and users bishek now has access to these control knobs C group procs, C group threads and C group subtree control and also it the this directory has been shown. So now users bishek can create sub C groups in that directory. So then we can subdivide resources that are given to this apparency group that anyway we like. So we can create other sub C groups and then control them and delegate resources further. And by the way why are we actually speaking of delegation? So we picked this topic because it comes up on and on especially people ask, people that do system software like for example Libvert people and people that do container managers has to deal with this concept of delegation and we think it sometimes misunderstood especially the last point on the slide tends to be misunderstood quite often. And that's that the cut off basically where the control of system D ends and where you can take control does not is not at the directory level because as we've seen some of the things here are still owned by root and hence this is the territory of system D. And this is because if you would be able to write to these files you could essentially affect or changes in those files could have an effect on sibling C group on this level, on the same level in a C group tree. So you are basically not allowed to do that and you can only subdivide the resources that this C group has. So and basically we can do that. So basically we can do that that's pretty much it when it comes to delegation. So hopefully if you are writing some system software it is a bit more clear like what delegation means and where the responsibility is. So the C groups that system D creates for you are still territory of system D and you can create your own C groups here and control those. And it should be mentioned that this is kind of clear in the case where the delegate T is less privileged and has sounding as a different user but it is quite usual to have this state where everything is still owned by root. And for example the root that controls the hierarchy is system D at the top and leap veered at the delegate T and then this is less obvious. But the rules are the same just that they are not enforced. Yes that was actually the bug that we discovered in leap veered recently that sort of there is a bug in delegation and leap veered was assuming that it has full control of this C group which is not true. This is still the system D territory I mean most of it. So also one thing to mention you can delegate only, you don't have to delegate all the controllers. You can sub delegate or you can delegate only control for certain controllers so the delegate option in system D, oh yeah so you do delegation with system D simply by setting a unit property delegate equals to yes but since system D 230 something I think six. Now you can also say delegate equals and list of controllers and that would basically mean that system D would set up delegation but in the delegated sub tree it would enable only these controllers. Also delegation can be nested. This is so think of container running system D dash dash user instance. This is the example of when we've been basically get into the state where we have sort of nested delegation. So we have a delegated sub tree and then part of that sub tree is again delegated to someone else. And yeah resources are divided hierarchically. So okay so this was a concept of delegation. Now we have another sort of bit that is a bit weird but we should I guess think we should talk about it and that's threaded mode for C groups. So what does it what it means? As Bishak mentioned V1 operated on threads, V2 operates on processes except there is a big asterisk and that's a threaded mode. So there are certain controllers like CPU and CPU set controllers in C groups V2 which are sort of kernel talks kernel documentation talks about them as threaded controllers because if you have CPU controller enabled in a tree you can then turn some leaf C groups into sort of threaded C groups and how you do that you do it by echoing string thread it into a control file that it's called C group dot type and now you turn basically this C group into threaded C group. So what does it mean? It means that you can then create sub C groups which you then also have to turn into threaded C groups and then you can put so you can then hierarchically distribute CPU resources not to processes but to threads and this was also sort of a major blocker for some time for merging a CPU controller and well yeah for the most part you work with processes except sometimes you can work with threads because there are major use cases for example Libvert is a major user of this and it basically manages QEMU VCPU threads and it puts different VCPU threads into different threaded C groups. And it is worth mentioning that this is apparently the only user of this at least according to code search debian dot net. Yes so well yeah this was a major blocker for a long time and there is only one user so maybe it shouldn't be so major in the first place but yeah this is what we have. Okay so we have delegation threading mode and now I will be talking about why basically the one of the main reasons why we do is sort of like much nicer and this is like resource distribution models in V2 and then we finally have some like same documentation and sort of common understanding how we distribute resources along that tree and what it actually means in V1 it was the interface was you know giving to you as a end user a lot of internal sort of details and there was a lot of knobs and it was like unclear what echoing some magical number into certain file actually means and this is now much clearer with C groups V2. So we have weights and then we have limits and we have protections and we have allocations so I will be talking basically in turn about each one. So well in V2 we had shares and weights and now finally in V1 in V2 we finally have just one we have weights and this is basically so if you give for example some CPU weight to certain C group this basically means that C group gets some fraction of a resource that's proportional to the weight that basically what it means. Then we have limits, limits are a bit you know maybe easier to understand because these are actually like sort of hard limits usually. So for example memory dot max is a good example basically if you go above the hard limit then the whole one killer will be invoked in a C group so C group can consume only up to the amount of resource that's configured. Then we have protections if I am not mistaken we didn't have any protections with C group V1 this is entirely new concept with C group V2. So we say that C group is protected up to some configured amount of resource that means that if the usage of a resource is below certain limit all is fine if we go above the limit then kernel will start to do something depends on the type of a protection. I will be talking about couple of examples on the next slide. And then we have allocations we had this even with the V1 and this is basically sort of the exclusive allocation of some finite resource for example real time budget and with all of them the over commit is allowed except for allocations. You have certain amount of certain resource and you have to sort of divide it somehow there cannot be any over commit. This is again just a slide that shows that we have some limits and protections and we have hard limits and soft limits and different types of protections. I will be talking about on the next slide but maybe Bishak can comment also. So you can admire my skills, graphical skills so the protections, the limits are easier. We set a limit and there is hard limit and the soft limit and the soft limit is actually the main limit that you want to set because it says that the given service, given control group should use the resource up to this level and then all is okay. If it goes above then we kind of nudge it towards being below the soft limit for example by slowing it down, taking the resource away from it but nothing too bad happens. And the hard limit is like a safety limit above which the, in case of memory the unkiller will be invoked and then we have protections and they generally are specific to, I mean how this is implemented is specific to each resource but in case of memory if we go, if the amount of resource available to the protected C group cannot be given to the C group things happen to other control groups, to other processes and those other processes are punished so that our protected control group can get the resource that was assigned to it. And for example if we are below the low protection level the kernel will reclaim memory from other processes and will skip our C group and if we, but still reclaim is still possible and if we are below the minimum protection no reclaim whatsoever can happen so this is almost like an allocation of the resource to the specific C group. Yes, thank you. So for example for memory C group we have a couple of, a couple of limits and protections for example memory, memory low, basically Bishop already explained what they do so I'll just say that the difference between mean and low is that one is soft and one is hard. So mean actually, it's hard memory protection that means that if the usage of memory for the given C group is below the limit the kernel will never try to take away pages from processes of that C group while with memory.low if there is no, well basically no memory to be reclaimed from unprotected C groups the reclaim algorithm will also target processes in this protected C group. Then we have memory high which is its memory throttle limit so if our memory usage goes above the limit then we will be put under a sort of heavy reclaim pressure so the kernel will try to take or swap our pages to the disk for example and try to free the memory that we are using that is above the limit and then memory.max we already talked about that. That's basically will get unkilled when we go above. So that is memory and then for example for block IO and for IO controller we also have some protections and some limits. So we can set weights so this is sort of an analog to CPU weight or to the weight in a CPU controller that should be noted that this is work conserving that basically means that which basically means that if there is enough resources for everyone kernel doesn't do anything only if there is a sort of a resource contention then kernel will actually look at the configuration here and it will try to distribute the resources according to our specification or according to our configuration. Which ones are work conserving? IO weight. CPU also right? Yes. That's it right? Yes. And also well I mean also IO latency if you are below latency target nothing is happening only if you are above the latency target then kernel will try to sort of punish the processes that are in unprotected C groups so your latency will be again below the target. So then IO.max you can sort of set absolute bandwidth limits per device. You can configure in bytes per second or IOPS. It's quite flexible interface how you can do it. And then IO latency so that's per device latency target protection. So basically you are saying that for a certain IO's that go to certain block device shouldn't be, shouldn't weight more than certain amount of time. And if that happens then the kernel will try to do something so it doesn't happen basically. So it will try to slow down IO from unprotected C groups so you will be below your target. Yes. And one more note. These things are low level control files in a C group hierarchy and it should be noted that system D has a higher level options for all of them. So look at the man page system D dot resource control and you will find a high level options and a high level system D APIs for all of them. And in general just use system D and you have to sort of forget about all these low level details if you are not really a person that writes some system software like Libvert for example. And should be mentioned that the controllers, the knobs in system D are mapped to both versions of the hierarchy. So system D does, smooths over the transition. So let's talk about the current state. So most of the software doesn't care. I mean the system D cares and some select programs care but most software doesn't except for stuff like Kubernetes and Docker. And generally the support for version two is rather spotty which is a bit strange because it's not a big surprise that this thing is happening. The nice thing is that the stack from Red Hat, Podman and the associate tools they all work nicely with version two. And Libvert also got support directly as a response to the change in Fedora. Java, SnapD, C-RUN, cryo, a bunch of things. And Docker, Kubernetes and some other stuff is version one at this time. And this is a slide from Philippa's presentation at all systems go. And I did some corrections here because stuff is changing to green quickly. You can see what changed last year. But still Docker and some stuff here on the right will not run nicely on version two. As far as no Docker even refuses to start it also. But the progress is happening. There are poor requests being sent to various repositories and things will improve hopefully. And yeah, so the switch in Fedora happened a few months ago and it has been surprisingly smooth. I mean there has been no major blow up. This thing has been in preparation for let's say five years and suddenly we do the switch and I mean some container people are happy. If you run Docker you need to set a kernel command line parameter. But other from that things seem to be going okay. And that's more or less it. To summarize, we have a kind of nicer hierarchy system with safe delegation and a consistent interface. Higher level knobs, soft limits, stuff. Things that don't need to be a character and don't need to be tied to the control group character have been replaced by EPPF filters which gives people more flexibility. And we didn't talk about it but better monitoring tools are being developed. And in particular the PSI interface which gives feedback about actual resource contention at the C group level. Yeah. And basically last slide, if you want to have a look at some of the materials that we used while preparing this presentation, the new kernel documentation. So this is the link to HTML. That's very nice. If you want to know all the low-level details you can have a look there. And then also regarding delegation and how all these things works. When it comes to system D, there is an upstream page on systemD.io C group delegation. So if you are writing some system software container manager or whatever, have a look there. And there is a bunch of other presentations about this topic online. So yeah, questions? We have a few minutes if anybody. As far, so the question was that control files of IO controller are all focused on specific block devices and whether there is some work going on to be able to also configure or not configure but affect file system layer with C group settings. And I don't know. I am not aware. But if you use high-level options that systemD provides, systemD at least can translate for example paths and can resolve file system paths to block devices. So you don't have to know and even care about major, minor numbers and whatever. So low-level details of a kernel interface systemD will do the translation for you. But that's probably not what you are asking about. Oh, there is a comment from Bernard. So it's an interface to the IO elevator. Like how you can tune the IO elevator. That basically means it only applies to devices that actually have an IO elevator. And file systems do not have their own. And NISA has complex storage. So if you have like, I don't know, rate device and stuff in Lux and these all appear as block devices, you would never actually configure the resource settings for the higher-level fake block devices. But it always needs to be propagated down to the actual physical hardware which actually has these IO elevators. And systemD will help you with this. As Emichael said, so we will figure out like if you specify a path in a file system, the systemD is smart enough to some degree to trace this down the stack. So it will figure out, oh, is this on Lux, is it on like DMCrypt, DMVarity, whatnot? Figure out what the backing device is, then figure out, oh, this is a petition. Figure out this is a, until it finds a device that actually has a, has an IO elevator. It's not perfect because it doesn't cover rate and things like that. But yeah, so for the most case, you can actually ignore the fact that it doesn't do anything about file systems because you can still specify the file system path. There was a question. And so has there been any work on GPUC groups? Can you explain what would you mean? I don't think so. So is there a connection between, I would call them traditional tools like NICE and early-mit and so on? So those are pair process controls and of course, a process can fork and then it gets the same set of resources again, right? And so C groups are nicer in this regard because we assign and forking does not change the resource limit. Is there, when you set those latency limits and stuff like that, is there something which does like complete picture of, because you can stop impossible constraints to satisfy when you ask that every C group, that latency below something, is there something which controls and tells you, okay, I am above the limit, I cannot satisfy this anymore? So as far as I know, there is nothing that would do this for you. And also, I think advice in the kernel documentation is to have a, I mean, to run your workload, have a look at io.stat, a C group control knob, which reports statistics like maybe look at the PSI, pressure stall information and figure out like what's the sensible latency targets are for your storage and your workload. So there is nothing that would do this work for you as far as I know. And yes, you can set things, I guess, in nonsensical ways. So you should experiment benchmark and try to figure out like what are the same values. Yeah, so except for some of the harder locations, it's totally okay to have nonsensical division in the sense that to over commit the resource. This is explicitly okay. And you have to ask if you nice a process inside a C group that the process, the C group schedule, still schedule this process with 50% amount of time. What can be done about this? If suppose that you, in one terminal, you run some compilation and you run it with nice and 20. And the C group system will ignore this priority. Well, I don't know, I haven't tried it recently, but some maybe few years ago I tried this and I ran a compilation with nice and 20 and I ran a different workload in a different terminal and both the compiler and the other workload was having 50% CPU. So it ignored the nice command completely. So the way you should see it is like you have the C group tree and it gives you a tree structure and then you should think as if at the bottom of this tree, like at the leaves, you still have these processes and what you configure as weights on the C groups, you configure as nice value on the processes, but it would be as if you had one specific C group for every process at the very bottom of the tree, I'm depending on where you put the root but I put it at the top now, that has the nice as the weight basically. So if you nice some process, then it shouldn't have effect on the rest of the system, it should only have effect on its immediate sibling processes inside of its C group. That all said, resource isolation on Linux is not as good as it should be. Right? So while we're working towards it goal, that if you have a misbehaving workload in some C group and really, really misbehaves, like takes a lot of memory, takes a lot of CPU, currently, unfortunately, we're not at that point yet where the other workloads wouldn't be effective, but there is what we're working on with like, I don't know, there's WMD and stuff like that from various parts of the of the user base like embedded desktop people want this so that we come to this point where we finally can guarantee full resource isolation so that as much as it was possible with modern hardware, yeah, if you have a misbehaving workload, then it shouldn't affect the rest. And if it's too misbehaving for the system, right, takes too much CPU, then eventually we'll kill it so that the other stuff can continue effected like the stuff like the passive store information, for example, should be seen, I guess, as one part in the puzzle to come to this point where we have this kind of isolation. But no, we are not there yet. It's a limitation of the kernel, but people like in particularly Facebook, people are working throughout this goal. So maybe we'll have this soon enough, we're getting closer and closer and maybe, yeah. Okay, so thank you. Thank you.