 Good morning, ladies and gentlemen. My name is Raymond Long. I'm an engineer in the Red Hat Core Kernel group. So today, my presentation is about C Group V2. So first of all, I want to introduce you to C Group. In the Linux kernel, C Group actually is a shorthand notation for the name Control Group, which is just a collection of processes that are bounded by the same criteria and associated with a given set of resources that you want to control. And C Group will give you a mechanism by which you can structure all the processes in the system hierarchically, like a tree structure, and define how much resource you want to allocate for each group of processes. And do it in a way that are controllable and easier to manage. And Control Group, C Group, together with name space, form the basis of what we can do with container. So container is basically required to essential kernel technology. One is C Group, the other one is name space. Without these two kernel technologies, you can't have a container at all. And there are many kernel resources that can be controlled by the C Group. Like you can control how much CPU you want each good process to have, how much memory are allocated to each set of processes, and the amount of block IO, the bandwidth, and the relative importance, et cetera. And what do I associate with each of the C Group? You can also control that. And also all through the network, like the bandwidth, the priority, et cetera. And you can also control how much process ID you want to allocate to a set of processes. Because one way you allow a controllable number of PID with the processes, then what one program can do is you can try to spawn as many processes as possible. You use up all the PID space. Then all the other applications will start for PID and they can't start or they have problems working. So it's a form of denial of services test. So sometimes you want to protect how many PID you want to be allocated for each set of processes. So how to use controllable? And from the kernel perspective, the process hierarchy is presented as a directory tree in the C Group virtual file system. So each directory in the tree has a set of C Group control file that can be read or written into. And you can write value into those control files to associate resource limit with each of the group or process that are in the C Group. And then you can also read the data out for some state information, as well as some statistical count that you want to collect, like how much memory are currently consumed by a given set of processes. And each process, every process is running a system associated with one of the directory in the hierarchy. And the default is the top-level hierarchy. If you've done nothing, all the processes are in the top-level directory. And you have to explicitly move them to subdirectory and then where you can set resource limit and control how much they consume in the system. And you will look at the control file in each of the directory. You can see that there are a set of control files that start with the prefix C Group. They are associated with the core C Group. And then there is a set of files. And each of the controller in the C Group have its own set of control file like. In this case, I'm showing the CPU set. You have CPU set, CPU exclusive. This is a binary file that tells you whether you want all the CPU within that C Group to be exclusive to that C Group and not used by other C Group. If so, then you set it to one. Otherwise, you just leave it at zero. Et cetera. All these files, the files that are listed in this case are associated with C Group V1. C Group V2 is somewhat different. And I will talk about that in one of the slides. So how you can use C Group? There are several ways you can use C Group. One way that people can try to, when you want to pay around with C Group, you can try to manually manipulate or create C Group sub directory in the virtual file system to create new C Group. And then manually move the process from the root C Group to another C Group by just echoing the process ID into that C Group control file. And in general, people can also use some tools like CGQA, CGXQA, et cetera to manage C Group. But the majority of cases, you are using some middleware layer like Docker, Eroxy, or some D that manage C Group for you. So you don't have to worry about how to use C Group. You just set some, you define your policy choices. And then the system software will manage that automatically for you. And you don't have to worry about the detail of how C Group works and how you are going to manage it. OK, in terms of internal, the C Group consists of two major parts. There's a core that is responsible for hierarchically managing the process and associating the process to each set of C Group. And then there is a C Group controller, which is where the action is. It defines how there are many different controllers. Each controller controls one type of resources. And those resources are typically you can manage the resources along the hierarchy so you can set different limits for different C Group. But there are also some controllers that are not really trying to manage resources. They are more like some kind of utility controller that do some certain functions for you that may not be directly associated with the resource consumption used by the process. They do things like accounting and control. So the currently available controller, at least for C Group V1, is the broad IO. And then there is a CPU, which controls how much CPU you get for each of the processes in the C Group. And also there is a CPU accounting that just gives you certain accounting information like how much CPU time it had been used by the process in the C Group, et cetera. So the CPU accounting is not a really resource controller. It's more like for accounting purposes. And then you have the CPU set, which controls the CPU as well as the memory no affinity of each of the processes. And then you have the device C Group. Control watch device can be used by each of the C Group. FISA is a kind of control type controller that you can use the FISA to kind of freeze the process. And then you can do something. So when you freeze the process, you basically put it to sleep. And it's not going to escape anymore until you freeze it at a later date. So during the freezing, when it's freeze, you can do something about it like you can migrate to some other places or to another system, et cetera. And then we site there. And then there is a huge TLB C Group that basically allows you to manage how much huge TLB pages you associate with each of the C Group. And then there is a memory controller that monitoring and controlling how much memory each of the processes in each of the C Group can consume in total. Netcast and priority are used for networking purposes. And then there's a Perf event controller. Again, it's a control type of controller. This is not actually a resource, but it's used to associate a specific C Group with the Perf tool. So when you use Perf tool, you can control one of the options you can use to select the process you use. You can select all the process within a C Group to do instrumentation on them. So this is where the Perf event come in. And then you have the PID C Group so how much PID you want to allocate to the process within the C Group, as well as the LDMA C Group, controlling the resource related to remote DMA. OK, C Group controller and hierarchy. So with the original, there's a V1 C Group, each of which can have their own process hierarchy. So if you look at the, let me, so this is under the CFS C Group directory, you can see that there's a bunch of sub-directory. So each of the sub-directory are associated with one type of controller. So you have one sub-directory for OIO, CPU, CPU set, device, et cetera. And you look into each of the sub-directory and you will see a bunch of file that are related to this orders C Group. The one that's associated with the C Group core and as well as the notify on release agent and task. So you want to move from task from one C Group to the other. You just echo the process ID into the task file and it will move the developer to that particular C Group. And C Group POP will show you what processes are in that C Group. You can see that there are a lot of processes in the root directory because by default, all the processes go into the root C Group unless you move it to one of the sub-directory underneath it. OK. The flexibility of, and you can also combine one or two controller into the same hierarchy. Like in the case of V1, we usually combine the CPU and CPU accounting controller into one hierarchy. But then because of the flexibility, you can combine different things. And so different distro may have a range way in a slightly different way. So there's no standard or hardware is the best way to do things in V1. And it depends on what middleware area you use to manage C Group. In our case, we use system D that manages the C Group for you. But other distro may use some other middleware tools for a similar purpose. And they may do things in a somewhat different way. And also, there are cases where one controller may want to cooperate with another controller to manage a certain type of resources. And in that case, you just can't really do it because the two controller may be in a two separate hierarchy with a completely different structure. So one process in one hierarchy may be in a completely different position in the other hierarchy. So you just can't coordinate between two different controller and doing things in a meaningful way. And this is where C Group V2 come in. C Group V2, unlike the V1, you can have separate directory. For V2, there's only one hierarchy, what we call the unified hierarchy. And all the controllers are in the same hierarchy. And internally, in the kernel community, we call the V1 controller what we call the legacy hierarchy, while the hierarchy used by V2 controller, either called default or the unified hierarchy. So how we can use a unified hierarchy? By default, all the controller can either be in V1 or in V2. You can't be in V1 and V2 simultaneously because the controller only knows how to manage one separate hierarchy. So you can't have one controller involved in more than one hierarchy and you get confused. And by default, any V2 usable controller that are not bound to V1 will be attached to V2. So you can use option C Group V1 with command line to force which controller should not be bound to V1. In that case, they will all bind to the V2. And one new feature that are in V2 but not in V1 is the concept of delegation, where a less privileged user is allowed to manage the C Group in a certain limited way, like moving courses to one C Group to another one. Usually, in the case of V1, you have to be good to do that. But in V2, you use the right option. A normal user can also move courses from one C Group to the other. So unlike the legacy hierarchy, a controller in the unified hierarchy is not enabled by default, except when in the root directory, they are all enabled in the root. But underneath the root, you have to explicitly enable each of controller for each level of directory hierarchy. And the way to enable it is to use the subtree control file. So for instance, you want to enable the C Group controller for all the child C Group underneath the current directory. Then what you need to do is you echo plus CPU into the C Group subtree control file. You want to disable it, just echo minus CPU, meaning remove the C Group controller from all the child C Group that you have. So if a controller is not enabled in C Group, the controller setting at the nearest ancestor C Group is where those resources are being controlled. For instance, in the unified hierarchy, first you can have root directory, and then there's a directory A. And then underneath it, there's a directory B and a directory C, et cetera. And under root, you can also have another directory D. If you enable CPU, oh, sorry. But what CPU is enabled by default? And when you use an echo plus CPU into the subtree control file, then both A and D should have CPU enabled, but not B and C. In order to enable B and C, you have to go to the directory A, and then echo the CPU into the subtree control file within the A directory, and so on. Yeah, there's an echo here. CPU used to be enabled here also. So in this particular case, all the B and C, because they don't have CPU control enabled, then all the forces underneath B and C will group together with A and manage by the same controller setting in the directory A. So it's as if this directory doesn't exist at all for that particular controller. So you can have other controller enabled in B and C, but not in those case, the process within B and C are managed by the setting that are defined by that particular, by the control file in that directory. But if the controller isn't enabled, then you use the setting within the newest and sets the control setting for the process within the sub directory underneath here. OK, one major difference between the CPU V1 and V2 is that in V1 tasks are managed on a threat level. So you can have individual threat put in different CPUs, but in the case of V2, the default is that you manage the task on the process level. So you can have one process and all is associated threat in one CPU or another one. But you can't have part of the sum of the threat in one CPU and the other threat on the second on another CPU. That was not allowed in Cgo V2. That should be fine for most of the controller except for the CPU controller, which requires threat level control. About three years ago, people started to work on this CPU V2 controller. And they have some disagreement on how to manage threat. And that leads to the development stopped for almost 18 months when they are arguing which each other was the best way to do threat level management. And this is where the CPU threat mode come in. And finally, we reached consensus of what how to do with the new threat mode support. So in that case, only some Celi controller regarded as a threat mode enabled. So they can be used when the threat mode is enabled, but not the other. An example of threat mode controller is the CPU controller. And an example of non-threat mode, which we call the domain controller, is the memory. Because for memory, all the process, all the threat within the same process are sharing the same set of virtual address space. So you can't have different control for different threats. So it just doesn't make sense. So the way of using the threat mode is that there is a special control file called C-type, C-coop.type. So in the diagram here, you want to use the threat mode. Then you add the screen threaded into the C-coop.type file. So once after you do that, this will become what we call a threaded C-coop. And the parameter of a threaded C-coop is in a special state, what we call the threaded domain. But another sibling within the same data level can remain as the main C-coop. So it's not private. Only this one, yes. And so this private domain has to manage children that some of them remain private, while the other remain non-private. And a threaded C-coop and all its descendants are what we call a threaded sub-tree. Once a C-coop is threaded, then when you create a sub-directory, it will be threaded automatically. But if you have an existing dart-tree, you only add some existing child, and you change it into the one C-coop in the middle into a threaded C-coop. And then your child is not threaded by default. And it's going to remain a state that you can't use it until you either change it to flat, or you will remove flat more from your parent. And also there are some behavior different between the C-coop V1 and V2 controller. And all the V1 controller are developed independently over time, at different time in the development process. So they have somewhat different naming and usage convention. And one V1 controller may look very different from the other one. And when people are working on V2, one of the design objectives is to make the naming and usage convention more consistent. So I will show you some of the control files that are now available with V2. They look more or less similar in terms of the naming and the semantic. And another goal of designing V2 is to trim out some features that we seldom use, or it's not that useful. So we look at the V2 controller. They have less control than what we currently have in V1. It doesn't mean that they won't get into V2 eventually. It's just that we point it to have justification in order we can add those features in the V2. And people don't want to do it just for shake because it's available in V1. They want a good reason to do it. And as long as people can provide the justification, I believe you can request that some features be enabled in V2 as well. They're currently in V1. About C-coop migration, C-coop is still used predominantly in most distribution because of the fact that C-coop V2 isn't complete yet. There's still some controller under development. Like for CPU set, it just recently go into the 5.0 kernel. And the feature controller for V2 is currently being discussed upstream. And most likely it will go into 5.1, I believe. And also for the huge TLP controller, it haven't decided whether we need to have it available in V2 at the moment. So it depends on whether there is a requirement to do that or not. And of this V1 controller, the device controller, the next CRS and the next priority controller will not be supported in V2 instead. They should be managed by using the EPPF program attached to each of C-coop. Because one of the features that are newly added in C-coop is that you can attach a EPPF program to each of C-coop. So if you want to manage networking or manage the Y, then use your PPPF program to do it instead of explicitly have a controller in C-coop to do that. So unify hierarchy, you have a number of advantage, but they also have some drawback. The primary disadvantage is the fact that because you only have one hierarchy, so you want different combination of different controller settings, then what you get is you create a lot more sub-directive, a lot more C-coop that you need to manage. And so this proliferation of C-coop is the result of the fact that we only have one single hierarchy. And there's another issue with that is currently when you have to individually enable each of the controller within each of the C-coop, but you do it on layer by layer. So you can have control. Let's go back to this. So if you have a controller like a C-coop controller enabled in here, let's say you don't want a C-coop controller enabled in here, but you need it in this child C-coop. You can't do that. You have to enable C-coop controller in the middle C-coop before you can enable it in the C-coop under here. So you have to get it. So along the home tree, you have to enable it when you're on the phone, you can go to the lift. You can't have the. It doesn't force it back. It wouldn't be like trying to enable wouldn't force it back up. So you can enable it here before you can. If it's not enabled in the middle C-coop, you can enable it in the C-coop. So the reddit was unprivileged by unshared username space, unprivileged. And they try to do it. There's no way it can ask for that to be enabled. If this is not enabled here, then you can't enable it in the home tree. In the case of C-coop controller, there is some performance costs associated with the deep level of nesting. So you have, let's say, four layer in the hierarchy. Each adjacent layer will add the courses to the C-coop controller. And the courses, there are two main reasons for the performance cost. So first of all, there is more management overhead with a deeper level of nesting. And also, for each new C-coop controller, you have to have some kind of statistical count about how much C-coop resources you consume. They are shared by all the courses within the C-coop. And when you try to update it kind of simultaneously, it creates a side-contention problem that can slow down performance. And one of the proposed patch that I sent upstream is what we call the Biposmo. So it's still a group patch right here. And I would have to discuss with other team developers whether it's good for them to do it or not. Group by 20, V2, the state of controller is either on or off. And with Biposmo, we add a Biposmo. That means the controller is off in the middle circle. But then you can, once it's in Biposmo, you allow your child to have it enabled. So in that way, you can skip some of the middle level if we don't need the controller. Would that be for a production use case? Like, you would do this as a production use case? Or is that for break glass? Like, you need to stop it from doing something temporarily? The primary reason for that is for the C-coop controller because of the overhead associated with the additional level of hierarchy. So you can have those controller not being enabled in the middle layer. But then on the lip, you really need it. You can enable it there. And so effectively, you only have two levels instead of maybe four or five levels. Now I'm going to talk about each of the controller in C-coop V2, the major one. So the C-coop V2 core is responsible for managing all the process within the C-coop hierarchy. And within the core, you have a C-coop controller file, which is the only file listing all the controller that are being enabled in that particular C-coop. And then there's a C-coop event file that are used by some management demon to know that changes have been made in the C-coop so that they can add accordingly. And then there's a match step to control how many levels of hierarchy you are allowed to have. By default, it's match, which means there's no limit. But you can control how much level by specifying an numerical value into the map depth. Match descendant is the maximum number of descendant C-coop you can create underneath it. It is for controlling purpose. So you can't have too many C-coop create underneath it if you don't want to. And then the C-coop process is a file listing all the PID of all the process that belong to that C-coop. And the way of moving process from one C-coop to the other is to echo the PID into the POP file. And then there's a stack file that shows you the number of visible and dying C-coop. A descendant C-coop that you have. And subject control, I talked about it previously, about how control needs to be enabled in the child C-coop level. And the C-coop flag is used for in threaded mode where you can move one flag to another C-coop and leave all the other flags behind in the original C-coop. So this is a new control file that are created for the thread mode. And C-coop type is for framework management, as I talked about previously. And for the CPU controller, you use to control how much CPU time to be allocated to each of the process in the C-coop. And the CPU, the V1 CPU account controller are integrated into V2 controller in V2. So there's no more separate CPU account controller. They are all in the CPU controller right now. And the CPU map is a two-way loop file. The least thing is for CPU maps is for bandwidth control. So it will allow you to control how much CPU time you allow to run within the given period. And this number is in term of microsecond. So here when you say map, it means that there's no limit within a 100 millisecond period. So within 100 millisecond, you'll see how much time the CPU processes within the CPU consume. If you set the network numerically here instead of map, then you exit the limit. It will force the process to put asleep and not allow it to run until the next period starts. And the CPU V1 file is for controlling the CPU contention between different C-coop. So it depends. The one with a higher rate will have more CPU time allocated to it. So you have two C-coop. One have, say, a way of 200. And one have 100. That means of the CPU time available, two third will go to the second C-coop, and one third will go to the first C-coop. And CPU V9 is another way of controlling the data. Whether in this case, you use similar scaling at the nice command that we have on the command line. OK. CPU set controller, for CPU set, you can control what CPU are associated with each of C-coop. But the list of CPU requests doesn't mean that you will get them all. So depending on what CPU are available, currently available in the parent, the goal of CPU request may not be all granted to you. The actual CPU that you allow to run at the moment is in the CPU effective file. And similarly for memory, what memory node you are allowed to use is controlled by the CPU MAM and the CPU effective. And there's a new feature in V2. It's called the CPU partition, which is used by some of the view time processes. Because in view time management, some process may want to have exclusive set of CPUs that are wanted by those view time processes only. And you don't want other process to interfere with them. So we create a CPU partition for the CPU set to do exactly that. And then we have the memory controller. You can see that the naming convention are more consistent here. So you have mean, which is the minimum memory that you're guaranteed to have before. And all that won't happen if the MAM consumption is below the limit. And low is the soft memory limit. And then you have a high, which is the soft limit where memory retain will happen. You will exceed that limit. Then you will try to retain some of the memory. And then you have the max, which is the absolute limit that you can't go higher than that. You try to go higher than that. And when we can't retain those memory back, OAM killer will be invoked to de-kill some of the process. And I won't go into the rest. I'm running out of time. And then there's an higher controller. Again, you can see that the naming convention are very consistent. You have the weight, which is used to control one CPU versus the other CPU. How much share you get with each other. And then there is a maximum by per second and IO op per second that you can have for processes within the SQL. And within the IO controller, there is also some subcontroller. One is for controlling the white back. And the other one for controlling the IO latency. And SQL namespeed, the missing space. SQL namespeed provides mechanism to virtualize the view of the POP PID SQL file and SQL map. And SQL v2 supports the special NS namespray.org option, which we consider SQL namespray a delegation boundary. So if you have this option enabled when you mount the SQL v2, then you use the SQL namespray. Then the process within that namespray will have the word prefragment in that particular namespray. It will be allowed to manage the SQL file. You can process from one SQL to the other. I'm supposed to give you some demo, but I think I will just show some of you here. So I will just give it a moment. And for VEL, for VEL 7, we are not going to support SQL v2 because the chain is just too dramatic that it will probably break a lot of things. So we are not planning to support SQL v2 in VEL 7. In VEL 8, we are planning to support it. The initial, well, it released because of the timing. It does not have all the required CPU, required controller available yet. So it will be intact if you only. We are hoping to offer you support SQL v2 in 7 from VEL 8.1. Depending on what we have done in that qualification to make sure that everything went out, then we will support that. Looking forward, we are going to see SQL v2 adoption way getting gradually increases over time. And at the same time, SQL v1 will still stay on for a rather long period of time because as I showed you in the presentation, there are some differences between v1 and v2. And not all the applications may be easily moved from v1 to v2. So we will allow v1 and v2 to be accessed at the same time. But for one particular system, you can have either choose to use v1 or v2. Or in some case, combination of both. So we will see how things will go and what to do next. And we are expecting to have more features available with v2. As long as there is a solid requirement for that. But at the current moment, the feature set is quite minimal. It's the core one that we typically use. But you need new feature. You have to raise your void and request that be available with v2. And most is true. I think it's going to support SQL v1 and v2 in some way for a rather long period of time. OK. That's the end of my presentation. Any questions? How to change? For this VPF, you are asking for the VPF to be applied for SQL? Well, because the device in that IO are being encouraged for EVTO, currently, that requires a full tool chain. So I'm not going to put GCC on my firewall. So is there effort going on for not requiring some other tool chain on those nodes? I see. I don't know yet. EVPF, their money is rapidly changing. So it's hard to say whether we need a complete tool chain or not. Currently, EVPF, I think, is better support by C-Land or the GCC. GCC is supported. Well, I'm just saying, I'm planning for GCC. I don't really know on some of the hosts. Yeah, I understand your point. In most cases, I think that this should provide the EVPF program for you instead of the user doing the EVPF themselves. We are thinking, in the most tiny scenario, it's that we already provide the self EVPF program that you can just load it and use it instead of doing it on your own. Any other questions? OK, if not, thank you for your presence. And hopefully, I will come back maybe next year and see you. Bye. OK.