 All right. Well, I want to thank you all for coming here. My name is Mike Anderson. Here's my particulars. My company, the PTR Group. I've been a speaker here at Embedded Linux Conference and many of the other OSS shows for about 20 years now. And we have been my particular company PTR Group. We're a consulting house, and we do a fair amount of work with different customers in the embedded space, mostly, although we also build space robots for NASA. We handle all the software that's done in the space refueling mission for Restore L at NASA and for RSGS at DARPA. So we've been in the business for quite a while. I sold the company. I was one of the founders. I sold the company last year, the beginning of 2018. And then in February of this year we got bought. The company who bought us got bought. And so I go from a 24-person consulting shop to a 40,000-person industry. If you want to have, for instance, a nuclear submarine or an aircraft carrier, hit me up. I can put you in contact with the right guys. That's what Huntington Ingalls does. All right. So let's move on here. So we're going to do a quick overview of the C Group interface. We'll then talk about enabling and using C Groups and what we can actually control with C Groups. We'll talk about a couple of, I'll show you a couple of example usages, and we'll get into a quick summary here. This is a very short session. It's only 30, 35 minutes, 35 minutes, I think. So we'll have to move through things pretty quickly. So there's not a whole lot of depth I can really get into in the short time, but we'll do the best we can. All right. So what is a C Group? It stands for Control Group. It was actually added to the kernel in the 2624 timeframe. So it's been in the kernel a long time, but it didn't get really significant enhancement until the 315, 316 kernel timeframe. So it is a mechanism that's basically focused on trying to kind of hierarchically organize resources across multiple platforms, or excuse me, across multiple applications. And it applies to any schedulable entity. So this means that it works on containers. It works on VMs. It works on threads. It works on processes. So any schedulable entity inside of the kernel can be subject to C Groups. Now, this becomes particularly interesting when we start dealing with multi-tenancy inside of large VM environments where I need to be able to significantly limit the amount of resources that are assigned to a particular tenant or to a particular set of VMs. So we'll get into the details of that here in a moment. So exactly what does it do? Well, a lot of times we're interested in trying to place restrictions on applications. If we're doing safety-critical applications, we need to make sure that no particular application hogs all the CPU or hogs all of the memory or consumes all the IO bandwidth or all of the network bandwidth. So we need to be able to control that, and that's exactly what C Groups does. Now, as a mechanism, it manifests itself in CISFS. It is a file system. So it's a virtual file system, a lot like PROC FS. You do have to mount it, depending on which version of the kernel you have, it may be automatically mounted, it may have to mount it manually. I'll show you how to do that in a moment. But the key thing here is, let's assume that I have a couple of processes that I want to be able to guarantee that no single process consumes all of the CPU. I want to be able to guarantee that both processes get exactly 50% of the CPU. With C Groups, I can do that. In any other mechanism, if I just let them run, then I'm not going to be able to control them that way. More importantly, I can also do things like lock them to specific CPU cores. So I can do CPU pinning and a few other things. We'll talk about all the stuff we can do with C Groups in a moment. There are two parts to C Group. Now, is it C Group or is it C Groups? It's C Group if you're referring to the components in the kernel. It's C Groups if you have groups of C Group. So it's kind of weird the way they define it, but that's okay. As far as the core is concerned, it's really responsible for establishing that hierarchy inside of the kernel. Now, the core is integrated in with EBPF. So if you're into Berkeley packet filters and all the things that BPF can do, then you'll find that C Groups is integrated quite nicely into that whole environment and can actually be controlled and provide additional information for you. The other piece of C Group is the controller hierarchy. The controllers themselves are the things that we're going to focus most of our effort on because those are the things that manifest down in user space. So the C Group hierarchy is actually manifest inside of CISFS. Excuse me, by default. You can mount it other places. You can have multiple different mounts of the same C Group interface in multiple locations in the file system. So the reason we might do something like that is we want to place in a multi-tenancy environment where I have individual companies in their own separate disk areas. So I may mount a C Group interface inside of their particular disk area because I want to control that particular VM or set of containers separately from other VMs or sets of containers. So in the normal mechanism, if we're going to use mount, we'd use mount-t C Group 2, none, and then whatever mount point we wanted. Now, if you've got a C Group 2, it implies there's a C Group 1 someplace, and there is, in fact, a C Group 1. C Group 1 has now been deprecated, but you'll still find a lot of systems that still use it. You can actually mount C Group 1 and C Group 2 simultaneously on the same system. So that makes for some really confusing times. I don't recommend that you do that. Just use C Group 2. The major reason they came out with C Group 2 was to make it more compatible with Kubernetes and other similar sorts of mechanisms. So they actually saw, hey, yeah, the C Group 1, the hierarchies are kind of weird, and it gets a little difficult to manage, so they simplified it considerably by simply saying, okay, let's just use C Group 2 and we'll clear a lot of this stuff up. Now, not all distributions enable C Groups, and even if they do enable C Group, they may not necessarily enable everything inside of C Group. So here's a quick little snapshot out of my 5.2.1 kernel. Obviously, it's an eye chart. You can't really see it very well, but there is a thing that says C Group Support, and when you enable C Group Support, there's a whole bunch of options that show up, which are all the individual C Group mechanisms, the controllers themselves. So we'll talk about what the controllers are, but just understand that if you go to use C Groups in your distribution, you may not necessarily have the C Group mechanism turned on in your kernel, which may mean you have to go back and rebuild the kernel. Now, most of the major distros, Ubuntu, Red Hat, SUSE, most of those distros have C Group turned on. Whether they turn on everything in C Group remains to be seen. You have to kind of look at it and see which ones are which, but at least you'll have a likely case that the C Group will be turned on in most of the major distros. Now, there are some user space interfaces that you also have to have. In particular, Lib C Group, C Group Bin, C Group Tools, and C Group FS Mount. Now, depending on your distribution, that's what they're called in the Debian world. If you're using one of the Red Hat or SUSE distributions, then your names may differ, but these are all user space tools that are focused on being able to manipulate and control the C Group. So, once I have the C Group file system mounted, I can look at it here in the hierarchy, and you'll notice there's a whole bunch of things that are put together on this we have, and we'll go through most of these functions here. So, whether it's block.io, the CPU sets, the funky one is freezer. Oh, that's an interesting one. We'll talk about that one. And some of these others, we'll go through most of these just so that you understand exactly what they're supposed to be doing. And remember, this is a virtual file system. So, what we're going to do is we're going to create a C Group and then add processes to the C Group. And then, once we've added them to the C Group, we can then associate these controllers to them. So, what sorts of things can we actually control? Well, the first one is block.io. We're interested in making sure that no particular application takes the lion's share of all the disk.io. So, for block.io, we're trying to limit it based on block.io performance. We want to make sure that this particular application gets no more than 35% of the disk.io. This is, of course, part of this, if you've never been in the service carrier or the service provider business, we have these things called SLAs. And as an SLA is a service level agreement. So, that's a contract between the service provider and the customer, where the service provider guarantees to the customer, through the SLA, certain percentages of performance. So, they will guarantee that you will get X number of CPU cycles per second or X amount of disk.io per second. And in order for them to be able to guarantee that those numbers are met, they'll typically use C-group in order to enforce that. And this also gives us the ability to do some accounting on that. So, if somebody comes along and says, hey, you're not meeting my SLA, if the service provider can't meet the SLA, the customer gets money back. And no service provider wants to do that. So, they're going to be able, they're going to focus on whatever tools they can get in order to be able to demonstrate to the customer that they've actually met the details of the SLA. So, the first one there is the Block.io. We just talked about that one. The next one is CPU. CPU deals with what percentage of the CPU you get for a particular application. Now, each application has at most 1,024 units of CPU time. And we can then adjust the units accordingly to limit the amount of CPU time a particular process gets. I'll show you an example of that here in just a moment. CPU sets. CPU sets is an interesting one, because if you're familiar with the hardware, of course, when we're dealing with hardware, we have the L1 and L2 caches, potentially a tertiary cache if we're dealing with the X86. And then we have physical memory that we're trying to pull applications out of. So, when I get ready to fetch something from the processor, and if it's in the L1 cache, then I can usually fetch it at the speed of the processor. I'm not losing any time if I'm running out of the L1 cache. However, the L1 cache is limited in size. Even on the big X86 platforms, having 32 k bytes of L1 cache, that's a big cache. And that's because the L1 cache is oftentimes implemented with like a 64-way set of associative memory. It's extremely expensive from a silicon real estate perspective on the die. So, we'll have 32 k bytes of L1 cache, but maybe 6 megabytes of L2 cache. So, the L1 cache is made up oftentimes of static memory, whereas the L2 cache is typically dynamic memory. So, if I have an L1 cache miss, and it's in the L2, it's probably going to cost me anywhere from 5 to 10 clock cycles to pull something out of the L2. If I have an L2 cache miss, and I have to go to the tertiary cache, then it's going to cost me 20 to 25 clock cycles. But if I have to go all the way out to memory to fetch that instruction or fetch that piece of data, that's going to cost me anywhere from 200 to 300 clock cycles. Now, understand in a processor like the X86, the X86 is a superscalar processor. A lot of processors these days are, which means that I'm executing more than one instruction per cycle. So, let's assume that I'm running 10 instructions per cycle, which is one of the older X86 platforms, but let's just do that for the math. If I have to reach out to physical memory, and it costs me 200 clock cycles to retrieve something from physical memory, well, that's 200 clock cycles times 10 instructions per cycle. That's 2,000 instructions that I've lost for one fetch out of physical memory. And I will never get that back. So, how does that impact us with CPU sets? One of the things that we want to do with CPU sets is we want to lock processes to particular processor cores. And by locking, they call it CPU pinning, and when you pin a process to a processor core, you lock it into the cache. So, when you allow an application to migrate from one processor core to another, it's coming into a cold cache. And therefore, all subsequent accesses for the first several thousand will be cache misses and force you to go out either to RAM or to the L3. So, allowing an application to move back and forth is a significant problem if you're trying to guarantee performance. So, what the CPU set C-group does is it allows us to specify a number, a collection of processor cores that we're going to allow this application to run on. And by setting that collection of processor cores, we can then say, all right, I want to make sure that everything that's health and status monitoring runs on processor cores one and two, and everything that's customer-focused is going to run on the other processor cores. And I may in fact lock particular customers to particular processor cores again to optimize the use of the cache and guarantee that I can meet my SLA. So, those are something that we would typically do through C-groups. There's another mechanism to do this. It's a task set mechanism in Linux. And task set allows me to lock applications to particular CPU cores from the command line. So, you can also do this from the command line. You can do this with CPU sets. You can do it from the command line or you can do it from within a GUI like VIRT Manager. CPU accounting just gives me information about the accounting usage. This might be something we would want to do in order to be able to guarantee the customer that we've met their SLA. We have devices that allows me to set which devices a particular environment or particular process can get access to. This is important. How many of you have ever heard of network function virtualization? Anybody? A couple of you. All right, so network function virtualization has to do with how I can take, say, an Ethernet card and dedicate five of the 32 channels off the Ethernet card to a particular VM or to a particular container. I can do this with containers as well. So they have some low level stuff they do. They call it SRIOV, single root IO vector, IO virtualization. They also have, Intel has their version called VMDQ. It's part of the VTC virtualization, the VPRO virtualization is part of that particular architecture. But oftentimes what we want to be able to do is we want to be able to say to a particular application or container or VM, no, I'm sorry, you cannot open the modem or you cannot open device, the card reader, the smart card reader or the smart card reader is dedicated to you so that nobody else can get it. So that's what the devices option does for us. The freezer, we'll come back to that one in a second. Huge TLB for translation look-aside buffers. This is an option we can turn on in the kernel. You've got your machine that's got 64 gigabytes of RAM. How do you take advantage of those 64 gigabytes of RAM? One of the mechanisms is called Huge TLB. If you turn on Huge TLB support, some of the virtualization layers will automatically take advantage of it. If you have a virtualization layer that understands Huge TLBs, you'll get about a 10 to 15% improvement in performance for that particular VM. So it's a significant piece, but Huge TLB allows me to control which applications have access to the Huge TLB facility. Memory, that one's pretty straightforward. That is how much memory I'm going to give to a particular application. And then I can deal with both user space memory, kernel space memory, and swap usage. So I can track it, I can limit it, I can make sure that I can account for it. All of that sort of stuff can be done through the memory functions. Net CLS, this is the classification interface. This one is particularly interesting for those of you who have quality of service requirements. So the Linux quality of service code has both classless and classful queuing disciplines inside of it. And when we're using the net classification option we can then put tags on network traffic that will then force them down particular paths inside of the quality of service code. So I can guarantee that video, for instance, is going to be treated in one way whereas bulk transfers like FTP would be transferred in a different way. So I could do traffic shaping, I could do traffic scheduling, I can do traffic policing using the net classification option. One that's similar, somewhat related, but because it has to do with networking is the network priority code. Now the network priority option then says, oh well I want to prioritize this user's application over this other user's application as far as network traffic is concerned. So if both of them are generating network traffic I'm going to take this one first and then deal with this one. So I'm going to be able to set priorities on traffic. Now I talked about the freezer, let's talk specifically about what freezer does. Freezer is like putting your laptop to sleep. So with freezer I can say all of the applications, all the processes that are part of this process group, part of this C group, go to sleep now, shuts them off. Now I can then move them to another application, to another processor and wake them back up again. So it's like process checkpoint restart kind of things. You can do with the freezer interface and it can also be used for some of the virtualization layers actually take advantage of it. So for instance KVM knows what to do with the freezer interface when it gets ready to migrate VMs from one place to another. Perf event, this is going to give me per CPU a mechanism basically to monitor threads that are given in a particular C group. I have the PIDS interface which then allows me to keep track of parents and children when you get forked. So when you fork something off, who does it belong to? Our DMA is the regulation of the DMA mechanism. So if I'm going to be doing high speed DMA from kernel space to user space, this is the way I can control that. And then Unified actually splits out the control interface not by process but by user. So I can then get to the point where I go, okay, well let's say user A has four processes. User B has 20 processes. Well in the grand scheme of things, user A is getting one fourth of the CPU that user B is getting because user B's got more processes that are all looking for CPU time. Well what I'd like to be able to do is I'd like to be able to say that user A and user B are equally important. So give user A 50% of the CPU so his four applications get 50% of the CPU and user B's application, their 20 applications get 50% of the CPU. So I can then do balancing at that level as well which is really handy again for service level agreements where I have multi-tenancy that I have to deal with. So kind of the good and bad about the C group interface is that there are a lot of knobs that you can twist and so many adjustments that sometimes it's very, it's kind of mind-boggling. Here's an example of just one of the C group mechanisms. One of the controllers, this is the memory controller and you'll notice here that we have kernel memory, we have memory usage limits. This is in user space. We have maximum usages and bytes. We have swapping. We have control for NUMA interfaces. We'll talk about that in a moment which is becoming an increasingly important interface in a lot of the larger systems and then I happen to have LXC and LXD installed on my platform so when I took this screenshot we saw a couple of extra things that were out there specific to LXC and LXD. So here's an example of manual C group interface. I've got this test script here and all it does is it prints a message and then goes to sleep and it does that forever. So what I would like to do is I would like to then create a C group so I'm going to do, and this is going to be doing it by hand. We'll show you in just a minute how you can do it automatically as well. So we'll do a make-dur inside of the C group hierarchy. In the memory controller, we're going to create a new object called foo. And so this is just a directory is all we're doing here but now we're going to add things to that directory. So we're going to echo 50 megabytes here into the limit in bytes option. So this will then set the maximum limit for the foo group. Anything that's in the foo group will have this maximum amount of memory associated with it. If you did a cat on that, it turns out that the way these things work is they always round up to the next available page size. So on the x86 those page sizes are 4096 byte pages and if you look at 50 million divided by 4096, you get these numbers. All right, so let's go ahead and we'll create an application. We'll launch it and we'll get its process ID. Now we need to get its process ID because that's the major mechanism that we use to add things to a C group. So now I know the process ID is 31344. So I'm going to then echo 31344 into the foo memory group cgroup.prox. So that's going to add it to that particular process tag inside a C group. We will then take a look at this using the ps command. We'll say show me the C groups associated with process ID 31344. And this shows me that I have memory is associated with the foo group that I created. So that's the only one that's actually specific to foo. So now I can run the application because I'm running the application at this moment. I can take a look at how much memory it's actually using. So I can take a look at its current usage in bytes and see that it's using only about 262K bytes worth of RAM. So I can limit it, I can monitor it, but I can also control it in the way that it dies. So let's assume that I have a memory leak. I can then set up the umkiller in such a way that it understands that this application, when it exhausts all the memory, is supposed to be killed and restarted automatically. So obviously I've got a memory leak. I need to report that and fix that. But I need to be able to also keep the system running while the DevOps folks are out there trying to figure out what's wrong. So I can then use the Cgroup interface to be able to control that and make that happen. Now, here's an example of using this using libcgroup. And libcgroup has with it the Cgroup tools. And these tools include applications like cgcreate and cgexec and cgget. So we'll see how those all work. So cgcreate allows me to create a group. So I'm going to create a group in the CPU controller called group A. I'm going to do the same thing for group B. Now, this is the CPU controller, so this is controlling what percentage of CPU time each one of these applications gets whenever they're a member of that particular group. So I'm going to use the cgget command. I will get the number of CPU shares for group A, and it comes back as 1024, which is the maximum number of CPU time slots for that particular application. And that's the default. We have the same sort of thing for group B. So now, what I'd like to do is I'd like to test this out. So let's create a couple of applications. We're going to do a cgexec into group A, and we're just going to do a dd. And we're just going to grab lots of nulls and send them off to the bit bucket. So we're just going to burn a lot of CPU time. Basically, I didn't want anything that would incorporate any disk.io or network.io or anything else. So I'm going to just measure CPU performance. Do the same thing for CPU group B. Now, because both of those were both given the 1024 chunk slots, you will notice that they actually end up eating 50% of the CPU for each one of those two applications. Now, if that's not what I want, I can then go in and adjust it. So I can say that the CPU shares for group A is 768 or 3 quarters of the actual full value. And for CPU group B, I'm 256 or 1 quarter of the total value. And sure enough, when we take a look down here at the percentage of CPU time being used, I've now got 75% of the CPU going to group A and 25% of the CPU going to group B. So the actual decimal point out there had to do with the startup code and a few other things. So it added up a little bit more than it should have. But once it gets to steady state, then it settles down and locks in at 25%. So we have effectively here a mechanism that allows us to control what's going on in our system. We can control memory. We can control network.io. We can control disk.io. We can control the assignment of CPU cores, groups of CPU cores. We can do all of that through the use of the C-group interface. Where this really starts to become important is when we're using either containers or VMs in a multi-tenant environment. If we're just running on-prem and we're just running our own code, then it's not really all that important. But it's when you're a service provider and you have to guarantee service to particular customers or you're writing a mission critical application that has to have a particular percentage of the overall CPU, then this starts to become important. Some of the container solutions like LXC and LXD have an automatic interface to C-group. We saw that when we did the little screenshot there. Others, you have to do it by hand. Well, there's a monitoring tool. Actually, Kubernetes has a mechanism to go in and dork around with us as well. So we can either do it manually or we can allow things to happen automatically depending on the policies that we set. Now, C-group mechanism is tied to namespaces. So if we have a particular namespace defined, then we can add the namespace to the C-group and everything that's a member of the namespace then becomes controlled by that C-group, whatever controller we added to it. We can have multiple controllers associated with a particular group. So I can limit not just memory or CPU time. I can limit everything on a per-group basis. So if I have a namespace that I've got applications that are a member of that namespace, I can then drop the whole namespace into the C-group interface and then control the namespace from there. So that's another capability that we have with this. And so that with namespaces and C-groups, although they're not the same thing, they are closely related to each other. So we can actually take advantage of both interfaces if we need them for whatever application we're trying to do. So I've got a couple of minutes for questions. Anybody got any questions? Yes. Yeah, so in fact, this was just a snapshot of the top two. And it has to do with whether I've allocated them to specific CPU cores. So if you have multiple applications that are running on the same processor and you typically do, then they're going to get in your way. So I can guarantee that these particular groups are well behaved, but anything else that's running on the platform may not necessarily be well behaved. So this is where the CPU sets C-group comes into play, because now I'm going to assign C-groups in such a way that the ones that are important are locked to their own CPU core and nobody else plays on that CPU core. So the scheduler will continue to run and it will schedule things as it normally would, but it will not allow anything that's not a member of this particular C-group to execute on those CPU cores. So I can then basically restrict the access to particular resources using this mechanism and taking advantage of the CPU set mechanism. Well, there's another thing that you have to do. It gets a little bit more complicated than that. I wish it was that simple, but you do basically have to then tell the scheduler, don't schedule anything else on these particular CPU cores. So that's a little bit more complicated than that, but that's essentially what you're doing is you're saying these CPU cores or core and because of the way the CPU set mechanism works, inside of a virtual machine, say VMWare, I have virtual CPUs, each one of those virtual CPUs is a process. So I can then set my C-group restrictions on a per-process basis for the virtual CPUs and control everything that happens inside of the VM. They do the same thing for containers, but it's just a little more visible when you're controlling the virtual CPU. Yeah. Yes. Now, so they do in fact inherit the C-group settings from their parents. If you're in a hierarchy, you cannot subvert the hierarchy. So you can't have a child process down here that's in a C-group that only gets 25% of the CPU. Suddenly say, I want 75% of the CPU. So that's, you can't subvert the hierarchy. That's locked in. Yeah, go ahead. But they will only get 25% of the CPU. The whole group will only get 25% of the CPU. So yes, you had a question back there. Yes. Yes, you can. So you can have it not only tell you that something is not looking right, but you can also have it reach in and start controlling things. And this can also be done as a script, by the way. So we can have it all basically set up so that it just starts up and just runs this way and we don't have to sit there and interact with it in any way. Yeah, your question. Mm-hmm. Yes. You can migrate it to another C-group. So yes, you can if you're the system administrator and you know where the hell you're moving it to, then yes, you can in fact do that. But it's no longer under the control of that C-group. It's now moved someplace else. Well, it does when you normally spawn it, but yes, you can in fact subvert it. It's like all Linux. There's always a way around it, right? Yeah. Yes, that's inside of the C-group tools. There's a mechanism to do that. And then if you're using things like VertManager, VertManager will automatically be able to do it as well. There was a question over here. Yeah. The question is, are the CPU shares relative to 10,000 to 1,024? Yes, they are. That's a full CPU. That's correct. Yep. And it's real powers of two, not bogus powers of two, like 1,000. Yeah. So when you do 1,024, it's unlimited CPU time. And all we're doing is we're simply restricting it from the unlimited CPU time. And so the same thing applies to bandwidth, disk IO, network IO, 1,024 is unlimited, and you're basically restricting it from there. Yep, that's correct. Yeah, just like we got another one here. Yeah. No. No. So if I restricted it to 75%, then it wouldn't matter what else was running. It would still only get 75%. Yes. Then it would have jumped them to different CPU cores. And this is, again, where we get into the CPU set stuff. Because what's going to happen here is, understand how the completely fair scheduler works. The completely fair scheduler tries to be completely fair. And that means it's got this red black tree that's implemented that then keeps track of what CPU time you should have gotten in the perfect multitasking environment. So it's actually tracking what CPU time you should have gotten and the one that's on the left hand side of the red black tree, the one that gets dispatched next, is the one with the gravest need. And so when we're tinkering around with these kinds of things, now we have to understand how the completely fair scheduler works. And of course we can completely subvert that by moving its priority to a higher priority than zero. And it also has, you can still mess around with the soft priorities as well. So the things from negative 19 to plus 20. So you can still do that as well. Yeah. By default, it's overall in the system. So mine's a hexacore. If I look at it across the system, it's across all 12 cores, because I've got cores plus threads. If you did that in the... Well, what will happen in the CPU set, you actually assign a flag for each CPU core you want it to run on. Yes. Okay. I appreciate it. Thank you very much.