 Hi, everyone. How are we doing this afternoon? Excellent. Thanks for showing up, 4.30 on a Wednesday. I know the parties are going to get started here in another hour or so. So I really appreciate you guys sticking around for the late afternoon talks. We're heading out. This afternoon, we're going to be talking about C Group V2 and taking a little journey of how Adobe kind of came across this new development in Kubernetes. Before we get started, my name's Tony Goslin. I'm a senior cloud engineer at Ethos, or excuse me, with the Ethos team at Adobe. And this is my co-presenter. I'm Mike Tujeron. I'm a lead cloud engineer also at Adobe on the Ethos team. And you can see all of our contact information below. Excellent. And before we start, just a little bit about who the Ethos team is. Ethos is how Adobe does containerized workloads. It is basically it's Kube ADM-based Kubernetes install. And beyond just Kubernetes, it is also a layer of tools and processes and practices on top of that. So Container as a Service, Platform as a Service, Ingress, CICD, Security Controls. You name it, we do it. We basically have a whole suite of tools available for our developers. We write across a number of clouds and across a number of nodes. We have the prerequisite brag sheet there, so you all can ooh and ah with the workload sizes that we're doing. And basically, the philosophy of Ethos is we try to make the container mistakes for everyone once so that they can focus on deploying really great applications. And are there any members of either of our families here? Just in case there are, I just want to really emphasize we have nothing to do with how often Acrobat updates. It's not our team, that's somebody else. So how does talk come to be? I'd love to sit here and tell you that Tuj and I got together and we did a careful study of Linux history internals and picked out a really great topic or that we had a meditation on Kubernetes releases and neither of those things happened. What really happened was we knew that two is greater than one. We're operators and we're Kubernetes operators and we really focused normally at that API level. Give me an API, let me deploy what I want. We don't really think about what's underneath Kubernetes all the time. We want to deal with the fun stuff, the YAML. Everyone loves YAML. Today we're not talking about YAML. So here's what happened. About, gosh, almost a year ago, we were going through just a normal upgrade of our clusters. We were going ahead and upgrading the underlying operating system. We use Convoke Flatcar. How many people are familiar with Flatcar? Excellent, some Flatcar fans here, more big fans. So we went ahead, we upgraded to version 2969.0.0. And this particular version had a new note in the release logins. Now we're defaulting to C Group V2. That sounds great. Two is greater than one. I love new software, let's enable it. So we went ahead and we went with it. And the first thing that we found out was our cluster doesn't work. That's no good. So we did a little research. Come to find out, hmm, we actually have to make Kublet and our container runtime aware that we're running this new version of C Group. Okay, great. If we can figure that, our cluster starts. Fantastic. So we're going through with our testing and then we find out that, gosh, Metric Server is missing some key metrics and in some cases it's not starting. So that's not good. So we dive in a little deeper and find out, oh, C Group actually is responsible for some of the metrics that come out. And we were finding things like horizontal pod autoscaling wasn't working. So that's okay. C Group, or excuse me, Metric Server, rather relies on the C Advisor. And the version of C Advisor that came with our version of Kubernetes, which at the time was 1.20, it wasn't compatible or it wasn't aware of C Group V2. Not a problem. We won't use that. We'll run a newer version of C Advisor. As a daemon set, we're very smart. So we run a 043. And it works, kind of. Some of the metrics now work. Some do not. And at this point our telemetry team says, guys, no, we're not doing this. You've broken our metrics. We're flying blind here. This is definitely not the way we can operate. And we were done founded. Two is greater than one. We're upgrading software. Why is this not working? We looked through logs. We had seen that C Group V2, it was mentioned in the 1.19 release blog, should work with Kubernetes 1.20. So we stopped by SIG node and we said, what did we do wrong? And they kind of sat us down and let us know that it's a lot more than just setting a config flag and starting your cluster. There are some considerations you have to take before you go ahead and enable this new feature. And this is kind of the thrust of what we wanted to talk about today. What do you need to consider before you can just flip the switch on C Group V2? So what are C Groups? C Groups are control groups. Control groups were added to the Linux kernel starting in January of 2008. And they're accessible via a virtual file system like everything else in the Linux kernel. And really, C Groups, the main function is to help you control and account for resources. So this is where you can basically out a process level set limits on how much of a resource a process can use and as well as find out what its current usage is. So we get that kind of accounting and resource management out of the system. Without C Groups, there are no containers and there is no Kubernetes. Just to be very clear, this is a feature that we, all of us, use every day whether we know it or not. And it's really part and parcel to how these systems function. So it's a system that we have to really give some thought when we're making changes to it about how it functions and what is compatible with it. How in particular, what are some examples of where you might see this in Kubernetes? The two most common examples, you're setting your resources and your limits on a container. That's actually flowing back down through to the C Group level. As well as the metrics you might get for resource usage for a particular pod or container. Those are also coming from C Groups. The system is basically interrogating down through the levels and that's where it's pulling those metrics. Other things that might use C Groups either directly or indirectly are things that rely on those metrics. Like again, horizontal pod auto scaler is a good example of that. So C Group V2, what changed? C Group V2 was introduced to the kernel as a non-experimental feature in 4.5. That was back in March of 2015. So it's been stable for gosh, seven years now. The single biggest feature that was introduced was the introduction of a single and what I call a strict hierarchy. So C Groups when it was first invented or first written rather, it was a feature that was written with maximum flexibility. Basically the kernel developers at the time wanted to give folks writing software as many ways as possible to manage resources around their processes. And in the process they kind of created a system that was really hard to manage. So for example, hopefully this is big enough for you to see. I tried to blow up this diagram as big as possible. But on the top here we have a C Groups V1 hierarchy. So you can see here, excuse me. At the top of my hierarchy I have what are called by controllers. So I have a controller per resource. In this particular example I have a controller for block IO, memory and process IDs. And then underneath that you can see I have software. I have a BG and ad hoc and I have groups under there. Group A and Group B. In this particular example, for one process I'm actually spread out across multiple groups. I have a group controlling my block IO for A but then I also have a group for A controlling memory limits. They're two different groups. Now if I wanna control things completely from a resource perspective, resources first, that's great. That's a great way to organize things. What turned down in practice is most people manage resources at the process level. They wanna be able to restrict in a coordinated fashion at the process. So they wanna group their CPU and their memory usage together. They wanna group their block IO and their network. Everything together controlled and tied to one pit. In this particular setup with C groups V1 what you actually could do or wind up doing is you could actually attach particular threads for a process to different groups. So in order to manage these resources it became incredibly difficult. People wrote some really convoluted code to try to do coordinated management from a process perspective. For a tree that was really oriented from a resource perspective. So when this was written in C groups V2 the model was flipped on its head. Now we have processes at the top of the tree. It's a single hierarchy. The process is at the top. I have a subgroup that says these are the resources I'm managing. I have another subgroup where I actually set those limits. And so now I'm able to manage things at the process level in a very tightly coordinated fashion. There are also some other rules and some flexibility that was taken away at this point that makes it a lot easier to manage resources. So for example, all threads can only belong to one group. Amazing. Yeah, so that was basically probably the single biggest feature for C group V2. It makes it safer to coordinate resource management all in one group. Make sure that it's a lot more thread safe that we're basically coordinating threads in one group. It basically takes a lot of the convoluted programming around resource management and just builds it directly into the C group system as opposed to having to abstract it out and build it in the layer above. Some other cool features that came out of C group V2. EBPF-oriented device control. So in C group V1, devices were often controlled via static files. Now C group V2 was actually built with ties into EBPF to control devices. There's the exposure of pressure style information. So beyond just having metrics around point and time usage for resources, now we have information about usage over time. Is your application or a process rather having spiky usage? Did it all of a sudden consume 50% more memory and getting close to its limit? Or did that happen over a longer period of time? And that kind of information you could actually use to make decisions about how you deal with resources and prioritization for processes. Two's will get into that a little more as well on how that deals with Kubernetes. And finally, with C group V2, one of the things that gets enabled that's really awesome is resource limitation for rootless containers. In C group V1, in order to run C groups, they had to run with root permissions. So with C group V1 in Kubernetes, you could run a rootless container but you couldn't limit the resources on it. You basically had to allow whatever resources it wanted. With C group V2 enabled, you actually can set limits and this sets up a whole world of security benefits. Now we can start actually running containers, not as root, and gaining all the security benefits that come from that. So just a quick timeline of how this access was added or access, how this feature was added to Kubernetes. In late 2019, this is when we first started seeing all the major container runtimes actually add support for C groups V2. In 119, initial support just for nodes was added to Kubernetes. And by this, it basically means nodes will run. And that was it. So no C advisor yet. We couldn't actually get the metrics out but you could actually run nodes using C group V2. After that, we started seeing some alpha features being filtered into subsequent releases, one of which is memory quality of service, which again, Tujul, talk about in just a minute here. And then probably the key moment when C group V2 actually became possible for Kubernetes was in November of 2021, we saw the release of C advisor, 0.43. This is the point where we got full C group V2 support in C advisor, which means we can actually get metrics back out. So not only can we set limits now, but now we can actually see what is my process? What is my container using? And then have full enforcement over that. I have a small asterisk here that there were a couple small metrics bugs in 0.43. So really we got that support with 0.44, which came in February of the following year, but 0.43 was the official C group V2 release. And then finally, we got full stable support for C group V2 in Kubernetes 125, which came out just this last June. So that means we have support across the entire feature set. And I'm gonna let Tujul take it over from here and he'll talk a little bit about what that means. Thanks, Tony. So what does this all mean for me, or more appropriately, I guess, for all of you? So these important hierarchy changes, the pressure stall information is important because this starts to bring it in on a per container or more appropriately per C group basis. So instead of being tied to just the process, that was much more at that top level now. It can now be brought down into the C group itself. So you can get that overall picture and that kind of can tie into the memory QoS perspective. Prior to C group V2, what would happen is you can set the memory limit higher than the memory request, right? But memory isn't compressible. You can't use memory that doesn't exist. Kubernetes before when scheduling could schedule all of that memory all at once. And as another pod came in and it wanted to go a little bit above its memory, one of those two would OOM, right? Because the memory just doesn't exist. Now when Kubernetes did that, it would assign all the memory at once. There was just nothing it would do, would take it all. So now what it can do based upon the PSI, you can look to see what's there. It can use some of the C group functionality to assign a little bit more of the memory and then add a little bit more to the other. Or it can add a little bit more. There's a formula that it uses so that it doesn't allocate all the memory all at once and gives that basically a quality of service to both of them as it goes beyond the request into the limits and provide that information back, reporting those metrics to you so that you as a user can also see where your application is going with inside that pod and those containers. It's a huge win to those of us that oversubscriber nodes. I don't know how many of you out there have nodes that have three, four, 500% oversubscription rates because users just don't use what they request. But for us, that's really common. So this is gonna be a huge win. We don't end up OOMing those sort of situations. Facebook's doing some really cool stuff in user space OOM killers. It also brings into a situation that's not there yet in Kubernetes, but it's gonna allow us to do container aware OOM killing. In many cases, a common pattern is to have a logging sidecar. So let's say all of a sudden your application starts doing a lot of logging and you're running your sidecar with just a little bit of memory because usually your logging isn't that bad. And your logging sidecar suddenly spikes in its memory usage. Right now, today with C Group B1, it'll kill the entire pod because it OOMed. One of the potential features now with C Group B2 is to OOM kill just that one container, restart that one container, and the main pod stays up so your application doesn't go down. Now granted that feature isn't there yet, but that functionality is possible with C Group B2, which is really cool, really powerful feature that can come down the pipe. So I'm really excited about that. Tony mentioned rootless containers. As many of you know, there's been a big effort inside of Kubernetes to go more secure overall. Kubernetes historically has been unsecured by default as a way of saying it. This moves us more towards the secure by default. The more we can approach, the more we can run things as non-root, the much better we will be. So all new resource management features, new things like the memory QOS, the container aware OOM killer, these will all be built for nodes and cubelets, things like that that are running C Group B2 only. C Group B1 new features will not be coming to that. So moving forward, that's where things will be going. C Group B1 will still be supported. That's really important to know. However, something to keep in mind according to the Linux mailing groups, SystemD will no longer support C Group B1 at the end of year 2023. So what does that mean for Kubernetes? I don't know, there hasn't been an announcement yet. But if I was a betting person, I would say we're gonna see some sort of deprecation announcement coming in the next few months about C Group B1 because of the Linux kernel doesn't support it in SystemD and SystemD is a requirement for C Group B2. We're gonna see something happening at some point. So something to be prepared about and to think about in the future. So with great power comes many updates, but thankfully this time around, there's not that many we have to worry about. So cluster operators, what do you need to do to prepare? Well, the most obvious does your operating system support it. It's kind of a dumb moment, but when you're doing your testing, it may not be as obvious as you think it is. Kernel 5.8 or higher, you can run this quick little command to see if C Group B2 is enabled. You can see what returns there. Some operating systems have it enabled by default. Some don't. Some require you to reboot to enable it. Some let you set it at boot time. If you're running like an AWS auto scaler that replaces nodes on reboot, keep that in mind if you have to set it and reboot your node, you may end up terminating your node on that reboot, and then you did yourself no favors. Kubernetes 125 is where it becomes stable. Keep that in mind. We highly recommend running that. You can use it on an earlier version. Your mileage may vary. You may have to run C Advisor as a daemon set to get everything. You may not. So keep that in mind. You need to be running C, or system D for both your container runtime and for Kubelet. Previously, you didn't have to do that. You had to, you could run, well, you didn't have to use system D for both. You just had to be consistent with what you used. The good news is, is that Kubelet will detect which C Group version you're using, C Group B1 or C Group B2. So that means you can launch a node pool that's using C Group B1 and a node group, or node pool that's using C Group B1 and B2 as separate pools so that you can test your applications, test your performance with the different approaches. That way there you can make sure your automation works, make sure your application works, all that kind of stuff. As Sonya mentioned, if you're running C Advisor as a daemon set, really make sure that you're running the right version. I believe they're up to 0.55, but 0.44 is your minimum version you need to run. So as a developer, what do you need to do to prepare? Well, once again, kind of a duh sort of moment, make sure your language of choice supports C Group B2, especially if you're using multi-process or multi-threaded applications. For Java, JDK, version, I can't read the screen here, 11.0.16 or JDK greater than or equal to 15. If you're in Go and you use the AutoMax Proc, make sure you're using greater than or equal to version 15 or 1.5.0, Python 3.12, you might get it. There's been a Git issue open forever about having any native C Group support. I don't bring that up to bash on Python because I use it a lot and I'm a huge fan of Python. The point of it is, is that some languages and some tools and libraries kind of have their own C Group support and the way that they do it is they look to the file systems themselves to figure it out. So if you're using an older library, an older tool that looks to the C Groups, they may not support C Group V2. You need to know what your tools are using, know what your legacy systems are doing, and do they even support C Group V2 and do they need to be updated to support both? I know in Python there's lots of tools and lots of libraries out there that use their own way of looking to the file systems to figure out how much CPU and how much memory a container has just because there is no native support at the time that it was written. So that means testing, testing, and not maybe more testing, definitely more testing because otherwise things will break. But hey, don't take my word for it. Don't take Tony's word for it just because, you know, test your stuff. So let's do a quick demo of doing it right versus doing it wrong. This was put together by Jansen Alphen, a cloud engineer also at Adobe. If the sound crew can turn on the computer audio and we'll do a quick demo by him. Thank you. Hi, my name is Jansen Alphen. I'm a cloud engineer on the Ethos team at Adobe. And in this demo, we will see what could happen to a Go application when moving to C Group V2 if it is using the UberGo AutoMax Prox package to automatically set the Go Max Prox variable to match the Linux container CPU quota. So here we have a Kubernetes cluster built with version 125 and deployed on this cluster, we have two pods running the same Go application. This Go application spins up a number of threads matching the Go Max Prox variable and then does some busy work. Both are running the same code but the pod suffix with old is using an older version of the AutoMax Prox package and the one suffix with new is using a newer version which includes the needed patch to correctly run on C Groups V2. Both are running on a node that has enabled C Groups V2. These pods have their limits set. Well, we can look at that. Their CPU limits set to one. So we'd expect this package to set the Go Max Prox variable to one. So if we look at the logs of the new pod, sure enough, that's what we see. However, when we look at the logs of the old version, we see that it was unable to determine the CPU quota and defaults to setting the value of the variable to the overall CPU offered by the node, which in this case is two. What this means is that the application will now spin up two threads of busy work and hit the CPU limit quicker in each CPU period resulting in the application getting throttled for the remainder of the CPU period. So on the left, we are showing this, the throttle percentage, which is the percentage of CPU periods where the container ran but was stopped from running the whole period. The lines here show when the pod is being throttled, which for the old pod is 98 to 99% of the time. And the gaps in the lines show when the quota was renewed and no throttling was taking place. In comparison, we see the new pod is not being throttled nearly as much as the old. And as you can imagine, as the delta between the Go Max Prox variable and the CPU limit of the pod increases, the quicker the application will hit its CPU limit within the period and start getting throttled. Thank you. Thank you, Janssen. And thank you, everyone. I hope this talk has been informative. I hope you're as excited about C Group V2 as we are. Again, the big takeaways we had for this is that this is a major upgrade for something that's so core and important to Kubernetes. And this really is setting us up for a lot of great features for both operators and developers alike. It's also something to be very much aware for in the next 12 to 18 months in terms of upgrades to both your infrastructure as well as your applications. Just a couple of talks here to be aware of this week. We actually have another C Group V2 talk. This is such a popular subject. We have another team actually talking about it. Our friends over at Red Hat and Google will be talking about this on Friday at 2 p.m. And at 4 p.m., if you're interested in all things Node, definitely would love for you to check out SIG Node. Again, they were instrumental in kind of helping us get on the right path with this and they're a really friendly bunch of folks and they would love to tell you more about their SIG and what they do and how you can help contribute. Also, if you're interested in C Group V2 and how it actually affects the stable release, there is a release blog out. Every release has a number of blogs on major features and this is no exception, so definitely check that out. And again, that's what we have. Thank you very much, everyone. I appreciate it.