 So we have Philippe who's going to give us a talk about revapping the container system the driver he works for a small blue website, which some of you may be using Philippe, I think you're a brand-new star there. So what do you take it away? Thanks Hi, I'm Philippe. I'm part of the Facebook delegation Tell you a little bit about me So I work for Facebook and I care a lot about c-group Q and about containers I've actually been with Facebook recently. I started working in June. So very recently Previously, I was working on Kubernetes. I was working at Google actually and I was working on cubelette and focusing on c-group Q And I've been a system the contributors since 2014. So what's what's this this talk about? So this is about containers in the Docker Kubernetes world. I mean, there's kind of like different approaches for containers There's system DN spawn. There's LXC and there's Docker's Kubernetes and this is about that word It's about c-group Q as well And it's about system D So a little bit about the state of the c-group world So the state system D can be considered as as the main user space API to see groups I mean, you can definitely like talk directly to this to the kernel like to the c-group 3 but system D is trying to expose this in a nice to use interface through debas or through through units and There are kind of three ish modes for the c-group hierarchy. So there is the legacy mode, which is c-group only there then there's the hybrid mode which Already mounts the c-group to 3 but doesn't really use it for any of the controllers and then there's the unified mode, which is c-group 2 only There's kind of like with the hybrid you can actually move some of the controllers Into the c-group 2 3 like enable them in the c-group 2 3, but that's not something anyone is doing anyways Hybrid is where like most people are right now because like since System D versions of like two or three years ago. It's been fairly stable but that means like people are still using c-group 1 and Unified is where we would like to be with the controllers using c-group 2 so motivations for for why we want to see group 2 and So like the hierarchies is better is saying delegation works better There's new controls like then was talking a little about like a memory that law that's something Fairly new to choose to see group 2 also IO dot weight is coming to see group 2 as well So see group 2 has is already like there is a lot of improvements that are coming to see group 2 only All this ebpf goodness is coming to see group 2 as well So album talked about trace loop earlier this morning. That's a group 2 My colleague Julia is coming talk about bpf more later and all this is connected to see group 2 Fedora 31 is going to be using the unified hierarchy by default. So that's a strong motivation to It's all dance and We would like to have like drive up the adoption adoption by other districts as well and See group 2 also improved support for nested containers like a lot of this work We've been talking about like with the rootless and demolus like see group 2 makes a lot of that easier as well So those are the motivations to to have better integration between lip container. Well to better support for see group 2 in lip container I Give it small Explanation of where the components are many of you might be familiar with that But there's like quite quite some container managers these days like for instance like podman docker cryo and container D So podman and docker are mainly like user containers So like user container managers, so like you run your own containers there while cryo and container D are Backends like our servers or demos that are serving like Kubernetes and Yeah, cubelits talks to either cryo or container D through the through the CRI protocol and All of those actually use run C to execute to actually run the containers Run C uses lip container actually like both components are are like same same project same same source tree and Which of those components support the unified hierarchy support secretion natively and Only one of them does like presently which is podman and it does it Mainly because it uses C1 which is like a Reimplementation of the run C kind of CLI and Who has has fixes to to to work with a secretion? So does that mean that That we could simply have the others use C run and that would solve our problem in in fact It doesn't because everything else in fact it depends on like links to the container uses the container So just changing the runtime is not enough so digging a little into lip containers, so lip container has several components to kind of create abstractions on Linux's Linux's like features for containers. So like versus like namespaces and capabilities are other things that the container abstracts and C groups are one of them and so like we're looking at the C group part of lip container And it supports two separate drivers one of them is a group of fasts Which essentially writes directly to the to the C group 3d she likes is a fast C group file system And the other one is system D that uses the bus calls you to talk to system D and so the main If if the system D driver was always going to system D This wouldn't be a problem because system D the bus interface abstracts whether you're running unified or hybrid or legacy and exposes like a Like a consistent API But the system D C group driver of lip container actually uses system D for some operations and then goes around it to change limits and and settings in C group 3 as well So first attempt in first attempt in in revamping like this lip containers system D driver was was Rewriting large parts of the system C group driver to actually go through system D all the time through the bus through the bus and not really right directly to the C group 3 anymore and I actually open up a PR for that But this attempt actually failed And it failed but like was useful to to learn something from it So one part is like touching this this this legacy code that's been around is Is hard there are problems there One thing is like compatibility with versions of system D So like the reason why this this lips lip container driver is writing directly to the secret C group 3 is that when it was written system D was not supporting many of the of the limits is setting and so like this first attempt was fine if I run it against system D 241 or 242 or like 243 for sure But not really if I go back to like a system D from say like something widely used like Ubuntu 18.04 or even like row 7 with the system D V 219 So that's definitely an issue that needs needs a fix here and the other part is features that were missing from either system D or the kernel and one example is freezing like a C group that That was only was only made available recently on on the kernel on C group 2 implementation in the kernel So that was another part. I mean there were some attempts to work around it But in the end it's like a limitation there So the the new proposal is to to split this this system D instead of instead of like what if instead of let's say Replacing like rewriting the code of the the system D driver is actually splitting into two separate Implementations one for the legacy hierarchy that's gonna cover basically like legacy or hybrid But basically it's gonna be C group v1 mostly and it's actually the current code and the second one That's gonna handle the unified case and the one with the unified case is gonna be able to go through system D essentially all the time so yeah a separate unified interface and the the advantage of This approach is that we can enable the new implementation only when we figure out that we're running on the unified tree and So we don't need to worry about compatibility with other OSes like CentOS 7, REL 7, Ubuntu 18 or even 16 or so so So one of the issues I mentioned is the for compatibility of the D-Bus API So this is something that should be sold in system D to to allow for this kind of implementation for this This is like a general issue with Clients clients of system D for this group API so like right now if you if you make a system D unit and you write specific options if System D doesn't recognize some options. It's just gonna simply ignore those and That's by design and that's fine because like if you're using limits that only a newer version of system D It's gonna recognize and you run it on another version of this system D that unit is still gonna work And it's gonna simply ignore those those directives. It doesn't know but that's not the case with the D-Bus interface so when I'm creating a unit with a D-Bus interface and I ask it to do like Like let's say IO wait, which is like something that's new that's coming with that is not even like in the latest system D And it doesn't recognize it simply like that D-Bus call is gonna fail so This needs to be fixed in system D and probably through a D-Bus protocol that can take optional directives and and probably report back on the ones that weren't available in that version of system D Missing features in system D and kernel is something that has been worked on and like for instance like free support was Was something that was mentioned previously And that's actually available like C-Group.Freeze is available on on C-Group3 in kernel 5.2 so that combined with the need for need for like a D-Bus API to do Like backwards for compatibility and backward compatibility of directives Means like we probably need a fairly recent A fairly recent stack of kernel plus system D to to make this work the good news is that like we seem to be right on time for For distros to actually start switching to running a unified hierarchy So we can assume that most of those features are gonna be in place when they switch to unified hierarchy So we can solve this problem in a way that all the components are Deployed already together and everything works So the vision feature is one where the driver detection based on unified or legacy hierarchy Is is made by the by libcontainer and can switch to unified to to using a system D implementation Based on mounting the unified hierarchy And it's gonna be fully functioning starting on specific versions of system D and kernel kernel 5.2 looks like it has like most of the features needed and perhaps the next version of system D could have everything that we need for that as well and Hopefully that helps driving up the option of unified hierarchy by other distributions other than Fedora and I wanted you to a call to to just a pay from red hat who has been working on this problem like his focus is slightly different than this one. He's been working on re-implementing the the access to this e-group 3 on libcontainer and he has one PR merged and so like he already split the legacy and unified driver and He partly Fixes the problem. He doesn't fix the problem for Run-C but fixes the problem for the other uses of the library like cubelands and cryo and still writing to the C-group 3 directly and He doesn't really implement some of the controllers like for instance like the device controller in C-group 2 like the the system the implementations based on eBPF So that would require like writing eBPF into libcontainer and that's one of the reasons not to do that I mean to go through system D because you don't need to implement that One alternative or additional approach to consider is using system D's recommendation for delegation so that Instead of simply creating new C-groups under the root of the tree it would Essentially use the recommended approaches like The network just wrote this document like a while ago with recommendations of like how you could just simply Well, you can just use system D natively so you get slice units and scope units but you can also like simply have your service delegates and then Create a year three under your service or create a new scope With delegation and then create your own tree So that would mean that like container manager manages that whole unit and from system D point of view. It's a single C-group essentially There are drawbacks from from that approach which means like Seeing this as a single thing means like whenever It needs to take action on that unit is going to see all the containers running and on the machine as a single unit and One item for future work is evaluating like the C-group to Controllers and the OCI OCI is the standard like image format for for Docker containers and and Kubernetes containers and So the OCI specifies like well, there's there's the image itself and there's also like Which constraints you use which which limits you set and so on and it turns out the OCI attributes for for resource control are very Tied to this a group one model Which has been changed a lot in to group two and see group two is still evolving like We were talking earlier about the new controls that are coming. So so this probably can take some work in in looking at newer controls, perhaps higher-level controls instead of very low-level ones and I believe like a ability to do extensions as well And yeah, that was it Thank you Hey, so I've been reading those pull requests for a while and I've got the impression that Run C is Slowly moving to become a wrapper on top of system D. Is my impression correct? I think Maybe I Think there's there's the interest of supporting the case of of like not running on system D anyways like many use cases are Like for instance like nascent containers where you don't necessarily have like a system D P1 in the first Container so you want to support the use case of Yeah, the use case of of being able to write directly to a delegated C group 3 Also, like I mentioned like a lip container doesn't do only C groups This is actually like a very small part of what what the container does and it only Like it only needs to interact with system D for for those particular cases of Managing C group 3 like namespaces is something you do on the on your process capabilities is also something you can sat and and sat So in that sense, yeah I'm sure of that hi Regarding the the fact that OCI embeds Basically how secret V1 was designed into the API. I wonder whether or not this also affects system D I know obviously D bus is extensible, but how is that problem being dealt with for the DB for the D bus API? Sure. Yeah, so there were some cases were yeah, some some some Some limits were met. I think in a way Looking at the history of system D. What happened is like C group 2 and and system D like system D support system D support for resource control Were developed pretty much in parallel so that system D at some point even like stopped trying to expose some of the C group 1 API's waiting for the C group 2 API to happen. There have been some cases where like like Some some directives were mapped like memory limit map map to memory max Since the name on C group 2 is memory memory max and it's like there's some semantics don't match exactly But they're close enough in some cases Yeah, so in that sense, I think System D bypassed these problems somehow like for the most part by at some point planning to implement C group 2 mostly Yeah, the Kubernetes community was somewhat hostile to the changing to C groups v2 for a while have Sorry Yeah, so the yeah, when we were trying to move to see groups v2 the Kubernetes community was probably the and specifically Google Thank you Hostile to changing to see who's v2 because I just thought see groups v1 was good enough Have you seen that attitude change? I think yeah, I was sad to see you leave Google to go to Facebook I think no, I think I think Essentially what happened is like a question of prioritization, right? It wasn't the top priority and Now when people are thinking especially at things like ebpf and how much stuff is coming on ebpf And it's like you see people talking about the ebpf all the time And that's the thing that I think is Changing a lot of the attitude towards C-group 2 because like we want the ebpf So like let's take C-group 2 for that reason, but yeah, I mean there's there's there's a lot of enhancements on C-group 2 and Yeah, actually tomorrow like We'll have Anita and Daniel talking about umdi and Johannes talking about senpai So like there's a lot of stuff that's being developed on C-group 2 like yeah, like People saw the potential in C-group 2 and invested there So while C-group 1 was clearly like catching the limits and yeah Hi, yeah Speaking about that too Weren't those C-group mean BPF features also enabled by choosing a hybrid approach Why yeah, why was the hybrid approach not pursued more? Yeah, there there actually limits to I mean you can get some BPF features like you can run some BPF traces on the on the hybrid approach But it's fairly limited what but what you can do like if you actually unlock the unified C-group 3 with the controllers like the Information you get is much much richer. So I mean to a limited to a limited extent. Yeah, you can get some of the some of the the BPF features on on C-group On on the unified approach, but you get much better integration Yes Just to mention that regarding the hybrid approach like in retrospect. I think it was a mistake We have added that system to me like it's a stopgap that has no future and we should not have done that So yeah, forget about the hybrid mode. It's just I mean if you will waste your resources in that then Yeah, you waste them for nothing. It's where we are today. Yeah, so I'm sorry for It's okay, I can come here I Was just kind of curious in in some of the stuff like even what you've mentioned with C run Yes If that was I know that a lot of people are using the lib container in places But it's it's been kind of curious and I'm not just boasting it because you said he works a red hat But we're getting contributions from it from interesting places because it now supports this stuff natively and it runs lighter And like MIPS and all this other kind of stuff It's sure it was it even much of a consideration there to just do that or I'm sorry Do you see run to you see run instead of lib container or what? Sorry, she's he run instead of lip container. No, okay. Oh, you mean. Oh, yeah, okay She is he run instead of lip container or instead of run C and so forth. Yeah. Yeah, so I mean see run So some of the immediate problem of like unblocking part of this because she run I mean, it's a whole re-implementation and has been done with a lot of C group to directly But the fact is C run is like a standalone container runner while lib containers a library That's actually used by most of the other components, right? So Yeah, that's that's That's that's why just switching to see run doesn't work for the general case I mean, it would work for cryo like just not using the container and using C run But Kubelet is also using the container to create its slices and right so Yeah, on the topic of mistakes. We should never have made I think Telling people they should use the container or even suggesting this as it was a good idea was was a mistake in retrospect I think especially since the lip can turn API makes absolutely no sense. And so it did this. Yeah I mean, it's a bit late to say this now, but it would it would have been nice to convince people four or five years ago to not do this But yeah, I was stuck with it. Unfortunately, but I but I do think that we should even if we do get secret We do stuff in lip container. We should still convince people to stop touching it directly at least until we redesign it or something right One last question So a question on that specific slides most of it is red and that's kind of like Blocking let's say wild adoption of secret view to everywhere Do you have more or less a gut estimation of like a rough timeline of when that is going to get like greener? At least like yes in the top levels of greens. So actually I Mentioned Giuseppe's PR, which actually Was created and merged after this light. So like that actually unblocks a lot of this stuff and I'm fairly sure it unblocks the the cubelette Connection so cubelette in the container can can use a group Q already and I assume cry you as well Then yeah, okay Yeah, so that that unblocks it like most of it. Yeah All right, thank you for the day. Thank you