 Hello, everybody, and welcome to my talk. My name is Peter Hunt. I'm a software engineer for Red Hat. I primarily work on Cryo for OpenShift, but I also work on Konmon and Podman and other container-related technologies. So today we're going to talk about Cryo, specifically the journey in Cryo in dropping the POS container. Before we get into the nitty gritty about what a POS container is, we gotta first talk about what, a little bit about Kubernetes background. So to get on the same page, let's talk about what a pod is. You can think about a pod like a group of, it's like a logical host. It's a group of containers, pretty much. They share some namespaces and some storage and they're kind of like servers. So you may have a web server container inside of it and maybe a logging sidecar and maybe some other containers that warm up the cache maybe. Now that we have understanding what a pod is, let's talk a little bit about the POS or the Infra container. So what is the POS? If you run Docker and you've ever run Docker PS on your Kubernetes node, you may notice a mysterious POS container. In the bottom of this picture, we have POS 3.2. And the POS container sits and pauses. It really doesn't do very much more than that except for at the end of its life. And we'll talk more about what the POS does. The original, the idea of the POS is to hold the pod level namespaces and share them with other containers inside of the pod. And it's contained within the pod C group. It's often referred to as the pod Infra container. So in this talk, we're mostly gonna call it POS but you may see some instances of it being called an Infra container. For some history of Kubernetes pods, in the beginning there was just Docker and this was well before CRI or Rocket or other runtimes were able to go under Kubernetes. Kubernetes needed a way to give each pod an IP address. How would one do that in Docker? The way that they figured it out was they started a container and then they jumped into the network namespace of that container and assign it an IP. Then they started other containers joining that network namespace of the first container and then the containers had the assigned IP and that's how the POS container came to be. As such, it was originally called the net container because it only held the network namespace but eventually other namespaces were added to it, the first one being IPC, at which point it was named the Infra container in the code base. Eventually the PID namespace sharing was added and in the pod's job came a little bit more involved. In addition to just holding the namespaces, it is also by being PID one in the pod level PID namespace, it's responsible for reaping processes in the pod and we're gonna talk about that a little bit more later. So now let me know what the POS is, why would we drop it? We've been using this forever. The first reason is it uses up space for the binary, the container and the image. So per node, the image is, it's a pretty small 600 kilobytes, not bad, not really anything anyone would sweat. However, the amount of memory that it has per pod is not great. I mean, one meg per pod is really not bad. Kubernetes nodes typically cap at around 250 pods, which means that's 250 megs, that's barely anything in the grand scheme of things, but that is a little bit of space for your workloads. Another reason is it does take time to create, mount and start the container. Below we have an example of running it, creating a pod with a POS container and without. And these times don't look all that different and I'm gonna go into a little bit more of a performance comparison. I ultimately know that there are really two things that pod creation requests needs to do and needs to somehow share the namespaces, create the namespaces for sharing within the pod and set up the networking stack for the pod. So this is, by dropping the POS container, we're doing about half of those things. Finally, there's some process management overhead. There's much more code and setup and cleanup and we need to keep track of the POS container. Notice here, we have a devoted conmon, which is a container monitor for our infr container to make sure that it's still alive or hasn't been killed. And ultimately, no process means no process management. We don't have to worry about any of that and the saved code is saved bugs as well. So how would we go about dropping the POS container if we needed to? What does the POS container really do? The first thing that you need to do is find a way to keep namespaces without an associated process. And luckily Linux just supports us out of the box. We can bind mount namespaces. So in this first line here, we have unshare dash dash net and unshare means create or point to another namespace and jump into it. So unshare minus minus net means make me a new namespace and then put me right inside of it. And then that following command mount dash dash bind proc self ns net devar run net one, that's saying mount my personal namespace, which is the namespace we just created in the unshare command mount proc self ns net to the specified location. And then in this terminal below, we can unshare inside of to that bar run net one and then be inside of the namespace of the pod or created by that unshare command above. The second step, which is actually something that took me by surprise in implementing this feature is you need to apply the syscuddles in the pod. So syscuddles are namespace and we need to apply them at some point. So the time that you apply them is when you are inside of the pod, but will you are unshared inside of the namespace but you have yet to, you've unshared the name inside the namespace and you configure it then. Some examples are like ip net ip forward and message mac they're just ways to configure some things inside of the namespaces. Finally, you need to take the namespaces that you've just created and pass it to the OCI runtime when, so it can use them in creating the container. So on the left here, we have the old way of doing it where we used proc, the proc entry for the pause container proc 864303 and took that, that namespace and then pass it to the runtime, which then unshared the container inside of it and then had that container inside of the namespace. On the other side, we have a created path that we passed down instead. So we have our unshare, our mounted namespace path and we just pass it right to the container and the runtime takes care of actually configuring it for the new container inside of the pod. Now that we've, so now that we know how, let's get to introducing how we actually did it and that is with a binary called pin and s pronounced pin and s not pins. And it currently lives inside of the cryo tree. It was inspired by the ip net and s commands as well as a container networking package that is used heavily in CNI today. It's a C binary that's exact from cryo. And the reason we went with this rather than doing it in cryo natively is because go is kind of funky with how it configures namespaces, well, how like the runtime interacts with namespaces, the runtime has the ability to jump through different namespaces at any time, you know. So you need or an interrupt, you know, your process at any time. So you need to lock the OS thread, do all of your namespace related stuff and then unlock it to safely do that. And instead of doing that, we just decided to do it in C which there's not a runtime mucking around with things and there's more native support for unsharing and mounting. Cryo then takes the mounted namespace from pin and s and hands it to a container and then boom, we did it. Pause process draw for the most part. The one case that we have to worry about still is the dreaded PID namespace. So remember the special responsibility that PID-1 has inside of a private PID namespace. It's responsible for reaping the children from the kernel's process table, basically calling weight PID so that those PIDs can then be reused. If you don't do this, then there's an entry in the process table but there's no associated process and this is called a zombified process that no zombie antidote will fix. For this case, a pod level PID namespace, we actually keep the POD container. Now there's a couple of hacky workarounds we could do instead like what if we had the first process inside of the first container inside of the POD B-PID-1 and made sure that that PID-1 always had the capability to reap the children of the PID namespace. But unfortunately, we would still need to ensure ordering. So if there was another container that jumped in beforehand, then that would be PID-1 and then there would be zombies. And ultimately this is what our pause container already does. We already reap the children inside of the POD level PID namespace and we've ensured ordering with the pause container. Luckily, this is not that much of a loss because most pods have a container level PID namespace instead of POD level. So we can still drop the input container and not worry about sharing a POD level PID namespace and only use a POD container for cases where we have a POD level PID namespace. So given that, let's talk a little bit about Kraus journey through different options and configuring sandboxes, which is an analogous for pods and having getting to the drop-in for us. So we started off configuring pods with namespaces and proc. We saw the, we have straw the string before, proc, PID, NS, net or whatever. And that's how we originally did it with a pause container. Then a little bit down the line, Cota containers needed a way to pre-configure the network namespace before starting the pause container so that the VM had a running network to start up with. And so an option was introduced manage network NS lifecycle, which used the container networking package and natively did it all on go to pin the namespace. And this made it for cryo 1.0. For security, we're moving towards managing all of the NS lifecycle. So about a year ago, we had a CVE where an interaction between the kernel OomKiller and Kanman, our container monitor, and PIDwrap caused an unprivileged container to be able to jump onto the network namespace of a host of the host, which could allow for the host to have the host name change or something like that. Which is very bad. That has been patched in a couple of different ways and we've nearly fully mitigated that. And the final nail on the coffin for that vulnerability is managing NS lifecycle, which is now the default option. Managing NS lifecycle is what introduced the pin NS binary and allowed us to pin all of the pod level namespaces, but without having yet dropped the pause container. And then finally for performance, we moved on to drop the pause container and that's corresponds to the drop in for option, which is slated to be introduced in cryo 1.0. So given that, let's talk a little bit about the future. So yeah, in experimental support for the pause container, dropping the pause container is targeted for cryo 1.19. Next, then after that, we plan on having podman pods also use pin NS to configure the namespaces, as it currently also uses a input container. And then eventually we want pin namespaces and drop pause to be the default so that we get rid of the pause container entirely unless we absolutely need it in the pin namespace case. So let's do a quick demo. So here we have our local cluster running and this is just has cryo and we're gonna start off with not managing the namespace lifecycle and not dropping the pause container. So here we have our two containers, hello one and hello two, our two pods and they're pretty much the same except for their name. And basically what they're gonna be doing is an alpine container that runs top, very simple. So we're gonna start off by running cube cuddle by our hello pod. Oh, now we have our hello pod. So let's look at it in cry cuddle, cuddle pods. So here we have the container ID for this hello pod. Let's look at the pause process itself. So we can do a run C list, yes, A. And notice here we have F3C, run C list spaces. So here we have the namespaces that we asked run C to use for this pod. So notice it has a private PID and mount namespace but we passed down the path for the net IPC and UTS namespace of the infra container. So look at this PID, it matches the 285264 of our pause process because we're taking the entry in the proc table for all the namespaces of the pause container and passing it down to run C. Now let us try it with dropping infra equals true. So now we're gonna create, we have hello to created. All right, cuddle pods, yes, wanna say. And look at this, this is our run C list. And notice here we did the same thing as before except instead of passing the path that had the proc entry we're passing the namespaces that pin in S unshared, configure the syscuddles and then bind mounted to the specified location. Crowd took that location and then pass it down to run C and now we have a pod running without a PID namespace without a pause container. So that is it for the demo. Let's talk a little bit about performance comparison, a moment of truth. So in this, here's a little script that we used to test how well the performance increase, how much better the performance is when dropping the pause container. All we're doing is in parallel we're running a hundred unique pods waiting until that's done and then removing all of those pods and taking the time for that operation. And we did it 10 times each for dropped and not dropped pause container and use a tool called multi-time to aggregate the data and make the comparison. So, and here's the data that we got. In the drop cause pause case we have about a little bit less or a little bit more than half of the real time as when we keep the pause. Now there's a couple of caveats here. Number one, notice how the user insists time is about the same and usually those are actually the meaningful metrics for the amount of time a process takes because they don't, because the real time has the noise of kernel interrupts and other things that are happening on the system. But unfortunately the way that crycuddle works it makes a request off to cryo, cryo does all the work and then in cryo returns a response. So none of the actual work is being of the pod creation and removal is being attributed to the real time of the cryo-cryo process. Okay. Well, so that means the, there's a little bit of noise and these numbers aren't precise but what we can take away is that we're doing many fewer things when we're not actually creating we're not creating the root of FES and then running the container and then having Kahneman listening to the container and then waiting for Kahneman to have you listening to the container and then we're doing none of that. All we're doing is bind mounting a couple of unsharing bind mounting and then going on with our network setup. So this is much better and we like that and this is not even accounting for the amount of time that we took the amount of code savings that we have with all the complexity of doing all of the pause container related things and it's also not accounting for the small but notable amount of memory that we're saving by not having a pause container for every pod. So that is the presentation. If you want to find out more here's a couple of different links you can also contact me and I'm at hair commander on most things. If thank you very much for coming. Do you have any questions and are watching live and feel free to ask them. If you're not live then you can ping me on any of these things and I'd be happy to answer. I really appreciate you coming and are there any questions? Oh, so yes, please questions in the chat. There was four, go ahead. I think we had one question that you said you were going to ask. We read it for you. Are you just going to find it? I can find it. So the question was couldn't we just have cryo create a pid name space and put a process into that that it would later kill. And so I hadn't thought about that prior to this. There's a couple of things that may actually make it so that that doesn't save us very much. So we still use the runtime spec generation for keeping this state of the pod so that we're basically reusing all of the work we did before making a config JSON blob and for the pause container even if we drop the pause container and that's so that we can restore the pod. I think we'd eventually like to move away from that but that's what we're doing currently. So it wouldn't save the spec generation step. We would still need to have some conmon like process because we need to be able to handle if somehow this pause process got killed somehow and if the pause, so we'd probably still want someone listening to it or be able to catch a sick child and Go can't do that very well because of how Go handles signals which is the runtime gobbles them up. And finally, in the Cota container case we still need a pause container because and I didn't mention this in the talk and I should have but for Cota containers the another VM based runtimes the pause container actually there's an annotation that is given to the pause container and then as it's passed down to the runtime the runtime actually creates a new VM for the pause container and then later containers are injected into that VM. So we couldn't have, we couldn't make, we couldn't be totally pauseless because we still need to handle that case. So yeah, but yeah, I think the, oh yeah, and then we need the, the we would need that process to be able to do PID reaping which is not too hard to do but yeah, it would basically be the equivalent amount of work with a little less of the overhead of OCI containers in general. So we're gonna add a whole lot. So it wouldn't save us a ton yet. And there's a follow up question and does pause container handled shared key groups or is it done via cryo? So the pause container used to hold the namespace for the, basically the namespaces for the pod was based on the pause containers PID proc entry in regards to this PID. But now we don't use that anymore. We use pin in S which just mounts the, oh shared C, okay. Sorry, shared C group namespace. The question was, does pause container handled shared C group namespace? And yeah, that's not shared in Kubernetes yet, though we're working on getting that working for the C groups V2. And pin in S to support it. Yes. Are there any other questions? If not, then. If we don't have any other questions, I guess folks can head over to the breakup, to the expo hall. If we click on the link to the left of your screen down there, you can go and meet other attendees of the conference or if you just want to take a break. The next talk is gonna start at 330. So we'll see you over there. Thank you so much. And great presentation, Peter. Thank you. Thanks Beverly. Bye bye everyone. Thanks for joining.