 Hi we are going to get started now because this is going to be a very packed talk and Launching right into it. I'm Matt Heon and I am here to talk to you about pod man So bit about me I've been an engineer at Red Hat for the last five years and I've spent all of those with Dan Walsh over there on the containers team and I've been working on pod man for about the last three years basically since the project was conceived of and I'm one of the lead maintainers on it So Let's start talking about pod man. This is a pretty simple sample pod man commands You can see we have a volume. We've got a memory limit we're gonna launch a fedora image and We're going to run bash and all of this came about because Scott McCarty I don't think it's in the room, but Scott McCarty asked me what happens when you press enter on a command like this and We're gonna answer that today and we're gonna trace this command in two different ways we're going to run it as roots and we're also gonna look at what happens when we run it without roots because Those two end up being pretty different So let's just move on briefly to the architecture of pod man just an overall view of it and You'll notice that we have two of them here We have one for root and we have one for rootless and rootless pod man is not a set UID binary Which means that we don't really have privileges We only run with what the user came in with and for an unprivileged user We're not gonna be able to do a lot of things Which means that we are going to have to do a lot of differently as opposed to root Now I'm not gonna linger on the architecture here other than to say that Everything aside from the attach socket you see over to the left is a separate process We'll go into more detail on each of these components later But let's just think about the differences between the root and rootless architectures here as we go Onto the biggest of those differences the rootless user namespace. This is the big box. You saw around the rootless container So rootless pod man needs a user namespace for two big reasons The first one is we need to be able to mount some file systems that we might not otherwise normally Unprivileged users can't really mount but within a user namespace We gain the ability to do bind mounts fuse mounts tempFS mounts and a few more that are less relevant But those three are going to be enough to let us establish a container root file system It's also going to let us get some additional users in play because normally when you log is an unprivileged user You have access to one UID and exactly one UID your own But if you pull any container image, you're gonna find that there's more than one UID in play There are a lot of files that are going to be owned by users other than root And we really need to support that for the rootless use case So here once we create this rootless user namespace using Unshare We are going to have to get the first and only set UID binaries involved new UID map and new GID map and those are going to read Etsy sub UID Etsy sub GID You can see those right there and they're going to grant us some additional users based on that So you can see my username and he on in there And it's going to give me six five five three six users Starting at one hundred thousand. That's the third and the second number respectively makes some sort of sense But once we have these users in here We have access to them and we can use them in the container so if you want to run a system desay in that rootless container you're actually able to and And finally the rootless username space gives us the ability to change the users around a bit in terms of ordering Our namespaces really are a way of altering a way of process looks at the rest of the system And in terms of the user namespace it allows us to remap both in terms of Excluding and including users from the host but also in terms of reordering them So we are going to take UID zero in the container and we're going to make it whoever launched the rootless container So you're still not going to be rude even though you launch this container You don't gain any privileges in terms of the kernel But you can pretend to be rude within the container and that's pretty important And now once we've finished setting up this rootless user namespace for the first time We're actually going to keep it around because any subsequent containers we need to join the namespace as well Those containers are if they want to share anything with containers that are already in this namespace For example, if they want your namespaces, they all need to be within the same rootless user namespace So we launch what's called a pause process to keep this alive and every subsequent pod Manification rootless is going to join this and again This is only for rootless if your root you can mount anything you want you have access to all the users There's no need for it Now let's move on to the first real step once we're past this which is pulling the image and If you've already seen Nolan's talk he went into more depth on this than I will be able to But let's go on anyways and do a bit of a brief overview So if the image is available locally, we can obviously skip this step. Let's assume that it's not The first real step is going to be to figure out what we need to pull if you remember that sample command I showed you it was your fedora image and fedora is not very descriptive here We need to figure out what the full name of the image is and we do this using something called search registries These are defined in registries cop And we're going to use those to generate some potential candidate image names full names and you can see those over there These are as it would appear on a standard fedora 31 system You can see that there are five potential candidates and we are going to pull those one by one in that order first one to pull successfully is the one that we're going to use and Pull in general is just going to be a series of HTTP get requests We're going to ask the registry do you have a manifest this image if it does not we're going to assume It doesn't have the image at all go on to the next one if it does Then we're going to try and pull all the layers of that image and once we're done pulling them We are going to save them into container storage And now we get to actually create the container So the first step is to parse the CLI of pod man into what we call create config This is basically a convenient struct that holds all of the changes that you made on the command line plus some baked in defaults And once we make that struct we're going to turn it into an OCI spec Most of you are probably familiar with the fact that pod man does not directly launch containers itself It calls something called an OCI runtime and the way we define the container the OCI runtime is going to build is the OCI spec and That is going to contain things like the memory limit we concluded it's going to contain the volume we included It's going to contain some baked in default as well But there are things that it's not going to contain the OCI spec for example does not care about images it expects all that to be done before it runs and It can't care about things like say a lib pod name volume if you make a volume with pod man volume create We don't really know enough to make a bind out for it yet And the OCI spec also doesn't care about it So we end up with a bundle here of an OCI spec But also some things that don't quite fit inside of it and we're going to pass that bundle into lib pod Which is the container creation library used by pod man all our container operations are managed by lib pod the repo is actually named container slash lib pod and We're going to you call create in there and that's going to save an immutable copy of this Container definition the OCI spec and the things that aren't quite in there and Once we have that saved the container is created as far as lib pods concerned But we have one more thing to do we need to make this container in the storage library We just pulled down an image, but an image isn't quite a container definition We also need something on top of it images are obviously immutable You want to share in between multiple containers without making changes to them But at the same time the container is not if I go into that container which was running bash and I touch some files I want those to show up so we have a read white layer on top of the container and That is going to be made by the storage library, but it's not going to be mounted yet. We're just making things Now we're going to prepare to start the container now if we were running just pod man create We would have halted right there But we're going to go on and we're going to call something called start and attach in lib pod And as the name implies starts the container attaches the container. That's the remaining work upon man run The first step of this is an internal function called a knit and a knit is going to make the container ready to start But start stop right before it starts it First step of a knit really is to get the container mounted. We need to mount it up using overlay FS on root Which is basically going to take all those layers that we downloaded from the image plus that root top layer We made earlier and it's going to merge them all into one directory and we can access that as a container file system now Unfortunately, as I said earlier root doesn't rootless rather doesn't have the ability to make many mounts So we cannot use kernel overlay FS the solution here is fuse overlay FS which was developed by Giuseppe I think he's in the room But fuse over lay FS is a re-implementation of kernel overlay in user space so we can use it without privileges Which is pretty neat. It's going to do roughly the same thing, but fuse and Then once we created the file system. We're now going to need to mount the network So or not mount the network create the network So the container needs a network namespace and as root we're going to use something called the CNI plug-ins if you're not familiar with these It's a series of binaries that will each perform a separate discrete task and in typical use one of them is going to modify system Configuration add the podman network to a bridge one of them is going to add some IP tables rule So we can forward traffic to and from the container one of them will handle any port forwarding that we need And all of these are making big sweeping system changes that again, we don't have privileges for as Routeless, so we are again going to have to work around this and the solution here is something called slurp for net nests Slurp is a small binary that originated in the vert land I believe it was part of Libford at first and it basically is going to tunnel inside the container to outside the container It acts as a manual bridge so to speak and lets us forward traffic without directly creating a network interface It's got its limitations obviously, but when you run as rootless when you have no privileges You have to accept some sacrifices And then we are going to create some container specific files because we don't want to use everything in the image as is The image probably has a resolve.com, but it is in no way related to the network that's running this container So we want to take the host resolve, but we also want to be able to change it You want to be able to alter the container So we're going to make a copy of the host resolve.com We're going to make it so that this is by mounted into the container and then we're going to finalize the OCI spec Now we have everything that we didn't have before because we made it before but we didn't know where the container was mounted We didn't know about these extra bind mounts that we've added We didn't know where the network namespace was now that we've finished preparing We can finalize the spec and write it to disk and then we can use it to launch conman Conman is the container monitor we call it and You can see a sample conman command line right there And what I want you to take away from this is that you should never ever invoke conman yourself let podman do it for you This is basically a very lightweight C binary and once we fork it off It's going to double fork to demonize and once it's finished doing that Yeah, so once it is finished doing that it is going to provide some services for us So podman itself is dangerous Which means that the podman process can go away at any time But the container is still running in the background potentially and what if I want to do say a podman attached to it Or what if it exits? I need to get the exit code Conman will monitor the container and it will provide a few services like attaching to it if we do a podman attach and It's also going to store the exit code for us and Once podman creates conman and conman double forks It has very little to do but invoke the OCI runtime and the first thing it's going to do is the OCI runtime Create command so what I'm about to say here is true for run C and C run What it's not true for is Cata G visor There are various other runtimes and most of them will be hypervisor based and they will have similar end results But the steps I'm about to describe very different. So this applies to run C C run whatever your default probably is Now the OCI create command basically OCI runtime create command is going to make what we call an init process And that is the larval form of a container and it's going to start adding security restrictions and setting up What will eventually become the container it's going to make all the namespaces We already made the network namespace before but it's going to establish a mount namespace paid namespace IPC namespace for example and Then it's going to finish the root file system So we mounted the image before but all the bind mounts they're going to be handled by this OCI runtime and Then it's going to create its C group the container has a C group But this is another place where root and rootless differ in the C groups be one hierarchy It's not safe to give a rootless container delegation It will be able to modify resource limits that were previously passed on it So it could basically ignore any resource limits and What that means is unless you're running C groups B2 we can't really create C groups for the container Next we're going to go ahead and we're going to drop our capabilities So we probably are not we're not running a privileged container here So we're going to drop any capabilities that we don't want to retain and this actually happens for rootless as well Because rootless established the rootless user namespace which gave us a set of fake capabilities within it And we're going to drop those capabilities next we run to initialize our security features sec comp se linux and Once that's done. We're basically almost at the container, but we're going to stop here we're going to stop right before we execute that first thing in the container that bash process that we want and and Conman and the OCI runtime are going to signal podman that We're done with this and then they're going to go to sleep and wait for the next step because there's one more thing that we have to do And that is we need to attach to the container So unless you specify hyphen D attach is going to happen if you do then we proceed to the next step We start the container and then we just print the ID, but let's assume that we didn't and Attaches as I said before handled by conman conman is the parent of the OCI runtime the OCI runtime didn't double fork So it's a direct child and conman has its standard streams So what conman is going to do it's going to open a unique socket for us And anything that gets written to the containers standard out and standard error It's going to append a head or two it to multiplex it over that socket and then it's just going to send it out And then anything we send to that socket will be written to the container standard in And the reason we have to do this now and not after the container starts is because this is not buffered So if we start the container and then attach to it anything that happens between the start and the attach we lose it We could write buffering code, but These are just do this And now we get to the final step We are going to invoke the OCI runtime Padme is going to do it this time not conman because we're just invoking OCI runtime start and that Really is just going to contact that init process that has been sleeping and tell it to wake up and move on to that next step And that next step is to exact that bash process And now this is not a fork exact. This is just an exact We are going to overwrite run C with the new container process the back process And since run C has already set up all these security features for itself It's already joined the C groups always set up name spaces. We now have paid one in our container and it started running Let's see I think we have a bit of time. So we'll go briefly into this. What happens when the container exits Conman is waiting for this. It's waiting on the container's PID and It's going to save the exit code once the exit actually happens But it's also going to do something else. It invokes a command called podman container cleanup Podman container cleanup is going to take the exit code the podman that conman stored because conman can't talk to our database so container cleanup takes the exit code and It stores in the database and then it tears down any remaining container resources It's going to tear down the network. It's going to unmount the container and once it's done we freed all our resources and The container is basically done One more thing is that if hyphen hyphen rm is given then the container would be removed at this point Alright, I think that was about it I know this was a very quick overview and I rushed through a bunch of things but Only 20 minutes. So if you have questions on anything that wasn't clear questions on anything that you'd like more detail on Feel free to ask Yes That is What? Okay, so I've been asked what happens when conman gets a PID That is not its child and that that is honestly bizarre I have no idea how that would happen if you want to open an issue on this one We'd be glad to hear it because that sounds very interesting Any other questions? So the question is why use conman and not systemd and the answer is largely Attach we cannot convince systemd to forward what we need from the container standard streams wise So we really need something to hold open those streams for us and to Basically forward content to one from the container. That's the big justification conman and in the end It's a few kilobytes of memory and basically zero CPU. So we're okay with leaving it around So the question is comparing docker and podman so Briefly speaking podman its goal is to provide a docker compatible interface So our command line front end is going to be basically identical We've got a few places where we differ but not many and We're going to add some features on top of docker for it from that We're gonna add pod support for example and generate play or generate cube play cube But we're also gonna do this in a manner that is demonless. We have no demons So there is no process that is waiting in the background and managing the containers aside from conman Which dies the moment the container does so we are lighter weights and potentially more portable because of that I Think that's a good summary any other questions So the question is on image pull what decides the order that we pulled the potential candidate images And it's pretty simple the order that they're defined in registries.cof So we have a list of search registries and registries.cof and we'll try those in the order. They're listed We're going to make a file and then we're going to write the exit code to it And that's going to be in a temporary directory just for the container So container storage in addition to mounting up the container It makes us a few directories that we can store files related to it in and we will throw the exit code in one of those and We will when we read the file back We know basically where it is it's named with the ID of the container and The contents of the exit code and the time that was created is the exit time So the question is how Cata containers are different from the other OCI runtimes and the biggest thing is going to be We didn't need to make a virtual machine here Cata containers is VM based so it's going to need to spin up a virtual machine and Once it's done that it is actually so they do a VM Per pot, which is actually rather interesting. I believe they're still doing some of the things we do like name spacing within those VMs But they need to spin up a VM with the root file system from the container I believe right now they're using device mapper to manually share it between hosting container There are other ways like vert IO FS or 9p but the big thing here is Cata runs inside a virtual machine and I believe it's still doing some container things inside that VM But they're separated by a separate kernel instead of just using the same current other questions Yeah, so the question is non-root podman advantages and disadvantages the advantages definitely security You are running with absolutely no capabilities besides one set UID binary, which isn't even part of us That's a I think core details or shadow details. Anyways No privileges and can be run basically anywhere. It's configured because of that any user can run it without worries Disadvantages definitely speed because we are using a fused file system We are having to manually shuttle traffic into and out of the container I'd say another disadvantage is also you can't do some things that you normally would be able to Podman for example can normally make NFS volumes. No privileges to do that as rule is so there are some things that you Just cannot do without root, but yeah advantages or security disadvantages can't do a few things and Slightly slower. Any other questions? Yes, I'm doing a lightning talk in about a half hour Come to that you will be happy Anything else? Oh never mind. We are out of time. Thank you everyone