 Hi, my name is Clemens Lang and today we need to talk about your use of root privileges in containers. I start out with a bit about me. I did study computer science in southern Germany in 2011, got interested in open source projects while doing Google Summer of Code with a MacPost project. I'm using a Mac here. And after that landed a job at BMW doing infotainment. So that's where I spent seven years building infotainment for cars, first doing some software integration work, doing packaging. And this is where the connection to today's talk comes in. We wrote a thousand line of C single binary container runtime to simplify building software for a platform. And I got very familiar with the various namespaces involved in doing that. Then switched to over the software updates for the entire car. I think we were the second in the market after Tesla to do that for the entire car. And for the last few years, did security at BMW, CISO, SecureBoot and all that, you know, I say Linux, whatever you have. And that eventually brought me to Red Hat where I now work in the crypto team. So first note, I don't do anything with containers in my day-to-day work. So this is sort of a talk from an outsider. And what I'm doing at the moment is I patch up in SSL. I try to get rid of Char-1 where I can, you know, that might have fallen in some of your feet recently. One of the culprits was me probably. And I'm also dealing with FIPS certification because that's a fun topic for everybody. So root and containers, right? Before there were containers, there was some conventional wisdom that we always applied. If you run a service, create a separate user for it just for separation purposes, right? And while, you know, then Docker came along and we started running everything as root again because that's just what Docker did. But nobody thought this was a problem. And honestly, we still don't think it's a problem. And probably also isn't that big of a problem, right? So just to put the entire talk into perspective for you. I brought an example here. Let's assume I create a directory and varlib. And then I bind mount that into a container. That container is running as root. So I'm running the entire command as root. And inside that container, I just happen to copy bin cat into that bind-mounted directory and then run a chmod command to give it the set UAD bit. So that means anybody who runs that binary now will get effective root permissions. And that also works outside of the container. So if I then outside of the container switch to user nobody, run that particular binary and give it an argument of ETC shadow, then I will get the contents of that file even though the nobody user should normally not be allowed to read that. This isn't really a surprise to anybody. And it's not really a security vulnerability as is because if you bind mount stuff in, then you should know what you're doing. So this isn't a surprise. But if we were running this with root less potman, then this wouldn't be an issue. You would still get a set UAD binary, but it would be for that particular user that might not have privileges to do anything else. So we could improve by running this container as root less potman. Right, so running things as root less potman. That brings us to the issue of networking. Because if you want to offer a service, then you probably want network in your container. And in potman, in root less potman, there are basically two options, either it's slurp for netterness that works using a tap device, but has a couple of limitations. For example, you can only communicate among various containers via exposed ports on the exact same host or using the local host interface. And all the requests that you get will seem to originate from the IP address that's associated with the tap device. So you lose the information of where the request was actually coming from. And you might want to use that to filter for which IP networks you offer a particular service. And there's some improvement on top of that, where you essentially take slurp for netterness, put it in a user namespace that owns a network namespace, and then do a common typical networking using netavark after that. That's great, because it now allows you to do standard networking between containers. But it still uses the slurp for netterness tap device, so all requests still originate from the same IP address. So this isn't really ideal if we want to do is run a service. So the question that I was asking really turns into can we run each container as a separate rootless podman user, but with the proper networking, proper in quotes, as in the rootful networking. And that's the question that I want to answer today. And so you've now seen my motivation, that's the outline for the talk. We have to talk a bit about theory, but it will be quick, I promise you. And then I'll outline the various solutions that I found by following a mailing list post in the presentation somewhere in the podman user group in 2021, I think. That was recently removed from the web server, boo, I have to go to archive.org to get it now. Right, let's get into it. Some theory. Why is it that inside of the container we can even read the file that's owned by root? To know that we need to understand how user namespaces work, user namespaces separate, they basically give you a separate UAD range, and that UAD range is mapped from inside the container to outside the container using a mapping file, and this is what this mapping file is. You can do this with pretty much every container that uses username spaces. It essentially tells you zero in the container is user a thousand outside of the container and repeat this for the following number of UADs, in this case one. And then after that one, the UAD one is mapped to UAD five, two, four, something for the next 65,000 and 300 and something UADs. This is what it will typically look like on a modern system where you have sub UADs. And the rule for accessing files is that containers, and I looked this up in the kernel, can't access iNodes that are owned by UADs and GADs that are not mapped inside your container. So if you don't map the UAD zero, then nobody can access root files. It's as simple as that. So that's theory part one. Theory part two is on networking. For that, you need to know that any non-user namespace is associated with a particular user namespace. And if you want to do an operation inside that namespace, then you need the required capabilities in that user namespace to do that. Sounds complicated, but it will be a lot easier once we get to the example. So managing a network connection requires CapNet admin. That's the capability that Linux kernel checks when you try to modify the IP address of a device, for example. If you now do this in a network namespace owned by a user namespace in which you are root, then you have that capability usually, and that action is allowed. However, changing the host's namespace requires CapNet admin in the host's username space, which is the root namespace that you typically don't see in Configure, but it exists. And that also tells us something. It tells us if we want to use the real networking, then at some point, we will have to have CapNet admin in the root username space. So there's no way of doing any of this without using actual root. So for some pieces, we will need root. So my first idea when I was trying to do this was, okay, I'll start a rootful container, but Potman offers me this flag dash dash UAD map that allows me to configure which UAD mapping I actually want. So let's just not map the root user from outside into the container, and then the problem that I was trying to solve is solved. This is what these lines here do. So essentially, I'm saying zero in the container, maps to the user that I'm currently am outside of the container, and then the second UAD map line just does the same thing with sub UADs, which we will ignore for now because it's, you know, we don't need to understand this to understand the concept that I'm trying to go for. So when I prepared this talk, I rerun this command, and my heart almost stopped because I thought, wait, this didn't use to work. And this is the error message that I used to get when I did this. Turns out this is fixed in Potman 4.5, right? So this is what you should be doing. Thanks for coming to my TED Talk, goodbye. So I mean, now this works, but I'm still going to tell you what I did before it worked. And there might be, you know, you might learn a thing or two and you might still want to not do this particular solution, but we'll get to that. So I did some Googling. I found a presentation that essentially said, yeah, you can run these two commands then set up the network manually, and that should work. And here's a link to a mailing list post of a guy who probably did that at some point in time. And then I clicked that and thought, okay, this looks nice. I can probably do this. And the idea here is that we, you know, all of this is as a user now. So we are running rootless Potman. I create the container, not create, not run or start or any of that. I create the container without networking. I give it a name because we will need that name for the commands that follow. Right after, I run container init. That's also an interesting command because what that does, it sets up all the name spaces but doesn't actually start anything inside your container. So you have the name spaces available, then you have time to modify them. For example, the configure network, which, you know, I did here in a script, let's call it magic.sh. After that, I run potman start and my container starts as normal. So the question really is, what's inside this magic.sh? And it's this. And it's kind of a lot, right? Let's go through what this does. I will take potman and spec to figure out the PID of the container that we initialized but didn't start. And this next, the sudo ln line is really just to give the network namespace a name that the IP utility from the IP route package can use. And then I do what potman initially or internally also does, which is set up a virtual ethernet pair, move one side of that ethernet pair into the container, then rename it inside of the container to eth0, which is what we would expect the network's name to be. And then bring it up on both sides and configure an IP address so we can have a communication going on. That is a lot of work. And note that we haven't actually, we haven't, well, we had to do manual IP config for this. I had to choose an IP address and set it. Then potman and spec won't know about this because we didn't use potman to configure the network. We must repeat this every time we start the container. And I haven't yet dealt at all with exposing ports, which requires writing firewall rules, which I really don't want to do manually. This is like tedious work. And yeah, I could probably write a script. But at this point, I was almost, I was about to give up on this because I thought now I don't want to deal with firewall rules and exposing ports. I don't want to let's just run everything with root. And then I thought, how does potman actually do this? Where's the code in potman that creates all these network interfaces and chooses an IP address and assigns all of this? And it turns out there isn't, because what potman does, it calls net awark for this and just pipes a bunch of JSON into it, and net awark takes care of configuring the network interfaces, moving them into the right namespace and so on. So I thought, can I just, you know, create this JSON configuration, pipe it to net awark and it will do what I want for me? And it turns out, yes, I can. So I created a new potman network. You can give it IPv6 or not if you want to. I mean, we want to get rid of legacy IP, so you should these days. Then generate the required JSON. That contains the IP address that you want to give the container, so we still have that problem. That contains the exposed ports and then pipe that to a net awark setup with a path that identifies the network namespace in which you want to do this. And then that will give you in return a namesaver server configuration that you should somehow get into the container. I was lazy, I just wrote it to ettresolveconf. What does that JSON structure look like? Unfortunately, it's not documented, at least not in the net awark readmes or documentation and also not in the potman documentation. So I reverse engineered it and this is the rundown of what this is. It tells you, okay, I'm looking at this container ID with this container name. Here's a list of the port mappings from which container port to which host port, how many ports, which protocol, and it also gives you a list of the networks that you want to attach to and which IPs you want inside that network and this also gives you DNS. So you can get name resolution and service discovery if you also specify name aliases for that. And then there's this block network info here at the bottom that I omitted because it's really just the output of potman inspect on the network. Right, and this is the point where I show you that this works. And let's pray to the demo gods because I'm running this over Wi-Fi. But if it doesn't work, then I have a TTY recording of me showing this. So I have two shells here on a Fedora 38 system. Can you still hear me while I'm sitting? Yes, great. I have two shells here on a Fedora 38 system. One is root and one is the test user. And I said we want to start out as the test user, so let's do that. I want to define a runtime directory. Let's call our container root less and for that reason I'm choosing that particular runtime directory and that probably doesn't exist yet, so I'm going to make it. And then I'm running potman create. And this CAD file that I'm specifying here, I'll explain in a second because I automated all of this. I'm not going to do it manually now for everyone. Thanks. And that expects the idea of the container in this particular path. So that's why I'm creating it. Disabling network, giving it a name. And we're starting Fedora 38. And just for the demo, we're starting a Python web server on port 8088. So this was created now. Now I need to run this container init command that we saw earlier. So potman container init and then the name of the container. That passed. And now we can see in potman ps-a that we have the container. It's in status initialized. It's not running yet, right? So let's get the PID for this. So we can see what's going on. My time wise, I should probably skip some of this. Let's go to, yeah, I scripted some of this. And this is the part where we need to start running things as root. So as root, I have a script here that allows me to set this up. Now I need to use the exact same runtime directory that I used here. So let me copy this and fix it because obviously id-u will return something else. When I run setup, I specify a name that's used to generate an IP address. So in this case, I just used the container name again. Then I have a secret that I use so that the IP address isn't predictable that I'm generating. Test is the user that I'm running the container under. Then I want to attach this to the rootful zero network. And let me quickly check that this network actually exists. So we don't get an error. It exists. And I want to publish a port. That's 80 to 88. So 80 is outside of the container, 8080 is inside of the container. And I could also specify a network alias here, but I'm not going to show you the resolution anyway, so let's just skip it. I mistyped something. This exists and container id exists in there. There must be a typo. There I don't see it. No, no, that's not the issue. I mean at this point, this is the case where I'm stopping this and just showing you the recording because I don't have time to debug this right now. So we see the same thing, and now I'm sure that this will work, right? And the advantage is that I have time and I don't have to type, so I can tell you what's happening. So we already saw this, right? So I'm creating the runtime directory, then create the container. Without network again, again I'm giving it a name. And we're starting the exact same Fedora 38 container as created. Connect container is initialized. Then again we see the status is initialized here. I'm getting the PID of that container because in this case I'm going to show you some of the network namespaces. So what we see now is this is the process that's actually running, right? The container isn't running, but there's the C run process which initializes the namespace. This also lists the namespaces that we have. So if you have time to look at the details, you'll notice that the numbers are different behind this. So this means that the namespaces have in fact been created, both the user namespace and the network namespace. And we can also see, and that's what this NS Enter command is doing here now, we can now enter the namespace and look at the IP configuration inside of the namespace and we should expect that there will just be a localhost interface because we told it not to set up any network. And that's what we see here, right? So there's no ETH0, no other network connectivity. Now it's the point where I want to set up the network. So again we see the root for zero bridge exists. That's the bridge that I want to attach to this container. Again the path to the container, to the runtime directory that it created, set up to tell it to set up. Rootless is the name, then the secret that I use to generate the IP address, test as the user, root for zero is the network name, then publish because I want to publish a port and in this case I'm also specifying a network alias that will be in the DNS server inside of the container. This is the successful output, a lot of JSON, you don't need to understand it, it just gives you the network, the DNS configuration and now we see if we rerun this NS Enter command that the Ethernet connection exists at this point. And if we want to, right, now we need to start it obviously because the process inside the container wasn't running yet, so I ran portman start rootless, now it's running and now we still need to test it because if we didn't test it then it's broken obviously. So I'm figuring out my own IP address and using curl to send an HTTP request to the Python web server inside of the container and it works. So our networking worked as expected. I'll skip the stopping because we're running out of time. I also have the same thing automated with systemd, so I put the exact same commands in a systemd service file and the lines that require root privileges, systemd allows you to just specify a plus at the beginning of the command and I will run them with root. So that's a nice trick to get all of this in a single systemd service file. I'd show you but we run out of time, so you'd have to go to the website where I publish this. You don't have to scan the QR code now, it will be on the last slide again and so no hurries. And that also will contain this rootful network Python script that I just used to do this. So what did we achieve so far? We now have automatic IP configuration even though I had to re-implement it but the script does it for you otherwise Podman would have done it. We can expose ports with the systemd service file that I talked about. I can have this container controlled by systemd and that takes care of correct startup and tear down. Podman inspects still won't know about this network because we added the networking manually and if we try to run, try to use systemd notifications that also won't work in this particular configuration because systemd will refuse the notify, it will see it but it will refuse it. So what's next on this? Before Podman 4.5 introduced this, the working dash dash UAD map flag where this just works out of the box, I would have said maybe we should add a mode to Podman to drop all privileges except for network configuration. Now I'm not so sure. Honestly, if you're on Podman larger than 4.5, probably just use UAD map to do the same thing and maybe there's some improvement to be had for rootless networking, there's a talk tomorrow at 9.30 on rootless container networks getting shaped with pasta. So if you want to give that a shot. Right, that's it. Thank you for attending and any questions? Yes? Any standardized, I mean doesn't change under your hands and doesn't break the user's free. Right, the question was then the JSON format for network isn't documented so probably not standardized, am I afraid that this will break with the next network update? Yes, vary. On the other hand, it's also a interface across two processes between Podman and network so I know those two are developed in Unison but I'm at least hoping that they will preserve backward compatibility and what I'm currently doing will continue to work. It also looked like somebody really wanted to document the JSON interface just didn't. So I think that's the lack of time is the only reason why it wasn't documented. So the question is I demonstrated that the, that said you had a bit trick won't work because it will map to the unprivileged user. What about file capabilities? I actually can't tell you. I'd have to look into the kernel source code what it does but I'm assuming the rule that I learned from Michael Kerosky has a great training on all those isolation APIs if you have a chance attended. What I learned from him is the general rule that you need the permissions in the username space that owns what you're trying to access and if that principle holds then it shouldn't be possible because in the username space that owns the files which should be the root namespace you as unprivileged user wouldn't have the capability to do that but I'd have to test. Other questions? Questions from the Internet maybe? Thank you and enjoy DevCon.