 Okay. Let's start another session. First of all, thank you. First of all, you probably noticed that the food trucks are already available. So, two of them are here in the main square and another three are in a parking lot behind the e-pevillons. So, right now, we are going to have a rootless containers presented by Giuseppe Scrivano, principal software engineer in Red Hat, and Akira Suda, software engineer in NTT Japan. In this talk, we'll discuss how to build and run containers without root privileges. So hi, everyone. Today, we are going to talk about rootless containers, that is how to run containers without having root privileges. So, we are Giuseppe Scrivano. I work with Red Hat. I work with containers related stuff on different projects. I'm Akira Suda. I'm a software engineer at NTT, which is a large technical company in Japan. I'm also a manager of mobile, which is formerly known as Droga Engine. I'm also a manager of build kits, the next generation background of Droga Build. I'm also a manager of Contrandee, which is a CS project. So the rest of the stuff is demo of user NTT, which is distribution of Kubernetes that can be executed as a normal user. So, on this note, my ID is 1000. So, my user name is user. And I don't have Suda. And I have binary of Kubernetes and Pryo on disagree. And apparently, this one doesn't have CPU ID plus. And everything is running as an unprivileged user. So, we can see Pryo is working as an unprivileged user and also HyperCube is working as an unprivileged user. Also, even PrimeMD, which provides much of networking, is running as an unprivileged user. And the cluster is composed of three nodes with Pryo nodes and Droga nodes and Contrandee. So, the cluster makes three different sort of runtime. So, this node is running Pryo, but we have Contrandee nodes and Contrandee is running as unprivileged as well. And Droga nodes is also running DrogaD as unprivileged user. On this cluster, we're running pods of NZX. And NZX has three instances running on three different nodes. And this one has different IP address. And also, the NZX process is running as an unprivileged user as well, of course. And we also have cross-node networking. So, for example, on the cluster nodes of NZX Instance, the IP is 10.5173. And we can connect to the Instance of G1 node as well as easily. And so, this is unprivileged, but in the Contrandee, we can gain root for the limitation of the scope. So, the ID becomes root. And also, we can even do some package management as APK. So, Wi-Fi is pretty. So, the APK is operating. But we have seven APK works as well. So, let's go to the presentation. So, let's start with the introduction of rootless containers. Rootless containers refers to the ability for unprivileged users to create run and otherwise manage containers. It's not just about running containers as unprivileged user. It also entails running container runtime such as client and orchestrator such as Kubernetes as an unprivileged user. So, this must be confusing. So, don't confuse with Docker run dash dash user for which executes a process in the container as a non-root user. So, it still executes Docker ID and Contrandee and runs it through running as a root. So, our work is also different from the user instruction in Docker file. This is substantially same as Docker run dash dash user. In this Docker file, notably, you can do a run a DNF install because you are not root in the container. Also, our work is different from user mode dash edge in Docker full which adds the user full to the Docker binoc group which are not really used to connect to a slash bar slash run slash Docker.soc. This is substantially equivalent to our user to gain the root on the host. So, the user can do Docker run dash dash privilege dash way slash dot slash host to gain the root on the host. Of course, our work is not about running through Docker or just search mode plus s which sets your ID bit on Docker binary, of course. And also, our work is different from Docker D dash dash user NS remap which executes container as a non-root user called doc remap using user name spaces. In this remap, inside the containers, doc remap can behave as if it was a root. This is very similar to our root containers, but it still requires the Docker ID quantity in the run to be executed as a root. So, this is different from our work. So, as a multiplication of our rootless container is to mitigate a potential vulnerability of container run times and orchestrate this primary motivation. But our work can be also used to allow users of shared machines such as HPC to run containers without the risk of breaking other users environments. Our work can also be also used for isolating nested containers such as Docker and Docker. So, the container run times has been suffering from many vulnerabilities. For example, five years ago, a vulnerability called Shoka was found by security researcher. So, a malicious container was accessed host file system as a root using CAPDAC research capability which was effective by default. Also, there was a vulnerability in Docker build command which could run arbitrary binary on the host as a root because of some archive issue related to LZMA. Also, last year, there was issue in container D. So, a malicious container image could remove the TMP on the host when the image was pulled from the registry. This is not when actually the container was created. It happens when the image was pulled from the registry. Also, just a couple of weeks ago, we found a fixed serious issue that relates to Minikube. So, a malicious container could gain the right access to a proc file system and a sys file system when the host root file system is in RD. So, it can result in arbitrary command execution as a root on the host via proc sys kernel core pattern or sys kernel event helper. As far as I know, Minikube was known to be affected and it was fixed in the latest release. So, you can try container breaking Minikube using this command. And also there was a bunch of vulnerability, for example, in Kubernetes. So, two years ago, there was some vulnerability that allows malicious container to access host file system via volumes. Also, last year, there was an issue that could be used to gain cluster as a mean and it has root privilege on the north. And there was a vulnerability in Git which affects Kubernetes repo volumes. And also, play with docker.com, which is all right docker playgrounds implemented using docker. Docker with custom app profiles has access vulnerability that allows loading malicious kernel modules. This is not really an issue of docker itself, but this is a misconfiguration of upper-end profiles. So, our root containers can mitigate these vulnerabilities, but it's a part of especially it's a power against kernel vulnerability and also hardware vulnerability. So, our work is just a new layer for what is called a caster approach. So, it should be used in conjunction with the security layers, such as SIG comp or SELinux. Yeah. So, now we can start looking at some implementation details on how rootless containers work. Namespaces are the kernel feature that allow us to have containers. There is the mountain namespace that gives the process a different view on the file system that there is from the host. There are other namespaces like network that gives the process a different view on the network stack. I think the username space is the most interesting one because it enables really nice scenarios like rootless containers. What username space does is to create a mapping from IDs from the host or more in general from the parent username space as you can have nested three to the process running in the rootless environment. So, a process can be lived to be running like root, but in reality it's running like the unprivileged user that created the username space. Not every ID from the host must be mapped inside the username space, only a subset. In fact, for rootless containers this is very important because once you get root privilege inside the username space, you have control on any other user present and defined in the namespace. By default, the user can map only itself inside the new environment. So, you can create a mapping between the unprivileged user to any other ID because anyway from the kernel it will be just mapped to the same user. The process can keep full capabilities like if it will be running as root, but of course there are restrictions. Like I said, you can change your ID but it's only limited to the IDs that are present already in the namespace. You can't change it to any ID you want or you have the capability of changing the network, but it's just the network devices that are defined in the new environment. I think just one ID is not enough for running most images out there. You can't run DNF, you can't install it in packages. For that, there is a statue ID program that allows to define additional users for each unprivileged user. In addition to your own ID, you can have multiple IDs mapped inside the rootless environment. These tools are now basically packaged for any distro. So, if you add a user on the system by default, you will get a ready range allocated for your user. Well, statue ID programs are dangerous and we tried for anything else to avoid them, but in this case it makes a bit sense because the logic of managing the additional ranges, it's managed on the system, not by the kernel. So, there were a few issues with these tools and also there is a need to maintain a centralized database of all the ranges allocated and to what user. You can't allocate the same users to the same IDs to the same user otherwise they will be able to, well, to read each other resources. Well, the simple alternative way, of course, is to use a single mapping, but as I said, that will break many images. It can work in some cases like applications that are not a single user, but that's very limiting. The other way we tried to, the other part we tried to follow is to limit the privilege of this statue ID application. And in fact, new versions now are not installed as a statue ID programs, but they are using file capabilities because all they need are just two capabilities of all the sets that you get when you become root. So, even if now this application will break, they will still have only two capabilities available. So, it's limiting the damage. So, you can create network name spaces with recent kernel, along with user name spaces. So, with network name spaces, I'm sure the user can create IP tables rules and also isolate abstract unique targets, which is a kind of unique target but doesn't have parts on the file system. And also the user can even set overline networking with BXLan. Also, the user can run TCP dump and also do bunch of stuff. But there's a problem. So, I'm pretty sure the user cannot set up the virtual Ethernet pair across the host and the name spaces. That means the name space cannot connect to the Internet. The prior work by LXC folks was to use the binary code LXC user NIC for setting up the virtual Ethernet pair across the host and containers. But there's a problem. So, the binary can be dangerous. And LXC is the NIC 2CB is so far. So, our approach is to use swap, which is completely unprivileged user-mode network stack. And inside the name space, we create a top device. And we send the file descriptor of a top device using unique sockets to the pairs name space. And pairs name space has a swap process. And the swap process provides TCP IP networking to the top file descriptor. And we have several swap implementations. But the swap for net NS, which is our own implementation based on QM swap, is the fastest because it avoids copying extra packets across the name spaces. So, the swap for net NS can reach more than 9 gigabits when the MTU is about 64 kilobytes. And also we need to provide port for the which is used for inbound connection. And user-mode for the further can be implemented independently of swap. So, even swap for net NS has built an important further that can reach 7 gigabits. But if you use sockets in the center, that can reach more than 9 gigabits. And our all optimized implementation can reach 28.2 gigabits. It's still freaky, but it's very fast compared to other implementations. And we can also support multi-node networking. As far as we know, BXLan and FlanNen is known to work. BXLan encapsulates inside the packets in UDP packets and provides L2 connectivity across rootless containers on different nodes. And other protocols should work as well except ones that require access to raw Ethernet headers. So, probably GRE is not likely to work. So, before a container can be used, the image should live somewhere. Most of the backends that work fine with root containers are not usable with rootless containers. Just because an unprivileged user has not enough privileges to set up the storage. Ubuntu unlocked the overlay and it's usable like you would do with root. But that is not supported upstream because kernel focusing is not safe to enable overlay to an unprivileged user. But RFS allow a privileged volume management. But it needs to be configured before by the admin. And device mapper is completely out of reach for an unprivileged user. The simplest to work around is to just extract the full image for each container. This works fine until you use a busy box. But once you start using bigger images, you realize that it doesn't scale. Each container will duplicate the entire image. There are a few alternatives. One is to use reflinks. Reflinks is a new feature that you can find in XFS and ButterFS that creates the inode of the file, but still the data, it's internally the duplicated. And the issue with reflinks is that you still need to create the inode. So if you have many files in the image, you will still create that many inodes. Since Linux 4.18, there is a nice feature in the kernel that allows you to use file system from a user namespace. So what we did was to re-implement overlay FS as a fuse file system. So all the advantages of using overlay as a root, you will still have a rootless user. There is the same duplication for layers. It's very fast to set up a new container because you don't need to copy any new file. In addition to the basic support for the kernel features, we added the built-in support for shifting IDs. I will talk more in detail later. The cost, of course, it's complex. It's quite a lot of new code that can bring new bugs in the stack. So when you create a user namespace, by default we map your unprivileged user to root. And then all the additional IDs specified in the IDC sub ID from one to as many as you have available. But you can still play around and you can create your own mapping with user namespaces. And we allow that. But this has a cost on the storage because each time you create a different mapping, you need to be sure that the image on this reflects the mapping you are using. If you start using, for example, some files that are honed by the wrong user, you will see in the user namespace that all IDs are wrong. So if you are not using FuseLayFest, the solution now is to create a copy of the image. For each different configuration of user namespace, you will need to create a clone of the image. What FuseLayFest does instead is to lie to the system on the ownership of the file. So it does this remapping on the fly without creating a different image first. Of course it's less expensive than copying and showing all the files as you will need to do using OvalDay. C-groups are the biggest problem with rootless containers at this point. C-group V1 is not safe to be used by a rootless user. By default, it's completely owned by root and managed by systemd. There were some workarounds that, of course, they introduced their own set of issues, like using again a CID, a CID program, as LXC does. C-group V2 will solve the issues that we had. But still, it cannot be used because it's missing some features that are needed for running container. So once it will be feature complete, it will solve all the issues and any rootless container will be able to manage its own C-groups sub-3. So let's look at the current options and start from the runtime. So Ranty started supporting rootless mod since two years ago. It was basic support, single user, not C-groups. I had, after a new release, added support for multiple IDs. And also, if you make the C-groups path writable, Ranty will start using it. So Podman, that demo-less alternative to Docker that then introduced before, and they already showed all the nice stuff you can do with rootless. So it uses all the features we showed before. By default, for rootless container, it's used to live for net ns. It uses a very fast for the storage. And then I think that you can use rootless Podman in the same way as you would use rootful Podman. There is no difference in the CLI. Each user will have its own storage and configuration that are completely separate from the system. When we create a Podman container, we create every time a new user namespace. This adds an extra layer of security of what we were used to have before. If you run our container, now you will notice that in the host, you will get a bunch of mount points. And if you don't do a proper cleanup, they will be leaked. The restriction we had with user namespaces, that's with rootless user namespaces, but you can't create any resource on the host. Well, it was beneficial to solve this problem because the kernel simply blocks us from leaking the resource on the host. And also, since containers run in different user namespaces, they can't join each other in namespaces. It's like, well, another layer of protection. But Podman has pods, and pods are by definition a group of containers that needs to share resources. So this conflicts with what I was just saying before. So what we do for our pod is to create just one user namespace, and every container that is part of the pod will join the same namespace. So they are able to share resources and still feel like part of the same environment. And Droga is also going to support rootless mode in version 19.03. But unlike Podman, Fuse over Ray is not yet supported. And actually, a rootless container is started by Edoxy folks, and Edoxy supports what we call rootless mode since six years ago. But unlike our work, rootless Edoxy requires security binary for setting up a network named spaces. And also, there is EdoxyD, which is a demo for mining Edoxy containers. But EdoxyD still requires a demo to be executed as a root. And security, which is popular in HPC community, also supports rootless mode when dash-dash-user NS is specified. But unlike our work, security doesn't support creating network named spaces with integrated connection. And also, CloudFoundry garden container supports rootless mode. But unlike work, CloudFoundry requires setUID binary for setting up a network named spaces. And we are also working on another OCI runtime called runRootless. RunRootless is not about rootless runC. It's different from rootless runC. So runRootless doesn't require subUID, subGID configuration because it can emulate subUID and subGID using p-tores and X attribute. This is suitable for LDAP environment because subUID configuration can be difficult with LDAP. But current implementation has significant overhead because of p-tores. So we are planning to replace p-tores with Tycho Anderson's new second framework that is going to be merged in Karnish 5.4 and blah, blah, blah. And there's also UDOKA, that's another implementation of DOKA. It supports both p-tores mode and runC mode. But unlike runRootless, the p-tores mode can be used for persistent shOVM. Also, the p-tores mode can be used in conjunction with runC. For the image builders, also rootless mode is fully supported by builder. So like Podman, it has the same set of features. It uses fusible AFS, it's sleeper for the network. And running rootless mode you should really be able to build the same images that you are able to build as root user. Also build additionally, Podman supports different isolation modes. It supports... When you run a command to build the image, builder creates a new container. The default mode is OCI, that creates a container that will run by runC as root. The rootless isolation mode creates an OCI configuration that is usable by a rootless user. In addition to these two, there is Cheeroot isolation. And this is usable when your builder is already running inside of a container. So you don't need to isolate the build from the host since you are already in a container. So when you run a command, it just gives a basic environment that is more similar to Cheeroot than what we have in this container. Our build kit is a new background hold Docker build that is used in Docker since version 18.06. But build kits can be also used as a standard error and rootless demo. And rootless build kit has been used in OpenFox cloud. And also build kit is used by Image, which is created by Jersey Frazer. So Image is also same as build kit, but it doesn't need demo. And rootless build kit and Image can be launched as an unpublished user on the host without any extra configuration. But when you deploy this rootless build kit on the Kubernetes, you need to set up security-contest.proc mount to unmasked so as to unmask files under the proc file system so that build containers can mount proc and first with dedicated PIE name spaces. This seems problematic but not really concerned as long as running in rootless mode. Also, the next version of build kit will no longer require security-contest configuration, but in that case, there's no PIE name space isolation across the build kit, the demo container and build containers. Google last year released KaniCo, which is kind of amplified to container image builder, but this is different from our approach. The KaniCo itself needs to be executed in container and Docker file run instruction executed without creating nested containers inside the KaniCo container so the run instruction gains the root in the KaniCo container. So we consider it seems appropriate for malicious Docker files because of lack of isolation. Also, recently, Uber released Marquise. This is very similar to KaniCo with regards to unprivileged execution. The next is current adoption status in Kubernetes. So, Kubernetes and Kube Proxy still needs to be patched for running with the C-groups and 6CTL, but we don't need any patch for Kube API server and Kube scheduler. So for Kube Proxy, we have some proof-of-concept patches and we are going to propose Kubernetes enhancement proposal of signals soon. We are also planning to work on Kube ADM integration. With regards to CRY run times, both cryo and continuity support rootless model already. For CNI plug-ins, front-end is known to work without any modification. And we provide user-natives which is experimental binary distribution of rootless contributors that can be installable under the home directory with that myth. So just download binary archive from github.com. And unpack the archive and just run .sh. And you can form a single-node cluster just with this run.sh and you can do a Kube CDL. We also provide Docker Compos YAML for demonstrating showed multi-nodes cluster. The cluster is a composite of Docker streaming nodes and cryo nodes and continuity nodes. And the front-end big plan is configured by default. But this Docker Compos YAML is just a proof-of-concept. So it doesn't set up here as it. So we need some contribution. We are also planning to provide YAML for deploying user-natives on existing Kubernetes cluster. So any questions? Sorry, speaker loudly. So the question is what happens as a root in rootless container performs some file system operation? How does it affect on the... So when you ask user in rootless container how does it affect on the host? So the answer is that the real root on the host needs to provide sub-UID file like this. So the new user ID is in the range of this sub-UID file. It's fine. But if it's out of the range of this sub-UID file, the other user commands will fail. Four months of what? The question is if we benchmark a few sovereign LFS. No, we don't have numbers for that. I did some tests like building containers and it takes the same time. I would expect few to add some extra costs but we have no numbers for that. The question is if C-group V2 will allow C-group inside of a rootless container. Yes, because you will be able to delegate your sub-tree inside the rootless container. So a root inside the container will be able to manage that and it shouldn't be no different than using it from the host. The question is if there is any new work happening in the kernel name space, in the kernel space for a rootless container. Not that I'm aware of. Is there anything happening? Tycho understands new second-frameworks that can replace P-trace. So this is very useful for rootless containers. So with a new second-framework, we don't need sub-UID and sub-GID configuration. So this is very useful for LDAP environments. Overlay, maybe at some point it will be usable for rootless. So the question is when Tycho's patch is going to be merged upstream. I'm sorry, I'm not sure. I hope it will be merged soon. Sorry, could you speak loudly? Modify kernel to... What device? If the kernel... Change the kernel to allow adding and removing the device from user space. Well, a rootless container will not gain any additional privilege that your unprivileged user had. So except managing additional IDs. That's the only extra privilege you gain. But besides that, from the host's perspective, it must look exactly as an unprivileged process. So there won't be any difference for about the devices related to rootless containers. If you can do that, I read some privilege. Okay, so it seems we have no more questions. Thank you all for your attention.