 Okay, good morning, good afternoon, good evening, depending on where you're joining us from. Welcome to today's CNCF webinar, Routeless Containers in Gitpod. My name is Kristi Chan and I'll be moderating today's webinar. We would like to welcome our presenters today, Christian Weichel, Chief Architect at Gitpod and Albin Query, Director of Kinfolk Labs at Kinfolk. A few housekeeping items before we get started. During the webinar, you are not able to talk as an attendee. There is a Q&A box at the bottom of your screen. Please feel free to drop in your questions and we'll get to as many as we can at the end. This is an official webinar of the CNCF and as such is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of all your fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF webinars page at cncf.io slash webinars. With that, I'll hand it over to Christian and Albin to kick off today's presentation. Take it away. Thanks, Kristi, for the intro. So welcome. Today we're going to talk about Routeless Containers in Gitpod and to dive right in we first have to talk about what is Gitpod. Gitpod is an open source project that automates development environments and you can think of it as a CI system that automates regular builds. Gitpod automates the provisioning of development environments for pretty much every developer. So it has ready to code dev environments, meaning all your tools are there, code is downloaded, code is compiled and you can start working with a click of a button. And it does that behind the scenes by provisioning Kubernetes pods. So each workspace that you start within Gitpod is actually a Kubernetes pod. And we want those pods, those workspaces that you can start in Gitpod to feel pretty much like your local machine, except you get a new local machine for every task you want to do. So there's no previous state that can impede what it is you're trying to do. And when we started out for a long time, one of the big differences between your local machine and what Gitpod would give you is what you could do within such a Gitpod workspace. For example, there was no sudo, meaning you couldn't install things after your workspace was running. You could only do that in the Docker image that you would bring to the workspace. And also there was no Docker, which in a cloud native environment is a bit tricky. And so what we really wanted to have is we wanted to enable those two key features where you could sort of have root in your workspace and be able to install things after the fact once it's running, but also where you would have Docker and do Docker build, Docker compose, etc. And this talk is pretty much about the technologies that made this possible and how we enable this in Gitpod. So now this is possible. Now you can run Docker, you can do upget install. You basically have root within your workspace. And the most naive way possible of doing this is by simply giving you all the privileges within the workspace container. We could just run as root, so to speak. But the clear and obvious downside is that that would also mean that everyone inside the workspace would effectively be root on the node. They effectively have all the privileges they need to potentially escape the container and to have really a lot of privileges on a node that is shared with say 25 other users. So clearly this is not an option and we need some good way of isolating those workspace containers from each other, but also to the node. And this is where Linux isolation tech comes in. And I'll hand over to Albon to talk about that. Thank you. So there are different ways to isolate more of the parts from each other and from the host. One way is to think about VM-like containers on time. Those are, for example, nav-like containers, GVizor, cata-continuous, Vcranker. And this technology to provide improved isolation compared to what Linux containers are. They work in different ways. For example, nav-like containers use Unikaners. This means that for every new workload there will be a different Unikaner build specifically for that workload. There is GVizor. What it does is re-implement the system call interface. So it's re-implementing in code. So when your application makes a system call, instead of talking to the Linux kernel, it talks to this interface, this application kernel. There are cata-continuous that build lightweight virtual machines, and it's compatible with several high-level servers, for example, QMU or FAC crackers. Those different VM technologies, they provide more isolation, but they also give more limitation compared to normal Linux containers. There could be compatibility issues, or they could have decreased performance, for example, with network traffic or higher system access. What we want in general is to have higher density. That means to be able to put a lot of ports on the same node without having to mean that too much. So next slide. There is an alternative approach, which is not to use VMs, but use what is called username space. So username space is a feature from the Linux kernel. Among other namespace, for example, network namespace, pin namespace, and so on. Currently, that's a feature that is not provided by Kubernetes. So Kubernetes works like you see on the picture on the left. It has worker nodes, Kubernetes, and so on that don't use username space. What username space does is to isolate users. So that means the user root inside the container is not the same user root on the host. So provide some estimations. There are different ways to use username space in Kubernetes. Here I provide three different explanations of different ways to use it. The second one from the left is called KEP 127. What it does is it adds a new field in the podspec, a bit the same way that you have an host network in the podspec to say whether or not you want to use a new network namespace in your pod. It adds a new field for username space. So I'll present it in this picture where the user's namespace will be located in this architecture. So that's a KEP. It means Kubernetes and Lenspan Proposal. That's not something which is merged in Kubernetes yet. That's something we work on that with others in the community to provide that. Another way to use username space is the next one, KEP 2033. It's so-called the hostless mode because it allows to run the different Kubernetes component without being root. For example, you cannot run Kubernetes without being root. You can run the container without being root. So in this way you have a username space that goes around all the components of Kubernetes. On the last solution is the one written by Gitpod where we don't touch Kubernetes. So we can use a Kubernetes upstream without modification. On the inside the pod, inside the workload, it makes use of username space so it creates the username space at this time. In this way it works on current Kubernetes without patches. So how do you create a username space? And this is an example, sort of walkthrough how to create such a thing. And it starts with the unshared syscall, the other syscalls that can also do that, that create the username space itself. And then once you have that username space, you need to establish the UID and GID mapping that maps a user ID from within that namespace to a user ID outside of that namespace. And this happens by writing to files in the proc file system. And then lastly you need some execve syscall to get a hold of the capabilities inside that username space. And you basically get the full capability set at this point, including Capsis, Upman, CapNet, Roar, etc. You can try this yourself with this command. This lets you observe sort of the things, the steps that happen to make this work. So unshare minus U, minus uppercase U creates the username space minus R, maps your current user, your current executing user to UID 0 inside that namespace. And the s trace in front just traces what's happening. So this is all fine, except in a Kubernetes pod, we would need to give quite far-reaching capabilities to make this work. So to write these two files, you need Capsis, Upman in the outer namespace. And because at this point Kubernetes does not provide username spaces yet, this outer namespace would need to be the node as a whole. And we don't want to provide Capsis, Upman for security reasons on the node inside the workspace. So we need to find a solution to that. And the way we built this within Gitpod is the root process that we start inside a container we call Supervisor. And Supervisor Ring 0 is sort of the thing that gets started that's the command of the workspace container. And then it starts the username space as Supervisor Ring 1. And once we have that, we make a GRPC call to a node demon service that runs on the node that we call workspace demon. And this service then has the capabilities on the machine to actually write those files. And we pass the PID of the process that identifies to use the namespace we want to write this UID and GID mapping for. That's all nice except now we have to do PID translation. And the reason for that is that containers in general are in essence a collection of namespaces and other isolation tag. And one of those namespaces is a PID namespace. This is why any process that you start sort of as the root of the container becomes PID 1. And it's not the actual PID 1 on the node, say system DO in it or something like that. So the PID that we'll receive from Supervisor Ring 1 will not be the PID in the namespace that workspace demon sees in the node namespace. So we need to do some translation here. So outside of the PID namespace of the container, this might be is something completely different. And in order to do this PID translation, there are a few ways how this could be done. There is no syscall yet that can just do this translation for you. There are some tricks using UNIX pipes, but also it's in the in the proc file system. So if we look at the status file of a process, we see that there is an NSPID entry, which lists all PIDs in the children namespaces from the perspective from the process that's looking at this file. Because we know that the PID that we're looking for must be a child process of the container of the workspace container, we can look at the children of that workspace container, look at their status files and this way identify the correct PID. So now we can create a username space and we can establish the PID mapping within this username space. Now we're left with a problem because this is working really well. If we look at the file system, we see that the UIDs now all of a sudden don't make sense anymore, at least at first glance. But thinking about it, this is exactly expected behavior because on the file system, we have some files that belong to actual proper UID 0 and we have some files that belong to a user that has a mapping within this username space. And the ones that actually belong to proper UID 0, they are shown as 65,534 in here because we don't have a mapping that maps the user inside the namespace to UID 0 outside. To illustrate that, what we would like to have is a file system that from within the username space looks like this. You have a whole bunch of files and folders that belong to UID 0 and you have some that belong to say some other user in this case, 33333. And in this example, we have a UID mapping where UID 0 inside the namespace is UID 10,000 outside of the namespace and UID 33333 inside the namespace is 433333. So basically just plus 10,000. So in order to get this view on the left from within the username space, we would need to have a file system on the node that actually looks like this, right, that actually has this UID shift implemented. But in reality, the file system that we need to do the shift for is the root file system of our container. And this root file system was put there by the snapshot of the container runtime. And it doesn't know about this UID shift and it also doesn't care. So in reality, the file system looks exactly like we would want it to look like from within the username space. So we need some process that dynamically does this UID or does this UID shift for us. And there are a few technologies that can do that. For example, there is Fuse Overlay FS, which has the benefit that it can be used without any privileges outside of the username space. So you can use that completely from within the username space because Fuse can be mounted within username spaces and the rest that's needed is a username process. There is very little upfront cost. All you got to do is start a process. But the runtime cost is comparatively high because it has to go through user land. On the upside, it is not very platform specific. There is also Overlay FS meta copy. Meta copy is a mode in Overlay FS where it just copies the metadata to the upper dear. So what we could do is we could basically mount an Overlay FS on top of the file system that we would like to shift and then basically do a change on onto that file system. And this is exactly where the upfront cost comes in. This change on is potentially very expensive if the root file system is large. The runtime cost is comparatively low. In terms of platform specificity, Overlay FS, my knowledge, can only be mounted from within username spaces on Ubuntu because they have a non upstream patch that takes the right box, so to speak, on Overlay FS. And lastly, speaking of Ubuntu, Ubuntu has support for a file system they call Shift FS, which can do this UID shift at mount time, so to speak. It doesn't completely work from within the username space because you need something that they call a mark mount. And this you can only do with privileges in the outer name space. But it has very little upfront cost. All you need is a mount. Runtime cost is very low. It is quite fast and it runs entirely in kernel space, but it is very platform specific. It only works on Ubuntu. For Gitpod.io, which is the SAS offering, the SAS version of Gitpod, we ended up going with Shift FS because we have control over the environment that this runs in, and we deeply care about workspace startup time and performance. So now that we have the PID mapping established, we're using the same trick that we used to write to the PID and UID map to actually create this mark mount, this privileged operation that we need to do that. So we make another GRPC call to the workspace demon who then creates that mark mount for us. Once we have this mark mount, we can use it to mount the shifted file system. And then we do bind mounts to dev, proc, etc., other bits of the file system of the container. And then start Supervisor Ring 2, which basically does a pivot route to this new file system. And this is how Inside Ring 2, your A inside this username space, but also you're looking at a shifted file system. So to you, all file system permissions and ownership looks correct. This is all nice, except we cannot just mount proc for this new file system. But we want to do that because Supervisor Ring 2 also creates a PID namespace to sort of hide this mechanism away and also to prevent users of the workspace from sort of escaping this new root file system. And we cannot mount proc because if we look at proc within that container, we see that there's a bunch of files that has a mask placed on top of it. So in the proc file system, there is a, there are a bunch of files, a bunch of objects that are singletons within the kernel that are not namespaced. For example, proc kcore or scat debug, which might even leak information about other namespaces, hence other containers on the node. And so what Kubernetes and more specifically the run times do, container run times do is that they mount masks on top of the files and folders in proc in order to prevent workloads from accessing those files. And in the kernel, there is a check that checks if such a mask is present. And if so, it prevents users from mounting slash proc because that would essentially render those masks useless. In order to work around that and to never sort of offer an unmasked proc to to the workspace container, we again rely on workspace daemon to make that mount for us. And the way that works is that we call out to workspace daemon with the PID of the, excuse me, passing in the PID of the target PID namespace, we do that proc mount, establish the masks, and then move this entire mask proc mount into the mount namespace of Supervisor ring one of our new file system that we're creating. That's nice. So now we have root inside our workspace and it feels like root and things like upget are working, but Docker isn't working yet. And rootless Docker has a, has seen a lot of work first and foremost by Akhi Hirosuda who has worked relentlessly on things like rootless kit and in general making Docker work in a rootless mode, but also our friends from, from Kinfolk, Alban and his colleagues have done a lot of work in this space. So how do we make this work? And the key issue here is that Docker needs a lot of capabilities with regards to networking and we can provide those capabilities by wrapping Docker or the Docker daemon specifically in a network namespace. And to do that, we need to provide some networking into the outside world, so to speak. And for this we use slurp for net and s, which is a user land mechanism to make, to make network namespace or the connection, their connection to the outside world work without needing privileges in the outer namespace. For proc mounts, because the container that run inside this or run in this Docker daemon, they will also need specific proc mounts because among other things, they're also PID namespaces. We use the same trick that we used to create the proc mount for the workspace container as a whole or for supervisor ring, ring one. We basically call to workspace daemon and ask it to mount proc for us. Now, this isn't quite as easy as it might sound because we need to sort of catch the right moment to do that. And we do this by sort of interjecting into run c. So as part of the OCI runtime spec, the container orchestrator, so to speak, in this case Docker or container d actually will provide, it will create the OCI runtime spec and in there it will have something like mount proc. And we sit in between there, we modify the the OCI runtime spec and add ourselves as hook in the container lifecycle to actually do that proc mount. Okay, so much for how this looks like on paper. Let's have a look like how this looks like in the real world. So this here is a is a git port workspace that runs in my browser in a browser tab. Obviously, there's a full blown container behind it. This is what we've just been talking about. And so in here, I can do things like this. So I can just install install new software, for example, but I can also run sudo docker up. And this will give me will start the Docker demon with the process that I just described. And now I can run, I can just run Docker containers. Right. So I just started Alpine. I can also do that with starting ports. And then git pod will realize that this port is now served. At the moment, there's nothing actually running on it. But if I if I start a web server in here, right, then I can access this service that now runs inside a Docker container in my workspace. So networking also works across this boundary. Right. Back to Evan. Thank you. As you have seen, there are different way to make it work. There are some difficulties that it might be easier in the future to implement such an architecture if we have more things in the Linux kernel. And I will talk about a couple of that now. So one patch set that is currently being reviewed is ID amount. And that's something to do something to do something similar to shift FS. But instead of being open to patch, it's something that is pushed upstream and is currently being reviewed. And once we have that, I'm hoping it will be easier to do this kind of shift FS operation. That will be useful both for the root FS of the container to be able to have this different ownership of file that will improve the performance both in time and in this space. And another use case is for volumes. So when you have in Kubernetes a host volume, you do a bind mod from the host to the container to be able to have this shifted ownership on this file. So that's one thing we like to have on the next slide. There is another thing. So that's another feature that I'm enthusiastic about. It's called second modifier. It's kind of a new second architecture with a second agent. So what is the use case for that? As you have seen before, there is this interface between the workspace and the demand outside that do some methods like prepare user in space or mod proc that run a previous operation like mod. And I'm enthusiastic about second modifier because it will be able to provide a proper interface for this kind of thing. So what you will be able to do is to have the container run the mod system call normally and then second notify will intercept that and send a message to the second agent that will run the mod system call on the other for the container. So on the next slide, I will explain a bit on that. At the top right, you see I have a second policy. That's where you define for each system calls if you want to know or deny their access to that same system call. But with this second notify feature, you have an additional action you can take called notify. And what it does, it will say every time the process in the container use that system call, it will defer the decision to an external agent called the second agent. And then this agent will be able to take decision on the system call on behalf of the container. Here I put a diagram where you see at the top left a run C when you use a run C or it's the same thing in communities when you start a pod. What happens internally, it will fork and exec a couple of time to create this child process and then it will execute the second system call with this notify feature and then you get a file descriptor to be able to get the events in this example, the mod system call. And that file descriptor will be passed to the second page that that will be able to run actions on behalf of the container. So when the container in the end of the month, the second page will do that. What it means is it has the potential to make things simpler for the pod because we could just use Docker inside the pod normally. And when it's on the month system call, it will automatically call the second page on to do that without having to implement this gRPC interface. Okay. And on the next slide, part of summary of the different future technology in the Linux kernel or in general that I think are interesting. So first in Kubernetes, the support for username space, there are two Kubernetes and a small proposal that are for that, that I described before. And in rootless kit, if you go to this GitHub page about rootless containers, you will find a lot of, a lot of history, lots of projects interesting like a rootless kit. Usanities, which is about running Kubernetes without being root, sleep for net-to-nest, which I'm talking about, which is the same thing, but with more performance using second notify. So second notify in Kubernetes doesn't exist yet. That's something that is a work in progress. But here, some different pull request. So there is a work in progress to make it available in RCA on time spec and in RunC. In Sierra and on command, the work is done only. And I think we are working on this second page, which is a generic second page to make it easy for you to use this kind of, this second notify feature. Thank you. That's my last slide. Yeah. So briefly to sum up, Gitpod provides deaf environments that are built for the cloud and automatically provisioned. Username space are the key tech to make, provide root within these workspaces. And then thanks to all the amazing people that actually make this stuff work, first and foremost, Kenfolk, also Agahiro Suda, and the community as a whole. Thank you very much. All right. Thank you both for a great presentation. At this time, we're going to move into our Q&A segment. So if you have a question for our presenters, feel free to submit it either through the chat or through the Q&A box at the bottom of your screen. It doesn't look like we have anything submitted yet, but we'll give folks just a few seconds here to submit their questions. Okay. It looks like we might have a shy group among us today. Albin and Christian, I know at the beginning of your slide deck, you have your Twitter handles. Do you want to go back to that slide just in case folks do think of questions later, place where they could reach out? Perfect. Awesome. So you can see both of their Twitter handles here on the screen. So feel free to reach out with questions. I'm sure they would love to chat with you more about this cool thing called Geppod. Well, that'll do it for us today here at CNCF webinars. Thank you again, Albin and Christian, for this presentation, and thank you all for tuning in. A reminder that the recording and slides will be posted later today to the CNCF webinars page. Stay safe out there. Continue to wear a mask, and we'll talk soon. Bye.