 Okay. Hello, everyone. Let me just turn on the screen share. Okay. All right. So, hey, everyone. Oh, I'm getting a call on Christian Mute. I muted. Okay. Yeah, your phone wasn't, or something. I don't know. I was getting a call on the phone. Okay. All right, weird. Maybe it was the guy, I don't know. Sorry, let's start again. Hey, everyone. I'm Stephen Graber. I'm Christian Browner. We both work at Canonical on the Lexi DTM and on related container projects like Lexi, LexiFS and Canon features related to those. Today, we're going to be talking about how to make your own containers from base Linux primitives. So, first things first. What are containers? For some, the big chunk of metal. Not that kind of container we're talking about today. The containers are effectively isolated systems. They're pretty similar to virtual machines, but different in a way that they share the kernel with the host. There is no virtualized firmware, no virtualized hardware. You share, it's a group of special processes sharing the host kernel. On Linux, containers are purely a user space concept. It's effectively a lie that we have that containers are a thing. It's user space deciding to use a number of kernel features together and the result is called a container. That also means that what the container is really depends on whose software you're using. That can vary quite widely as we'll go into more details. There is effectively no such thing as a container, no single way of detecting it either. So, you might want some of your monitoring tools to be able to track your containers. You can probably do that on like a per container runtime basis, but there is no real way of knowing like, oh, this process is running on container named something. Because, again, depending on how you use those different bits, things can really change quite a bit. Some of the main components are used to create those containers. So, you kind of need like many containers start with five-system isolation. That's using true.pivotroot to give your process a different five-system, which can be different distro, different version of your distro, or just a very restricted set path of your five-system. On top of those, you can use namespaces to further isolate a number of kernel objects, effectively. Those namespaces are, I think those are in the order that were introduced. UTS namespace first, which will use like kind of just a demo namespace to some extent. It's actually just holding your host name. So, it lets you have a different host name, but that's it. Then the mountain namespace, which is you have a different mount table followed by the paid namespace, which gets you a different view of the process hierarchy. Followed by the IPC namespace, which lets you isolate your inter-process communication, things like share memory on a per namespace basis. Then the network namespace, which gets you your own set of network devices and loopback, followed by the username space, which gets you a completely different view of your system's user, UID and GIDs. And well, it used to be the most reason that it actually is not. Then we had the C-group namespace that was introduced to having a different view of the C-group hierarchy. And our most recent addition is the time namespace, which lets you have an offset-based time tracking compared to the base system. So you can have a container that's offset by say, 40 seconds. Yeah. Yeah, that one's got introduced very recently and when they added support for it in LXC last week. Then you've got a number of security measures you can apply on top of that. It's probably worth mentioning that the username space is by itself a pretty big security measure and that a lot of those actually act as safety nets when a username space is in use. But those are the Linux security modules. So thank AppArmer or Selenax, smack that can let you apply policies on what files and what objects you might be able to access from within a container. SecComp, which lets you filter system calls and block all legacy system calls or newer system calls that you don't really trust yet. We also have capabilities that lets you drop a number of kernel capabilities while like some specific part of the Linux kernel access. Resource controls, that's done mostly as your C-groups on Linux. And that lets you improve some number of attacks, especially denial of service type things. We'll be going into each of those in much more details and show you how to actually use them all together to create your own containers effectively in this presentation. Looking at, we've got a notes list of caveats, I think I mentioned some of those already. Username spaces are also a security feature, I mentioned that. And, oh yeah, the device is C-group, which is technically as C-group, but not really resource control so much as access control in that case. The device is C-group lets you restrict what devices a container can access, so what device nodes in such depth effectively. All right. First, a word of warning. We will be showing you how to use all of those different features. At the end of this talk, you might be under the impression that you're ready to write your own container runtime, which strongly recommend that you don't. There are a lot of corner cases and specific issues that apply that we know of, that we've discovered over the years and that we know how to deal with. In general, if you can use in existing container runtime, use existing libraries, do not try to reinvent the wheel because you're going to run into problems. Some simple things, the order in which you use some namespaces, the order in which you apply some of those limits matters, even though it doesn't seem like it should. And if you do it in the wrong order, you end up potentially having devastating security issues. So yeah, something to keep in mind, don't reinvent the wheel, try to use existing libraries. Also, if you are in any way tempted to use privileged containers, don't. Those are a terrible idea. We'll go into some more details about the user namespace and some of that later on to kind of explain what the difference is there. But effectively, yeah, just don't use privileged containers. All right, so I'll hand it over to Christian to go through the description of the different features and for each of them, I'm going to give a small demo afterwards to show you how they actually work. Yep, so Zephan mentioned like a lot of the components or most of the components that containers are actually most built from. And as he said, it's one of the things, one of the problems is that everybody has their own opinion about what a container is and what it isn't. And people usually have strong opinions about it. But I guess one of the things that most people agree upon is, at least it has to do with some sort of, some form of isolation from the rest of the system. Otherwise, I mean, technically you could, but you can't just create a new process and call it a container. I mean, it's not the point of the whole exercise, right? Then writing a container runtime would be trivial as well. One of the aspects is obviously fast system isolation. This is usually shared between all users. So all users have the same view on the file system. Well, you start out in a different home directory, but still like you can access most directories and so on. And obviously, this is something that people have been thinking about even before containers on Linux, like how can you isolate your file system, your file system view? And one of the, I guess most people call it the predecessor of any container or container runtime the crude syscall, which gives you your own view on the file system hierarchy. But it's easy to use. That's one of the advantages, but it's also terribly insecure as the fund will show you. It's pretty trivial to break out of a crude. So it's really just something that you wanna use for fun, but not to write a container runtime. So because of all these shortcomings, people implemented or the kernel provides another system call, which is the pivot root syscall, which gets around some of the security issues that crude has. It's a secure version of crude, essentially. Turtle to use often involves amount namespace. Doesn't have to, but often does. It has a couple of restrictions. You can't use it on RAM disks, which means if you wanna run a container on a RAM disk, then you can't use the pivot root syscall, which is not that great, especially if you wanna run system containers, so containers that put a hole in the system. And it also moves a bunch of, you have to move a bunch of mounts around in some instances, and you can use pivot root if you're on a shared mount point. So if you wanna have your container's root file system be MS shared that won't work with pivot root, which is a shame because it means mount propagation is not really something that you can use. You can use it with containers, obviously, but you can't use it for your root file system, which is a shame, to be honest. Yeah, and usually pivot root should be done in amount namespace. So both of them get your private view on the file system hierarchy. That's what I said, so you can suddenly, a different directory becomes your slash. And so you protect your own root file system. And Stefan will now go on to show you how you can escape a croot, hopefully, and how pivot root protects you from that. Yep. Okay, so as you can see, I'm running this as a normal user on my laptop. I'm gonna be using the unshare command to create a new user namespace, a new mount namespace, a new pivot namespace, remap my client user to root and fork as a process for good measure. Good point, I'm root-ish. I mean, I'm rooting that namespace, but not real system root. I've got in that directory, I do have a directory structure here called alpine edge that shows a normal Linux structure. I can show root to it. Oops, business edge because alpine does not have bash. There we go. And we say that, yeah, I'm on slash now in that container. Okay, that's fine. Let's say now in my tiny container, I mount croc. Well, then you'll see there's a tiny problem because croc will show you, oops, all the processes from the host in there, including bit one, which now let's me do this, at which point I am no longer in that container anymore. I'm on my actual system. Yeah, it's so much easier to see it through it entirely. Yeah. So that's a bit of a problem with root. And that's why instead what you could do is same thing. So create the same namespace. We need to create a mountain tree for that directory. So we just mind mounted on top of itself. Then we go into it and then we use pivot root to replace slash with that. So that's the replacement for truth. Okay, that's done. At which point I can see exact in SH. And now I'm back to being in a container. Same view as before. Now let's do the same thing. Let's mount croc and try to escape with croc one root. Doesn't work. There's no bin bash. That's because croc one points back to us. So if I do SH, I'm back where I started. I can't escape through that particular issue anymore. So that's really the difference between truth and pivot route. Pivot route is also actively used by the next distributions during boot to switch from an in-intran disk over to the final physical hard disk that you're booting. So that's its other use really outside of containers. But yeah, pivot route effectively lets you replace a lot more of the references by the target material hockey instead of the old one. Okay, let's switch back to the slides and switch back to the next topic. Namespaces. Ah, yes. Sorry, by the way, sometimes there's a bit of lag. So if I'm reacting a little slower, it's not because of being in Europe, I guess. Yeah, namespaces. I mean, this is what most people think of when they think containers on Linux. And to some extent, namespaces, someone once put it namespaces are a way, well, basically on Linux, we didn't think about containers. So we invented namespaces to get around some of the inflexibilities of the kernel. I like that idea. So namespaces give you a lot of flexibility in doing a lot of different things, not just implement containers. They are obviously most closely associated with containers, but there are a bunch of them that have users outside of, independent of containers. So yeah, multiple namespaces. We have seven, and since the last kernel release or since two kernel releases ago, we have eight namespaces. So mount, UTS, user, net, IPCC, group, pit, and finally, we also have a time namespace. And the oldest one as far as I remember right now is the mount namespace. And actually, this, I think it was invented independent of containers. I think it was just an idea. Like one of the first motivations was to give each user their own mount hierarchy when they lock into the system. There is even an old article somewhere from IBM out there that mentions this. So mount namespaces are again concerned with the mount hierarchy. And the easiest example is that you want to get your own private mount table, right? So if you share a mount namespace and you you mount something, then you unmount it for every user. So you plug in USB stick, somebody mounts it, but then some other user also unmounts that USB stick and then it's gone. So the idea is obviously, what if we implement a mechanism that makes it possible to give you your own private mount table such that you mounts in one namespace don't really affect the mounts in another namespace. So I could mount the same USB stick in one mount namespace and in another one. And if I unmounted it in one of them, I don't automatically unmount it in the other one. If only it were that easy. That's the main thing people think about when they think mount namespaces, but mount namespaces also come with a tiny feature that is called mount propagation. And mount propagation lets you set up dependent mounts, for example. So you mount events only propagate into a dependent mount namespace, but if you unmount something in the dependent mount namespace, it won't show up in the parent or dominant mount namespace. And then you have something called shared mount names, or shared propagation, which means if you unmount in one mount namespace and it also propagates into any other mount namespace if they have shared mount propagation set. So it's a bit complicated. And actually we use some of this, some of the mount propagation trickery to implement features, such as hotplugging mounts into containers and so on. So it makes it a lot more complicated than probably they need to be by now, especially now that we have containers. They're pretty important, especially when you think about using pivot root. We have UTS namespaces, so if I mentioned this, this is rather unexciting insofar as you can change your hostname, which is important for containers, obviously. You have the network namespace, which is concerned with isolating the network stack. It's your set of private network devices, private IP tables, private routing tables, and so on. And that you can obviously see that this has a use case completely independent of containers. Network namespaces, I guess, are the prime example of having usage outside of containers. IPC namespace, usually for most people, also pretty unexciting, just concerned with isolating projects, Sys5, Linux, IPC protocols, and so on. And to see group namespace, this is really, I guess, a namespace that only really exists for containers because it's concerned with basically showing you, similar to how route gives you a private view on your file system route, C-group gives you a different view on your C-group route. So instead of being located, so if you, for example, are located in SysFS C-group, my C-group, you don't see the whole SysFS C-group, my C-group path, you would only see slash if you're in a container. So you have the impression that my C-group is actually your C-group route. So this is really just something that exists for containers, I think. Pid namespaces are important insofar as they isolate your process identifiers, meaning in a new Pid namespace, you can have Pid1, even though on scene from the host Pid namespace, it's PID12, for example. And Pid namespaces in themselves are pretty interesting. I could talk about them for a long time, but I won't, otherwise Stefan will get annoyed. And the last one is the time namespace, which is, I guess, most important for container migration. So when you migrate from one physical host to another physical host, migrate container from one physical host to another physical host, then you can easily end up in a scenario where monotonic time seems to go backwards, which can be an issue, a bit of an especially when you migrate containers. So you can specify an offset to make sure that when you restore a container on the host, monotonic time or boot time actually increases. So this namespace has just been implemented and we've added support for it recently. And the namespace API is a bit complicated, I would say. So Stefan mentioned problems that the way and the order in which you create namespaces sometimes matters. I guess the remaining example is the network namespace. If you create a username space and network namespace at the same time, I think you can get either IP tables or NF tables, routing table ownership wrong. So you need to create a username space first and then write an ID mapping and then share your network namespace to get the ownership right. So I thought the permissions in slash says also get wonky if you do user before network. Oh, well, that's actually something that would be fixed by a patch I'm thinking of, but I digress. And namespace, so the way and the order in which you create namespaces can be quite important. And you can create them with the clone syscall, which is usually how container runtime do it. We have the, then there's the unshared syscall which unshares your namespace or also creates a new namespace if you wanna think about it like that. And you can change your namespace. You can attach to the namespace of another process as well. There is an API for this as well. It's called the set an asset namespace syscall. And I guess the most important namespace that I haven't mentioned so far is the username space which is concerned with isolating the privilege concept on Linux. So isolating IDs, giving a container the impression that it actually runs as UID zero as seen from its own username space. But if you look at it from the outside and you see it runs as a completely unprivileged ID 100,000 or something. It also encapsulates capabilities, meaning if you have capsis admin in a non-initial username space, then it doesn't mean you have it in the initial namespace. So capabilities are per namespace as well. And one last thing I should mention is that each namespace that's not the username space has an owning username space. Well, technically, I guess a username space also always has an owning username space. But so if you create a new network namespace, it will be owned by the initial username space. If you create a username space and then create a network namespace from within that username space, that network namespace will be owned by the username space that you created before. And this way, the permission checking is always right. And also you see that there's another dependence between namespaces. So namespaces have owning username spaces. But I could go on, but Stefan will now give you a nice example of how namespaces can be used. Yep, okay, so let's switch back to, there we go. So namespaces, the easiest way to show namespaces on Nexus again with the uncheck commands that I used earlier, it's got arguments to control creation of my namespace, UTS namespace, IPC namespace, network namespace, pin namespace, users namespace and secret namespace. Doesn't currently have time namespace, but it's just because I'm using an old version of it. I'm sure someone sent a patch for that one already. Let's look, let's start by looking at the namespaces for our current process. So you can see that in proc and that shows you a simulink looking thing which is actually a magic link for every one of the namespaces that your process is in. Because I didn't unshare anything yet, at this point, those should mostly match up the host namespace for all of those, the initial namespace for all of those. So now if I do, I'm just gonna, whoops, I'm just gonna unshare the user namespace, remap myself for this process. Now if I look there, you're gonna see, I mean you look pretty closely, but you can see that the numbers for all of the different entries are the same, except for the username space, which has changed because we just unshared it. Now, let's do something a bit more useful and unshare a mountain space from within that. We could once again go and look at which point we would change from mount was 1840 and now we're at 2725. Okay, let's do another one. Let's do another namespace now. Okay, so network goes from 2008 over to 2731. That also means that now we've got an empty network namespace. So if I look at all the devices in there, we only get a look back device which every network namespace gets. Okay, and lastly, let's, well, not lastly, but more lastly, let's do a new pin namespace. So now we've done both amount namespace and the pin namespace. That means we can do a new amount of truck. There we go. And if we look at the list of process, now we only have two. We've got bid one, which is the shell that I used to create this pin namespace. And we've got bid 19, which is the PS command that just ran. If I try to change my name at this point, so change the name, the host name from Catherine over to Blanc, it's gonna tell us you can't. That's because there's another namespace we didn't share yet. That's that UTS namespace. So let's do that and share UTS. Hey, now we can change host name. And if we put a new shell, we've got a new host name in place. So that's a bit of another view of what nshare lets you do. It really lets you nshare one namespace at a time or multiple namespaces at a time and build the namespace view that you want. It also does convenient things like that UID remapping from your normal UID over to root. One thing I can quickly show for the username space to here if I go back to just creating username space. So I show up as root, right? UID 0, GID 0. Now, if I create a random file, say Blanc, we see on the first time that Blanc is owned by root. Now, if we get out of that container and we look at the ownership, we're gonna see that Blanc is owned by me. That's because that's what the username space does. And out of the box, you're allowed to always map your own user ID and group ID to UID 0 and GID 0 inside a username space, which is exactly what nshare does. So when you see root, you're really not root. It just looks like you are, but you're actually still your own user outside of the container. Yeah, so this is the ID mapping concept that the username space encompasses, which is I'm always looking for words. I mean, I've worked on this in the kernel and in user space and I still find it hard to succinctly explain how the username space works. All right, and that's, next one is gonna be stack comp. Right, so I'm always torn here if this is, it's a core feature. It's kind of related to LSMs, so to Linux security modules, which we will mention in a little bit, but to some extent it's also, it's not, because LSMs are always treated as being kind of an optional add-on to container security, while SecComp is kind of considered at the core of container security, I guess. And I think that's due to the fact that it's been around in Linux for such a long time and it's really concerned with syscalls and at a very low level at the entry point of the syscall path in the kernel, I guess. So SecComp, secure computing, obviously. And it allows you to restrict syscalls, I guess, that's the easiest explanation. So if, obviously there is, for unprivileged containers, it's not so much a problem necessarily, but for privileged containers, it definitely is. So there are a bunch of syscalls that if you were to just allow them, could allow you to escalate privileges, to potentially escape the container, or in general, just do things that you shouldn't be allowed to do. And I guess Stefan's favorite example always has been opened by Handelette, which is a way, I guess it even works with, does it work with unprivileged containers? I'm not sure right now. No, it doesn't. It's blocked by default, by the kernel. I think you get deeper or something. You need some extra privileges there. Yeah, but you can basically, even if you have used pivot route, you can use it to escape to, you can use it to escape to host route, which is obviously pretty bad. So if this is a syscall, you definitely want to block it. Go ahead. Yeah, it's a very nifty syscall that caused a bunch of issues for Docker and others a few years back. Effectively, what it would let you do is you would open, I think, a file descriptor to some directory, and then it would let you get a file descriptor for something relative to that directory. So you would effectively open, say, slash in your container and then say, I would like dot dot slash from there. And it would quite happily traverse onto the host and escape the pivot route entirely. So that was a bit of a problem. It's blocked by all container runtimes for years now, but that's the kind of example of issues with previous containers and old syscalls that people don't really remember. It's a syscall that hasn't received much law for sure over the years. So yeah, so definitely a syscall you want to restrict, especially for previous containers. But also if you, it's a principle of least security. The principle of least security. No, I mean, if you want to, if you run an application, it's usually a good idea to only give it access to what it absolutely needs. This becomes obviously more of an issue when you think about browsers doing decoding and video encoding and so on, running kinds of plugins that you really don't trust. So you want to restrict the syscall interface and, for example, just allow them to open specific files or yeah, call specific syscalls. And second is the way to do this. So in the easiest implementations, you just say, here's a set of syscalls you're allowed to use. So this would be the allow list approach or you have a deny list where you block all syscalls, where you block all syscalls that you think shouldn't be allowed, but you allow all others. Obviously, the allow approach is the smarter approach because if you add a new syscall to the kernels that you deem unsafe, you're still in the clear. And you can instruct Seccom to, for example, say if it blocks a given syscall, you report back a specific error code to user space. Like you can tell Seccom, I want a specific error code returned to user space. And the usual convention that we follow with our container runtimes, at least, is that we return back ePerm because most programs will know how to gracefully move on when they receive ePerm. Yeah. Yeah, that are using Inosys, which is the other way of getting a nice fallback on your syscalls, effectively pretending that the kernel does not support the syscall, which then causes the coding program to go through a compatibility code path. But you can also use, you can also get more fancy than that. You can write, so Seccom makes use of a dialect of BPF. Most people nowadays think when they think of BPF, they think about eBPF extended BPF, but and what Seccom uses is a predecessor, essentially CBPF or classic BPF, which is a retrofitted term actually, used to be called just BPF. And you can write Seccom filters in CBPF, which has some limitations, but it's still pretty expressive. It means you can define more complex syscall filters. And for example, filter based on arguments passed to a system call, but because of the way CBPF works, there are some limitations. One limitation, for example, is that you all pointers are essentially opaque to Seccom. So if you want to filter the amount, if you wanted to filter the open syscall and you were to filter on the path passed to open, which is not a great idea, you can't do this because for Seccom, can't you reference pointers? It can't chase pointers. So any structure that is passed by a pointer and so on for Seccom is opaque to it, but you can filter on any register based arguments. So if you have a flag argument, then you can filter on opens flag argument, also not that useful, but or on mount flag argument. For a bunch of syscalls, this is pretty helpful. So you could, for example, restrict the unshare syscall to only allow you to unshare the username space, but not any of the other namespaces. So it helps you, it's actually quite useful. And Seccom recently has been extended. This is work we've been involved in as well. You can intercept syscalls. Well, you could always do this, but you can also now outsource the decision whether or not a syscall is supposed to be skipped or continued to another user space process. So you can use Seccom to supervise and emulate syscalls. The way this is done is that a task can get a file descriptor for its own Seccom filter. And then that file descriptor can be handed off to, for example, the container runtime. This file descriptor itself is pullable, meaning you can get events for syscalls that you have registered in your Seccom filter. And then when you receive an syscall event of interest, the file descriptor becomes readable and then you can read the second information from the file descriptor which involves the syscall number, the architecture, the syscall arguments, and then you can go on and read, even if you wanted to, you can now in user space chase pointers what Seccom can do. And there is ways to do this safely, but one needs to be careful. I can go into detail, but probably that means we would be running out of time. Yes, and yeah, so we're using it to emulate syscalls, meaning we get a notification, you get a notification for a syscall, for example, to make not syscall, and then you can inspect the syscall arguments and you can realize, oh, right, the container is just trying to create a dev console, which is a pretty boring device node and which we bind mount into the container anyway. So why not emulate the syscall in user space for the container, create the device node in the containers, in the containers, mount namespace and call it done. So this is what we can do with, on the sun's coming up and being linaced. And this is, yes, this is what you can do with, this is what you can do with the second notify, as it is, it's called. But that's a pretty extended, a pretty advanced usage, I would say, for a second. Usually you just really use it as an additional or core security mechanism, whereby you restrict the container to only use a certain set or subset of syscalls. For system containers, this is somewhat, we could do it, but usually for system containers, and since we're mainly concerned with running unprivileged containers in the first place, we can, for the most part, rely on the kernel blocking all dangerous syscalls anyway. So having a allow list approach for system containers is usually not a good idea because you're booting a full init system. But if you're running a tiny application, just one single process in your container that doesn't need to do a lot, then you can use a allow list and only allow a very small set of syscalls. But Stefan can give you a fun little demo of returning a weird little error code via second. All right, so one, let's just get ourselves another namespace, keep doing that. This time we're gonna be relying the, we're gonna get the mount namespace as well. There we go. And let's look. So there's a second.c far here. Let's look at what that does. So that's directly, so we're not even using the second, really, just writing a bit of BPF, which effectively says that if you're trying to call the mount syscalls, we're gonna return enoano as the error. And for everything else, you get to go through. And the command, if you like when you run that binary, what it will do is it will spawn a bash terminal as a sub-process with that profile applied. Well, a lot of sub-process will re-execute that for you. So if we run this, we get back, well, actually, let's just do a normal mount. So we'll just try to mount tempfs on slash mount. Fine. No problem there, let's unmount it. Now, let's get ourselves that restricted terminal. Everything is working as normal there, except for, if I try to do this, I think no anode, which is no anode response. And that's kind of the most basic example of just blocking a syscall using sec-comp. For more complex use cases, you definitely want to use something like libsec-comp to do a lot of the abstraction for you. One thing to keep in mind is that syscall numbers differ between architectures and can differ even within the same architecture, depending on exactly what mode you're in. So you want to be kind of careful with that, you don't want to make too many assumptions. For example, in this case, I could probably use the 32-bit version of the mount syscall and still bypass that. Thank you. There are a lot of counter cases around syscalls that you need to keep in mind that's why there are good libraries around that and you should be using those. Some of those libraries also let you optimize your BPF codes for maximum performance, because if you want to allow, say, 200 syscalls, you don't really want to go and have like one statement after the other that can quickly make things slow, so there's some amount of optimization that exists or not. Even when you have sub-architectures, but think about running 30 syscalls, 32-bit user space on a 64-bit kernel, and I think we had cases where it was a 64-bit kernel on a 32-bit user space, then a container with a 64-bit user space and then another container with a 32-bit user space. And then you're sort of writing a correct second filter for this and making sure that all of the, even Lipscomb sometimes gets confused in these scenarios or used to get confused in these kinds of scenarios, so this is really not a trivial exercise. New kernels will make sure, thanks to work by Ant Backman, will make sure that the same system call number is used for most architectures. The only exceptions are alpha, but everyone has a deck alpha at home, I assume. And IA64, and I think the MIPS architectures as well. So things that are widely in use, although I have an IA64 server right here next to me, so, but yeah, it's kind of, it's not as easy. There's a lot of MIPS equipping out there. Yeah, there is, so. And I think of all of your Wi-Fi routers and stuff, most of those are still MIPS, some of the newer stuff is on, but MIPS is actually a common, so you need to be careful there. People actively use LXC containers on MIPS and have run into some of those issues in the past. I think MIPS is simply, if I remember this correctly, MIPS is indicating the architecture type by pre-pending five, four, or three to the syscall number or something like this. And IA64 has like an offset of 1,200 or something, but yeah, so generating a correct second filter is not, can be a non-trivial exercise. Even actually if you use the Notifier example I said before and you read syscall information, you read the raw syscall information from the Notifier and the kernel will give you a syscall number back and then you need to be sure to know what architecture you're actually on, to know what syscall you're getting notified about. Otherwise you're going to be very confused when you think it's to make not syscall, but actually on this architecture, this number is mount. That can actually happen. But yeah, it's so far for a second. Okay, next step is capabilities. That should be a slightly faster topic, so we don't go too much into the math that's involved with some of that stuff. Yes, capabilities. So once upon a time, no. Traditionally when we think about doing performing operations that have an effect on the whole system or can be considered in some way critical, shutting down the system, mounting a device and so on, I usually guard it by some sort of, you need to have some sort of privilege on the system and the traditional concept for Linux, for Unix, I get POSIX, I guess is being root, being UID zero. And that's obviously, that means that's basically allowed to do, it's not true anymore, but it's basically allowed to do, if you're root, you're allowed to do anything you want with the system, which is non-ideal because you're people familiar with Linux over traditional Unix systems, we remember all of the issues we have with set ID bits. So if you want to enable an unprivileged user to perform an operation that requires privilege, you need to have some sort of mechanism to implement this and the traditional guess work around or some people nowadays probably call it hack is to set the set as ID, set UID bit on a binary and have that binary be owned by root. And so if an unprivileged user calls that binary, it can run with elevated privileges and perform a given operation. But obviously that means if you have any flaw within the binary, that means you can potentially use that binary to exploit, yeah, to attack the system. And that obviously has never happened ever in the whole history of Linux, but it's, that was a bit of an issue and people obviously realized this and people tried to come up with different ideas on how to solve this problem. And I guess people still have strong opinions about this one and Linux has gone in a specific direction. You can like it or you can not like it, but this direction is capabilities which differ from the way you would think about capabilities if you just look at it from the computer science literature. It's not really capabilities like you would do them. But they are a way of splitting up the root privilege into separate distinct, I guess, privilege sets. So the closest you can come to being root if you have capability support is having capsis admin on the system that allows you to mount file systems. And I have obviously have this all in my mind. I'm not totally looking at the man page right now. Oh yeah, capsis admin has a long, looking at it, if you have, if you have a terminal in front of you and you type man capabilities, you can see there is a long list of what you can do when you have capsis admin, it's basically the new root. And tends to be the one that we use whenever we don't have anything better to use. Yeah, and every time we don't know how to guard something within the kernel, we may get NS capable capsis admin. But there are a bunch of finer grained capabilities. So we have capsis resource, which lets you override resource limits, capsis time, which lets you set the system clock and so on. There's a long list, capchome, capset UID, capset GID, change ownership and so on. So it definitely has some benefits. And capabilities have an interesting property since the addition of username spaces. So before I briefly mentioned that username spaces isolate UIDs and GIDs, such that root inside of the container isn't root outside of the container. And they also do the same thing with capabilities, meaning if you ask the question, do I have the capability, what you're really asking or what the kernel understands is, do I have this capability in the relevant username space? And if you do, you can perform that operation. And if you don't, then you can't perform this operation. But the world is obviously not that simple. It's not always, you're not always asking the kernel automatically the question, do I have the capability in that username space? Often for some operations, like for example, calling make not, creating device nodes, what you're really asking the kernel is, what you're asking the kernel is, do I have this capability inside of this username space, what the kernel is actually looking for is that you have the capability in the initial username space. So some capabilities in some circumstances are not charged against the username space you're currently in, but to use the initial username space because it affects the whole system. So capabilities are kind of have this weird state where you have to have the capability in the right username space. But in general, it's capability in a given username space. And they come in different sets. So Stephan listed in here effective capabilities, inherit capabilities, permitted capabilities, ambient capabilities and bounding capabilities. And there is set theory on the capabilities man page. And I'm not joking. So the interesting set for us right here is essentially the effective capability set, which is the capability set that the kernel is looking at when you ask the question, M, do I have this capability in the given username space? You can expand this too. Do I have the effective capability in that given username space? So do I have this capability right now? And ambient capabilities, for example, inherited capabilities, you would think of the capabilities that you take with you when you exec a new process, but actually that's not how that works. It's also pretty complicated. So ambient capabilities were invented so that you can preserve capabilities safely across exec VE. And there are some restrictions also. So you see it's interesting. Usually for privileged containers, you drop a bunch of capabilities and especially Capsis Admin, because if you have Capsis Admin, then game over anyway. For unprivileged containers, as I said, capabilities are per username space. And so especially again for system containers, you don't really need to drop a lot of capabilities, it's usually when you want to lock down your process or your container even more than just making it an unprivileged containers. And last but not least, we have file capabilities. And I mentioned the UID bit that you can set on certain binaries to get around to make it possible for unprivileged users to gain privilege, to perform privileged operations. This was a, you get all privilege or no privilege at all kind of operation and file capabilities are a way to selectively choose specific, delegate a specific type or subset of privileges. For example, the Ping nowadays has cap raw, not cabinet admin. Cap net raw. Thank you. Cap net raw. So you can set a file capability on a given binary and then when you execute this binary, this capability gets raised and then you can perform a privileged operation. So it's a more fine-grained set UID. Don't quote me on this. And they're also namespaced, which is nice. They weren't for a long time, but this is work which former colleague of mine has actually done. So you can now set file capabilities in a username space, which is obviously pretty great because before if you unpack or untart a file system, root file system, the capabilities weren't preserved or couldn't be set and this is now actually possible. We, for example, make quite a bit of use of this in Lexity itself. And Stefan can now give you a demo of how this works. Okay, so capabilities, let's just reset this once again. So first let's see what we have. I'm running as a community and privileged user on my laptop, so I've got nothing. Now I can ensure a username space, do a network name space as well, remap root to my user and focus a process and see what we have now. Okay, I've got everything. So Kapp.sh used to be showing us the entire list and I just shortened this, but that means I've got everything. That means that I can say the new network device in that network name space I created, I look stuff like you find, the device is now there. Let's say we do Kapp.sh and this time we want to drop Kapp.net admin and put a new terminal using that. Now let's go look at what capabilities we have. I've got everything, minus Kapp.net admin and sure enough, if I now try to create another dummy device, I'm no longer allowed to do it. And obviously it's not there, we only see the previous one. But that's what developing capabilities lets you do. Like it lets you block some specific part of the kernel API. We could also have done it the other way around with just allowing the capabilities actually do need to get everything else done, but prevent that one part. But in this case, it shows how to handle capability dropping on the next. And every username space, I think we didn't mention this, every username space starts out. So if you create a username space, you start out with a whole set of capabilities. So it's not that use. Yep. Yep. Go on. Yeah, you get everything. Which is a bit confusing some people because they're assuming if I get caps this time, then I'm allowed to change the global system clock. Well, no, actually you're not because the global system clock is checking for caps this time against the initial username space, not against your current username space. So even though you've got the capability in your username space, it doesn't actually let you change system time. So some like a number of software cannot do the wrong thing there and gets super confused. But yeah, username space gives you the entire set of capabilities against that particular name space. Yep. Next one is the Linux security modules. So this is what I briefly touched on before. Linux security modules are, well, I guess maybe it's just me, but I've always seen them sort of an optional thing for containers, not in a sense of like you definitely should, you definitely should use them. But if you think about what is a container, then you wouldn't think something is not a container if it doesn't use Linux security modules, I think. And also there are different ones, right? So you have app armor and you have SA Linux, which are the two big ones that people think about. And at least as of now, it's still the case that you can either use app armor or SA Linux or distros usually use either app armor or SA Linux, but not both and sort of, yeah. So major LSMs or SA Linux app armor and smack. Linux security modules let you implement additional security by not just letting you do duck. So discrete access control, but mandatory access control. It's obviously a big thing if you, in security, we're not going to go into detail here, but Linux has not one mandatory access control mechanism that a bunch of them, SA Linux, app armor, smack. We now have a BPF based LSM, I think, even though it's not yet concerned with access control, I guess. And one of the, so for example, we use on Ubuntu, we use app armor. So we have an app armor profile for all of our containers which we load by default, which blocks a bunch of operations which we deem unsafe. And for privileged or privileged containers, it's a much bigger deal because Linux security modules in these for privileged containers actually do that, I guess, the heavy lifting of making your container even just remotely secure. So if you're running a privileged container, but you're not even using a Linux security module with a decent profile, then you can just not run a container at all, I guess. Unless you have a specific use case, I mean. And for unprivileged containers, it's an additional safety net, I would say. This is always how we have used it. So every unprivileged container also has an app armor profile or if, for example, if you're running on Fedora, has an additional SA Linux profile, has an additional safety net. One of the things that we've been interested in and that's work that's slowly going upstream is to make it possible to stack LSMs. So similar to how, when you think about namespaces, right, you start a container and you start another container in there. So your nest unprivileged containers, you have a username space inside of a username space, but there's also a use case for when you have an unprivileged container that runs another unprivileged container. And in the first one, you want to run app armor and in the second one, you want to run SA Linux or the other way around. Ideally, we would, at one point in the future, end up in a scenario where we have one container with running app armor, another one running SA Linux, another one running app armor, another one running SA Linux so you can mix and match the whole stack. I don't know how feasible it is, but at least we can end up in a scenario where we can do one level of stacking, I hope, where you can have app armor on the host and SA Linux inside of a container. That would already help a lot because right now this is, as far as I understand, not something that you can currently do, but this is work that takes its time. There is some resistance or there has been some resistance by maintainers, but also it's difficult to implement correctly because some LSMs used to have, maybe still have expectations on what level of the stack they are, like who gets the last say. Yeah, and with app armor, we at least have LSM namespaces so it can nest up armor profiles so each container can get its own app armor profile. That's what we do when we nest containers and in lexity at least. So yeah, LSM is a pretty big security mechanism. Core feature need to have when using privileged containers and an additional safety net when dealing with unprivileged containers, but Stefan can give you a demo. Yep. Okay, so I'm running on a bunch of demos, it's gonna be on app armor. I'm gonna be showing first what the current LSM labor is in my case, which is unconfined. So that's what you normally get outside of a container or outside of a application specific profile. Now I'm gonna spawn bash under the lexity container, the fault profile, which is already defined on my system. Of the nut, at which point we can see that this is applied. That particular restriction is not super useful right now is to show that even an unprivileged process is allowed to switch to a profile that was already loaded on the system. Now let's spawn an actual container. Okay. So now that that container running, I'm gonna grab its process ID, just this thing, and from the outside, I can go look at that particular process and what it's ID is. You can see this one is rather long. So to explain what that all means, the beginning is the app owner namespace. So lexity-c1 underscore varsnavlex.com and lexity slash slash and percent. That's the namespace. And then the next one is the profile that's applied at the base of that namespace, which is another auto-generated lexity-c1 underscore varsnavlex.com and lexity profile. So that's the actual policy that lexity generated for that profile. And on top of that, so within that namespace, we do unconfined. So that's, it looks like inside the container that you're effectively unconfined even though the parent profile applies to you. We can go look at that by getting a terminal inside that particular container. There we go. And if we go look at cox-file occurrence now, we see unconfined. And we can even see that that particular container has loaded profiles. Like it's got 28 profiles that have been loaded inside the container, inside that apartment namespace. And we can, from within the container, go poke a bit and figure out that we're actually confined. So we can do NS name, which was the name of the namespace that we're part of. We can see how many level of namespacing is currently applied, which is one, and can only be one because apartment doesn't support more than one level of namespacing right now. And we can check, I think there's another fine thing that's been there, which, oh, it's not stacked. Yeah. Which tells us we've got the namespacing and stacking in place, which we do. So that's what you get currently with, with a partner on modern distros. Next we've got resource limits with C groups. Yep. Hopefully, at some point, we'll have this with the stacking with AC Linux and AppArmor. Yes. Yeah, nothing quite looking forward to being able, oops, sorry. Nothing quite looking forward to being able to run things like Android on, or even Red Alt and CentOS with their own confinement on top of an apartment system, or the other way around, running on a Red Alt based distro and then running confined Ubuntu or SUSE on your third distros that use AppArmor. Right. So, C groups. Resource limitation feature, also something that most container runtimes use, so I guess it makes sense if you don't want your container to eat all of your memory, or hog all of your CPU. It's mostly concerned with limiting various system resources, obviously. So you have CPU, block IO is the examples to find this that CPU said, so you can find you in exactly how much memory your containers is supposed to get by the memory C group and so on. We're in a state where we're dealing with two major versions of C groups, which has caused quite a bit of churn in user space. So in C Group B1, which is the thing that most distros still use, I mean Fedora has switched over to being C Group B2 only and at some point, I guess most distros will follow, but most major distros right now, I'm still using C Group B1. And in C Group B1, you have the concept of a resource controller. So you mount the C Group file system, but you're not done at this step. You also need to mount a given controller whereby a controller is something like CPU set or memory and this controller is what you're actually interested in. And you can mount each, you can mount all controllers into the same mount point, so they all show up under the same mount point, but for whatever reason, this is not how it's usually done with C Group B1. What we have over time standardized on is mounting each C Group controller into what is called a hierarchy, a separate C Group hierarchy. So you have slash sys, slash fs, then slash cgroup, slash memory, slash CPU set, slash CPU and each of those directories under assist of a C Group is its own controller and then under each of those controllers, you can have sub C Groups that you move that you create a C Group for your container and so on. So it's quite a bit of work and if you wanna code this, it involves a bunch of loops and moving stuff around and so on. So there was some complexity associated with this and there was also some issues with how different processes on different levels of the hierarchy could compete with each other for resources, so child process, further down the hierarchy could compete with a process higher up the hierarchy for resources which made them less than ideal for limitation of system resources. And C Group B2 is aimed to rectify all of these problems. It has cost a bit of churn, I mentioned this before because the API is so, the user space experience I should say, is so different than from what you're used to with C Group B1 and also you now have inner systems like SystemD that are not just an inner system but also a C Group manager essentially for good reasons obviously that have specific expectations. So all the versions of SystemD will not know about C Group B2 or deal with C Group B2 and newer versions of SystemD will know about C Group B2. So if you're on a system that only boots like Fedora that only has a C Group B2 C Group B2 mount point but you now run a distro that runs a binary an init binary that only knows about C Group B1 you have a bit of a problem right here but there are different strategies different strategies to solve this problem. Yeah, I don't need to go into details on this is a pseudo file system but it's quite important for containers for resource limitations. We are now, we have been supporting C Group B2 only systems for quite a while and most other container runtimes should have caught up by now as well. Yeah, okay. Let's just show our C Group behave. Let's again. So I'm on Ubuntu 20.04 which is on a hybrid setup. So if we look at POC C Groups we can see I still have all of the C Group B1 controllers listed there. And if I look at POC C Group we'll see those same controllers again with the last one zero being a unified hierarchy for C Group B2. Now, if we look at C Group as Christian mentioned they're all mounted separately with unified being C Group B2. Okay, so let's play with them a bit. Let's unshare new username space and the three map route. There we go. Now I want to get the process ID of this terminal and I'm gonna switch to another terminal here. I'm gonna become real root in that one. First time I do this in the demo everything else was inside the username space and I'm gonna create a new sub directory inside the PIDs controller. So that's a C Group B1, okay? I'm gonna call that demo. Then we're gonna write the process ID of that other terminal into the C Group that prox byte in there. Now if I switch back to my first terminal and I look at POC C Group we can see that the, where's PIDs? Yeah, PIDs is listed as slash demo now. It's been moved. And now we're gonna apply some limits. So let's write one in the PIDs controller, maximum number of processes and try doing anything. Yeah, that's not working so well, huh? Because we can't fork anymore because we've got a limit of one process and we already have one process in there. So I'm just gonna fail. Now we can go and move that limit back to say five which should make it quite a bit happier eventually once bash retries. There we go, and now we're good. So that's basics of how to set up C Groups. Obviously that's what your container manager normally does. We still have that container C1 I created earlier. And now let's apply some limits to it real quick. So actually let's just go see what that looks like before it. So if I go in there, I've got 16 gigs of RAM. And if I look for processes I've got four CPUs visible in there. Now let's just change that to two CPUs and we'll do when you go from exact back in there, look at processes that didn't work, sweet. And memory is done to one gig. The process of it is another funny aspect of the CPUs at C Group that I was hoping not to have to cover but frankly I do, which is that if we remove processes, if we remove CPUs, you need the entire tree to also remove things properly. It doesn't let you yank out a CPU from the entire tree. So if anything in that container treated a sub C Group, then you can't remove it, which is what happened here. If we restart the container then the limit should be correct. We've got some work planned on the Galaxy itself to try and mitigate those kind of issues by automatically trying to figure out the right thing to do for the entire tree within a container but it gets pretty hairy, pretty quick and that's a bit of a thing in mind with CPUs that it's a bit weird for removing and reconfiguring. You have a nested container in there, right? I don't have a nested container, but I've got system D running in there which probably treated a slice of some kind which then pins the CPUs. So that's something that can happen and something worth keeping in mind. You can't always reconfigure your C Groups. Okay. Okay. So, yeah, we've got something weird on the audio bridge but I think we'll just keep going and hopefully things will be okay. Yeah, so last one of these, this is a slide I once made and we don't need to go into a lot of details but this is one way of how we usually start a container or what is involved in writing a container on time. You can do this in different ways. You can implement this as a state machine or what have you but looking at this from a more procedural perspective, what you usually do is you have a bunch of barriers that you place to synchronize. So basically you start, you have your container manager or supervising process which you start and then it forks off a child and this child process will eventually become the container once it execs but in the meantime, before it can actually exec, there is a lot of work involved to actually set up the container and a lot of coordination involved between the parent process and the child process to make sure that all of the things get set up correctly. So for example, because often we need to interact with a container by attaching to its namespaces and so on, what the container manager process is often doing, it preserves a bunch of namespace file descriptors to be able to quickly set this into the container's namespaces. It also sets up C groups usually and sets up networking although I've worked on a kernel patch not a long while ago that is now upstream that lets you directly spawn a container into a given C group by providing a given file descriptor to the new clone thesis call and then instead of having to create the C group and moving into the C group manually, you can just create the container right in that C group which is performance-wise a pretty big improvement because of how the locking is done inside of the kernel. And then you spawn, obviously you specify the namespaces when you create the child process that becomes the container that you want the container to be started with. But as we said before, there are some namespaces that require you, that require to be unshared instead of cloned because of ordering issues or simply because for example, with the time namespace is also a good example. Right now, clone hasn't been extended such that you can actually write an offset at process creation time. This is something which we need to do in the future. So right now you need to unshare, need to write a bunch of offsets for your time namespace and then set an S into it. Also, the container manager needs to write ID mappings after it has created the username space so that the container manager needs to write ID mappings for the container so that the container can set your ID to user and as root inside of its username space and become privileged over its namespace. And then it can go on to set up mounts, write it's LSM profile because that needs the container manager that the container needs to do it itself. It needs to set up SACOM and configure networking devices. And if indeed, say since one for network devices back. So there's actually, it's a multi-stage process that is tricky to get right. I mean, it's not impossible. None of this is obviously magic, but it's still something that requires a lot of care to not introduce any security issues. And finally, when this is all done, you call exec and then the init process that you chose to actually be your container starts up and at that point, you're done with this. But it's a lot of code. Well, I have looked at most container runtimes and any serious container runtimes that want to be as generic and secure as possible have to do quite a lot of work to get this right, especially when you consider unprivileged containers coming into the mix. But yeah, Stefan. Okay, I guess I can do this one. So to recap, just before we take a few questions with the couple minutes we might have left at that point. So containers are a user-specific fiction. There is no concept of containers in the Linux channel. It's all done in user space by combining all of the different technologies we showed you so far. At what point you can call what you did a container is cannot up to you, which is also a bit of a problem when evaluating different technologies, you can have to understand the different bits that they're using and how they're using them to know whether the security guarantees they might be advertising are correct or not. Like just saying, we are okay to run interested workloads on our infrastructure so long as they are in containers is definitely a problem because that can go from anything from like the most pair privileged container that could be a massive security issue all the way to a fully set up user namespace plus all the restrictions on top which would actually be pretty safe. So yeah, something to keep in mind there. Building safe container runtimes is very hard. There are a lot of moving pieces, a lot of weird kind of cases, a lot of differences based on kernel versions. In our case, we support all the way back down to 2632 and that's not always super fun because there have been a lot of changes since and some things that will work fine on older kernels doesn't only on newer ones and vice versa. So that's a bit of an issue to keep in mind there too. Architecture's matter, especially if you care about SecComp. So if you're gonna do SecComp policies, you need to really think, well, you shouldn't be doing it directly. You should use one of the libraries and even then you need to keep in mind that multiple personalities are a thing on Linux and that blocking assist code on a particular personality doesn't necessarily block it on the others and that there might be some ways to go around your profile that way. Security matters, like containers share the kernel. So obviously privileged containers are a very bad idea because they can run as real root and just jeopardize the entire security of the platform. You also want to make sure you don't pass unsafe devices. Like you don't want to pass DevSDA to a container because even if it can't mount it, it's still able to write to it, which would then let it do very nasty things to your entire system and potentially escape and gain privileges or get data that it's not supposed to access. Like you need to be careful of anything you expose to containers and see whether that's fine or not given your security model. Resource management is also important. The Niagara service attacks are a thing. They're not necessarily as disruptive as stealing data or breaking your system, but they're still quite problematic. So you want to make sure you properly configure all the serial limits so that you can't easily run the system out of resources with something like a fork bomb, for example. And lastly, like don't reinvent the wheel. I think we mentioned it a few times, but there are a bunch of libraries around that for secret management, Seccom, Pecelinux, Aparma. They all have libraries. They all have examples. They all have existing solid policies you should build on. And if possible, just don't write your own runtime. Use one of the existing ones. Or you can even use Libelix C, which is what we wrote, that lets you choose what pieces you want to use and what confession you want to do to do all of that. But doing it through a library means that a lot of the mistakes we learned in the past, you won't have to learn for yourself the hard way. Okay, so it looks like we've got about five minutes left and we've got a few questions. So I'm gonna, I did prioritize them on our side. So I'll go with some of those. Christian, you can always jump in and there. So, no, and people are adding more questions, but let me try to, oh, someone just messes with the priorities. Please don't do that. I didn't. I did prioritize them before. Okay, well, someone is playing with them right now. Whoever that is, please don't take on stuff. Okay, so because that one I wanted to skip, that one I wanted to skip, yeah, someone really messed up the priorities that I properly applied. Okay. That was probably me. Sorry. Okay, I'm just going through them again. Sorry, let's take a tiny bit of time. Okay, so there was a question about the difference between C-group resource restrictions and C-group namespaces. So, I mean, they're effectively just, so C-group namespaces is effectively a feature on top of C-groups. You would normally create a C-group in all the resource controllers for your container. Then you would apply restrictions that you want to apply for the entire container. And then you want to create a C-group namespace. That C-group namespace will mean that inside the container, when they look at Proxel C-group, everything will show up as slash. So, as if they're at the root of a C-group tree. And they can then create sub-entries in their C-group tree that can be more restrictive than what was applied to the container itself. So that's kind of the idea is that you can then run system these slices, you can run sub-containers. All of those will just work and we'll be able to create their own sub-C-groups that will be more restricted than their parent. Itself, C-group namespaces by itself are not a resource restriction feature but sort of information protection feature, you could probably say. Right, I mean, it does plug a tiny bit of information leak and that you can't see where you are on the host C-group tree anymore. But more importantly, it lets you easily create sub-entries without having to think about how long I'm in, I mean, that very long path like you know, slash container slash name slash something. But the view I've got in C-SFS C-group is actually just a subset of that. So now I need to figure out what's the match between my process C-group and the view I've got in C-SFS C-group. That was very confusing for a while. And that's what C-group namespace just fixes by lining up everything inside the container so that it just makes sense and it matches what you would expect on the host. There was another question around Docker using all of those kernel features and saying that Docker was based of LXC before. So that's true, Docker at the beginning was indeed based on LXC and it was effectively a wrapper around LXC. That's not been true for years now. They've re-implemented things into heap container which then turn into container D which then runs run C. So the entire aspect of Docker has changed. Most Docker deployments are using privileged containers. Don't be fooled into thinking that not passing dash dash privilege to Docker means that you're running unprivileged. You're not. You're just not quite as privileged as if you passed dash dash privileged. Unless you manually tell Docker to do so, Docker does not use the username space out of the box which means that you are in a potentially tricky situation. Docker containers because they are single process can however benefit from much tighter capabilities, restrictions and a partner and LSM restrictions effectively. And also you can run those processes as an unprivileged user inside the privileged container which can also alleviate a number of issues. But unless you've manually configured your Docker to use in privileged containers, you're not using in privileged containers and you should be quite careful with that. Let me check. The principle of last security. No, I misspoke. This is why I made fun of myself before. It's principle of least privilege, not principle of least security. I mean, you can also have a principle of least security. It's just not, I don't know how long you're going to be employed, but there was a question around C groups with real time, real time runtime limits of a rule by any moment or outside. Yeah, I don't know about the exact interaction between the RT flag and using presumably the CPUC group in that case. Not sure if you know anything about that, Christian. The RT user space. Question number 11. Five, six, seven, two. I'm only seeing, ah, okay, there's a second page. Run RT limits are overruled by any bare metal when a non-RT process needs to do some CPU intensive task. Real time runtime limits are... Yeah, I would need to know more details. Yeah, we've not looked into that part particularly closely, so we can probably just vaguely cover what's possible around C groups for CPU. If, usually, my recommendation for people that, yeah. Real time and C groups are not really a thing, right? So they don't really get along as far as I know. So there was a session scheduled last year at Linux Plumbus where how to make C groups in real time play along nicely. But I don't think this is something that works as of now. Yeah, it's a bit tricky. You can do some specific, like my best guess for that would be do specific things. My best guess for that would be do specific pinning on Leola. That's a very specific CPU set to be used for real time tasks and you can maybe get things going that way, but otherwise it's pretty tricky. Anyway, we're out of time. We will be on the Slack channel for a little bit if people have more questions they want to ask there. Otherwise, thank you everyone for listening, thank you everyone for coming. I hope you enjoyed it and we'll see you all at some other later conference event, maybe in person at some point. All right, thank you.