 My name is Christian. I work as a kernel engineer at the Lexi and Lexi team. I'm one of the core maintainers for it as well. Right, and I'm going to be talking about interesting kernel container work that has happened in the last year, and also going to touch on some stuff that we're going to work on, that is planned to be worked on. Let's put it like this. And there's a lot of interesting stuff happening, and I'm not going to be able to talk about all of it just because it's so much. And I originally hoped to go into a lot more detail on various patch sets and also do some demos. But timing-wise, this is going to be a problem in 30 minutes. But we'll see how deep we can actually look into various features and so on. So the first thing, so basically I start from kernel 4.15, right? That's one of the first kernels that got released in 2018, I think. So users, in use case we had for a long time with the system containers, Lexi and Lexi, is often users want to mount specific directories into their containers. So we allow for mount-hot-plugging. So users at runtime can inject a new mount into a container. So say they want to mount their home directories, right? And we run on privileged containers by default, which means you specify an ID mapping, and that ID mapping will usually block you from actually writing to the directory you just mounted into the container. And you probably don't want to churn recursively through your home directory. So one of the solutions that users usually came up with and what we supported was punch a hole in the ID mapping. Basically say, my own user ID, which is 1,000, I'm going to map this through. So the UID 1,000 inside of the container is the exact same meaning as UID 1,000 outside of the container. If you do that, you can write to your home directory. Turns out users want to, especially if they share system containers or a container, they want to have multiple directories with different UIDs in the container. So they want to punch through multiple maps, multiple holes into their ID mapping. And the kernel for a long time had a limit which was set to five. So you could only have five ID mappings. And the reason for that was, does anyone know? Well, performance actually. One of the destruct that was chosen inside of the kernel to do this was exactly one cache line or limited to a maximum one size of a cache line. So when the kernel performs lookups, for example, when it tries to determine, can I write to a specific directory? The kernel tries to look up, does this UID have a mapping inside of the username space? This write is going to happen. Or the kernel also performed depending on what it wanted to do lookups the other way around. And this needs to be very performant, so that you don't get, for example, problems with cache eviction and so on. So the struct size was actually chosen in a way that it held exactly five mappings. So what we needed to do, we wanted to bump this limit significantly. Originally, we thought like 10 should be enough and then people came up with, oh, no, we have like 100 UIDs or whatever we need to map through. And they're like, oh, great. And obviously, I wrote a first version of the patch. They just bumped the limit as high as possibly could. And it was like, I think, 340 is the limit. After that, it gets really tricky with cache line sizes and so on. So we had the task of going over five mappings and making it as performant as possible without regressing the base cases, which means zero mappings. So you have no UID mappings at all and up to five mappings. So it should be exactly the same type of performance. So what we did to make this happen, a bunch of smart people obviously thought about this. I'm not smart enough for this. Joking. But we had, so you see the struct UID GID map. And inside, there is a substruct UID GID extent t. And extent, in this case, just means a single ID mapping. And what we did is we chose the struct, we put a union in the struct. So you see the U32, that means number of extents, all of the extents that you specified, that's the number of mappings you have. And then if you have the five base cases, then you can use the struct array, struct UID GID extent, extent UID GID map max extents. So for our five mappings, nothing really changes. It's the same type of performance. And if you go over five mappings, then we have a forward and a reverse pointer. And what it basically does, it points to a struct, it points to a pointer of UID maps, UID GID maps. And they are sorted in different ways. They are either sorted by the username space ID inside of the username space, or the ID is seen from the ancestor UID GID name space. And so when the kernel performs a lookup, it can use binary search on those pointers. And depending on in which direction it looks up, and this is actually pretty performant, it turns out. So we render numbers. And if you want to see how you can write a really shitty patch, then you look at version V1, which was my first patch. And you scale like zero mappings up to 340 mappings that you can specify. And you see there was a great idea, right? So the test that we performed is we created 1 million files and made sure that they were in the cache and then started each of those files a bunch of times and then calculated the mean stat time for each of those files. And here you can see what the mean stat time for 1 million files is. It's 158 nanoseconds for zero mappings, which is the same for both patch sets. And then if you scale linearly, you end up with something like 2,760 nanoseconds when you go to 300 mappings. And if the algorithm we came up with in the end, the one you saw before, you end up with 348 nanoseconds. That's like not even 100 nanoseconds that you get slapped with when you use 340 mappings. So that's pretty good. So yeah, this problem, we're still working on this problem to some extent because this is a stop-gap measure. The correct solution would be something like ShiftFS, which another team member of mine has worked on. It's in good shape. Well, it's actually ready, sort of, right? So once we have ShiftFS, which I'm not going to touch on now, this is all not going to be a problem anymore. But right now, if you need to map through specific UIDs and GIDs, this is the way to go, and you can do it up to 340 mappings now. Right. I mean, this is something you probably have heard of, right? Unprivileged mounts in username spaces. Who has heard of this? A few, at least, have. This is a work that has been going on for a long time. It has been done by a lot of different people, spearheaded by Eric Biedermann from Redhead, and Seth Porsche has done a lot of work on my team on this as well. And this is like a multi-year effort. This is not just something that landed in one kernel. There's a lot of... By the way, I should have mentioned this. These are all GID comet hashes you can use when I put up the slides to look the comets up in the Linux kernel tree so you don't have to search through the whole three for the comets that I'm actually talking about, which you might find helpful or not. And this is actually interesting work. You can mount a bunch of pseudo file system right now in username spaces, right? You can mount Dev TempFS... No, not Dev TempFS. That's definitely not possible. Dev PTS and TempFS and a bunch of others. And these are interesting file systems. Well, the file systems you need, but not really very interesting file systems. What users really actually care about is mounting proper block-based file system, right? So X4, XFS, whatever it is. And this has been a long-standing request. The problem is, well, you face problems on two fronts. Security in the VFS and security from the individual file system that you are talking about. So the first step that needed to happen is make the VFS itself safe when mounting file systems in username spaces. That included work like, for example, tracking the username space. The file system was mounted in the super block. Stuff like dealing with device node creation, like, for example, when you are able to mount the file system inside of the username space, you immediately become privileged with respect to that file system, which means you can make not-ended file system all you want. Obviously, inside of username space, the trivial attack you can come up with, make not Dev KMM, if that's enabled in your kernel, and then write to random kernel memory, kernel crash. So that's probably not a good idea. So you need to come up with a way to actually block unprivileged users from messing with your file system. Then there was a ex-address, unprivileged file system attributes, extended attributes. This is work that has been done by search halons to username space file system attributes so you can use them instead of username spaces. But even if you go through all of that and you make the VFS itself safe, which is what all of these patches or most of these patches did, so from the VFS perspective, it's fine now. Oh, does anyone, everybody know what the VFS is? Okay, good. And now you have an additional problem. The VFS might say, it's totally fine. From our perspective, you can go ahead and run and mount file systems inside username spaces. Now you have an additional problem that VFS security is independent of file system security. So a specific file system maintainer must be willing to essentially say, I can guarantee that in the face of a corrupt file system image or in the face of a corrupt image, it's still totally fine to mount this inside of a username space. The kernel is still safe. And that's a guarantee that I'm pretty sure not a lot of file system maintainers want to actually give, which is reasonable. Because file systems actually, whenever, and I think never designed with unprivileged users in mind or username spaces in mind. So the attack that I'm thinking of is essentially if you are able to mount a random image that somebody gives you inside of a username space and the one corrupts, he or she corrupts the super block and then you mounted kernel crash. So that's a huge attack surface. The only file systems that really are possible to be mounted inside of username spaces of now is Fuse. That's a recent addition in 4.18. And there is talk about making possible for OvalAFS. There are patches for this, but they haven't been upstreamed yet. And I don't know how comfortable the OvalAFS maintainers at this point would feel with making OvalAFS inside of username space as possible. But it's definitely one of those file systems that could be a good candidate. For the others, it's probably very, very unlikely that it's ever going to happen, to be honest. I would conjecture. The kernel regression bit is quite interesting. There's a, basically, we reverse user space with this at some point by making make not possible inside of username spaces. This is reverted now and I don't have time to go into this, but yeah, it was an interesting thing that happened this year. The next piece is something that Leonard always hates me for. No, joking. So device management and Linux is split between user space and kernel space, right? So the kernel, actually on a newer Linux kernel, creates device nodes inside of DevTemperFest and Udev triggers some bunch of rules to do static naming of devices and so on, all the kind of crash stuff. And I don't know, this is fast and I'm aware it allowed to swear, right? And so username space runs to UdevDemon, exactly. So in the clue between these two events is essentially a U-events, meaning kernel creates a device node, kernel sends out a U-event based on the U-event Udev triggers and runs rules. One of the problems was for a long time, U-events were broadcast into all network namespaces. They are tied to a net link, that's why they are tied to network namespaces and they get yelled into all network namespaces. The problem was the permissions these U-events were sent out with, so the UID and GIDs, for example, were not fixed up relative to that username space, which basically meant that if you ran a thousand containers on your machine and you had a cache event, like from a kernel cache event, slap, slop, slap. They are, I think slap is at least, which triggers U-events, so you get U-events for that. Or you created C-groups or whatever, you got U-events and they got sent into all network namespaces without being usable because Udev never saw them even if you ran system containers in there with Udev actually. Because it just ignores events from UIDs and GIDs that are not UID or GID zero, which was pretty fun. So the first step we did is we implemented U-events namespacing saying only network namespaces that are owned by the initial username space are able to receive U-events and all of those that are not will not receive U-events. But then we thought, hmm, actually there's a valid use case right now. What we allow you to do with Lexi, for example, and Lexi is to inject devices into the container. You can inject USB devices and so on. Wouldn't it be cool if you could see U-events inside of the container, run Udev inside of the container, and then could trigger rules based on that. So what we did, and this is what U-event injection means, is basically to allow you to forward U-events into a specific network namespace by adding a write method if you want to call it like this to the k-object U-event socket. And so you can receive a U-event on the host. You can strip the U-event sequence number. You can relay it to the kernel. The kernel will append a new sequence number and make it visible inside of your unprivileged container at which point Udev will see the U-event and then you can trigger, for example, running your USB rules or whatever you have. Lexi comes with this by default. It's actually, this has been a pretty nifty feature, actually. And at some point, if we think this is reasonable for other patch sets, we might enable it for them as well. Right. And this is a patch set. We're pretty excited about this. We'll land in the 5.0 kernel. So I talked about this before briefly, or you might have gathered this from my talk. Usernamespaces come with a bunch of limitations, including you cannot make not, you cannot mount any interesting file systems and so on. And the denials for this are all essentially tied to the kernel. So the kernel has a generic security policy saying mount is not allowed for this type of file systems because security, but your container manager, for example, in this case, might have a much better idea of when it should be safe to mount a file system or not. So for example, if your container manager knows, oh, this device is specifically targeted to that container and it's totally safe to actually mount it, then you would like to defer the decision to the container manager instead to the kernel. And this is exactly what's second trap to user space allows you to do, which Tyco has done, who sits in front right here. The guy in the sugar shirt who gave a talk before. And so what you can essentially do is you specify a second filter and for interesting sys calls like make not or mount, let's use mount as an example. And then when the container itself performs a mount, the kernel will trap it, you can get an FD for the container back. And from the FD, you can read interesting information about the sys call the container now performed. And based on the information, the container manager can then instruct the kernel to say either return back success or return back failure, depending on whether or not it makes sense. And then the container manager itself can perform the operation for the container and then tell the kernel report back success because I did this just for you. This is a very powerful mechanism, which is not just interesting for container, workloads of container managers. This is really interesting for a bunch of other stuff, sandboxing just in general. And again, if you wanna talk to the guy who did this and wanna have more information about how this actually is usable and so on, then you should talk to him. I feel like I should make you stand up, joking. All right, did I miss anything? Okay. Right, this is work that is going on currently. So there is a new mount API going to be written for the Linux kernel. Well, actually it's in pretty decent shape. This is work done by Elviro and by David Howells. This is pretty exciting actually. So the old mount API has a bunch of bugs which you might or might not be aware of. The most severe one, for example, is under certain conditions, depending, well, under certain conditions on remount, the kernel will just silently ignore mount options. So for example, let's say you try to remount and you specify that this file system should be remounted with no suit. But as I said, under certain condition, the kernel will just ignore the no suit option. So you end up with a mount point that still supports setUID binaries, which can obviously be a security issue. This has been a long-standing problem. There is also no way to recursively bind mount, a whole mount tree read only, for example. There's actually a generic problem insofar as mount properties right now are not applicable recursively. This has also cost a bunch of CDEs, actually. And the new mount API tries to solve most or all of these problems. So there's obviously a long discussion going on. But one of the other nice features about the mount API is it actually, mounting is a core concept in systems programming, unique systems programming. And it's all done to one single syscall right now, which is just mount. And there's a bunch of mount options. There's a bunch of ways you can configure file systems and so on. And by now, it's probably not a good idea anymore to just have one syscall for this. So what David Howells came up with, I assume it's David who came up with the idea, is, okay, how about we split this into three distinct steps? These are most syscalls, but the concept is, create a mount context, configure the mount context, and then commit to the mount, which is very unixy, right? Open, you get an FD, you perform operations on the FD, and then you close the FD. So it's the same kind of model, which is, oh, I'm sorry, yeah. Okay. Right, so you have the FS Open syscall, which gives you, basically, creates a new FS context for a new superblock, then FS config, which you configure the superblock with, and FS mount, you can commit to it. And there's a bunch of other syscalls as well. But yeah, but the mount API is something that has currently been worked on, and this is going to be pretty exciting, I think. Another thing is a friend of ours, a good friend of mine is working on Alexa, is restricting path resolution. This is also been something that container runtimes have been plagued with in the face of simlings and so on. It's pretty annoying to guarantee that users, that users cannot trick you into writing to a file or accessing a file that you actually don't want them to access. This has been a bunch of security bugs, especially in the face of mounts. So there was a prior patch set that had a similar idea. Alexa picked up this patch set and extended it quite a bit. So the idea is you get new flags for the open and open-ed call to restrict path resolution. So open-ed allows you to specify a DRFD, and then you can give it a path relative to this DRFD, and then you can specify a bunch of flags. And depending on that, path resolution will work differently. So the idea is to add a bunch more flags. Oh, xdev is you don't allow mount point crossing. So if you specify a DRFD and something relative to that, and this would involve crossing a mount point, this is not possible anymore. Ono Simlings basically does what it says on the tin. It's no simlings at all allowed. There is one special case called OnoMagicLinks, which is proc slash pit slash fd slash something. These are simlings, magic simlings, that can end up beaming you around on the file system depending on where the file descriptor is coming from. So there's also potential security issue. OnoMagicSimlings would selectively allow you to block that. There is Obinif, which guarantees that the DRFD file descriptor that you give to open at always, if you try to escape it, go up above the DRFD, then you will get e-perm or e-axis or whatever the appropriate error for this case is. And oh, this route is an interesting idea. It's basically crude without crude. And the idea is you give it a DRFD and you give it another path, but this path, even if you try to go upwards, will always make you relative to the DRFD. So it will guarantee that you have to stick under the DRFD. Pretty cool idea. And last but not least, this is going to land in 5.0. This is what we did. There have been a bunch of users in use cases don't ask me why to run Android inside of containers, system containers. But the way Android is structured, it uses the binder IPC mechanism. And it always grabs hold of a couple of binder devices when it boots up. And the manager will make it so that as soon as it has grabbed hold of these binder devices, you cannot reuse them in containers. Meaning you need a way to dynamically allocate binder devices, which wasn't possible. So we came up with an idea, OK, let's write a tiny file system, which is an idea, well, afterwards I learned. It's actually an idea we stole from KDBusFS. And so this is a version of KDBusFS, if you want to put it like this. You can mount a new file system. It's mountable inside a username space. And when it's mounted, you only have the binder-control device showing up in there. And then you can use this binder-control device to issue ioctals and similar to loop control, essentially, and say, give me a new binder device. So you can dynamically allocate as much binder devices as you want. You can have separate instances of binder of S inside of all of your containers and so on. So you can run all the Android instances that you actually want inside of the container. This is upstream. It's done. And it should be in 5.0 unless Linus decides to revert it because he hates me. I don't know. And last but not least, this is something we are currently working on, or I am currently working on, is phytoscriptors for processes. This is obviously not something that is a completely new and crazy idea. This has been done in FreeBSD for a while now. We want something similar for Linux, basically at some point ending up with a whole phytoscriptor-based API for processes. And why do we want this? Well, the first patch set that I'm going to send is a pull request for the 5.1 Merch window because we're in v7 now and most people have acted that think it's a good idea. Think about the kill syscall. The way pit allocation in Linux works, so the process identifier allocation works, is that it's cyclic. And usually the limit is set to 33,000. So that's the highest pit identifier that you can actually get. And it only wraps around when all of the pits have been at least once allocated. But if you're on a system that is under heavy load, so there's a lot of processes being created and going and exiting and so on, so you can end up in a scenario where your pit gets recycled. And if this happens behind your back and you try to send a signal to a pit that is recycled, you might end up sending a sick kill to a process that you really didn't want to send sick kill to. It's been a long-standing problem. So one of the ideas is, yes, let's use file descriptors for processes. Let's use something that basically pins struct pit inside of the kernel instead of struct task, so you don't waste a lot of memory. The way you can do this is you get a file descriptor from slash proc slash pit. And this is by design already pinning struct pit inside of the kernel. So what happens now is if you try to signal a process that has exited behind your back through this pit of D, you get no such process, which also means you can use this is called a test for the existence of processes and so on. If you send it signal zero and you get a no such process back, you know that it has exited behind your back if it succeeds. It's either a zombie or it's still alive. So it has a bunch of nice features that a pit of D-based API can give you. Yeah. I think that's mostly it what I wanted to cover at this point. I mean, we don't have a lot of time. So I could give you a few demos, but it's probably not worth it. It's rather I stop here and let you ask a bunch of questions about the stuff we did. Thank you. One of the problems I always have with mount is that if I do something wrong, I just get a yin val and I have no idea what happened. I saw here you have an FS config and there's a key and a value. But presumably that's just building up the same mount option string that ultimately passes to the file system to parse the super blocks or has he lifted all that code out. So I'll get yin val at config time instead of at mount time. You probably get yin val at config time, but also along with the mount API, it's probably along with it. I mean, have you looked at that or because you'd have to refactor every file system, right? It's what they are doing. They're switching this over to this f-context stuff. They have to do it anyway. So that's you can look in the kernel tree. You can verify in the kernel tree. But what he's also doing is I talked to David about this. He said, yeah, this is something that upset him as well. And there is a notification, basically a notification mechanism built into this mount API. If you remember it correctly, that would give you informative error messages when something goes wrong. Other questions? Can we expect something like audit logging inside the namespaces? Sorry, what? Something like audit logging. Audit logging inside the namespaces. Are you asking for if we have basically audit trails for containers? I mean, this is the whole audit ID discussion that is going on upstream. That's a, it's not definitely per container. It's going to be a bit abstracted, right? In the sense that you have a random identifier that you at least, that's how I understood it. I didn't follow that patch that's too closely. You have a random identifier. You can assign to whatever you consider a container. And then you will get an audit trail tied to that. The problem is containers are not a kernel thing. It's basically just a user space thing. Also again, David Howells would disagree. I have seen his patch set. He has added two syscalls as a proof of concept to basically make containers a kernel thing. Any other question? Well, thank you.