 Hello everyone, and welcome to the session of the Open Source Summit 2021. As you might have guessed, this is not an in-person session. Unfortunately, the current situation did not allow to travel to Seattle. Nonetheless, I think this is a great event, even virtually, and I really hope you all enjoy this session. So, my name is Christian Browner. I work as a software engineer at Canonical, which is the company behind Geopuntal Linux Distribution, and I'm part of the team that is responsible for developing Lexdi. And Lexdi is a container manager and virtual machine manager, and consequently it allows you to run containers and virtual machines. It's different from application container managers such as Docker, Pacman, and RunC, in that it focuses on running full system containers that can be managed and treated just like virtual machines. And I'm also the main maintainer and developer of Lexdi. This is a shared library which provides a simple API to start and manage containers, and we also develop and maintain LexiFS, which is a tiny fuse file system providing a virtualized view of various system resources. We make wholesome user space work around container development, and we are also in touch with a lot of other developers who do user space and also kernel space development. In addition to that, I mainly spent my time working on the upstream kernel, and I do development in nearly all container-related areas, but also focus on some aspects of process management and on file systems abstractions, which I really like. And every year I tend to give a talk that summarizes various developments in the upstream Linux kernel related to containers, or things that might have an impact on containers, or that are sort of vaguely related to containers, because as there is no real in kernel concept of what a container is, it's a user space fiction. It's kind of debatable what actually counts as container-related kernel work and what doesn't. So this is also a smorgasbord of things that I find interesting and I think will also have an impact for containers in user space. The first thing I want to talk about is the Close Range Discord. It sounds very unexciting at first glance. It was added in kernel 5.9, and it simply allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. And so in this example on the slide, we would essentially close all file descriptors starting from zero up to the last file descriptor that could possibly be open. So what are the use cases for this? Well, one of the use cases Close Range was designed for is to drop file descriptors just before exec. Usually this would be expressed in the sequence you see up there, unshare clone files, close range, and then the range of file descriptors that you would want to close. This is one option. One would be one safe option to do this. It's one way of working around the problem that file descriptors are not closed on exec by default, which is also aggravated by the fact that we can't just switch them over without massively regressing user space. One obvious solution would be to say, we just make any file descriptors close exec by default now. For a whole class of programs having an internal method of closing file descriptors, all file descriptors is very helpful. For example, Demons, Service Managers, Programming Language, Standard Libraries, and Container Managers, and so on. And in this example, the clone files example is needed if the calling task is, for example, multi-threaded and shares the file descriptor table with another thread, in which case two threads could race with one another and one thread could be allocating or opening your file descriptors while the other one is trying to close it via close range. So since Linux exec semantics is such that only the exec track thread survives, all other threads get killed. They could still sneak in a file descriptor that the exec thread was not supposed to inherit. For the general case close range before the exec should be sufficient. So especially if you don't want to incur the cost of having those two system calls. So the close range unshare flag encodes this exec pattern that you find in user space with a hand-rolled close range implementation quite often directly and allows user space to avoid the necessary clone files call completely. So if clone close range unshare is requested, the caller does in fact currently share their file descriptor table and you private file descriptor table will be allocated. Then all of the file descriptors that are supposed to be closed in the newly allocated file descriptor table are closed and then the old table is switched with the new file descriptor table. And so we get an atomic close of all file descriptors before an exec. So this is already quite useful for this scenario. Another use case is quite obvious kill hand-rolled close range solutions. A lot of user space programs, especially libraries, implement closing all file descriptors by parsing through user-space-prox-pits-fd and then all file descriptors and calling close on each file descriptor. And so back when we looked at all the various large user space code bases, this pattern of having a hand-rolled user space version of close range was pretty common. You had service managers such as SystemD, you had Lipsy, such as G-Lipsy, you had various container runtimes. You also had programming language runtimes in standard libraries such as Python or Rust. And with close range, all of these hand-rolling such functions is not needed anymore. It's a simple system call. And it also works when PROCFS support is not even compiled into the kernel or PROCFS isn't mounted or the caller doesn't have access to PROCFS, which is also a nice benefit you get. The performance gain is striking too. We did, or I did performance measurements comparing simple user space close range implementation with the close range discourse. So for this close all of these functions, as I've called it here, is really, really very bare bone. It doesn't do any complex handling because it's not as easy as it sounds to actually write a function that closes all file descriptors in user space. So I did a very simple one that is as cheap as possible. And you can see that the performance gain is really massive, especially if you go into closing a lot of file descriptors. And since we merged close range, we also tweaked it a bit to be a little bit faster. But I don't think this is any serious bottleneck and really a function that needs to be extremely performant. Close range is also designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound, as you can see here. For example, you could have a first close range discourse where you close all file descriptors from three to 10 upwards. And then you could leave, for example, a gap where you have allocated file descriptors or reordered file descriptors that you want to keep open. And then you have a do a separate close range Cisco, which closes everything starting from 100 up to infinity, more or less. So this is useful for scenarios where specific file descriptors are created with well-known numbers that are supported, supposed to be excluded from getting close. For example, system D does this quite regularly where it has a lot of file, where some services allocate a lot of file descriptors and it already orders them. And here we see a new addition to the close range Cisco, which was merged not too long ago, and we're going to go with a close range flow exec flag, which is a use case that we come from the container people where they interact with a second profile that might want to block the close range discourse, which some people might want to do. So you have two options, essentially. You first install the second profile then call close range, then exec, and the second option is to close range discourse first and install the second profile and then call exec VE. But both options have disadvantages. In the first variant, the second profile cannot block the close range, this call, as well as open-tier reader close for the fallback on older kernels. And in the second variant, close range can be used only on the FDs that are not going to be needed by the runtime anymore, and it must be potentially called multiple times to account for the different ranges that must be closed. So it's none of the solutions are particularly great. So with close range, close exec, we can solve these issues. The runtime is able to use the existing open FDs, and the second profile can block close range. And the close range, close exec flag, what it does is instead of actually closing the file descriptors, it just marks the range as close on exec, which is a great addition. One last nice thing about the system call is that we coordinated it with the FreeBSD folks. So FreeBSD has the exact same system call as Linux does, which is excellent. And the system call has since seen quick and wide adoption user space. It's used by a variety of projects that just jump right onto it, which is obviously great to see. The next thing that I want to talk about is Secom. So over the last year, which is obviously an important building block of containers, I'm not going to go into this, allow us to block system calls and so on. And over the last year, Secom has gained an important new feature, which is called the Secom Notifier, which I'm going to recap real quick, or the Secom User Notifier also, which is what it's called. And we recently extended this. And I'm going to quickly reacquaint everyone with the Secom Notifier before diving into new features that were added. So as we know, unprivileged containers come with a range of restrictions. So any operation that can be used to attack the system is obviously off limits. And two rather obvious examples, are building device nodes, so creating device nodes and mounting file systems. So if a task running in a username space were to be able to create character block device nodes, they could, for example, create slash dev slash Kmem or any other critical device node and use the device to take over the whole system. So the kernel simply blocks creating all device nodes in username spaces, so there are device nodes such as slash dev null, dev zero, and so on that are completely safe. Similar to the task, were to be able to mount arbitrary block devices. It could mount malicious file system images and also use this to crash the kernel. So other operations are less hazardous, but they are still off limits for good reasons. So for example, attaching arbitrary BPF programs. But of course, these restrictions are pretty annoying. Not being able to mount block devices or to create device nodes means quite a few workloads will not be able to run in unprivileged containers even though they could be made to run safely. Quite often the container manager like LexD will know better than the kernel when an operation that a container tries to perform is safe, so we somehow would need a way to allow the container to perform syscalls. The container manager to perform syscalls for the container. And this is exactly what the second notify allows us to do and this is the problem that it solves. In its essence, the second notify mechanism is simply a file descriptor for a specific second filter. And when a container starts, it will usually load a second filter to a restricted attack service and that is even done from privileged containers most of the time even though it's not strictly necessary. And with the addition of the second notify mechanism, a container wishing to have a subset of syscalls handled by another process can set the second bread user notif flag as it's called on its second filter and this flag will instruct the kernel to return a file descriptor to the calling task after having loaded its filter. And this file descriptor is simply what we call the second notify file descriptor or user notifier or second notifier. There are various ways to refer to this. What the container can then do is to send this notifier to the container manager or another suitably privileged process which will usually, as I said, which will usually be more privileged as the container otherwise there wouldn't be much point. And since the second notify FD is pollable, it is possible to put it into an event loop such as ePoll or select and don't use select and wait for the file descriptor to become readable and for the kernel to return that ePoll into user space. So for the second notify to become readable means that the second filter it refers to has detected that one of the tasks it has been applied to has performed a syscall that is part of the policy it implements. And this is just a very complicated way of saying the kernel is notifying the container manager because this task in the container has performed a syscall it cares about. So for example, make not or mount. Put another way, this means the container manager can listen for syscall events for tasks running in the container. Now, instead of simply running the filter and immediately reporting back to the calling task, the kernel will send a notification to the container manager on the second notify FD and block the task performing the syscall. Well, and what this allows the container manager or any other suitably privileged process to do is to emulate syscall in user space and then report back, tell the kernel to report back to the intercepted and blocked task whether or not it should report success or failure. And this syscall emulation mechanism had quite a few limitations. We couldn't intercept all or most of the system calls. We will never be able to intercept all system calls like forward, go clonest. It's very unlikely that this can be made to work. And as things stand, we are unable to intercept system calls such as open that cause new file descriptors to be installed in the task performing the syscall. We also can't operate on file descriptors in lieu of the task to not get file descriptors out of the task. Unless obviously we share the file descriptor table with the task which is usually never the case for the container manager and would also be super weird if you supervise another task. But maybe I'm not imaginative enough. So this is the reason why we added a new add FDI octal for the second notify. This new second extension allows the container manager to instruct the target task to install a set of file descriptors into its own file descriptor table before instructing it to move on. And this way it's possible to intercept system calls such as open or accept and install or replace like Duke 2 for example, the container manager is resulting in the target task. It's a complicated way of saying you can intercept or you can open files for another process. This new technique opens the door to being able to make massive changes in user space which is great. So for example, techniques such as enabling unprivileged access to perf event open and BPF for tracing are available for this mechanism. The manager can also inspect the program, the BBF program for example, a perf event program and the way the perf events are being set up to prevent the user from doing ill to the system. And on top of that you have various networking techniques such as zero cost IPv6 transition mechanisms in the future and so on. So this is actually a really exciting technology. Related to that and also unrelated to that we introduced a new system call pitfd-getfd which is useful for the second notifier as well but is also independently useful of it. In essence, the system call allows to retrieve a file descriptor from another process. And to do this the pitfd or a pitfd of the target task needs to be specified and the file descriptor number that you want to retrieve from the target task. And the pitfd is nothing fancy it's just a file descriptor referring to a process it's safe against pit recycling and provides a stable private handle on a process. So if the caller has the right permissions pitfd-getfd will then return a file descriptor referring to the file referenced by the past in file descriptor number. This is useful in general where abstract UNIX sockets for example with SCM writes aren't or can't be used. It's also useful when the target task cannot be expected to cooperate. So CIS call supervisors using the second notifier employ the system call for example to retrieve file descriptors returned by interceptor system call such as open or accept OPPF. And if you think about it you can already see if the target task, the supervised tasks opens a file descriptor opens a file and the file descriptor is returned from open you can do pitfd-getfd to retrieve it you can perform whatever operation you want in the year of the task and you can even swap out the file descriptor by using the addfd IOCAL to for example transparently replace the file descriptor for the target task so pretty cool pitfd-getfd and addfd is a really nice complementary mechanism. Another core component of containers are obviously C groups and they are used to limit resources of containers just memory CPU maximum number of processes and so on and most of these features are implemented as what we call controllers as they allow to control and delegate the resource that they refer to so obvious example the memory controller allows to control the memory available to the container in various forms and in addition to such proper resource controllers C groups also have what we call utility controllers and utility controllers don't delegate a proper resource but instead they implement other useful management functionality so one obvious example is the freezer utility controller the freezer utility controller doesn't delegate any resource it is simply allows to recursively freeze and unfreeze a given C group hierarchy and it is implemented as a simple file C group that freeze that can be written to in order to freeze or unfreeze the given C group hierarchy so far the freezer controller has a few utility controllers available and with newer kernels we added a new utility controller which is called the kill utility controller and this controller allows to recursively kill all processes in a given C group hierarchy similar to freezer the utility controller is implemented as a simple file called C group dot kill it is only writable and not readable writing one to the C group dot kill file will cause all processes in the given C group hierarchy to receive a sick kill signal causing their immediate termination and prior to the implementation of this kill utility controller user space applications had to recursively walk a C group hierarchy down to the leave note and iterating through all life processes sending each process a signal signal by hand this was obviously bracy and also costly so with C group dot kill we can guarantee that concurrent forks can't inject new tasks while the C group hierarchy is being killed it's only available in the unified C group hierarchy and it's available in all C groups apart from the root C groups similar to the freezer utility controller if you're in user space and the kernel that support it it's easy to check you just need to check for the existence of the C group dot kill file in a non-ancestor C group you can get rid of the complex code to kill all processes in the C group and just simply rely on C group dot kill on old kernels mounting was done by using a single multiplexing syscall called mount most people who have programmed on linux in more low level environments will have encountered this this is also the the system called used exclusively by the mount user space tool that probably even more people are familiar with and it was a single system call which clumsily multiplexed a variety of operations which cost a lot of muddy semantics so for example you could create new file system mounts so creating a new super block actually exposing a file system for the first time in the file system hierarchy you could create bind mounts meaning you could create mounts of an already existing file system exposing the same set of files in multiple locations you could change mount properties for the whole file system so a super block mount options that apply to the super block so the whole file system and you could also change mount properties for a given mount so something that doesn't apply to all of the mounts but only a single one and you could also change mount propagation for a given mount or a mount tree so there's a lot of different things going on in this single system call luckily new kernels have split this into multiple syscalls together they form the new mount api but one of the limitations was that the new mount api while it did allow changing file system-wide properties so things that affected the super block and that applied to the whole file system it did not allow to change mount properties of existing mounts such as bind mounts and this is obviously a severe limitation meaning that bind mounts could not be interacted with in all circumstances through the new mount api you had to resort back to the single old multiplexing mount system call and this is also not great because the new mount api has one excellent feature which is actually only operate based on file descriptors not on paths which the old mount api doesn't allow you to do at all so luckily this is now rectified with the addition of the mount set adder system call the mount set adder system call allows to change mount properties of existing mounts so it's the last missing piece more or less in the new mount api and in contrast to the old mounts this call also allows to change mount properties recursively that's a great addition actually I think I'm actually praising myself here which is a bit weird because I wrote it but the mount properties can be changed for a whole mount tree which couldn't be done in the old mount in the old mount api which also led to quite a few security bugs which is why I'm shilling this feature so much for example an obvious example is in the old mount api one could not turn a whole mount tree read-only with a single system call so in stat user space would have to parse the mount information available on proc and then turn each relevant mount into a read-only mount problem obviously what do you do if you had a mount tree consisting of 10 mounts and at the 7th mount you failed to make it read-only but now you had all other mounts made read-only do you turn them read-right back and what happens if one of the mounts can't be turned from read-only back into read-right so it's all kind of really messy it's inherently racy and not transactional at all with mounts that are this can be done in a single system call and is guaranteed to be atomic what do I mean by this, atomic sounds a bit rich maybe you turn a whole mount tree read-only and if the system call returns you are guaranteed that all of the mounts are read-only if it fails somewhere in the middle then none of the changes will have been will have taken effect so it will revert back to the old state and then you can try again maybe in the future we should extend it to also say ignore all failed mounts just turn all mounts you can turn to read-only mounts or whatever into read-only mounts but that's something for the future another great feature is that in addition to this MountSetAdder allows to clear and set mount properties at the same time also something which you couldn't do in the old mount API so this allows a caller to only change exactly the mount properties they want to change while not affecting any of the others the old mount API expected a caller to always specify exactly the mount properties they wanted to set thereby removing all others they did not explicitly specify this for example meant that making a mount read-only that was no suit and no exec would remove the no suit and no exec property if they were not explicitly specified together with the read-only property with mount setAdder this doesn't happen as you can see on the slide specifying read-only in the adder set member of the struct would only make the mount read-only leaving the no suit and no exec settings in effect only if the no suit and no exec bits are requested in the adder clear member would they be actually removed while also setting the read-only flag so this is a great win I think and you can see you can also recursively change mount propagation and what else you can do with this system call we will be looking at in the next slide so closely associated with this mount setAdder system call is the introduction of idmap mounts which I am also giving a talk about during this OSS which you can check out users can specify the mount adder idmap flag together with a file descriptor referring to a username space and creating an idmap mounts makes it possible to change ownership in a temporary and localized way for a set of files exposed under a given mount the file is changed because the ownership changes are only visible via a specific mount by going through a specific mount and all other users and locations where the file system is exposed are unaffected it's also a temporary change because the ownership changes are tied to the lifetime of the mount so whenever callers interact with the file system through an idmap mount the idmapping of the mount will be applied to user which is associated with file system objects this end compasses the user and group IDs associated with files or with inodes if we look at it from the EFS perspective and also the security capability as well as the system POSIX ACL access and system POSIX ACL default X adders because they record UID and GID information as well so quickly just to give a broad glimpse or a quick glimpse into what idmappings actually are an idmapping is essentially a mapping of a range of user or group IDs into another rather same range of users or group IDs and idmappings are written to map files through PROC, it's not that important for our specific example here you can see it by white space you can see it here in the example 01003 and the first two numbers specify the starting user or group ID in each of the two user namespaces the third number specifies the range of the idmapping so for example in this right here UID 0 is mapped to UID 1000 UID 1 is mapped to UID 1001 UID 2002 is mapped to UID 1002 so the range of the map is 3 it's possible to specify up to 340 of such idmappings for each idmapping type so for both group and user IDs and if any of the user or group IDs are not mapped all files owned by that unmapped user or group ID will appear as being owned by the overflow user ID or overflow group ID to most people this is known as nobody no group essentially in the common case the username space that is passed in with the user NSFD member together with mount etcher idmap is to create an idmap mount will be the username space of a container this is one of the use cases that this helps to handle in an elegant way in other scenarios it will be a dedicated username space for example associated with a login session of a user as is the case for portable home directories and in system d home d implementation and it's obviously also fine to create a dedicated username space for the sake of idmapping amount so what is this for well they can be useful in a variety of scenarios and I'm just going to name a few but for example this is file systems between multiple users or multiple machines especially in complex scenarios so for example idmap mounts are used or will be used to implement portable home directories in system d where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different user IDs and group IDs so this effectively makes it possible to assign user IDs and group IDs at login time which is great sharing files of file systems from the host with unprivileged containers which allows users to avoid having to change ownership permanently through chown as often happens that users want to change want to share some data with the container for example you can also idmap containers root file systems so users don't need to change ownership permanently through chown especially for large file systems using change can be prohibitively expensive you can share files of file systems between containers with non overlapping idmappings you just need to attach the idmapping the username space of the individual containers to their separate mounts and it can be used to implement discretionary access control checking for file systems lacking ownership just for example what we did with VFET and you can obviously also efficiently change ownership on a paramount basis so in contrast to chown changing ownership of large sets of files is instantaneous with idmap mounts it's especially useful when ownership of entire root file system as we've seen of a virtual machine or a container has to be changed as mentioned above with idmap mounts it will be a single system call needed and it will be sufficient to change the ownership of all files it's also possible to take the current ownership into account which is something that chown also can do idmappings precisely specify what a user or group id is supposed to be mapped to which again in contrast to this chown system call which cannot by itself take the current ownership into account it simply changes the ownership to the specified user id and group id and it's a locally and temporarily restricted ownership change idmap mounts make it possible to change ownership locally restricting the ownership changes to a specific mount and temporarily as the ownership only applies as long as the mount exists again changing ownership through chown changes ownership globally and permanently and in case I haven't mentioned this before it's also possible to change a whole mount tree so the idmappings of a whole mount tree can apply this recursively which is also a really good feature going back to the pfd api which saw another extension in the form of non-blocking pfds pfds again are just file descriptors referring to processes thread group leaders to be precise but this is just the current limitation there's nothing inherently wrong with in the future also making it possible to refer to individual threads will just be a bit more complicated these pfds can already be used with the wait id system call to wait on processes referenced by it so you don't if you just want to use pfds to do process management you can totally get rid of pits but passing a non-blocking pfds to wait id currently has unfortunately no effect so it means it simply isn't supported however there are obviously users which would like to use wait id and pfds that are all non-block and mix it with pfds that are blocking and both pass them to wait id obviously useful in event loops for example non-blocking pfds will obviously hang when no child process is ready but non-blocking pfds in contrast will return immediately with id again when no child process is ready and this makes them again very suitable for event loops another cool feature that was added is a new capability the cap checkpoint restore capability so what's checkpoint restore what's going on so the CRIU project is closely associated with the CRIU restore feature and it essentially aims to restore arbitrary processes and process trees in user space so for example it can be used to checkpoint a JVM virtual machine after startup and this allows users to resume it avoiding the high cost of the time it takes to start up the JVM which is notorious and in addition to such use cases it is possible to restore whole containers it's one of the goals gaining more and more support in various container run times it's mostly interesting for stateful containers obviously so far CRIU required extensive privileges in order to checkpoint and restore processes requiring capsis admin which is sort of a super capability so to speak and newer kernels rectify this with the introduction of a dedicated new cap checkpoint restore capability which is great which only covers the functionality and the privileges required to actually restore a process and we've made very short we hopefully made very short this is safe and this capability can then be set on a CRIU binary by the administrator making it possible for unprivileged users to checkpoint and restore their processes and the administrator will probably feel more comfortable setting a restricted capability on the binary rather than a catch all capability like capsis admin and last but not least one of the additions, new additions is support for unprivileged Oval AFS this is a patch set we carried in Ubuntu for quite a while already it's now also upstream but it's not our patch set it's upstream, somebody else did the work but the Oval AFS system is heavily used in application container runtimes Docker springs to mind and it allows to efficiently share a file system among different containers obviously not with ID mappings involved so unprivileged containers can't easily share Oval AFS file systems currently it works by providing different writable layers two different containers on top of the same read-only base layer so you create an Oval AFS mount for container one and container two they use the same base read-only layer but we have separate writable layers and this makes it very suitable to distribute minimal images for example and so far Oval AFS mounts could only be created by privileged users similar to most other file systems in newer kernels it's actually possible to mount Oval AFS inside of username spaces and making it possible for unprivileged containers to create their own Oval AFS mounts and unblocking a range of other use cases yeah and obviously in a talk like this we can only cover a limited number of features there is a lot more we can use things to cover probably including the new landlock LSM module for example which is a great feature taken a long time to get it upstream but it's finally upstream and it deserves its own talk and you should check it out at the Linux security summit the memfd secret system call which is a new system call which allows to create secret and protected memory allocations also really great we also have extensions to file system monitoring tools such as fan notify which is now available to unprivileged users as well so I hope this is a helpful overview covering some of the more interesting features merged over a long time stretch or longer time stretch and that this talk has made you aware of new features that you might want to consider using in your applications this is really what it's all about as we hear of people I have this problem and I don't know how to solve it and then you tell them this has been solved since like 10 kernels back essentially and this is the API you need to use so this is also a good chance to raise awareness of new APIs early and if you want to know more about some of these features please make sure to check out the website and repo of the Linux man pages project they might have already especially for system calls man pages edit well certainly for all of the system calls that I wrote and make sure to check out the website because your distribution usually doesn't have the newest version and also a lot of things I talk about here will have examples in the man pages that you can use as a first overview how to use those new features which is also excellent so with that I hope you really enjoyed this talk and that you enjoy the rest of the conference be it in person or or virtual and for those who are attending in person please enjoy Seattle for me and hanging out with other people which now seems like a huge luxury and for those attending virtually see you at the next event hopefully