 Hello, everyone. My name is Corey. I am a software engineer at Morantis and a maintainer on the Mobi project, the upstream for the Docker engine. Today, I'll be sharing one technique we used to make some Docker container operations faster and less resource intensive on Linux and explain how you could use the same techniques in your own Go programs. I'd like to start with a motivating example, increasing the limit on how many container image layers can be mounted. Mobi typically uses the overlay file system to compose the layers of a container image. When mounting an overlay FS, the source argument is ignored and the source directory paths are passed in through the mount options. There is no hard limit on the number of layers in an OCI image, but how many layers can we mount at a time with overlay FS? Well, file system-specific mount options are passed into the mount system called through that data argument. Note how there is no argument for the size of the data. How then does the kernel know how much data to copy in the kernel space? While overlay FS, in particular, expects data to be a null termated string, the Linux ABI doesn't actually require that it be one. And besides, even if it did, the kernel can't just trust that user space is passing a valid string of a reasonably small length. So in actuality, the kernel copies one page of data, which in practice is four kilobytes. Taking into account the null terminator, we have 4,095 bytes of options, which, you know, for a traditional file system where you've got, like, options like no exec, that's fine, but that's pretty restrictive for overlay FS. There is a new set of file system creation context syscalls, new in Linux 5.2, which lifts its limitation. Unfortunately, we still have to support older kernels for now. So we're stuck with the mount syscall and its limitations. One way to increase the number of layers we can squeeze into one mount option string is to reduce the redundancy. So the directory paths in those mount options can be made relative to the process's current working directory, so we can change our working directory to the common prefix and squeeze a few more layers in. Now, changing the working directory temporarily is just fine in a single threaded process, but DockerD is multi-threaded. And the current working directory is global state shared by all threads, so changing it would affect every open call and every other thread until it's changed back, and if two threads tried to concurrently mount container file systems, they would collide. Now, the Moby project historically took the approach of starting itself as a child process and then the child which changes working directory and mount an exit. Now, starting a whole new go process from scratch dominates the time to issue the mount syscall, and we have to deal with moving data and results across a process boundary. There's got to be a better way. Let's first dig a bit deeper into how processes are started. Launching a new program as a sub-process is a multi-step procedure. First, you call fork, which duplicates the calling thread in a copy of its memory space. Now, forking is fairly cheap on Linux because memory has copygone right. Next, the child process sets up the execution context, such as change in the current working directory. And finally, the child will call execve to replace itself with the new program. Well, the child process doesn't have to exec. It could keep running the same program as the parent until it exits. So that's one possible solution. Fork a child process whose only job is to change directory, mount an exit. Unfortunately, Docker daemon's written in go, and fork without exec is not supported in code programs. You could invoke the raw syscall if you want, but the child process is going to be in a very sad state. The child will inherit copies of all the mutexes in the parent, but we'll start with just one thread. None of the garbage collectors threads will be running, and the child will likely deadlock rather quickly on one of those mutexes. Go programs are able to spawn new child processes, but the runtime makes all the arrangements to fork and exec on your behalf. It needs to do a lot of preparatory work to make it safe and reliable, and details which are deeply tied to those runtime internals. So unless you are the go runtime, or are willing to tie yourself to the implementation details of a particular runtime, write code in an extremely limited dialect of go, and pray that tool chain updates won't break you, you cannot fork in go programs without also exacting. Now, while go programs may not be able to cheaply fork off child processes, they do have an abundance of threads. Now, how come changing the current working directory in one thread affects the current working directory of all the other threads? Well, in a word, because POSIX says so. In a more practical sense, the current working directory is shared because the threading library, or in go's case, the language runtime, has instructed the kernel to make it that way. Threads are spawned using the clone system call, which compared to fork gives the caller more precise control over what is and is not shared between the caller and the child. Clone can also be used to spawn processes as there's not much distinction between process and threads, but thread's just a process that shares thread group ID, virtual memory space, and signal handlers with other threads in the process. Other pieces of execution context can be shared, but don't actually have to be under Linux. So for instance, if the clone FS flag is passed to the clone syscall, the calling process and child process share the same file system information, which encompasses the file system route, the UMask, and the current working directory. Otherwise the child gets a copy. Now, most of the process execution context that can be shared using clone can be unshared using the appropriately named unshared syscall. Thread can call unshare with the clone FS flag to then reverse the effects of clone, disassociating its file system information from that of the other threads. Now, note that there is no way to re-associate the threads file system information afterwards. You may be wondering how unshare can be used in Go programs as threads aren't exposed to application code. All application code runs in GoRoutines, which do not map one to one onto threads. The runtime schedules GoRoutines onto a pool of threads, not entirely unlike how the kernel schedules threads onto CPU cores. If a GoRoutine blocks waiting on some IO, receiving on a channel, acquiring a mutex, or simply if the runtime decides to preempt that GoRoutine because it's been running for too long, runtime may go and schedule some other GoRoutine onto that same thread. Different GoRoutines may run on the same thread at different times, and any particular GoRoutine may run on different threads throughout his lifetime. Normally it does not matter that a GoRoutine may suddenly find itself running on a different thread. As aside from having different thread IDs, all the threads are practically identical. Well, unsharing parts of a thread's execution context makes the thread different from the others. It would cause chaos if random GoRoutines were to be scheduled onto such an unshared thread. For example, the GoRoutine that wanted to change just its own working directory could unexpectedly find its working directory reverted. And then some other GoRoutine would see the change working directory, all at the whims of the runtime. Thankfully, Go has a solution for this. Runtime.lockOS thread wires the calling GoRoutine to its current thread until an equal number of calls are made to unlock OS thread. The calling GoRoutine will always execute in that thread exclusively. Since unsharing a thread's file system information is irreversible, no other GoRoutine can ever be allowed to be scheduled to run on that unshared thread. Thankfully, Go also has a solution for this. You simply return from the GoRoutine function without unlocking it from the thread, and the runtime will terminate the thread and eventually spawn a new one to replace it. This is roughly what changing the working directory to mount looks like, minus any error handling. You spawn a new GoRoutine for this operation, lock it to a thread, unshare the file system information, then we can simply go change the working directory, mount, and return. The ability to wire GoRoutines to threads makes it possible to do things in Go programs which could not be done in any other way. I'll take you through a few other examples of how it's used within Moby. Path sanitization, it's really hard to get right from user space. The kernel can do a much better job, especially because it can do it atomically. The OpenAT2 syscall makes it easy to guard against path traversal attacks, though that's only available from Linux 5.6. In order to support older kernels, Moby takes a different approach, sandboxing the thread so it cannot open paths outside of where it's allowed to. For use cases like ours, such as untowering image layers, where we don't need to sandbox arbitrary untrusted code, CHroot is arguably perfectly adequate when used from memory safe language. The root directory is part of the thread file system information, so in sharing it makes CHroot calls thread local, in addition to CHter. Unfortunately, using CHroot makes Moby incompatible with GR security kernels because those kernels block CHmod and make node in CHrooted threads. We work around this by instead using PivotRoot to change the root mount of the current mountain namespace, which is a much more robust sandboxing mechanism as well. But we can't safely modify the existing mountain namespace as it could be shared by many other processes, not to mention the other threads. So we call unshare with the clone new NS flag, which moves the thread into a new mountain namespace, which is initialized to a copy of the previous. Now we're free to mount, unmount, and pivot mounts to our heart's content without affecting the mount table of any other thread or any other process. Another use for wiring GoRoutines to threads is to enter a container's network namespace. The only information you need to know to access the network namespace created for a container is the container's process ID. Moby enters the container network namespaces to provide a DNS resolver over the container loop back interface, which can resolve the private addresses of other containers, and to forward DNS queries from one container to a DNS server running another container. The setNS system call is used to move the calling thread to the namespace referenced by a file descriptor. Unlike a more traditional process state such as the file system information I spoke about earlier, the thread can be moved back to its starting namespace with another call to setNS. And a thread which has had its namespaces restored is indistinguishable from threads which were never moved at all and so can be reused by the GoRuntime for other GoRoutines. A combination of unshare and setNS can also be used to cheaply create a new network namespace for example, as I'm demonstrating here. Manipulating the execution context of threads in a language which hide threads from the application is not always gonna be easy. There are sharp edges and gotchas which you need to be aware of if you wanna apply these techniques to your own Go programs. You may find unexpected and even impossible behaviors in completely unrelated parts of your application if you get things wrong. The GoRuntime and most Go code assumes quite reasonably that all execution contexts are made equal that they all have the same file descriptor table, view of the file system, UID, GID, network interfaces, et cetera. If you violate the invariant that all unlocked OS threads are fungible you're gonna have a bad time. Make sure to always lock your GoRoutine to a thread before manipulating its threads execution context and only unlock after you put the thread back exactly the way you found it which may not always be possible. So when in doubt, keep the GoRoutine locked to the thread and let the runtime terminate it. When writing code which opens handles to threads original namespaces make sure to lock the GoRoutine before opening handles to its original namespaces and open from that GoRoutine. Otherwise your code might restore the thread to the wrong namespace. I've done this and it was not a fun bug to chase down. The initial thread of a process is known as the thread group leader. GoRoutines can be scheduled on to it same as any other thread. This is important to keep in mind when modifying the execution context of your program's threads because the proc self magic link refers to the thread group leader not the current thread. This can trip you up in a couple of ways. Unless your GoRoutine happens to be locked to the thread group leader, the files in proc self are not gonna reflect the unshared state of the thread your GoRoutine is executing on. When writing code which opens proc files for an unshared thread, make sure to open the files for that particular thread. Use the proc self task directory for the current thread ID or the proc thread self magic link on Linux 3.17 and above. Remember to lock the GoRoutine to the thread first so the thread doesn't change underneath you. And to avoid any surprises with code and with external processes which aren't prepared for your unshared shenanigans, I recommend that you lock the main GoRoutine to the thread group leader. Go guarantees that the init functions will run on the thread group leader and also that main will also be executed on the leader if lockOS thread has been called from an init function. So long as the thread leader thread is left alone, any code in your process running on unlock or routines can continue to open files through proc self and get the expected results as the threads they're running on will be sharing all the execution context with the leader. And the last gotch I wanna talk about is the parent death signal option. When starting a sub process, you can instruct the kernel to send it a signal if the parent dies. It's very handy for ensuring you don't leak sub process if your process crashes, for example. However, the kernel considers the parent to be the thread which started the sub process. If GoRoutine, which locks and exits, gets scheduled onto the same thread which you had previously used to start a sub process, your sub process will get signaled seemingly at random. You can guard against this by locking the GoRoutine, you will be starting the sub process from to its thread and not unlocking it until after the sub process exits. Thank you.