 Hello, and welcome to my talk about Cisco supervision. My name is Christian Browner. I am one of the LXD maintainers and project leads for LXD and LXD together with Stepan. And I'm also a kernel engineer working on the upstream kernel in a variety of fields. And today I'm going to talk a bit about a work, we've done over the last couple of years to make containers more usable. And it's a very interesting piece of work that has to do with intercepting and emulating syscalls in user space. So first of all, this is a rough outline of this talk. I think we're going to start reintroducing the concept of unproved containers because this is usually something that people don't have a clear grasp on, even though they have been around for quite a while and in general a good idea to keep everyone on the same page. Then I'm going to briefly talk a bit about syscalls, but I won't be going too deep into it, so don't worry. And then I'm going to talk about syscall interception, and syscall emulation, and then I'm going to give a brief demo of this feature. So what are containers? So first of all, the dictum that most people are familiar with at this point in time is that containers are a user space fiction, so they don't really exist, meaning the NUSK kernel doesn't really have a concept of what a container actually is. It's sort of more you take a bunch of different kernel interfaces and you can combine them in different ways, namespaces, cgroups, lsm, secomp, and then at some point you end up with something that you can call a container. But we can classify them roughly in two distinct types, main types that are of interest to this talk at least. So first of all, privileged containers and then unprivileged containers. So privileged containers are essentially containers in which UAD0 inside of the container is identical to UAD0 on the host. So UAD0 in the container is actually a real root on the host. That also means that any container breakout that happens from a privileged container onto the host is immediately pretty severe, so it's going to be a big problem. And the other type is unprivileged containers and it's the other way around. So UAD0 in the container is not UAD0 on the host. So it means if you're not real root and if you escape such a container and you have set it up correctly, then you will have the exact same privileges as any user on the host. So there's a big security boundary here. And the security boundary is enforced by username spaces. They are what make unprivileged containers actually possible. So username spaces in itself, they are kind of complex, but they are a pretty powerful security mechanism that should be used everywhere instead of privileged containers. So in the simple explanation is they allow you to define mappings between UADs and GIDs. So let's say you're inside of a container and you're looking at your own user ID, then you see that you will be running as that you will have UAD0, for example, or UAD1,000. So everything looks normal from inside the container. But if you look at this container from the outside and suddenly you will see, usually you see UAD in a very high range. So even though you're UAD0 inside the container, if you look at it from the host, you UAD100,000, which kind of feels weird. So, and this is done by these UAD and GID mappings. So basically you're telling the kernel, I want UAD0 from inside the container to correspond to UAD100,000 on the host, such that when UAD0 breaks out of the container it will have the exact same privileges UAD100,000 will have on the container. So this is the whole security mechanism. Or this is one of the core concepts of the security mechanism of the username space security mechanism. For example, if you define a mapping like 0100,000 and the range is 1,000, that means you map UAD0 inside the container to UAD100,000 on the host. You map UAD1 inside of the container to UAD100,001 on the host and so on. And if the container sees UAD0 and the host sees UAD100,000 and this is mainly what it's all about. So before we go on to the next slide, username spaces not just isolate do not just concern themselves with UAD and GID mappings. They also isolate capabilities which is another security mechanism in the Linux kernel. So instead of asking, so when I'm asking do I have a given capability? I'm usually asking do I have a given capability in my current username space? And so for example, a range of capabilities but some capabilities will be checked against the initial username space or the host username space and such that a container can never have these capabilities on the host. And this brings us to the limitations of the unprovoked containers. There are quite a lot and there are quite a few obvious ones. So first of all, they usually can't mount block devices and they usually also can't mount block devices and they can't create any device nodes. And this especially the device nodes have something to do with what I mentioned before, capabilities for creating. So if I want to create a device node, the kernel will check whether I have the capability to create device nodes. But instead of asking, do I have this capability in my current username space? It will ask, do I have this capability in the initial username space to prevent unprovoked containers from creating device nodes because creating arbitrary device nodes as well as arbitrary block devices can be used to actually attack the host. And that's something that you want to avoid. Think about creating Dev Mem or Dev K Mem and then just writing into random kernel memory if your kernel is so configured. So that would be a pretty big security risk. In principle, you can think of it like this. In any operation that requires privileged on the host can't be performed inside of an unprovoked container because that will usually mean it's something that will either affect the whole system if this is changed, if something for example, like a syscattle is changed or it can be used to attack the host, both things we want to avoid. But obviously in a decent container manager we'll often know when a privileged operation is saved. So sometimes we know that even though this operation requires privileged on the host we can guarantee as the container manager that this would be safe for the container to perform. Two very obvious examples are mounting a block device dedicated to the container. Let's say you have set up a block device you can vouch that this is not a malicious file system image and you now want to expose this device to the container and then mount the file system inside of the container. That's something we would like to do, but we can because of the aforementioned limitations or creating harmless device nodes such as depth zero, depth null and so on. So all device nodes that we usually need in order for a container to function correctly. And you can see that this is actually safe because the standard practice nowadays is that all these device nodes are just bind mounted in from the host and the container will usually have right and read access to these device nodes. If we wouldn't trust the container with these device nodes then we wouldn't bind them into it. So why there is no obvious reason why the container shouldn't be able to create those device nodes. So one of the things we kept asking ourselves is can we somewhat elegantly get around these restrictions without, for example, hard coding allow list of device nodes into the container which wouldn't be an elegant solution. We want something more dynamic. This is where Cisco come into play, right? All of these limitations I talked about so the creating device nodes and mounting block devices are done via Cisco. So you create device nodes where they might make not or make not at Cisco and view mount block devices with the mount Cisco. And the kernel glossing over details is essentially a request handler and Cisco also the main requests it recognizes. So when I'm creating a device node I'm issuing the make not Cisco and I'm asking the kernel can you please create this device node for me? The kernel will then they'll go on to check that I have the necessary permission to actually create this device node. So this is basically what defines the boundary between user space and kernel space. Every time we need something most interesting done we usually transition into the kernel and ask the kernel to perform an operation for us. So a brief overview because it's important and later on to explain how we emulate and intercept Cisco in user space. You see that I've drawn this boundary right here user space and down below this is kernel space. So when user space performs a Cisco like say for example to make not on the mount Cisco then we will transition into kernel space. There is a specific instruction that needs to be triggered and there is Cisco convention that can be different for each individual architecture. And then at some point the kernel will look up the Cisco number into Cisco table and if it recognizes the Cisco number if it doesn't recognize the Cisco then it will return EnoSys which means I'm not recognizing I don't know what you want from me this is not a valid system call or if it recognizes the system call it will then go on to actually perform the work that you asked it to do and depending on whether or not you have the right permissions or something went wrong it will report back to you whether or not it was actually successful in performing the Cisco and then it will transition back to user space and report either an error code or success or a specific return value like for example a file descriptor or some memory and an interesting aspect of how system calls work in the Linux kernel at least is Secump. Secump is short for secure computing as a lot of people might already be quite familiar with this and what Secump does is it allows you to write filters for system calls and you can see it here on the diagram we will go into this a little bit that it sits in an interesting position in kernel space even above the Cisco table and we will come back to this in a little bit so Secump as I said is short for secure computing you can restrict the Cisco it's a task it's allowed to make you can also write fine-grained you can even write more fine-grained filters in classic BPF which is nowadays called CBPF and it's not to be confused with EVPF which is extended BPF and CBPF at least it's not as powerful obviously but at least allows you to filter on specific arguments and values for those arguments so you could for example filter only a specific set of MakeNotSysCalls or a specific set of MountSysCalls with some restrictions because of the CBPF language and it's something that is wildly used from browsers to data centers but what Secump usually does it usually causes it's called to be skipped and report an error code to user space so if we go back to this diagram right here what you will see is when user space makes a system call before even the system call is looked up in the system call table it goes to the Secump it goes to Secump and if there is a if a Secump filter is loaded for the task that performs the syscall and the filter actually is written for this syscall or triggers on this given syscall and then Secump gets a say of what will happen so there are multiple options so either Secump could completely ignore the syscall and then you enter in the regular system call path we looked at before so the kernel just performs the system call or the kernel Secump instructs the kernel to skip the system call and then Secump can for example fill in an error code or a specific return value to user space and then return to user space even before you have actually looked up and you have even verified that a system call that the system call you try to make actually exists and so when we think about extending the capabilities of containers Secump seems to be a very natural candidate because it hooks into the system call path oh sorry I've already been over the slide so what we want and why Secump and Secump is a great candidate for us is we somehow want to outsource the decisions about whether a system call is allowed to user space process because right now Secump filters are relatively aesthetic meaning that once the filter is loaded the kernel will always give you the same action for that system call for example return and an error code it can't so to speak dynamic calling decision not so to speak it can't dynamically we can't dynamically let the kernel decide whether or not a specific instance of the system call is allowed so what we really want is for example to bring the container manager into the mix so that somehow the container manager instead of the kernel gets to have a say whether or not a system call is going to be successful or not and the way we implemented this is by implementing a new Secump option that can be set on Secump filters when you load the filter into the kernel and what it does is you can retrieve a file descriptor for a Secump filter which we usually refer to as a notifyFD because you get notified on this FD about system calls that the task for which this filter is loaded makes these Secump notifies can be pulled so you can put them into an event loop and then get notifications about system calls so a task can actually listen for individual events on such a Secump notifyFD there are a few interesting properties and we will look into a few other ones in more detail in a bit the task can for example use an iOctl to read the system calls from a Secump notifyFD so it's called a receive iOctl to read to extract the data from the kernel which includes the system call arguments and so on the task can also read the memory of the system call in which the kernel because of restrictions of the CBPF language cannot do in any depth and the task can then use an iOctl to instruct the kernel or tell the kernel to for example report an error code or a success for a given system call and to newer kernels it can even allow the kernel or instruct the kernel to continue the system call so this is the rough outline of this what you can do with this which is why we're really excited about this feature is you can use this to emulate system calls in user space and so you can emulate system calls in user space that would normally or that would otherwise fail the way this is done is when a container started up or how you can do this when the container starts up and it loads a second filter the container manager will instruct it to load the second filter with the notify property set which means that the task will then receive a file descriptor for its second filter from the kernel this file descriptor will then be handed off or can be handed off to the container manager and the container manager can then place this notify into an event loop and depending on what for example a type of filter you have written the container manager could then for example choose to be notified about the make not all the mounts is called which were our two primary examples and so then when the process inside of the container performs the make not system call the container manager gets a notification the task will stay blocked for as long until the container manager had responded to the kernel and instructed the kernel of what actually to do with the system call and then the container manager can for example read the system call arguments read the system call memory if it for example wants to look at the paths given system call like the mount system call or the make not system call has been made with and then it can choose to emulate this call by for example creating the device node for the container or mounting the file system for the container and usually the container manager will be more privileged process or at least suitably privileged process for all of the tasks that you want given container to allow so this is pretty powerful and allows you to get around a lot of the restrictions which we talked about earlier but one of the problems is that we cannot actually emulate all of the system calls so I gave the example of the make not system call which we can fully emulate and that is mainly because we can exclude first of all make not fails completely in containers no matter what arguments it gives and second of all we can write very fine grained filters CBPF language is powerful enough for the make not system call that expresses for example I only want to intercept system calls make not system calls for a specific set of device nodes and only those device nodes I then actually want to emulate even if I were to accidentally intercept any make not system call the container manager knows that it would fail anyway and so it doesn't really matter if the system call isn't really performed the container manager can just instruct the kernel to return EPAM but now think about for example the mount system call the mount system call has the limitation that most of the arguments are memory arguments or pointer arguments like source target system type and data and the problem is that the CBPF program a CBPF language is not powerful enough to actually reference or as we like to say chase pointers so that you can't really instruct the kernel to filter based on the file system type argument because the CBPF language is not powerful enough so that means if I write a filter for the mounts call I can filter on the mount flags argument and for example make sure that I don't intercept bind mounts because I know that the container can already take care of them by itself but I can't really tell it that I only want to intercept X4 mount system calls but now imagine I intercept mount system calls and I also intercept that means I also intercept mount system calls and the container for example tries to mount the tempFS file system the problem is that this would usually succeed the container would be able to mount a tempFS file system if correctly set up but the problem is we now the container manager would now need to emulate tempFS mount system calls as well which is totally pointless but also very fragile you need to get all of the context and security context right in this scenario so this doesn't make a lot of sense and other limitations are the open or the connect system call so for example the open system call usually returns a file descriptor but obviously if the container manager intercepts the open system call for a task that is running in the container and then calls open the file descriptor will be valid in the container manager but it won't be valid in the actual process that performed the system call so I can't really do anything with open or connect and as I said any system call that is accidentally accepted needs to be emulated so these are severe restrictions that we have to keep in mind so I want to talk a bit about ongoing and future work in this area so where we try to get around some of these restrictions that I pointed out in the slide before so first of all in order to solve the problem where we accidentally intercept system calls that we would then need to emulate for the container we introduced a new property for the second notifier which allows the container manager to instruct the kernel to continue a system call after it has intercepted and inspected the arguments but this needs to be something this needs to be used with a lot of care because there is an inherent talk to there is an inherent talk to, the inherent talk to stems from the fact when the container manager, when the container actually performs a system call and the container manager gets notified and the process in the container is blocked so the container manager now goes on to inspect all of the arguments of this call so it for example reads it from memory and parses it and sees oh it wants to create a mount for the X4 file system at this from the source path I think is fine and then it were to continue this call and reasonably privileged sensibly privileged attacker could then in the meantime write into the memory of the intercepted task and rewrite the system call arguments so that means when you have two processes in the same that have the same privilege level and you want to use one process to deprivilege the other process by using the second notify that won't work that is inherently unsafe so the second notify can't be used to implement the user space security policy for equally privileged processes in other words you always need to be sure that if someone were to rewrite the system call arguments of the system call you're about to emulate you're about to continue and then that there are already sufficient restrictions in place that guarantee you that the system call won't be allowed if rewritten to something unsafe or rewritten to a system call with unsafe arguments and that is usually true for or not usually this is true for user name spaces because the kernel will ensure that everything every system call is unsafe will actually be blocked but it can't be used to implement the user space security policy to be very clear about this I've also written a large long comment in the second kernel header about this that you can go look up if you're interested in this in newer kernels we've implemented a system call pity get if T which relates to a different API and implemented over the last years which sent us around using file descriptors for processes instead of pits but it's what you only need to know is that you can actually retrieve a file descriptor for another task so that means if you for example intercept the socket file descriptor or connect file descriptor and you want to for example connect the file descriptor of the socket system call for the container to another location and the container originally intended that you can retrieve the file descriptor connect to the address that you want the container to connect and then let the container continue on it's merry way this is a pretty neat mechanism and we also made it possible with the newer kernels that nine kernel actually the released one to inject file descriptors into a different task so while the task is blocked waiting for the container manager to tell the kernel to go on the container manager can with a suitable I octal inject file descriptors into this task so that means the container manager can open files for the blocked for the block so when the task calls open on a path it usually wouldn't be able to open and the container manager can perform that open then use that retrieved file descriptor to inject it into the target task we're also able to replace file descriptors it's a very powerful mechanism that we have worked on and that is now available in the kernel there are a few more a few other things that we would like to do but with this set of work that we've done right here we can emulate the make not system call we can emulate amount system call and with injecting file descriptors and retrieving file descriptors we are even able to for example intercept and emulate BPF programs which is pretty neat given how important BPF has become over the recent years so I think there was a lot of talk and I'm already excited for all the questions you're going to have but maybe we can go on and do a little demo so I need to stop presenting first okay so here's another try on me trying to give this distance demo sorry for the problems this has caused so let's launch a new container and let's also expose a device node to that container so I'm adding a disk device a block device to this container and now I want to intercept first of all the make not system calls I'm using the xcconfig set f9 security.syscalls .intercept.make not sorry true xc restart was f9 and now let's assume I want to create a device node in the container which would usually totally fail so let's say I want to create depth zero c515 now and see that worked the kernel emulated this is the container manager emulated this is called first actually a character device that we now have available in server to container but now let's see if we can mount a block device let's we have a disk we have a block device at slash def sda which is an export file system and let's say want to mount this sda to slash mnt and the kernel will not allow us to do this but with syscall interception we can tell the container to config set mount and then to also allow x45 system we need to restart this to reload refresh our second filter and now I want to mount this file system to slash mnt this worked see now I have it exposed here to my container which is pretty great you can see there's the mount there it is now some would go obviously point out that this is unsafe but for those we can say we can also say fuse x4 equals fuse to s now start the container and then perform the same mount again and you will see instead of instead of having mounted the real file system xd will have rewritten it to use fuse and also will have given us right access to the directory so that is pretty that's a pretty cool mechanism and as I said we have advanced features available you can also intercept BPF programs nowadays and yeah we're excited about a lot more work coming in this area and making containers even more usable than they are now and with that time at the end of my talk I've even survived a demo and I'm going to leave the the failure piece at the beginning in I think it's got to be honest right so thanks for telling my talk and I'm happy to answer any questions you might have there will be instructions available to ask questions so ask away thanks