 I'm Stefan Greba. I work at Canonical, been doing container stuff there, running the Lexi project for quite a few years now, eight years, something or not. And used to be working with Christian over there, who's now at Microsoft. What do you do? I work as a principal software engineer, but mainly up-stream, maintain a few kernel features. All right, and so in this one, we're gonna want to just go through kind of the state of containers in the current Linux kernel. Also, kind of, where do we stand now? What's happened recently? What do we see happening over the next few months or years? It's always kind of hard to predict kind of development. And do a few demos here and there of the kind of domain aspects of container on Linux. So, the first thing is what's a container? That's always a bit of a weird one on Linux because there's really no such thing as a container inside of the Linux kernel, despite a lot of people wanting it to be, yeah, there's just no such thing. So, to use the words of Serge Hallens, that it's effectively a user-space fiction, there's no such, yeah, no such concept. Now, a container is usually a set of processes that are part of some namespaces. Typically, we consider that they need to at least be part of the paid namespace for it to be considered a viable container, otherwise, those processes are not really namespaced in any useful way. And there's also some optional confinements that we tend to put around it, whether it's things like the LSMs or partners, SLNX, the like, or SecComp, or capabilities, that kind of thing. There are effectively two kind of containers, just in the way we tend to communicate that to other people. One type would be a privileged container where root inside of the container would be equal to root outside of the container. That's what's been the default for many, many people for a long time, and is unfortunately still the default in the likes of Docker and Kubernetes, and there are ways around that, but yeah, by default, they're still privileged. And then you've got unprivileged containers which make use of the username space where root inside of the container is not equal to root outside of it, offering a lot of extra protection. And that's what we're gonna be focusing the most in this talk. We effectively consider privileged containers to be legacy and something that needs to die and go away, not something we want to really spend too much time focusing on. So just let's look at kind of the most basic of containers we can create here. So just switching that over, here we go. So just on my laptop, saying we want to launch a Ubuntu 2204 container, there we go. That's using LexD in this case, which has the most advantage of defaulting to username space and turning on pretty much all of the features you can suspect. So you can go in the container, looks like a normal Linux system, everything is running, you even have things like Udev and all of the system services and everything running on there as you would on any Linux system. If you look on the network side, it's got its own network card, it's got its own host name, so all of the namespaces are effectively put in place there right out of the box. And if we look at what things look like from the outside, you see here at the bottom of the screen that the owner of the processes is actually UID one million outside of the container. So that shows the username space at work. Inside of the container, it looks like UID zero, outside it doesn't. So should something dramatic happen and the user be able to escape the container for some reason, they're effectively just a nobody user on the whole system, mitigating a lot of potential security issues. So that's kind of your container 101 and previous containers with username space effectively. All right. Also, yeah, that container effectively had is a comp policy in place to filter some system calls that we just don't want to see allowed ever. It also has an apartment policy used as, which effectively is like a path-based policy for a lot of the file system things, used as kind of a last resort in case of like, should you be able to escape the container for some reason, that policy will still apply to you and might still block even more attacks. Linux bugs are a thing. Like nobody disputes that. And occasionally some of those bugs would allow for someone to committee bypass a container and escape it. So that's why you cannot want this approach of layering security features on top of security features so that even if that does happen, you can really limit the blast radius. But you should also be extremely careful as far as applying all security updates and making sure everything is updated. All right. Yeah, so I mean, one of the basic building blocks, most people will probably be aware of this namespaces. This is a kernel technology. And the namespace isolates a specific resource, usually that the kernel, that usually would be global and the namespace makes it local. And the most famous namespace or the namespace that we currently have are the UTS namespace that's concerned with isolating the host name so that the container can have a different host name than the host. The pit namespace which isolates the PID identifier. So PID one in the initial namespace is different from PID one in another pit namespace. And they are hierarchical. So that means all pits that have a representation in one of the descendant PID namespaces will have a representation in an ancestor PID namespace but not in a sibling PID namespace. So if you think about two PID namespace hierarchical trees, this is the root tree, PID namespace one, PID namespace, sibling PID namespace two, these one don't share any visibility into each other's processes. But if they fork children PID namespaces, then all of the pits in the child will have representation in each parent. And in each parent it will be a different number. So the parent, the ancestor PID namespace can always see this is the process I wanna send the signal to and kill it in one of the child PID namespaces. Mount namespaces isolate the mount table. So that means you can get a new one. When you create a new mount namespace, all of the mounts of the ancestor mount namespace get copied. So they're private copies. And if you unmount them, usually, then you don't unmount the parent mount namespace. However, mount namespaces are like a Swiss cheese concept. You think they're private, but then you have mount propagation, which means you can have tunnels and relationships between different mounts, which means if you unmount a specific mount in a child mount namespace, and that one is a shared mount, which belongs to what we call a peer group of mounts. And then all of the other peers in other mount namespaces get unmounted as well. And this is not even the complex part of it. Like, if you wanna talk about mount propagation, we can spend a whole afternoon here to figure out it's semantic. So it's very complicated, but the original reason, for example, was that you need some type of flexibility to make mounts show up in child mount namespaces or in container namespaces. So it can just be an isolation mechanism in the same way that, for example, network namespaces, which we're gonna talk about next are. So network namespaces isolate network devices. So when you clone or create a new network namespace, all of the network namespaces on the host disappear and then you're left to figure out how to give network connectivity to your container, which you all know leads to such beautiful things as Kubernetes plugins. Complex networking is great, right? Everybody loves it. I personally, if it goes beyond VET devices, I'm out because I don't understand it anymore, but network namespace isolates routing and IP addresses and all that kind of stuff. It's a very powerful concept for sure. And then we have IPC namespaces, which is usually the most uninteresting one, at least in terms of describing its functionality, just isolating inter-process communication. So system V5, semaphores, IPC, message queues, and so on. And the most important one arguably is the username space, which is the only namespace that is concerned with isolating the privileged concepts that we have on Linux. So if you create a new username space and you land in a world, I'm ignoring writing ID mappings now because that's complicated as well and really not that important, but suffice it to say that if you have, if you do IDU inside of a container with a username space and you get reported UAD0, then that's different from UAD0 on the host. Like that UAD0 in the container will be mapped to some random ID, 10,000, 100,000, whatever on the host. And also the capability set is isolated to that container, meaning if you ask, usually the kernel doesn't ask, do I have a capability globally? It asks to have a capability in a given username space, for example. And this is important because all of the other namespaces have owning username spaces. This is something which has been not very explicitly expressed for a very long time to the detriment of user space, but your network namespace that you usually are on the host is owned by the initial username space. So if you wanna perform management operations on say a network interface, then the kernel will ask, do you have CAPS net admin, which is the capability that you need to administer a network namespace in the initial username space? And that's usually a pretty high bar. That usually means you need to be rude or you need to be executing a binary that has this specific capability set. Now, when you create a new username space, unshare dash dash user, and then you create a new network namespace within that username space, what it's essentially doing is it makes that new network namespace owned by that username space. So if you perform network administration operations in this child network namespace, the kernel then asks, do you have CAPS net admin in the username space that owns the specific network namespace? And that sort of logic is generalized across all of the other namespaces as well. So if you unshare a username space first, and then unshare additional namespaces, they all get owned by that username space. Consequently, if you unshare a username space without unsharing any of the other namespaces, you lose all control and privileges in those namespaces. So it's a pretty flexible mechanism, but again, it comes with caveats. You have beautiful things such as some operations are so unsafe that the kernel cannot ask, just ask, do you have CAPMAKE not in the username space that you're currently located in, like an unprivileged one, but it will always ask, do you have CAPMAKE not in the global namespace? So this isolation is not necessarily perfect, but it's a pretty powerful mechanism to gain more security for your containers. I mean, it's not foolproof, right? There is a lot of functionality that is exposed via username spaces that can potentially be used to cause exploits, right? It's one of the biggest criticism. Like new functionality is made available inside username spaces, but that also means you increase the attack service. But as it usually comes, people want to have ever more functionality. And so over time, it just grows in terms of features. So that's a valid criticism that you can leverage against it. But in terms of container technology, this is probably one of the most important aspects that username space. And there are some more recent namespaces that we don't need to go into in excessive detail because they haven't been merged yet. But usually we get proposals for a new namespace every two or three years. Somebody comes up, can we add a time namespace, which we've added because it uses a part from containers. We had the IMA namespace, which is constantly being pushed. I don't know if it's ever going to make it in, but it's something that people keep thinking about. And then there are stuff like binformat, what is it, binformat-misc? Yeah. Binformat-misc namespacing. I wrote that patch that I should probably remember that. But binformat-misc namespacing is also a thing. There is, usually when you want to do something nowadays, avoiding making it a completely separate namespace that requires you to call a new clone flag because that's how namespaces are created either during process creation or via the unshare system call and can just dangle it off another namespace, usually the username space because you want to express the notion that the resource that you're delegating is owned by that username space, then that's the preferred way of doing it. But namespaces are an interesting concept, but they're all sort of related to each other and then also orthogonal to each other that it's very difficult often for userspace to work saintly with them. And the username space is quite convenient for that because it already has a relationship with all of the other namespaces. It's a very convenient place to kind of dangle everything else. So whenever we need something else in the next kernel that needs to be namespaced, like mention binformat-misc also have some potential thing we might need for tracing at some point. Like on my team, we've been looking at maybe doing something for logging at some point because it's also been a bit of a mess in the past. For those kind of things, instead of like coming up with an entirely new namespace which we're gonna eat up clone flags and make it kind of difficult to reason about because we kind of have to figure out what happens if someone wants to use that namespace just on its own. It's quite a bit easier these days to just be like, well, this is gonna be part of the username space and username space doesn't necessarily mean that you're running things unprivileged. You can totally make a username space and then actually use a privileged map inside it and effectively gets you the benefit of the username space as far as ownership of all the namespaces without necessarily losing and kind of privilege on the system. All right, so just a quick demo here of how that stuff works. So there's a very convenient command on pretty much all Linux systems called Unshare which is a convenient wrapper around what the Unshare system code can do. And you can use that to play with all kinds of namespaces. In this case, the version I've got supports pretty much all of them. So we've got Mount UTS, IPC, Net, PIDs, user, Cgroup and Time. Cgroup is the one we forgot to put on the list earlier. Oh, sorry, let me fix that. There you go. See, there are so many namespaces that we keep forgetting what exists. Yeah, when we're making this a lot like, that looks a bit short. We probably missed one. And yeah, we did miss the Cgroup namespace which doesn't do a lot other than just say when you Unshare it, whatever Cgroup tree you add at the time of Unshare becomes your new route of the Cgroup tree for any operations after that point. So yeah, the Unshare command, it's got flags for just about every single namespace and where it gets kind of interesting is when you use it on privilege. So you can do user namespace, say, I want IP namespace and net namespace and mountain namespace. I want it to remap routes to myself and I want it to fork. Right, so now I'm a route. I'm not real route, I'm just a route. So in this case, I'm a route within that new user namespace. If I do ID, it shows that, yeah, I'm a route. But it also shows that nothing belongs to anyone anymore because there is no actual mapping for real routes instead of that namespace. But I could go and create network devices, I can go and do math, I can do a lot of stuff without ever having had any kind of privilege whatsoever. That Unshare command is running just as me as a normal user, it didn't use any kind of SETU ID or any kind of other privileges creation system whatsoever. So it can be used as a very useful security measure because you can use that within just a user session to do advanced security, even within an application. Web browsers have been known for doing that, doing things like running different namespaces per tab or running the render thread inside of a different namespace to make it much harder to attack. For example, like if you're running a piece of software that needs to spawn a sub-process that should never be allowed to go on the network, well, you could just create an empty network namespace for it and it won't be able to do that. So you can do that kind of stuff and it's very convenient to, well, we can explore it with just the Unshare command effectively. So next, the LSMs. Yeah, I mean, in addition to the username space, we mentioned that there is a lot more security that you can leverage to secure your container and one of the most crucial ones, probably Linux security modules and as is tradition on Linux, there isn't just one Linux security module, we have 10 or probably if you'd ask me in 10 years, we have 20, I'm joking, but we have a lot of different Linux security modules that got merged over the years and the current state of the art is that they are not combinable. So usually what you have is your host system has applied a SL Linux security profile or an app armor security profile, which also means that your container will have an SL Linux or app armor profile applied to it. That's usually what people do. So everyone kind of knows, I guess, how they work. It's mandatory access control in addition to discretionary access control. So after the DAC permission checks on the kernel, for example, opening a file set, this is fine. You're allowed to open it. You then have the Linux security modules which can get or get called and they get another say, are you allowed to open this file? So I guess there's a difference between authoritative and restrictive security modules. So they are on top of DAC. They can't override DAC, for example. Recent discussion about this, interesting enough. But yeah, so they're pretty important. There's a bunch more. There's Mac probably, and we have two newer ones which are kind of exciting. One is Landlock, developed by a colleague at mine at Microsoft incidentally. Well, actually way before that, but he's now with me at Microsoft. And Landlock is a completely unprivileged LSM which you can use nowadays, more or less, to replace up armor. I don't know if it's a complete feature party, but the idea is certainly there. And I think it's a more modern and more elegant design around a set of system calls. So this is a very cool idea in my opinion and can be leveraged for containers and currently isn't just because it's very new technology. I don't know how many people have heard of Landlock before. A little bit, four or five, see? So, and the other one is the BPF LSM. Well, how many people have heard of BPF? Okay, okay. So the BPF LSM allows user space to compile BPF programs that can attach to where it gets very technical. But the way this works that in the kernel we have security hooks, which is literally like a for loop and that goes to all of the registered security modules. Like capability is one security module. The next one is, for example, ASL Linux. And then you could also call a BPF LSM hook because that's technically stackable. And the way this, so you could attach a BPF program to a specific LSM hook, for example, when opening a file and then inspect the arguments that that hook gets passed and make decision based on this. And it's actually a pretty powerful mechanism. It has a, we use it in some contexts related to this to, for example, restrict access to only a specific set of mounts in the unprivileged mounting code that we currently have done in, that Lennart has done in system D. So it's a very powerful feature. I'm very excited about this. And I think we haven't really exhausted all of the possibilities that come with BPF LSM hooks but it is pretty nice. I use it quite often, actually. And it's much more dynamic than ASL Linux profiles or APAMA profiles. You don't get into relabeling issues and so on that you need to do. You can just replace the BPF program. So that's, it's not really new but it's I guess also not very much used outside of the big cloud providers. That's at least the impression that I get but it's definitely something to explore because it's much more fine grained. But obviously also only available to privileged users. So you need a privileged process to hook that BPF program up for your container. The container can just do it by itself. And last but not least, we finally get the ability in, I guess, last kernel release or last two kernel releases to block the creation of the username spaces. So username spaces never had an LSM hook in them and there was strong resistance to that but it also meant that because username spaces are in privilege to unprivileged users that you could just call unshare dash dash user dash dash map root. And then you could mount tempFS, you could mount overlayFS and so on as a fully unprivileged user on your system which is obviously, it's kind of neat on one hand. On the other hand, it's a huge attack surface and there are a lot of workloads that might not want to give this exposure to unprivileged users and they had no way of actually restricting that which is why all distributions carried a patch for disabling unprivileged username spaces. It's the same SysCuttle patch that exists for 10, 15 years, I think. And now we finally have at least an LSM hook where you can say, and it will probably be the major use case, if this request comes from an unprivileged user, so not from root, then I'm able to refuse the creation of username spaces while still allowing the creation of username spaces by privileged users. So I'm pretty excited that this finally went in because there was for sure missing functionality. I think BPF Landlock and this new username is more or less the most recent additions that we had in the LSM world that is exciting for containers. Yeah, and I mean the ability to turn off the username space is also kind of interesting even if you don't do it just globally for users, like the ability to do it per process so those LSM hooks will be quite interesting because that was kind of the main issue with the SysCuttle, so it was an all or nothing system-wide knob, which is not always ideal. There's definitely cases where like, do you want your web server to be able to create a username space? Probably not, but do you want your users in groups so and so to be able to create username spaces? Maybe, and now with that hook we've got that kind of flexibility. Whereas in the past your only real options there were to probably play with SecComp, which I'm gonna get into very shortly, but SecComp has some limitations, especially when we look at some of the new system calls like Pluron3, which uses a struct with a bunch of pointers and fun stuff instead of simple integer arguments making it impossible to validate through SecComp and so making that kind of approach effectively impossible, short of completely turning off all of the new ClonesSysCalls. So, speaking of SecComp. So, SecComp is not quite an LSM even though it's often very, very close to LSMs. It's effectively a way to process, apply policies on system calls right at the entry point of system calls in the next kernel. That was historically used to just build simple profiles saying this system calls that out, this one is not. That got extended a bit to support BPF, not EBPF, but BPF, to also be able to evaluate arguments and then based on those arguments make a decision whether to just reject it, keep going or look to audit. There were a few other targets that you can use. One of the things that was added somewhat recently, I keep saying somewhat recently but it's been quite a few years now but still recent for many people. And you just make it into the kernel and then 10 years later you get your first bug reports from user space. Yes, so it was probably like five or six years now that we've been working on this stuff, but so there's simply a new target, again new, called Notify, which allows for if a pattern or if a BPF pattern matches instead of just allowing it or rejecting it, instead sending it, sending notifications to an FD to a user space monitoring process that can then decide what to do. And that process then gets to send back the response being like continue or reject and if you reject what I know and stuff that you want to send back. This is quite interesting because it allows for a more privileged process on the host system to process all of those second requests effectively and then it lets you go even one step further and have that privileged process then perform the action on behalf of the calling program and just return the final return value back. So effectively never letting the kernel directly process the request but just hijacking it and doing it in user space. This is quite interesting for container managers that primarily work with unpublished containers because it lets us fake whatever the hell we want pretty much. We've used it for things like the mount syscall so we can allow some mounts that are normally not allowed instead of a user name space, we can now go and do it. We've done it for things like the Sysinfo system call to look at things like the container resource limits like the C groups and then update the value of the Sysinfo syscall to include those values directly. We've used it to allow make nodes, to allow set XR to allow a whole bunch of different things inside of unpublished containers which would not normally be safe but if that goes through a privileged process that can look at policy and can make a more educated decision as far as what's safe and what's not it might be acceptable. Although it is, I consider it to be a stopgap measure to give containers the impression that they're not subject to the limitations that they're actually subject to in a way. So for example, just specifically the mounting thing that we do, it's very complex to do safely and to do correctly. So this is a don't try this at home kind of warning I guess. And it just expresses the notion, it just tells us that we are lacking the current appropriate mechanism that we would want in the kernel but we're getting to this later. Yeah, and there are definitely a bunch of cases where we want the username space to be mostly behaving like a full Linux system with everything that you can do but at the same time because it's running as a non-root user, you don't want that user to be able to like grant itself higher process priority than it has normally or knew that kind of stuff for giving itself more capabilities than it normally should have but there are then a set of cases that we consider to be safe that this lets us work around. I mean, if I have one minute to get philosophical about this I think the difference is that we that the way namespaces were architectured the way that we thought this could work originally or the author originally worked is if you create a username space and then everything that you're allowed to do should be done from inside of that namespace and the kernels should basically vouch for the safety operation that you're doing and that's the only thing that you're allowed to do but the problem really is that it doesn't necessarily scale so in a lot of situations I guess device node creation is a good tiny example where allowing it unconditionally in containers doesn't work allowing it for 20, I don't know the subset of make not calls it would actually be safe and you can't really express this and if you have this notion of everything needs to be performed inside from that specific namespace and in other hands in other words, you don't ever ask for rights to do something the kernel is just always granting a global yes or no to this specific operation that you're trying to do and this usually doesn't scale second was a way to get around this because you're implicitly asking container manager for for the rights to perform this specific operation but we really should get away from this notion yeah, we should really get away from this notion that it always needs to be the kernel that vouchsafes for this like it's way nicer if you can call out to user space and ask them is this operation safe to perform I think this is a nicer mechanism especially for stuff like mounting and it's a different design yeah, and as it turns out we need to deal with like 30 years of existing stuff which is a bit of a problem all right, I'm gonna kind of pick up the pace because we only have about 10 minutes left so let's go just a quick demo of what you can do so I still have that one container here running I'm gonna pass it a block device if I go in there and we format that block device there we go and we try to mount this stuff this is not gonna work that's good, that's the default behavior with interception we can actually actually intercept mount on this container also EXT4 is not trusted this is a terrible idea from a security point of view but it works now if you wanted to make this a bit safer one thing we can do is install fuse 2FS which is a user space implementation of EXT and then we can do some magic in this case we're gonna tell EXT to now intercept any attempt at mounting EXT4 and instead of running the real thing just give it to fuse and there that works and if we go look at the mount we can see fuse.exe4 is the file system here instead of instead of real EXT4 so that's one way we can intercept things and actually redirect them to something that's safe because this is just a process now running inside the container there's nothing running outside of it that's fine so that's pretty powerful we're all just reinventing upcalls for containers I think this is an exciting new feature that I apparently have to talk about in two minutes and I want to point out to use 30 seconds of the two minutes to point out that we don't even have man pages for this yet apart from some system calls so the new mount API is a way to split the single mount system call that we had called mount into multiple system calls and make it completely FD based which is just so much nicer you should use FDs for everything because they provide a stable handle and the mount system call the original mount system calls have various limitations like for example you couldn't mount cross mount namespaces so you couldn't just say take this mount and mount it into this container because the kernel would just be like no this doesn't work this is you know I just have the single system call and you need to be located in the namespace that you're mounting it to so this doesn't really work but the new mount API is a split where you can create an opaque handle amount FD and then you can set mount options on this you can do this in one namespace completely privileged if you wanted to so mount an XFS file system set options on it create the super block and then you have an FS mount FD which refers to a mount but it's detached and detached means you can't reach it from anywhere in the file system it's really just a handle on a mount that is not alive anywhere it doesn't belong to any namespace so but that means you can switch into a new mount namespace and then issue a new system call called move mount in this case and attach it into this container so in a way it's really nice because you can actually now inject mounts into a container something which we are making a lot of use of already and there was a talk about this in another conference how this can be leveraged so there is delegation built into the new mount API and this concept of having detached mounts that don't appear in the file system is something that has been sorely lacking on Linux usually people would do this by attaching a mount opening it keeping a file descriptor and then unmounting it again and this is sort of the same concept without this dance and without that so without that mount ever having belonged to a specific namespace or being owned by a specific user namespace so this is really really powerful to use and as part of the new mount API we also made it possible to create what we call IDMAP mounts and that is essentially just a way to change the ownership of the whole directory tree or just a single file whatever want on a specific mount so you mount the file system every ID is owned by UAD0 and then you say at this mount point I want all files that are owned by UAD0 to be owned by UAD1000 then you can actually express this notion this is powerful for containers this is interesting for containers but it's also interesting for examples in system D this is used to say on this specific mounts only UAD1000 can write like if UAD0 tries to write to this mount this ID isn't mapped so you can't write anything to disk it also gives you the ability to make UAD mappings that user namespaces rely on completely transitory meaning you never persist the ID mapping that a container uses onto disk so that means every time you start a container you could randomize you could randomize the ID map for that container because there are no files on disk that belong just to the ID mapping of the previous container so that's pretty good and what we really want to do in the future and that's what I talked about we want to be able to do delegated mounting so that if a container calls mount via the new mount API inside of a specific mount namespace then I can register myself a process can register itself as the mount handler for the specific mount namespace and then get notifications about mounts and make decisions on whether or not they are allowed so a properly designed delegation mechanism that doesn't rely on SACOM Alright, I'm going to do the shortest demo ever of this so just if I go inside of the container again I've actually been using this all along so if we look at the CFS we'll see that the root of this container is using ZFS and there is a flag here you see read write, real-time, ID mapped so it means that the data mapping is in place and even though my process tree is running as like 100,000 if I go look at first ZFS obviously isn't upstream this is an upstream feature but the ZFS folks have very quickly and jumped onto this so if I go look at oh yeah it's ZFS, never mind I can't easily show you but the file tree on disk would be on like would be effectively unmapped there's no, you wouldn't see any 1 million new ID on there you would just see zero because that's the ID map we've got loaded in place Alright, very very briefly because we're very rapidly running out of time so C group and what's going on there we're still moving to C group V2 overall like we've been saying that for a decade I think at this point distros have by and large been doing the move now to C group II most stable distros aren't C group II there's still a few gaps here and there net CLS, net prior being another two controllers that are somewhat missing you can do equivalence with NFT in some cases but there's still a bunch of user space that doesn't do that and it's a bit problematic bunch of memory pressure stuff was added which is really nice effectively getting that PSI value for like how much memory pressure you're dealing with and letting you deal with that with demons like Umdi and some others to take action before the kernel goes and just start scaling stuff so that's pretty nice there's also now support for Z-swap devices as far as limits directly in the C group III and I've said that recently, which is pretty nice and as I mentioned hybrid systems are a bit of a problem we've got user space tools that literally set up C group V1 on top of C group V2 and just cause all kind of problems so that's the kind of fun we're dealing with and we're hopefully gonna get rid of soonish just kind of conclusions so usual reminder, get off privileged containers please that's a thing we keep kind of saying but you know, username space has been around for well over a decade at this point it's really best to start using that because a lot of the new kernel container features are not gonna be available if you're not using a username space so even if you want to run things privileged create a username space put a privileged map in place use that but you need to get on to that more and more helpers are becoming available to handle mediated resources doing kind of resource mediation for unprivileged container to just give you that tiny bit of extra privilege when you need it and not constantly like with a privileged container so that's pretty interesting it can work in different ways either through second interception where the workload doesn't need any kind of awareness of what's going on or with more advanced kind of APIs where the workload does need to know that it needs to hit either like a UNIX or an API or some other kind of service to get something privileged happening and we've got a full minute so you get to say a few more words if you want No, I think that it should really open up for questions Yeah, if there's like one question in the room it's probably about as many as we can take Yes Thank you for the talk So with Seccomp Notify and your work there you're able to kind of leave the kernel and send a message over to user space and it waits with the EBPF LSM is there any hope that a similar mechanism could be implemented because currently there's no way to await user space and it kind of very much limits the use case of EBPF LSM This is basically a question whether or not Seccomp will support EBPF Well, no, it's more like if you use the EBPF LSM and you want something hits it like being able to ask user space for whether to continue or not I mean currently I think the closest you can do to that is by using BPF maps so you can actually have a BPF map that is being populated by user space with some rules that's then accessed from the EBPF program but that's not really a notification so much as like a way to dynamically change the response of your program I've got a feeling that people wouldn't really like an LSM hook that straight up goes to ask user space That would mean the LSM hook would block on user space, right? Yes It's always a bit of a tricky one I mean it was tricky when we suggested it for Seccomp as well You should ask the EBPF people They're open to all sorts of crazy ideas I'm joking Thank you I really don't know Yes, it would be kind of problematic because security hooks are called from security hooks in a way are always a layering violation because they appear all across the VFS stack or whatever and I can get why it is done that way I don't necessarily like it but fine and so they can be called in pretty interesting contexts like for example you could suddenly block in the decash or something I don't think this would be a good idea to allow listing specific hooks I don't know if this is feasible and it's really something that the LSM people should answer I mean you also expose yourself to very interesting obvious reasons Wouldn't be a fan of it Yeah, because LSM is one of the few that is sleepable mainly that's only for reading from user space memory at this point But I guess you said that the BPF LSM was sold as not actually being a security enforcement measure right as the monitoring sort of thing which would I think add further opposition to that of course things go beyond what they're sold for I think if you want that capability implement a separate defer to user space LSM that you can stack and that allowed the idea to be evaluated on its own merits without BPF or any other sorts of baggage with it I think Christian could correct me if I'm wrong on that one but what I remember one of the issue of the LSM is that you can't make them as easily as an out-of-tree module they need to be within the gamma tree I don't think you can easily have one completely on the side I really don't know Yeah, since there was some complexity there which makes it slightly harder to just come up with your own bet LSM to just kind of ship on the side as like a DKMS module or something I seem to remember that you effectively need it to be in your canal tree which then means you need to actually roll your own canals which is fine for a lot of people Yeah, LSMs are not runtime-loadable Yeah, sure It's kind of more general if we assume security in the current this is all predicated on the kernel getting it right and the safest bet is always to assume security is only temporary in the kernel someone will find something eventually There's always, so you have to be nimble With SGX going away, Intel's going away from it except from the Xeon and stuff like that and ARM's got some... is there any silicon substrate help here that you can imagine, even theoretical like there's all sorts of side-channel attacks and stuff I mean, as long as you can keep your boy from the physical machine Is there any hope here for actual security? It's pretty tricky, right? It's something that I know Canonical looked at a while back trying to come up with something and the short answer was that we never really managed to we talked with a lot of the silicon vendors and yeah, cool, they've got some really nice security features but they all kind of depend on it not being a single kernel that does everything If you're dealing with virtual machines you can do a lot of interesting things whether it's AMD, SCV or those kind of features but when you're dealing with a single kernel it becomes a bit trickier I know that IBM was doing some amount of research on I think for me it becomes a workload question in the sense that there's catacontainers and all that kind of stuff where you suddenly really blur the distinction between a try to blur the distinction between a container and a virtual machine to which I would always say does it have a separate kernel then it's a virtual machine end of discussion, it's not a container So I think if you really need, for example you would never, I would assume that you would never say I'm running a bunch of unprivileged containers and I'm giving this out to customers on the same machine untrusted customers because that's really not, in my opinion at least you might have different opinions here but that's at least not the use case that I see because you're making promises that you can actually hold in my opinion but if you own, for example your cloud or machine or this is a workload in a sense that you control that is untrusted and you just want extreme density then that's the case when I think unprivileged containers make a lot of sense especially when combined with the deep pressure stall information that C-Group provides because it's not just a thing of isolating user IDs and privilege and so on it's also a matter of can a container guarantee the resource constraints that you want to have like memory constraints, CPU constraints and so on and nowadays we're in much better shape thanks to C-Group E2 because it's a lot more strict so these guarantees can be given and we know of companies that run such workloads but if it's untrusted machines you're giving out to users and they need to be isolated from each other especially if it's not the same, sorry, customer then you will always use virtual machines so my question basically is what sense does it make to implement hardware that actually can feature specifically for containers when what you really want is a virtual machine and think about making virtual machines better and faster but containers in a sense are crucial to for example to user space services like if you have system D services and you have thousands and tens of thousands of them you want to sandbox them as finely as possible you want to resource constrain as much as possible that's absolutely a use case where you don't suddenly want all of your system D resources to run in separate virtual machines I'm pretty sure someone would now interject and would say no, that's exactly what we want but once we are performant enough to actually do this maybe, sure I think we're going to have to vacate the room for the next talk but yeah, I would usually agree that for tenant separation VMs do pretty well then containers work really well inside of such an environment after that and depending on how much you care about security you can go with multi-layer with your actual workload when it has an previous user inside of the container put maybe an LSM around it put maybe a second profile around it then do the same thing on the container itself and make sure everything is kept up to date and the likelihood of all of those going wrong at the same time are going to become extremely unlikely to the point where you're effectively safe or safe enough there's no such thing as being safe in this world thank you we're still going to be around if you have any questions outside but we need to do the video room