 And yeah, I think it's a bit of echo going on here. Probably not. It's just as far away from the speakers as we can to avoid the echo. All right. Wow, okay, that's echo-y. Anyway, okay, cool, all right. That's very echo-y. Yeah, yeah. Is this going to be this way during the whole talk? Yeah, we can try that. Try it on a few years. Testing, testing. Yeah, that's, well, that is nothing. So, testing. Nope, there's no mic now, I think. No? Hello? No, that's not. How does it sound a bit? Yeah, I heard the mic. Okay, cool. Okay, so that's, yeah. We still have a minute to go, I think. Cool. Yeah, I just have to see if we need to turn off the mic, I don't know what I've just fixed it. We'll see. I can just... Is yours on or is it? It's on. Okay, okay, I think they fixed it. Cool. Should I stand behind it? All right. Hello, everyone. I'm Stefan Graber, work at Canonical on the Lexi project. And I've got Christian Branagh, who's these days at Microsoft, working on the next security, kernel-y type stuff mostly, I think. And we're going to be talking about SecComp, system connection reception and how that can be useful for containers. So, for anyone who came to our talks before, especially at the security summit, this is gonna sound slightly familiar, sorry. Containers and different types of containers. So, you've got your privileged containers, you've got your own privileged containers. One is bad, one is okay. Guess which is which. Privileged containers are unfortunately what everyone uses, which is slightly unfortunate. You know, your Docker Kubernetes and all of those have the annoying tendency of using privileged containers. Something like Lexi defaults to on privileged containers, which makes use of the username space as the primary barrier for security. Whereas privileged containers, it's kind of playing guacamole, trying to plug all the holes and trying to run as many like LSMs and SecComp files and whatever to try and prevent someone from just gaining root on your entire system. So, obviously we want to go more towards everyone should be using on privileged containers, privileged containers should not be a thing. But there are a bunch of things that don't quite work inside of an privileged containers. They're effectively running as a random user on your system and we don't allow random users on our systems to do a lot of things. But there are ways to kind of fix that. In some cases, the way to fix it is we add new namespaces, we add some knobs and then root can tweak some of the knobs and allow privileged users to do privileged things to an extent. That's fine, but we're not gonna have 50,000 different namespaces and there's been some issues with getting some specific new ones in the past, so that's always a bit tricky. So, that was the state of things. Before we started really looking into SecComp and what we can do with things like system for interception as a way to give more privileges but in a way that's quite controlled and mediated by processes on the host to otherwise and privileged containers. So, SecComp and system call interception. I mean, most people here are familiar with SecComp, I assume, more or less. And one of the convenience things is that SecComp sits in the system call entry path and it's actually done before you perform any system call. And so, you're fairly early on, you have your SecComp, you can run your SecComp program and one of the problems that we traditionally had, for example, is if you have certain system calls and some of those system calls we will mention later is make not, for example. And we want to be able to sometimes create device nodes in the container because if you set up a standard container, what you usually do is you make dev null, dev zero, dev TTY, dev console and so on available to the container. These are usually pretty boring device nodes but the kernel just flat out allows the creation of device nodes in unprivileged containers. So, the kernel will require that you are capable within the initial username space, so on the host and you will have to have capsis admin in order to create device nodes. And the reason for this is simply that if you could otherwise create stuff like dev, K mem, dev mem, whatever, some random block device and then crash the kernel trivially. But think about dev null and dev zero or the random and the view random which are required to even run containers. The way we do it in unprivileged containers usually is we bind mount the devices from the host into the container. But there is actually no really good reason why we shouldn't be able to just create this device node in the container. You could have probably an allow list in the kernel and say here is a set of device nodes that's okay to create but it's kind of hacky and it doesn't feel right. So for a long time we weren't open to this but as we found out some time ago, actually we kind of do something like this because nowadays we allow the creation of so-called whiteouts which is used by OvalAFS to indicate that a file for example has been deleted and it's a device node which has major and minor number zero zero. So I don't know if it's still such a good argument to say we shouldn't have allow lists for device nodes in the kernel. But we currently at least don't have and we once tried to make it so that you could create device nodes but you couldn't open them. That broke all of user space and all container workloads because the standard assumption is if you can create a device node then you should also be able to, well container run times try to create a device node and if they fail to create a device node then they will assume okay I'm apparently in an unprivileged container and therefore I'm now falling back to bind-mouting the device node in. And if it succeeds then the container manager will often assume that it's fine to interact with this device node and that you can open it. And this assumption is nowadays well spread across every container runtime user space. So just making the make not system call in the kernel succeed and then denying the open is not a great idea. So we thought about okay this is just one specific instance where we're dealing with something that seems safe and we would like to be able to delegate this ability to an unprivileged container. And this is where second system call intersection sort of came first into place an idea thrown around I think back in the Linux Plumber's Conference in 2017 or maybe before that. And the idea is well so kind of now if you load a second filter and you say deny make not system calls for all for a specific block device you can filter on arguments with second right? You can say if this device number is such and such fail the make not system call otherwise allow. So the idea is basically there that you can already you could think about a mechanism to outsource the decision to another process and it was the whole starting point for second system call interception. So instead of having a static policy that you load into the kernel and then the kernel denies or allows based on this static policy you would implement a mechanism such that another process in user space for example can get notified about a specific system call being performed and then you could somehow envision a mechanism where this user space process then tells the kernel what to do in some vague sense for now. And second system call interception is exactly that. You have a filter that you load a specific new type that you load into the kernel. Second red user notif I think that's it. And what you get after you've loaded that filter is a file descriptor for the second filter. I guess the second filter stack I guess is technically correct. And this file descriptor is pollable so you can hand it off to for example a container manager and the container manager can stuff it into an event loop and then when a system call is performed for example the filter tells you notify me about make not system calls then the user space process will get the notification hey somebody just made a make not system call. And then we have a bunch of I octals and one of the I octals is you can read the information about the system call that has been performed by the processor task I should say the thread in question and it will contain the second system call arguments and yeah the regular second information that you usually get and then you can inspect the descriptor, see for example what make not system call has been performed and then decide whether or not you want to allow or deny the system call for the container. What you would usually do is you would in user space then enable the container manager who just had been notified about the system call being performed to emulate the system call in user space. You can tell the kernel, basically we can't the kernel is not in the picture of actually what to do after it has been notified the kernel you can just tell the kernel continue the system call we will get to this in a second continue the system call deny the system call report an error code to user space a specific one or report success but nothing more happens anything that you want to do you will need to do in user space so the kernel only gives you a notification the system call has been performed you can inspect the system call arguments a privilege process in user space can then go on I'm now going to decide what to do with the system call and then for example say oh this is a make not system call I'm now going to emulate it in user space because I can vouch that this is safe to do and then after I have done finished emulating the system call and created a device note for the container made it available to the container then I'm going to tell the kernel okay continue or if I fail to emulate it tell the kernel report back an error code but safety concerns to make this a little bit more engaging can you think of a specific safety issue like any thing that really becomes a problem yeah yeah the question the answer was is there a risk condition where a process could change the data after you checked it say with pointers yeah so I for example just spoke about the make not system call which more or less is a pretty boring system call because it only has integer arguments so which means the information that you get back from the kernel in this second data I think it is you can just directly read this and nothing will happen to this right but say you have a pointer argument second in the traditional BPF implementation would be different if second would be in the BPF you can't really it can't really chase pointers so what you will need to do is you will need to open slash proc slash pit slash mem and then use the offsets it views the specific arguments in the second data struct as offsets into this memory and then read out the process memory then inspect this memory and then perform the system call so the problem here obviously is we have a mechanism in second where we can continue the system call with the second notifier meaning you could for example and people immediately jump to this conclusion you could have the idea I'm going to write a user space safety mechanism based on the second notifier for the open system call so I could for example then read the path argument that is passed to the open system call right and then read it out and then look at the path and say oh yeah this path is fine I can guarantee that nothing will happen and then continue the open system call the problem is in the meantime sometimes someone could have rewritten the path that you just parsed out and read and inspected and I don't know wrote Etsy shadow in there and then you're like yeah you should totally continue the system call so you have a problem that's for example so talk tools are actually a big issue I mean if you were very careful then you can program around this and but as soon as you bring continuation of system calls into play you have serious issues the way for unprivileged container is this often not a problem like think for example about the make not system call the kernel is the ultimate security boundary the kernel will disallow any make not system call anyway so it will never succeed so if you continue a system call and even if somebody has rewritten it to something completely unsafe the kernel will still deny the make not system call anyway it's similar for a few others for example mounting block device file system and so on so really continuation of system calls in the face of talk tools like the ones that we talked about actually only works if you can guarantee that the kernel your ultimate security boundary will do the right thing even if the memory is rewritten otherwise you cannot do this so it specifically means you cannot use the second notifier and I've written a long comment in the second pattern about this you cannot use the second notifier to implement security policy in user space for example privileged containers because people are really excited about this right when this landed it's like oh yeah we can write security policies for this in user space yeah so still another thing to keep in mind is second system call interception will often require that you have a trusted privileged process on the host it's pretty uninteresting if for example you have nested containers you have two unprivileged containers and in the outer unprivileged container you have the demo running who is supervising the system calls of the unprivileged nested containers because you know you can't create device nodes you can't mount file systems so it's not really any help you always need to have a privilege process on the host for interesting work that you want to do that supervises the system calls of a bunch of unprivileged containers all right so I'm gonna be going through a bunch of the defense system calls we've been looking at and implementing so far and kind of why we've had to do that it's funny because the list we ended up with is quite different from the list I had initially back in Los Angeles or whenever I came up with the crazy idea of how about we intercept system calls and do stuff in user space like some of them we knew from the beginning what we're gonna need some others kind of just showed up as we ran into issues specific workloads and we're like well if we do that one system call we can block things so the current list we do is makenode as mentioned setxada, ebpf schedule, mount and sysinfo so that's what's implemented in next year right now I know that other projects have been using the system call interception to some extent for other things here and there not sure what everyone has been using it for but that's what we've done so far on our side so the first one would be makenode, makenode art and the question kind of already went over the part about creating safe device nodes so effectively you know you have your dev zero they've randomed those kind of things so that's one thing that we wanted to allow that makes it easier for example we couldn't run something like the bootstrap or another similar tool instead of a previous container before because as soon as it would need to create device nodes it was failing now this works which is nice we can build images for a whole bunch of different distros without having to run in a privileged environment which is really convenient but the other thing that really caused us to look into makenode was actually OvalaFS because it was so now as mentioned the whiteout is just C00 which works it's allowed everywhere but it didn't used to be it used to be C00 but it was not allowed for one so we actually had to put Cisco interception in place so that Ovala, so that actually running Docker inside of an privileged container would be able to unpack its layers in the OvalaFS format and be able to create the whiteouts properly we generally consider interception of makenode to be safe as in there's no obvious way to completely DOS the host or anything we're not aware of any bug in that particular code on our side it's been working pretty well we don't enable any of those by default but that's one that we can tell people like yeah you can turn this one on and even if it's random user has access to that container you're probably going to be fine the next one we've done is set Xata and that's for the same reason in this case OvalaFS because another way to mark I think it was also whiteout or something else it was to use an Xata and there were kind of two ways to do it one would be you mount OvalaFS and you actually delete the file or something and that will cause I think there's change of type or something like that was using an Xata anyway you could do some action through OvalaFS and trigger it which is fine but as it turns out Docker doesn't do that what it does is that actually impacts its layers directly to disk without actually using OvalaFS and it sets the external attributes directly so we needed to also do interception there and we effectively have a allow list on our side for the few Xatas that are safe we obviously don't want to allow all of them because having access to say security or any of those would be extremely bad but we are allowing whatever is needed effectively for Docker in this case so the goal had always been why we originally did these workarounds we wanted to be able to unpack a whole root file system inside of a unprivileged container inside of a username space this was something that we'd like to do fairly early on because you don't want to unpack a root file unpack a root file system and then especially if it's not if it has exactly the UIDs and GIDs that you would expect starting from UID 0 and then you have all of the Xatars that might be on files like security.capability or trusted Xatars or setID and to set UID and setGID bits and so on and then have this unpacked directly on the host it should be done in an unprivileged container in a username space so all of these privileges are ideally isolated from the host and we got fairly far along the only things that didn't work were extended attributes we did that work some while ago that you can set you can basically a way to name space extended attributes and the only thing was whiteouts creation and some specific Xatars that we needed to be able to set and on newer kernels I think none of this should be needed anymore this is basically just for kernels where this isn't available and next one is the mount system code oh yeah so another security concern that I didn't talk about is for MakeNot it's fairly easy right you just want to create a device node for a container usually what this involves is you attach to the mount namespace of the container and then you create this device node inside of the container's mount namespace so you don't actually need to play any specific games with the privilege level or security level of the process that is attaching to the container but now when you talk about system calls like mount system calls now suddenly you have all kinds of privilege levels in the mix so you have capabilities that you require or that you might not require you might have LSM profiles you might have additional file system UIDs that you need to set in general change UIDs and GIDs you might need to attach to specific namespaces such as a mount namespace, user namespace or the pit namespace depending on what you need to do so one additional security concern that you need to keep in mind is you need to find the exact balance of the privileges that you need to retain that you had on the host and assuming the privilege level of the target process inside of the container for which you emulate the system call that because you basically you emulating system call for another process so you need to assume his personality, their personality to some extent while also retaining the necessarily privileges to perform the operation the mount system call is for example a prime example is a good example for this you need to attach to the mount namespace sometimes you also need to attach to the username space if you for example mount a fuse file system and that becomes really tricky to get right so this is really low level user space programming that gets in touch with a lot of privilege mechanisms and that's difficult to get right so this is all a nice mechanism but these are always things that you need to keep in mind so why intercept the mount system call well, often times you will have dedicated often times you might have dedicated file systems to the container that have been originally provided by the host such a dedicated X4 file system or XFS file system for which the host or the container manager can vouch this is safe to mount because you cannot let container mount arbitrary file systems because you could have a malicious file system image so usually mounting block based file system at least so anything that is not a tempfes more or less or sysfes or prokofes requires privileges on the host but as I said in some scenarios there is a well-known path for example on the host or disk that you can mount inside of the container you could tell the container manager via the second system call interception mechanism gets notified about the mount system call it inspects the arguments of the mount system call and then it will perform the mount for the process inside of the container and there you also can see the original problem that was it who also correctly pointed this out earlier the mount system call is full of pointer arguments I mean there is basically just a flags argument and the flags argument and really that interesting the additional problem is the mount system call is a terrible multiplexer so it can mount, it can create bind mounts it can create specific, it can mount file systems it can mount pseudo file systems it can change mount attributes for a super block or for a specific mount so there is all kinds of layers that you need to be aware of here LexD for example implements a way where you can say I'm allow listing specific file systems so you could say make it possible to mount X4 or XFS file systems that's in general not a great idea because it means that you can mount arbitrary file systems from inside of the container and as people might know it's possible to create an image as an unprovoked user and create a file system super block on it and then for example corrupt the super block in some tricky way if you have read the kernel source code and you've figured out there is some bug in the X4 file system then system call interception LexD or some other container manager would diligently mount that file system for you and here we have an exploit but as I said if you know what you're doing if you can vouch for the safety of the file system that you're about to mount then intercepting the mount system call is actually really nice because I have ideas on how to do what I call delegated mounting actually in the VFS itself but until we have that this is actually something that we might be able to do All right and which I'll show it a bit later on for this one but LexD also lets you do pretty interesting things like automatically setting up a ID shifting layer on top of intercepted mount but the coolest thing we've got is actually intercepting mount and then not mounting the actual file system but calling it the fuse equivalent which then actually makes it pretty safe because you're not actually hitting the file system in the kernel, it's just being redirected straight to user space inside of the container All right and this one we can go over quickly we have, I've implemented a POC for BPF system call interception some time ago and this leverages quite a few work that we did over the years in the kernel because something that we haven't talked about so far is I mentioned about intercepting the open system call for another process well the problem obviously is that you need to share the file descriptor cable with the process in question in order to do this because otherwise the file descriptor won't be valid in the target process so a while ago I did work which is called PIDFD which introduces file descriptors for processes in Linux and we have a system call called PIDFD open and we also implemented a system call called PIDFD PIDFD get FD and what it allows you to do is to get a file descriptor for a file in another process so which means you have also made it possible to intercept, accept, connect, open and all that stuff and BPF as well because you have, you call it BPF system call you get a file descriptor back you get the file descriptor out in the container manager you set off all of the options that you want you attach the, you give the file descriptor back to the process that you intercepted it because we also made it possible to inject file descriptors into another task safely and then when the process calls BPF attach will not, which will not work in unprivileged containers completely we intercept this system call as well attach the program to the container and send the file descriptor to the target process inside of the container and one of the limited programs where we allowed this to do is the BPF device C-group program so that you, a container can further restrict devices that it has access to, access to. All right, the other one we did very quickly is get that scheduler we don't consider that one to be safe because I find it dodgy Yeah, it is pretty dodgy this call but it turns out Android uses that a bunch for some reason and so when running Android inside of unprivileged containers we had the issues with Android being a bit unhappy about things so we've added that option which our users can opt into for specific containers where it's needed effectively it allows changing like some of the scheduling types and that kind of stuff that in theory the worst that could get it starts, you end up with processes that have an extremely high priority but some of the alternative scheduling options and stuff are not super well tested in Linux and not considered to be extremely safe so there's potential issues with this one. The last one we've done recently is CSNFO and that's effectively, for LexD we've had that thing called like CFS for a while which lets us look at the C-group limits and expose that inside the container on things like meminfo, CPU info, those kind of things to show the actual limits you have instead of the host-wide resources. That works pretty well except that a bunch of processes are using CSNFO to get that same data and CSNFO is a Cisco also LexDFS can't do anything because it's not a fine. So we've recently started intercepting CSNFO as well and we've effectively filled the CSNFO struct ourselves with the data coming from C-groups so that way you get the uptime of the containers correct the amount of memory available and stuff is all correct as well. All right, and let's get into a bit of a demo. So, this one, cool. So, I'm just on my laptop here I've got a container called SecCamp, that's running. It currently has nothing special configured on it so I can get a shell inside it and say we try to create dev-null. If I put the name, it's gonna F-0, right? No. It's null C. Yeah. Operation null permitted in order to do it out of the box. Now, we can change that with security Cisco intercept makeNull true, then bounce the containers because we need to actually rewrite the SecCamp policy when it happens, we need to actually restart it then go inside it, makeNull works. Just to make sure that bad makeNull still doesn't work so let's try to create a block device just still doesn't work. And if we look on disk, we've got dev-null here created properly. So, that's the basic interception for makeNull that we have. Now, for something that's a lot more interesting, actually let's me just get back into the container. I do have a block device that's passed by the container manager at devSDA. Let's create a file system on it, it's already one which is like recreated, there we go. So, just making that X4, there we go. And I'll try to mount this thing. Well, that's not gonna work because you're in an unprivileged container. So, Colonel says no, no big surprise is there. And you also created the image inside of the container. You're right, exactly. So, it's actually kind of what scenario with the container also literally created a file system. So, that would be a very bad one in theory. But say, for some reason, you really trust that container and you still want to allow it. Mount true and security Cisco intercept. Mount allowed, EXT4. Okay, again, need to bounce it for the second policy to update. If we go back in there, mount works. But if we look at the permissions in there, they're all a little bit wonky. And that's because mount was done by real root, but this is inside of a username space. So, everything is shifted. And so, something that should be root root turns out being nobody know group because it's all the overflow UID and you can't actually do anything on this file system. So, it's not very convenient. To address that, we've got another option. So, this one is shift true. Leave, oh, mount.shift, sorry. There we go. The policy is already in place. It's just a small tweak. So, you don't actually need to change anything to bounce it for that. Do the mount again and look at that. It's all nice now. So, what it did is in this case, it actually set up the VFS ID map, shifting stuff on top of it. So, that the permissions now look all correct. But again, this is not a very good case because the user was literally allowed to create the file system. They could write whatever the hell they wanted on their SDA and then mount it and attack the kernel and hey, you're root, like real root. Or you just crash the system. Not great either way. So, that's where we've got a bit of magic. I didn't install Fuse 2FS inside that container before, which is a Fuse file system for EXT 2, 3 and 4. And now, let's go and clear some config. So, I'm gonna be undoing the config for the shift. I'm gonna be undoing the config for the allowed file system. So, that means I just have the mount interception enabled with nothing currently configured. And then I'm gonna add this one, which is Fuse equals, EXT 4 equals Fuse 2FS, which that means if we intercept and we see EXT 4, we're gonna call Fuse 2FS to handle it instead. So, go in there and just try mounting it again. Okay, all right. Permissions all look good, everything is fine. But now, if we look at poke mounts and you look at what the file system is, yeah, this changed a bit. So, now it's Fuse handling it. And this one we actually consider to be generally safe to actually enable for mostly interested users because it's running a process as themselves inside their own container. It's exactly the same thing as if they ran Fuse directly, but it works on all their processes, even those that have no idea that they should be calling Fuse. Something that just normally calls the mounts is called will just work. So, that one is pretty nice. The other thing I was gonna show you is the sysinfo bit. So, we're gonna set a limit on the instance. Limit memory, I'm gonna set that to 256 megs. Go inside the instance, look at the free output. So, that's 256, that's great. That works because of LexiFS doing the file overlay. So, we look at poke mounts, we see that meminfo is overlaid by LexiFS already. That's why this works with free. But if we run, I've got a small binary that just does the sysinfo syscall. We can see here, if you look at total memory, it's 16 gigs. So, it sees my entire laptop and not just the container. That's not very good. The load info is probably also wrong. The uptime is also wrong. It all shows the host info, which is not amazing. Especially because we already have all of that data instead of LexiFS. If I run uptime, I'm getting three minutes because that's how long the container's been running. We've got that data already. We had a lot of bugs that users reported, for example, when they had specific programs. I think in this case, it was Java that allocates a chunk, preallocates a chunk of memory based on the system run. If Java looks at prog meminfo, then it would probably be fine. If it uses sysinfo, then it will think, ooh, 16 gigabyte, I'm gonna use four gigabyte. And we saw some other issues like alpine Linux, the implementation of free. I'm not sure what version of the standard utils they're using, but the version of free uses sysinfo. It doesn't use prog meminfo. So, suddenly we had everyone using alpine be like, hey, the limits are not working. It's like, no, the limits are working. You just don't see them and you're gonna be hitting them and gonna be very unhappy at that point. So, now we've got the interception for sysinfo. So, just put that in place. Again, need to restart to update the second profile. And if we go back in, so meminfo should still be fine. And if I look at sysinfo, we're gonna see total RAM is actually turned into 66 megs. And uptime and the load and everything also correct no longer show the entire system. So, that's most of the useful thing I wanted to show. Like, there's not too much money in showing the set scheduler because the only way I can test it is with like five lines of C to show that the system code doesn't fail. But that's kind of hard to show it, doing anything useful. Same thing with the Xata, like I would show that can set it, but it's not super useful. So, let's switch back to slides. And so, what's next? What are we doing next with this stuff? I think one of the, I don't know if you wanna in user space intercept specific system calls. I think in general, for second, at some point, I'd like to bring up the possibility to at least in some limited way, ported to EBPF to extend its abilities. We have a problem right now, which we come with our ideas of how to solve this in various ways. All of them I find to be quite distasteful. And that's, we have switched away from having a system call design limited by second. So, what do I mean by that? For a long time, if you wanted to add a system call and you passed an astruct, somebody would come yelling at you and tell you, go away, we can't do this because then the system call will never be usable because second can't chase pointers. So, don't do it. We finally got rid of this requirement because it's just not feasible and not even to support multiplexers. So, for example, IO Euron would be a good example of a recent multiplexer. But also for system calls with, for example, the new mount API. So, we have the mount set adder system call, which takes a struct argument because it has a bunch of additional arguments that wouldn't fit in a six system call argument limit that we usually adhere to. We have the clone three system call which takes a struct argument. We have open add two, which also takes a struct argument. And in general, that has proven to be quite useful simply because we get around all of these extensibility issues. Like these new system calls have been designed with extensibility in mind. Such that you can extend the struct and every time the struct changes, it's correctly padded and so on. It's backwards and forwards compatible. So, you can actually do extensions to existing system calls without always introducing a new system call. It's always a trade-off. You could also make an argument. We should always have a new system call for this, but then you end up in a situation where you have dupe, dupe two, dupe three, except two, except four. It's not even the versioning for that is clear. So, I think in general, this is a step in the right direction. But obviously now we have a problem. We've seen it with the G-Lipsy, for example, that tried to switch to the clone three system call because it has a bunch of, it has overall, I think has a nicer API which user space can use and they would like to switch to it. But as soon as you run G-Lipsy inside of a container and the container has a second system call filter, then in general, they will block, they will often block or may block the clone three system call outright because they cannot filter. If you have a use case where you say, I don't want my containers to be able to create additional containers. So, I don't want it to be able to pass any clone flags, like clone new user or clone new pit namespace or whatever. Then you can filter on the traditional old clone system call because it has a flags argument, which is filterable by Secom. If you do it for the clone three system call, the flags argument is U64, which is within a struct, which Secom cannot see and cannot filter on. So, the only way to prevent it from, prevent your container from doing additional containers is by blocking the clone three system call outright. So, if G-Lipsy wants to use clone three, they actually can't in a lot of interesting cases. Additionally, there is a problem standardizing system call interception in general in user space. So, usually what we recommend is, if you wanna communicate transparently to one of your workloads that a given system call is not available, then your Secom filter should specify enosis. So, that for example, G-Lipsy would get enosis when it tries to perform the clone three system call and then G-Lipsy would be like, oh, okay, I see. Clone three is not available, fine. I'm falling back to clone. If your filter reports E-Perm, then G-Lipsy will be what? I don't understand and fail. So, that's a big issue and a lot of the, I think container run times have now slowly tried to switch this, but the issue really is that a lot of them do still report E-Perm and in general it would be nicer if we could write Secom filters where we could specify, where it could at least filter first level pointers. You cannot do this with the traditional old school BPF language, but you could for sure do it with E-BPF. The question just becomes whether or not we feel fine with E-BPFing Secom. I think we- That's good, like on our side, as far as what you're looking at maybe doing, moving forward there like as far as other schools who got some interest in intercepting for Lexity, we've got, we've had interest on enough for one that's gonna scare the head a lot of people, which is init module and F init module. So that containers can effectively load kernel modules. We would not actually let them load the kernel module. What we would let them do is passing the kernel module they would like to load. We would pass the kernel module to figure out what the hell it is. And then once you know what it is, compare it to a list of kernel modules where there are containers to trigger loading and then load the host version of it and not the one that was fed by the container. One of the reasons for that is things like firewalling, where you might need to load some net filter plug-ins and that kind of stuff. And right now the way we do it is that the container config can list a number of kernel modules that we will load on container startup, but it would be nicer if we could not do that and do it on demand as the container actually needs it. So that's one that's kind of interesting to also make for pretty interesting demos, I think. But it's a bit scary. And the other thing that I was always said we can do with system call interception, but we've not done it yet, is it actually makes it, so Seccom is interesting because it's at the syscall entry point before we've actually gone through the syscall table. So you can actually implement new system calls on a system entirely in user space. Because you can totally define a new, like intercept a new syscall number that does not exist in Seccom. For it to user space have a C implementation of what you want the system call to do. Prototype, make sure that everything works, user space is gonna function and everything. And if that all works, then okay sure, now you can submit it as actual kernel code. So that's pretty interesting. I've not seen anyone actually do that yet, but that's something we could do. And with that, kernel out of time, but questions, and I think it's also break times if you do have any questions, then come see us. There's a question over there, we can try and take a quick hit. Okay, so just to repeat the question, it's around the new pdfd getfd interface and saying that in the past we could do that using ptrace and getting the fd and then sending it across and what's better of this approach? I can think of one, which is the task might be running under gdb or strace, at which point you cannot ptrace something that's already been ptrace. So that would actually prevent you from doing that. Like a lot of the Cisco Interception stuff we're doing with SecComp, you could in theory do with ptrace. It would be slower and it would prevent you from running anything else that you just ptrace. You also don't need to have, for this you probably also would have to have capsis ptrace, right? And in this case you only need to be able to ptrace the process and question that you want to get the fd from, which is a less strict requirement. So you need less, it's requires less privileges basically if I remember this correctly. I should probably remember the code that I've written, but yeah. Okay, I think we should wrap now because we're starting over time. So thank you very much and feel free to catch us doing back. Thank you.