 Next one is going to be Tyco Anderson from Cisco. I think about folding Cisco to use the space. Hello everyone. So my name is Tyco Anderson, and as Dimon said, I work for Cisco. I've worked for three different places recently. Also, people usually have a picture of themselves on GitHub, but my GitHub picture is like six years old and I didn't have a beard. I wasn't really looking for anything like me, so I didn't think that would be very useful. Anyway, what I'm going to talk to you about today is a little current pass set that I worked on recently that basically aims you to forward Cisco to use space. So what does that mean? So for motivation, if I do an unshare and I'm in a user namespace and I'm trying to make note, I'm not allowed to do that. And for those of you who are familiar, one and three are a device major minor for DevNull. And that seems sort of unreasonable because DevNull is a fairly harmless Linux device which she sends EOF. And if you look in the current code somewhere, there's this check basically which is this and the important part is this capable bit. So there are devices like DevKMEM or DevMEM, which are, you would not necessarily want to continue to have access to because then people can do bad things. But in the DevNull, DevZero, these kind of innocuous devices, it might be nice to relax this with the restriction a little bit. So one option would be to hard code a list of acceptable devices in the kernel, in the VFS node layer, and then move on with our lives. But then every time somebody adds a new device, and you have to argue with the maintainer of that device about is this okay or not. Yada, yada. So another problem is you can't mount certain file systems. So the bunch of kernel actually has some patches that are out of three patches that there's a sysctl where you can enable mounting of X4. There's a flag, maybe it's fs underscore userness underscore something else I forget what it is in the kernel that indicates whether or not a user can mount a file system. But the reason these restrictions exist mostly because, you know, the file system parser's not safe or, you know, whatever. And that, you know, some of the maintainers, like the test of X4 maintainers have said, hey, I'll fix weird file system parser bugs that cause exploits. So some of the file systems might be safe. So maybe that's something you'd be interested in doing. And basically, it sort of has the same check here. You must be rude and have the Capsus admin capability to do that. So what if we could instead do something like this, and the P network's bad because I had lags animation, but we'll just use a PDF so that it doesn't die. So anyway, if I start the container and the container runs a mount syscall, you know, that would go into the kernel through the, you know, regular entry path, and then do some magic function. And then forward that to the host, which could then set an S to the container's name space, do the mount, assuming it approved things, and then lie to the container and say, hey, your mount call's exceeded. It didn't really run the big mount call in the context of the container, but we did it for them, we lied to them, and it all worked out. So you can imagine doing this with mount, or with main node, or loading kernel modules. So ordinarily, you would not want to let a container run a loaded kernel module because then all of a sudden they were running code into the kernel. But there is this thing where it's actually maybe reasonable to look at the module blob that the container's asking you to load and then load the host version. Because in particular, the kernel that the host is running may be different than the kernel that the container was running. So even if it wasn't malicious, it still might be incorrect. You wouldn't be able to load that module. But if you're, you know, trying to load the IP tables module or whatever, as long as you load the host module and it exposes all the APIs that the IP tables thing exposes, then it should work just fine. So this is a way that you could allow kernel, or containers to load kernel modules, you know, assuming that you had a safe version that you actually trust that it was on the host. So this is like another application of this. You can imagine there's a whole bunch of these, or it would just kind of be nice to have an implementation of this magic function here. So how do we implement magic? So the easiest way is with Ptrace. And this, you can do this today. And in fact, if you look at the checkpoint restored tool, the live migration tool for containers, they do a lot of this stuff, but actually GDV and every other thing, they do this stepping where you attach to some task, you step through syscall by syscall, you look at, you know, whatever the arguments are, you process them however you want. And of course, the problem with this is you have to stop on every syscall. And so that gets slow, and most of the syscalls we don't care about, you maybe only care about mount and init module, which are fairly uncommon, and the rest of the thing you want to just run, you know, forever. So there's a second way, actually, it's still using Ptrace where you can use syscall. And basically this is like code for, this is like the assembly code version of PDF, but the important part here is the set up retrace. Basically what this says is, if you see a mount syscall, trigger a Ptrace event. And so then you can Ptrace attach to a task, and it runs along normally, not stopping at every syscall, but then when this set up program says, hey, something happened, then it triggers a Ptrace event, and you can be listening for that. So this is a lot faster, because you don't have to stop on every syscall that you're looking for, but it still means that you have to Ptrace the task. And so if I'm a container runtime and I'm trying to implement this line to containers about it, it's just calls are returning, that means I have to Ptrace them. So then when the user comes along and wants to debug that, what do they do? If they want to debug, well, they use the Feynman algorithm, which is write down the hard problem, think really hard, write down the solution. And that's not really ideal. What you'd like to be able to do is use strace and gdb and all these other tools that you're familiar with. So what we'd really like to do is implement something that doesn't involve Ptrace. So that's basically what this set does. So we have, instead of doing the setcomp write trace, we can sort of introduce yet another setcomp return code, which is return user node invasion. And that basically would trigger this magic to happen. And then how you would interact with that as a programmer who's on the other side of that is it's basically, at least in the implementation that I've posted, it's just a file descriptor. And so you have two things. You have a notification. The kernel sends you this structure. You have an ID, which is like a little disk of cookie. The process ID of the task that did this and then the setcomp data. And if you look at this as the same structure you get, you know, for like your EDTF program that you had or your EDTF program here, this setcomp data, you get basically the same data. And this is important because this setcomp data structure does not expose stuff like actual memory of the process. So for example, for the bounce system call, you actually pass it the name of the file system and the source and target locations that you want to mount at. You pass those as pointers. And so in the kernel, the way the syscalls are implemented, they're not atomic. So for example, if the setcomp policy runs first thing right after entry, it might read some pointers, check the memory to say, okay, this bounce is reasonable. We're just not here or there. Those are both not important paths. And then, you know, it says, okay, go ahead and do the syscall. But then when the FS layer actually reads this, if you change that value via another thread or somehow with shared memory or something, then the setcomp filter already said it was okay, but now you've changed it to something nefarious, so there's a time-on-check, time-of-use problem here. And so that's why setcomp traditionally does not expose the task memory, the programs. And so you might think, okay, well, don't you have the same problem here? And the answer is yes. You have to be a little bit careful about how you do this. So the magic trick is there, if the task has some shared memory with another thread or something that can mute its address space, it can change the memory for that task. But if you as the tracer or the person who's doing the mounting or whatever on behalf of this task, if you copy all of the relevant memory out and then inspect it all on your side, then it's an atomic thing because they can only edit their memory, they can't edit your memory. So whatever the initial value of the syscall is, that's what they're going to show you. And then you can apply your policy and say, yes, this is an okay mount or no, this is not an okay mount. So with a little bit of cleverness, you can work around this other time of check, time of use issue which Setcomp has. So even that's what the notification looks like. So you basically, you know, you read from that, I'm here on the right side, you read from that, then you do whatever the stuff you're going to do is. And then you respond. And you can respond with an error code, you know, and you respond and that's all well and good. As you might imagine, do stuff is slightly complex. So there's the time of check, time of use issue that I just mentioned. There's also this problem about actually accessing the memory is just itself a complicated, somewhat complicated problem. You basically, there's a lot of these files called map files where you can open some other task memory and you can see, and map that and then read whatever addresses you want. Or you could, if you really like pain, you can use ptracepeak and poke user. But those are not super awesome to use. And then there's actually a third problem which I sort of haven't discussed. If you're doing this with, you know, I'm going to fake opening net length suffix for you or something. You can imagine doing all sorts of crazy stuff with this. But in that case, the system call actually returns a file descriptor. And you need to inject them, that file descriptor back into the task static space. Because, you know, if you open a file descriptor here, it's got to go there somehow. And so, there's a number of ways to do that. Basically using ptrace, you inject some code that then does a received message on some file descriptor or something. You could inject the code that just opens the file descriptor from somebody else's left product if it's actually viewable. Or the potential extension to this set is we could add a way to do this very nicely from the response structure, which may or may not be interesting. We'll see how this whole discussion goes in the first place. So then there's a question of how do you get a hold of one of these things? So I told you, you read them right over this file descriptor. So this is also sort of an open question if you have an API design that you think is cool. Come talk to me. Right now I have sort of two different ways to do it. The first one is you install a set comp filters you normally would with the set comp system but you can pass this additional flag. And then you get the FD back and then you can send message that FD to some server or whoever. And then that server will subsequently write a poll, do whatever. Of course the problem with that is if your set comp policy blocks send message then you can't do anything. So that's not really ideal. Another way would be use ptrace again but then you can attest to a task and get a listener for one of its set comp programs and then you don't have to send message or you don't have any of this weirdness but you'd have to use ptrace to do that. So some people don't like ptrace. So I think if you have another API design in sound school I'm happy this isn't really that hard to implement. So there's some implementation issues secretization and cross tasks because they're passing these messages back and forth between tasks. This is somewhat challenging. It took me a while of sitting and thinking to figure out how to do it. I think it's right but maybe it's not. So come tell me I'm wrong. There's another issue about synchronization with an expired ID that's sort of a detail status. There's GitHub. We discussed the Linux plumbers which is sort of the genesis for this idea and then it's sort of sat on the shelf for six months and so that's why I implemented it. I posted v1. Those of you who are very astute will notice that that URL is today. I posted it this morning. So there's not a lot of discussion there yet enjoying it. You could tell me I'm wrong. I was going to do a demo but I only have five minutes. So I'll do this. I'm a polite person. I think my session or the VM itself for the team for me obviously this machine is not that super awesome. So I screwed the demo. I wasn't really going to do it anyway. Very cool. Thanks. If you have any questions, I'll have any answer. That would be my control of the space. So you want to use the kernel to check the permissions of the tasks that you're retracing. See, that information is mostly exposed to like prox self-stat and other things. So you can tell what capabilities would you need to reinvent the logic from the kernel of which capabilities are required. If that's the logic you want, yes. So, you know, you can imagine lots of other things. The idea I think here is really to relax the constraints a little bit, but there's so like container orchestrators have a lot of insight that I think the kernel just as a general rule doesn't. So the idea here is to sort of exploit that. We can still allow containers to be more things but still be secure. Although, somebody pointed out to me that the syscall like, is this a valid syscall code? This runs after setcom. So you can actually use this to implement new syscalls. So I think the application is sort of, if it gets more it's just going to be a little bit weird. But, you know, other questions?