 All right, next one we've got Christian Branagh who is going to be talking to us about supervising and emulating system goals Hey, everyone I'm Christian. I'm one of the lexity maintainers mostly work on the colonel though And the month Maybe it's not high up enough. Keep it differently Okay, this is better. It's a better if I still keep my head this way. Okay, um Right, I'm going to talk to you about supervising and emulating and syscalls and how we make use of this inside of lexity So just a brief introduction. I'm not going to explain what syscalls are but think of it like sort of a way that allows user space to communicate with the colonel make a request for example open a file for me Write something somewhere and so on so like a fancy request handler simplified and so obviously in a lot of scenarios you kind of have use cases where you where you run an app and Per se there is nothing restricting this app from making any kind of syscall it once I mean ignore the permission checking that the colonel performs But in general the app can just if it knows the syscall number and how to do syscalls It can just ask the colonel perform something for me and for a lot of apps You might want to restrict them in the way they can request something from the car You want to for example just allow them a certain set of syscalls not all of them You might want to restrict how it can call a specific syscall with what arguments and so on And this is sort of where most operating systems have kind of a similar concept, but on Linux this The part of the colonel that is responsible for allowing you to do such sandboxing is second So it allows you to intercept system calls and then denies or allows them usually for example a container runtime That runs privilege at least would block a syscall that is called open by handle at because with Simulink trickery you can open any file on the host system That you want and so what usually is done it returns enosys So second profile is loaded for that application and then if the application makes a syscall Open by handle at return enosys or e-perm or whatever to tell the application you cannot use the syscall This is off limits So you can see what this is going right you can have black lists and white lists and so on and most container runtimes make use of this in some form or other but one of the So one of the limiting things is that the colonel actually never blocks So the policy to some extent is not fully dynamic, right? So you can't really do something If an application performs this syscall Stop it and wait for me for example to tell you what what you're supposed to do That's not possible with second. Well, it was not possible with second But that would open up a lot of interesting interesting possibilities and would extend how you can send box application In a much more fine-grained way And there's an interesting Interesting fact that you should keep in mind for the later part of the presentation That is second never runs before the syscall number is looked up in a syscall table So what the colonel is usually does right you request a syscall and you pass down a number in a specific register dependent on the architecture and the colonel then takes this number and is like I'm looking up in the syscall table The corresponding syscall and then call it but if obviously if this syscall number doesn't exist in the syscall table, right? Then the colonel will tell you this syscall doesn't exist second runs before that if you know where I'm going with this, but if not I'm explaining in a little bit So right so a second never blocks What could you do if it actually were you to enable if it actually would block well There will be a lot of interesting use cases like What we wanted to do is To load a profile that specifies if the application performs the make not syscall, right? Then I want the colonel to not deny or allow it. I Want the colonel to wait for me to tell it what? What it's supposed to return to user space? So the colonel blocks and waits for a response from the corresponding process To tell it what to do and this is exactly what the second notifier Is doing it's been something that's been toyed around with in various forms and the guy who implemented this is actually here Tyco who did the original kernel work for this? Yeah, so One of the factors is that second ask user space for the return value and anno But the execution does never continue usually doesn't continue in the colonel. So If the if this if the application performs a syscall for example a make not syscall and it blocks the only thing that we can do right now is That we emulate the syscall in user space, right? I'm going to give a demo about this in a little bit So what this allows you to do is basically fake any well fake any kind of syscall may be wrong, but You can very much expand what you can do with containers right now So containers come we we touched upon this in the first talk I think was somebody asked what are the limitations that you usually experience with unprivileged Containers which are the containers that you should use because they're actually safe Or way safer than privileged containers Well, you can do a lot of things right if you want to mount any kind of block device inside of a container The colonel will not allow you to do this Because usually file systems in the colon are not able to protect himself against malicious images So if you allow an unprivileged user to mount any kind of image that has been given stick in a USB stick mounted And so on you could crash the colonel easily In user name spaces, you're also not allowed to do any kind of make not syscalls, right? If you could you could create I don't know deaf K mem death mem whatever it is and then write into random kernel memory as an Unprivileged user it would also be pretty bad. So there's quite a bunch of limitations and this obviously means that Most user space tools have not been written with containers And especially not written been with the username space or unprivileged containers in mind And it's kind of a catch-up game where user space is slowly Accepting containers to be a common thing and usually tools that get written nowadays will Will take care to accept function in username spaces, but a lot of old tools I don't know fake root or whatever they will usually try to create device nodes and so on and then they're running in a container And there is no there is no good reason why they shouldn't be allowed to run in a container, but they will fail But we also don't want to get me the obvious solution if we're thinking about the make not case that you could throw in My face is why not have a white list of devices inside of the kernel that we deem safe and if You know that's kind of messy then you get into kind of game kind of a catch-up game We have to wait on the kernel version And that gets you the right device node and whether it's safe or not and often times Especially for the container case right you have a container manager and you have the container and the container manager is sort of your your main authority apart from the kernel the kernel denies things that are globally unsafe and Your container manager will usually be able to judge whether or not a specific Operation is allowed for that container so it will usually have an idea of the workload Well, not necessarily, but for a lot of containers it will know what's the workload Can I trust this process inside of the container? Do I think a specific syscall is safe and so on? And there was no way there was previously no nice way To actually delegate this power to the container manager itself And this is where the second notify really becomes powerful. So you load a second profile and Make not syscall doesn't work But now you have this second profile that stops the process when it performs to make not syscall, right? Right before it's before it's even looked up in the kernel in the syscall table and then it sends you a message on a file descriptor and And the user space process that is supervising the container can then inspect the arguments and so on of the of the of the syscall For the container and then make by the based on that make an informed decision say for example It's safe. He just wants to create the console inside of the container and not something really malicious That's the that's a general idea and then deny it or allow it as you see the Concrete implementation is you load a sec a task the container loads a second profile and then it gets an FD for this profile And this FD is the so-called notifier FD this it can then hand off to the container manager Like Steve Docker Portman, whatever or common And then it can be registered in an EPOL loop in an event loop or in whatever form And then you can wait for events on it and every time a syscall that is registered in the filter for the container is made The supervising manager will get an event on the notify FD can then receive the syscall arguments What syscall is made and has been made and so on inspect it inspect the arguments and then Do whatever it wants and at the end tell the kernel Tell the process the syscall succeeded or tell the process the syscall failed excellent So Maybe I do the demo first and then we this is the gist of it But we've expanded this quite a bit over the last Times this is too small, right? better Okay, I have What do I do with this? So I Have a container running here Not running apparently, but so I can start a container. Let's go with the F1 and clear and as we are the LXD project we run full system containers as you can see and We are in an second. We are in an unprivileged container UID map and For the really suspicious GAD map So what we can't do obviously is stuff like this wait, let me see no Whatever make not BBB see that is basically Make me a device note called BBB, I know the arguments are really terrible C is character device and five is Console five one is console in any case But the kernel will not allow you to do this at all like this is this is off limits Because you could create any device note otherwise and there is no static device list in the kernel and so on as I said so Now it was B stick x4 file system So I'm not that's for a later part of the demo so don't get suspicious so Now we can Let's see config set F1 security This calls Intercept make not true and Now because currently second profiles can't be reloaded dynamically restart F1 Go inside Let's try the same exercise again no So You can see we don't we don't fool you. There is no Mount or an oops. Well This is an actual device note Man, it's the cops that Yeah, it's the character special Device note and so what we did obviously is we haven't changed the kernel I haven't rebooted the kernel and so on what we did is we used the second notify the second notify I sent a message to Lexi demon said this container wants to make a make not syscall and then the Lexi demon was like, okay Let me look at the arguments and then we have a whitelist and Lexi that states for example death console Yeah, whatever totally fine and Performs the make not action for you emulates it in user space so to speak and If I were to change this to block device it hopefully fails Well, yeah, obviously it fails. I mean, oh, this is I know why it is why this still works because I Modified the demo to allow me any device note, but good Because I wanted to show cool shit The same So you can obviously this is a very Well, it's an interesting use case because it allows you to run to run tools like fake root and so on But it's not the most interesting use case the more interesting use cases are currently we're very limited in the way that So if you request from us from the Lexi demon that we mount something for you So the user on the host type something in and states I want this mount to appear in the container then we can already do that like we can inject mounts no problem whatsoever But often when you have a tool running inside of the container, right? We currently had no way to intercept this syscall We had no way no way of knowing that it actually wanted to mount something that we consider to be safe So what you can do is you can also intercept it five minutes you can also intercept The mounts is called and then make it possible to have various stuff mounted in your container. So let's start with That's mount, right and then Allowed and we state that it's fine to mount Let's say x4. I should have a device noted there. Hopefully Yep, so I have a file system. I should probably so let me briefly remove this and the reason is that I Want to show you that it actually doesn't work by default Shall have one here And so if you do mount slash def SDB, which is an x4 file system and try to mount it The kernel will tell you know this is off limits because it could be a malicious image. You're trying to crash me or whatever like But if we Disallowed and do a restart now. I do the mount. There it is So Because we have in our policy allowed x4 mounts So Lexi demon will now intercept these syscalls and then if you call mount from inside of the container It will mount the x4 file system for you. The problem obviously with this is that It's only save if you do it if you do it via fuse So instead of Enabling the file system. We also allow you to set an option where you can rewrite it to fuse So any mount any x4 mount inside the container will be done by a fuse which is safe That's a safe user space implementation of this It's the only way you can safely do this But if you know your workload and you know in this unprivileged container is running something that Can only have ever access to specific types of device nodes Then it's probably also find if you set the allowed property the cool thing is or One of the limitations that we've recently gotten rid of with just got released a patch that I've merged for 5.5 is We now also allow you to continue syscalls because what I said is the kernel never continues syscalls from user space Actually, this is now possible when you set a specific when you tell the kernel to do so But be aware this cannot be used to implement a security policy at all It is only possible It is only possible if you are sure That if you continue and An attacker so imagine the following scenario so a container performs a syscall it gets stopped and you now Inspect the arguments and then after this based on the arguments that you copied out You make a decision and you say continue this syscall an attacker with equal privilege could rewrite the syscall arguments in the meantime So you need to be sure that if that happens the kernel will still be sufficiently sufficiently protecting you Yeah, I actually have more but I think we're out of town if you try to just do the demo very quickly Well, people ask question or whatever. I don't like you still have a better minute I don't have your expertise in doing demos very quickly But I also want to take chance to take questions We do have the actual instructions for that on the release post for lexd3 19 So if you want to try it for yourself, we do have the explanation part to do that the demo was gonna do There's more features coming sorry gonna someone from from Netflix has worked on a new syscall that I've merged for 5.6 That also allows you to retrieve file descriptors, which means that will also allow you to at some point bridge sockets So you can intercept Get an fd from the task that you intercepted and then connected Connected to a different port or a different IP address whatever you want. So it's pretty cool. We can take one question If the talker with equal privilege could change the parameters Why not to make another option to allow the application to be blocked while the security check is being done? Because this way you could actually use this as a security mechanism I think I have really bad ears. So I didn't understand it. You mentioned that You cannot use this as a security mechanism because the attacker can move change the parameters While there's a check or in a race condition So why not make an extra option for the application to be blocked while the check is being done? And this way this could be a security mechanism Mechanism The thread itself is frozen when when you hit in that the problem the problem you're multi-threaded You are the problem not only multi-threaded, but also the issue that SecComp At the time what secComp is running? The kernel has not copied any of the memory. So as soon as you get out of secComp And into the actual handler in a kernel that logic there will read again from user space So you can use it to do a quick check and deny But you cannot really do any more because otherwise the value might still change underneath you And that's just because of where secComp is in a stack So that comes back to my point about running before Running before it's looked up in the syscall table. So the syscall obviously the syscall in memory has not been performed Extra memory copying happens in the actual syscall. It's well it mostly in the actual syscall itself The cool thing is that which I haven't demoed I have a demo for this Is that you can invent syscalls with this because it runs? That's why I pointed out it runs before the system call number is looked up in the system call table So technically you could for example say Some random number that is now then that I invented as a new syscall Is a way to communicate to lexity to perform a certain kind of operation for you And a syscall doesn't even need to exist on the kernel So technically even if you're an older kernel we could for example say oh, it doesn't have to syscall Let's emulate it in user space. Yeah, you can emulate missing syscalls with it You can put a type new syscalls with it. You can do a lot of really weird stuff. Okay. I'm well of time Thank you very much