 Hello everyone, thanks for tuning in. My name is Ignat and today we're going to talk about sandboxing your code on Linux and Seccom. Just to let you know, this session is prerecorded because my home broadband is not that great to give live presentation, but I'm hopefully online when this presentation is being streamed, so I'm happy to answer all your questions during the presentation or after the presentation. Thanks for tuning in and let's go. First, a little bit about myself. I work at Cloudfer, I do mostly performance and security. I'm from London. I'm also passionate about cryptography and enjoy low-level programming, Linux kernel, bootloaders and other low-level scary C stuff. Okay, to understand sandboxing, let's talk about Linux system code and Seccom first. Well, on modern computers, we don't run our code directly on the hardware, right? So we on modern computers, we use operating systems and like with Linux, Windows and Mac OS being one of the prominent examples. And so we run our operating system from on top of our hardware and then we run our application on top of the operating system, which kind of separates the whole computing into two environments. It's like application run in user space and the operating system kernel runs in kernel space. And it's like a middle layer between application and the hardware, which provides useful obstruction and services. So system calls then. So if you run and your application on some kind of operating system, at some point, you need to communicate with that. You need to request some services from the operating system or request access to some hardware from the operating system. And for that purpose, the operating system kernel provides an interface to do that. So system call is basically this interface. It's a set of functions you can call from your application to request services on behalf of the application. So yeah, system calls is just a well defined interface between user space application and the operating system kernel provides many features like for example, hardware independent obstructions, your application does not need to care if it runs on a like an old spinning disk or on a modern SSD. Also, it doesn't need to know the details of your network implementation. So yeah, the operating system provides your basic high level services like file IO network access and also like time sound and something out some other services. This architecture also allows operating system to enforce different security models like access control list permissions and privileges between application. And because we also run multiple application on the same hardware, the operating system kernel is responsible for some resource management to ensure that hardware resources are shared fairly between all the application. So what is second then second is is just yet another system call. It's a operating system interface. And it provides a way for an application to notify the kernel which other system calls the application intends to use and sometimes which system call the application will definitely not use. And then the kernel enforces the system call policy provided by the application. So basically second is just a tool for an application to confine on sandbox itself. And remember second is just a tool. The documentation says it's not a full sandbox solution. It's a tool to create sandbox. So let's go back to our example. So in this case, the application can use second system call to provide its system call policy to the kernel. Think of it like as a contract between the application and the kernel where the application promises to the kernel to use only a subset of system calls and probably never to use some other system call. And because it's the contract, if the application later breaks the contract, the kernel is free to take action. So it usually penalizes the application by terminating it. So why would applications actually do that? So if they don't provide any policy, the second the Linux kernel will never penalize the application. But why would the application would take the risk to providing that policy if they can suddenly, for example, violate it in the future and be abruptly terminated? Let's consider an example. So imagine you're writing a simple clock application and you're writing it on Linux to be able to use second. So what does the clock application actually need? It only needs to know the current time. So you can actually define a second policy for your application and pass it to the kernel and your policy would be, hey, I'm just a clock application. And I will only use the get time of day system call because the clock only needs to know time. And when your application is executing at some point, it needs to know the time. So it uses get time of day system calls to get the time from the kernel. And because it's within the allowed policy, the kernel will allow that system call to happen and return the time. But imagine at some point later, because you wrote your clock application in a low level unsafe language like C or assembly, someone hacks your application and that achieves arbitrary code execution. Then the attacker will probably use that application to collect some sensitive information from your system. And they will try to send it to themselves over the network to steal it. So they will direct the application to send the data over the network. So the application will most likely use the send system call. But because the send system call was not part of the original policy, of the original promise by the application, the kernel will not allow it to happen and immediately terminating the application thus like terminating the hijacked application and preventing the data leak. So that's why SACOMP is very useful. SACOMP is not very important for normal behaving application, but it's very effective if the application is somewhat exploited mostly from a good way to protect from arbitrary code execution attacks. So sandbox your applications. Okay, how to actually do that, right? Let's consider, let's jump into code now. So let's write a simple common line application. If you're familiar with UNAME, a common line tool, it will be something similar. So our application will use the similarly named UNAME system call to get some static information from your local operating system kernel and print it to the user. Let's see how it works now. If we compile this code and run it, we will see that our operating system is Linux, which is great because we can now experiment with SACOMP further because SACOMP is the Linux only feature. Now let's try to sandbox our application. So we will modify the code and add just one function, additional function before the main logic, which we'll call sandbox. And here is the function implementation. So let's define our policy here. So actually to illustrate that SACOMP is working, we will define, we will actually prohibit our own application to use the UNAME system call, but we will allow any other system calls. So to do that, we will need to write this scary listing. Unfortunately, SACOMP rules are a quite low level feature and they are written in the BPF programming language. So if you, the operating system provides some useful macros for that, but the whole rule still looks very like assembly. So yeah. But the gist of this scary listing is this part. So in this case, what our rule says is that we allow every system call within our process, except the UNAME system call. And if the application tries to use the UNAME system call, we will block it. And instead of terminating the application, we will instruct the kernel to return an error code in that down. And finally, in our sandbox function, we need to actually apply that policy or enforce that policy by using the SACOMP system call and sending our SACOMP filter program to the kernel. So let's see how it works now. If we recompile our modify source code and try to run it, we will see this. We see that UNAME system call failed and that network is down. And we know that SACOMP is working because of two reasons here. First of all, there is no way the UNAME system call can return network is down error because basically on itself UNAME system call doesn't, is just reading some static data from your local operating system kernel. It doesn't need any network access. But because our application receives that error code, we know that we hit our SACOMP rule. And secondly, what we're hitting with the error code is the error pass in the application. So the application has an error pass defined where it just prints the error to the user. And we see the output. And to actually to print something to the user, the application has to use other set of system calls, notably the right system call. So we know that other system calls are allowed because the application was able to actually print the error code to us. So yeah, it's quite simple, but it's definitely not simple to write SACOMP rules by hand. So it's quite low level assembly like language. It's hard to write. It's hard to review. It's hard to debug and update. So even our small example contained a quite large and rather unreadable code, whereas consider if you need to define a more complex policy for a very complex real world application. If you write it by hand, you also need to be aware of some low level details. So SACOMP rules, BPF rules do not operate on SACOMP system call names. They rather operate on system call numbers, which is internal representation in the kernel. And to deal with SACOMP system call numbers, you also need to be aware of the architecture because the same system calls may have different numbers on different architectures. And there is also some other quirks like setting the no new privileges bid for the process you want to send box. And the description of this is buried very deep in the SACOMP main page. Luckily for many complex things, there is a high level abstraction library and the SACOMP is not an exception. So we can actually rewrite similar policy using the high level leap SACOMP library, which is actually even recommended on the SACOMP main page. We will modify our policy a bit. So usually just returning an error code if the process violates its SACOMP policy is not great because you still give the chance to malicious code to recover from the error and try to bypass or do an evil thing any other way. So it's usually better to just terminate the application, which we will do here. So instead of returning error code, we will modify, we'll still prohibit the uname system call, but we'll tell the kernel to immediately terminate the application. And actually that huge manual BPF listing can be boiled down to three small statements with the high level library. So first we define the default action. So by default, we tell the kernel to allow every system call. Then we add our specific rule with a small statement. We say, but we prohibit the uname system call here. And if the application tries to use it, we just terminate it. And notice here, we can reference the uname system call by name. And we don't have to deal with numbers. So leap SACOMP will resolve it for us. And finally, again, we pass the compiled policy to the Linux kernel using the wrapper around the SACOMP system call. Let's see how it works now. So we will recompile again our modified source code, but this time we need to also link with the SACOMP library because we used it in our code. And if we execute the system call, we see the bad system call message. Notice here that the application did not have the chance to print the error now because it was immediately terminated when it tried to use the uname system call. So that's why we don't see the uname file message anymore because our policy is more strict. Okay, this is all well and good. But the problem is two previous examples. We actually had to modify the source code for the application to add sandboxing support. But it's not always possible. So imagine, like, if it would be a live presentation, this would be a show of hands question. But think of, if you're a developer of, you know, the developer think of the following. If you're a developer, you have, you know, project, you know, sandboxing exists. And you know, sandboxing is the most efficient way to protect your code from potential security vulnerabilities introducing in your code. So how many developers think like this? So most likely I will write bugs and security vulnerabilities. I should sandbox my call. Of course, not many, right? Like, if you're a developer starting a new project, you have so many other priorities. And it's usually like, oh, we have to deliver the primary functionality first, we have to deliver the MVP, we have to do something else, else, else, and we'll think about the security later, which is a typical case. Also, there are some proud developers out there, which will probably even never admit they will write bugs or security vulnerabilities in the first place. So, yeah, this is not an option, right? So here is an inherent problem with second that second applies the defined rules of policy to the current process. And it's basically the model is that it is expected that developers themselves will add sandboxing support support into their code. And there is actually no external interface to apply second rules to running process. But on the other hand, you have these other people like operators, sysadmins, or SRE who run that code in production, and they would really like to sandbox that code to have better security, but they have almost no control over the second policy, because they have no access to the source code, or even if they do, they're not very familiar with the code base, et cetera, et cetera. There is also this small side of application everyone runs, which are like closed source proprietary tools, where source code is completely unavailable, or there are just some kind of third party applications, which should be sandboxed by default because you cannot even audit the code, but you don't have any means to do that. So with second, there is a gap, right? So on one side, you have the developers of the code who have the ability to sandbox their code, but they are most of the time they are not incentivized to do that. On the other hand, you have the operators, sysadmins, and SREs who are incentivized to sandbox their code, but they don't have the ability to do that because they don't have access to the source, for example. So this is where no code sec comps comes into play. So it would be nice that you could can apply second policy to any process on the Linux kernel, even without having to have having to have to recompile it, or even like some having to have access to the source code. And turns out there is a solution. So you can do that with system D. So if you run your code through system D, you use system D as a service manager, if you dive deep into the system D documents, you can define so-called system call filter directive. So you can supply a list of permitted system calls to system D, and system D will inject, convert them to a second policy and inject that policy before actually starting your service. And if the process later, violates the policy, it will be terminated with sys signal. You can read about this more in the system D documentation. So let's see how it works. Let's go back to our original non sandbox version of our tool, which is just basically calling your name and print the result. So now let's try to run this tool from system D. So we can actually use system D run to simulate the to running our tool as a like fmrl system D service. And we see it runs fine. So we still see our output and system D also prints that the application or the service exit it was successfully with exit style zero. So all is fine. Now we can try to sandbox our code. So we can actually add our system call filter property and supplied denialist to deny your name system call to the to our application. If we run it, we see that it's not printing any results anymore. And we see that system did tells us that the application was actually terminated with the signal, which is what we expected. And now to apply our custom second policy, we didn't need to modify the source code for the myos tool. And we didn't even have to have access to the source code. We just operate on the same vanilla binary. We can actually simulate our first policy where instead of terminating the application, we tell the kernel to return on custom error code. We can also simulate this with system D. We have to add yet another property system call error number in and down. And now we see similar behavior as our very first example with second that the tool is actually able to print the result. We expect that you name system call failed with network is down error code. And but the process itself is not being terminated is just exits with an error code, which is the way how we program from the start. So great. This is all well and good. Why don't we just use system D to sandbox any code on our system? It's possible, but there is always a but, right? And there is a small print. If you read further down the system D documentation, you will notice this line that some system calls notably exact V exit and some others are implicitly allowed. They're always allowed and they don't have to be listed explicitly. And even though you tell system D to prohibit the system call, they will be still allowed because the system D, sorry, the system call, the system calls are needed for system D to actually function problem. Most of these system calls are okay, except exact V. So exact V is a quite dangerous system call. Let's consider why, like why blocking exact V is good and why you should try to do it. If your application logic doesn't need this call. Most application don't, for example, if you're not writing a shell. So let's remember our clock application example. So we remember that the second is the most effective measure to protect our code from arbitrary code execution attack. So what does an arbitrary code execution attack look like? So for example, if you're an attacker is able to exploit your code and make your application malicious, they will try to expand their capabilities of execution by directing your application to actually use exact V system call to run some other application. And most of the time it's a shell. So the attacker just needed shell access to your system. And then the attacker can issue arbitrary commands and do whatever they want. So this is the most common pass for arbitrary code execution exploit. So exploit the application and make it run exact V system call and replace the application code with the shell. So blocking exact V is quite handy to protecting from these kind of attack. That's why we develop our own sandboxing solution, which is we call it just a sandbox tool. It's a toolkit to inject custom second rules into almost any process without any access to any source code. And the toolkit considers of two pieces. It actually follows a system D approach to allow to sandbox any code on your system with no code changes, but takes it one step further. So the toolkit contains two pieces. One piece is a shared library, which is designed for dynamically linked executables. And the other part is a launcher for static link applications. And the advantage is that toolkit can block any system full, including exact V. And it works on binaries written in any, any programming language and it works even on proprietary binaries. And the tool is open source. So you can check out the source on GitHub. At this point, you may be like, what shared library, you said, we are talking about zero code, a second support, doesn't a shared library implies we need to add some code to use it in our application? Well, no, because it's a bit special. So let's go back to our toy, you name like tool, which prints the currently running operating system, and which is non sandbox version. So it's like vanilla tool. To sandbox our, our common line application with our new cloud for sandbox toolkit, all you have to do is run it like so. So we do two things here. So we preload our send our library from our cloud for sandbox toolkit into the application process space using LD preload. And then we configure the desired second policy using an environment variable. So in this case, we configure a denialist policy with only one system call you name. And when we run it, we see it behaves as expected. So when trying to do trying to call the you name system call, our application is being terminated immediately by the currently, and we know that because it even doesn't have a chance to print the error message to the console. So yeah, it is a dynamic library, but it's magic that you don't have to use it in your code. All you have to do is to somehow link it into your process address space. So some people don't like the LD preload approach. And sometimes actually LD preload approach is not usable because it has an exceptions in Linux. So to work around the LD preload thing, we can actually patch the compiled executable and add our sandboxing library as a permanent runtime dependency. So if we do that, we don't need LD preload anymore, and we can run our application anytime just as is. And what we need to do is just to define the environment variable with a desired second policy. And notice here, we patched the executable, compiled executable with no access to the source code, and we still get our result. Okay. But what about static binaries? So if we recompile our original application as a static binary, we will see that the LD preload approach does not work anymore. That's because for static binaries, there is no runtime linker. That means linking dynamic libraries into the process address space does not work anymore. For this specific use case, we have the other part of our sandbox toolkit called sandboxify common line tool. It has the same configuration interface with environment variables, and all you have to do is to run your application through the sandboxify launch. It was the same way probably as we would run our application using the system D service manager. And this way we can enforce custom second policy on static on processes spawned by static executable. So in this case, we see that again, the application was terminated immediately after it tried to use the name system call and wasn't able to print anything to the console. So yeah, all you have to do is to replace the LD preload directive with our custom sandboxify launcher, which is part of our toolkit. So in a nutshell, Cloud for Sandbox tool is quite easy to use, and it has a very simple configuration interfaces configured by environment variables. The default in preferred mode is to actually supply your policy as an allow list. So anything, everything in the list is allowed, but everything is not in the list is denied and the application will be terminated immediately if it tries to use any system called not in the allow list. The more flexible, but probably less secure approach is the denial list, which we saw earlier is that every system called is allowed by default, except the ones in the list, which might be some kind of dangerous schools or whatnot. Also, there is a certain environment variable called default action. So by default, if your application violates the configured second policy, it will be terminated, but you can instruct the kernel not to kill the application and actually allow the system call to happen, but just log that the attempt has been made. This is very useful for like early discovery stage where you're working with some kind of new executable and you're not exactly sure even which system calls are being used. So you can run your application in so called permissive sandbox where you just monitor which system calls the application uses, or you can actually use this mode to verify if your defined second policy is not too tight and you're blocking some legitimate syscalls. And yeah, and the toolkit itself, as I mentioned before, considers of two parts. The first part is Leap Sandbox SO, which is a shared library. It's useful for sandboxing dynamically linked executables only, and you actually just need to preload that library into the process adverse space either using the LD preload or actually patching the executable directly. And the other part is the sandboxify common line tool, which can be used on both statically and dynamically linked executables. And to make use of it, you need to launch the executable through our custom launcher. And at this point, what many people ask, so why do we even need to leave the first piece, the dynamically linked, the shared library Leap Sandbox SO? If the launcher can be used on both statically and dynamically linked executables, why not just leave the launcher and never use the shared library approach? And the answer is they're different. And to understand the difference, let's from like 1000 feet view review the process startup stages. So when you start any process, when you launch any executables, what usually happens there are two major stages. The first stage is some kind of runtime in it. So every code has some kind of runtime. If you write in C, you have the C runtime. If you write in Go, you have the Go runtime. If you're writing some scripts using interpreted language, you have some interpreted runtime. And then you have the actual main logic, right? So in this case, if you use the sandboxify approach to inject custom second rules, the rules are injected before the run, usually before the runtime in each stage happens. But if you use the Leap Sandbox approach, the rules are injected after the runtime in each stage. But why it's different? So what is this runtime in it? So runtime in it, regardless of the language we use is basically a set of system calls which are never used afterwards. So that's why it's in it. So this runtime sets up some resources, process memory, map, et cetera, et cetera. And it uses a lot of obscure system calls to do that. But most of these system calls are never needed in the main application logic. And here we have the advantage of Leap Sandbox because it's usually the second rules are inserted after the runtime in each stage. If you're using the preferred allow list approach, all you have to allow is all the system calls from your main application logic. But if you use the sandboxify approach, because it happens before, because the enforcement happens before the runtime in each stage, you actually need to allow all the system calls, both from your main logic and from the runtime in each stage, which is usually a lot more. So you're basically you're allowing some system calls, which your main logic doesn't need. And if you don't do that, you cannot start your application because the runtime in each stage will just crash. So let's see a concrete example here. Let's go back to our toy. You name like tool. So if we now, we will change the policy now for previously in our examples, we use the denialist approach where to prove that Sec Comp is working. But this time, what is the minimal allow list we have to allow for our toy application to function properly? And if you use the Leap Sandbox approach, turns out you only have to allow for system calls for application to function properly. So you have a very tight sandbox here. If you run the same application using the sandboxify launcher, because of the runtime in each stage, C runtime in each stage for the specific application, you need to now to allow certain system calls to run your application. So it's four system calls versus certain system calls and they allow this. So yeah, with a dynamic library, a Leap Sandbox approach, if you have a dynamically binary, you can end up having very smaller allow list thus greatly reducing your potential attack surface by disallowing many other system calls your main logic doesn't use. And that's basically it. What I had for today, here are some useful links. The first link is linked to the Sec Comp man page has a lot of low level details and quirks I've been briefly mentioning in my presentation. The second link is the official repository of the Leap Sec Comp library we discussed today. And also Leap Sec Comp library is actually used under the hood in the Cloud for a Sandbox toolkit. The third link is the link to the SystemD documentation which describes how to inject Sec Comp rules into any code with SystemD. And finally, again, we repeat the link to the our Cloud for a Sandbox toolkit. We hope that you find this toolkit useful. You will apply it to your application and provide some feedback to us and even better some pull requests and more functionality. Thank you very much. That's it for today. Now I'm happy to take any questions you have in the chat. See you and stay safe. Hello everyone. Thank you for joining my presentation. I try to answer all the questions in line, but there are still some building up. So I'll try to attack them one by one. Let me just publish the first one. Yeah. So Al asks that it's not unusual and unusual for complex applications to read, byte, and or receive system call. How do you protect against those? So the general idea around Sec Comp is you try to come up with a minimal allow list of all the system call your application need and to prohibit all other system calls. In most case, like blocking exactly is probably even good enough because exactly plug dangerous and is the main tool the attackers will use to get arbitrary code execution if they will be able to exploit a code. So even if you just block the exact V and your application module does not need it and it doesn't need it unless you're writing some kind of a shell. So it's a good idea to have it in place. Like consider a very famous code execution exploit in image tragic image processing library. So let's say if you were executing this code in like a second environment, most likely you will be less vulnerable to the kind of attack and like probably an image processing library doesn't need to execute any other binders, I hope. Okay, to the next one. So Rustam asks, is lib sandbox being distributed by distro? No, because I just pushed the code today on our public GitHub repository. So I hope that will change in the future. But if I know like some distros are very open to new maintainers taking new packages, if you want to take it on yourself, if you feel free, you will get my full support. So the next question Wayne asks, can we use both approaches which would allow a longer system called reviewing runtime and then more restricted list during the main code? This is a good idea actually, I haven't thought about that maybe. In its current implementation probably no because both tools are like reading the same environment variable. So if you like define a probably to define different second policies, you probably need to define different environment variables. So maybe future releases, we will consider to changing the environment variable. And like if lib sandbox will use one set of environment variables, but sandboxified to use a different set of environment variables, maybe that would be possible. Although keep in mind that if you like have a sandbox policy, a second policy force, the only thing you can do it is you can tighten it more. So and you also have to remember to allow the second system call itself in your initial policy. Okay, so next one. Thomas asks, how to include in things like go and rock to use dynamic libraries probably. So again, because you're linking the library and the code at some time, it doesn't matter. So I definitely know go has by default has a quite sophisticated system determining like whether when you compile your code, whether the output binary will be dynamically or statically linked. Most of the time, if you use some operating system services, especially within networking or domain resolution, you will get a dynamic binary. If you don't use them, you might get a static binary. Also, if you compile a go application will see go disabled, you will most likely to get a static binary. I know Rust by default uses dynamic link binary. So just you can check it with LBD common. If you have a dynamic or static link binary on your output, you can just decide which approach to use. Okay, I'm sorry if I didn't pronounce your name properly asks if can we filter a system code based on parameter value? Yes, the general second system code, because basically all the rules go down to BPF programs. It's technically possible to write BPF programs, which basically filters, provides decision based not only on the system calls themselves, but also the past and parameter values, although the current library and the system D thing doesn't support it. So if you want that approach, you probably would have to write your own code yourself, your own BPF specific to your specific application. So Emily asks, probably how are you collecting a list of system calls for common programs to run on your sandboxify? So there is better like two approaches. The simpler approach is if you just have access to the system, and you have a common line tool, you want the sandbox, you can run it under a trace, and it has a flag, we just like count all the system calls the application users, and then it will form a basis for your policy for more complex application from our experience. Because you don't exercise the whole execution path of a complex application, you will probably get an incomplete list. So for productions, that's the reason we included this permissive mode, where a variable which allows you to instruct the kernel not to terminate the application, but just log the system calls. So for production services, we put the service, the new service into like the so-called permissive mode first, and then our log and metrics collecting system collects these logs within like days, and then we analyze these logs and compile the initial list of all the system calls we need to allow. That's basically why the discovery mode or the permissive mode is in the code the first place. Yeah, I think we have another list of questions. Sorry. Okay, so Jason asks, how does sandboxify limit exactly while it needs to use it to execute the subject binary? I urge you to look into the code. So basically in the newer kernels, you can run the child process under ptrace and temporarily suspend the second soldering. So this is our main difference from system D. So system D just spawns the child process, sets up the second policy and does the exact V. So we spawn the child process temporarily under ptrace, then set up the second policy and suspend the second rule processing before doing the exact V, and we also put a breakpoint on the exit of the exact V system call. So we basically continue the child execution of the child process, then hit the breakpoint when the process executed the exact V, but we return to the main code yet, and then we review the second rules processing and just continue running the application. So it's just basically one step forward from the system D approach. So this is how we're able to actually block any system call including the exact V. Okay, Renato asks, is it possible to change allow the block system call during the run time? Yes, but you can only tighten policy and you also have to remember to allow the stack comp system call in the first place in your initial policy, because if you block it, you cannot do any further stack comp system call. But if you don't block it, you can then add more rules to the system and the resulting sandbox will be combined, a combination of previous policies plus your new rules. So you cannot loosen it, you can only tighten it. Vasily asks, what is the license? I have to double check on that and thanks for pointing this out. Emily asks, what happens when you try to use the sandbox and executable that declare the second sandbox? This is the same what I mentioned before. So basically, if your initial policy will allow the second system call itself, the code will able to further sandbox itself, if and the resulting sandbox will be a tighter combination of the both, if you prohibit the second system call from the beginning, the application will be just terminated when it tries to sandbox itself. Okay, I think we're out of questions here. We can continue the chat in the Q&A room. So we can chat in the Q&A in the Slack room, which is called number two track Linux system. So I'm waiting for you there. Thank you very much for listening to my presentation and basically that's it. See you online and waiting for your contributions and feedback on the sandbox tool. Thank you.