 Okay, welcome everybody. Thank you for joining me for this presentation about system call filtering with SACOMP. I realise it's day four or possibly even day five of a long week for all of us, so I'll try and keep the energy levels as high as I can during this presentation. Just to introduce myself, my name is Simon Goder. I've been working in embedded Linux for nine on 15 years now. My background is in applied mathematics and I got into the technology industry doing algorithm development for 3D graphics and worked at a semiconductor company, STMicro, in Bristol in the UK for many, many years. And it's there that I got into helping customers use Linux primarily in home entertainment, set-top boxes, those sort of devices. I've been with DoLoss for nearly ten years and I'm part of the team that writes and delivers training in the embedded Linux space, drivers, Yocto, security, etc. So what we're going to talk about today is system call filtering with SACOMP and the key word in the title, which I've just not said of course is the word practical, we're not going to get into the detail of the underlying technology. That's not what this talk is about. What we're trying to give you is an overview of what this thing is and what it can do for you in your system that's the sort of top level title designing to the worst case scenario. It's about when you're thinking about the security of your product, what could go wrong. So we'll introduce some basic concepts which I'm sure most of you are familiar with but it's useful just to make sure we're all at the same level. And then we'll introduce SACOMP and the accompanying library, LibSACOMP. Look at some examples. We're not going to get too hung up on APIs and parameters. That's what the man pages are for. But at least introduce the basic concepts. And then we'll look at how we can use this technology in practice in a real system with things like system D and containers. So when you design a product, your primary goal is that device does what you expect and what you want and what your customers want. It needs to be correct in that sense. But there could be some misuse of that device through some floor design, some hardware problem, some weakness in a piece of software. That weakness may be something you do know about or it may be something you should know about, but it's there nonetheless. And the goal of security in a system like this or an embedded system like this is to try and protect against this misuse of the device. We're trying to prevent or at least mitigate the effects of somebody attacking your device. In the picture there, we can see the device in the middle. It's talking to other devices. It's talking to users. It's potentially talking to cloud services. And we have represented by the rather scary red lightning bolts. We have potential attacks coming into that device. And what we're trying to minimize is the so-called attack surface. We have different attacks being made and the sum of those attacks or potential attacks is referred to as the attack surface. And we want that to be as small as possible. And it can be summarized with a phrase which is used in lots of different contexts, which is hope for the best, but plan for the worst. And that's really the goal when you're securing an embedded system, an embedded Linux device. Some basics, which I'm sure most of you understand, but it's worth just going through. The first is how do applications actually talk to the kernel and ultimately to the hardware within that system? And they do that via a mechanism called the system call interface. And there's around 300 system calls. It varies with kernel version. The later the kernel, the more system calls we have typically. And these are normally called via a corresponding library call from your C library. There's not a one-to-one mapping. There's plenty of system calls which are not supported in the library, but typically you make a library call. It gets turned into a system call. It goes down into the kernel and something happens. So we have an open call here into our runtime library, converted into a sys open into this magic block, the VFS, the virtual file system switch. And it's that block that decides, well, you're calling open, what does that really mean? What are you trying to open? Is it a file on a disk? Is it a display? Is it a sensor? Where do I actually need to run some code in order to service that open request? And special structures called FOP structures, file operations are registered with that block. And it's that that provides the routing through to the underlying driver function, you know, whatever that might be for the particular hardware you're dealing with. So user space can't interact generally with hardware directly. It's a sort of service relationship. If you like, user space asks the kernel to do something, and the kernel generally will do its best to respond to that. Or there are circumstances where it will say, no, permissions, for example. Effectively, what is happening in the kernel is at the highest level of privilege within the system, and user space is at the lowest level of privilege within the system. The second sort of underlying idea we need to get to grips with is how user space processes are actually launched in a running Linux system. And that's done, we have an idea of a parent process and a child process. And the parent will call, a system called fork, and that will create the child process, which is effectively a duplicate. It's slightly more complicated than that, but for now we can accept that it's effectively a duplicate of the parent. And typically what happens, apart from the most simple sort of hacky test programs, is that it will call, and trust me, I'm no stranger to simple hacky test programs, so I know what I'm talking about, the child will call one of the family of execs, so-called exec family functions, which loads a new program file into that child process and off it goes. The parent can exit or typically will continue to run. Often that parent is the init process within the system, the first user space process that runs when the kernel boots, process ID one, it will look for the init program, will start it running, and the init will then launch all the other services. Or, on a more sort of mundane level, the parent will be the shell. Every time you type a command into a shell and hit enter, this is what's happening. And you can see that for yourself, if you use the system trace caller, strace, and you just strace the shell command, the first line will always be some variable on this, exec something, and the new program that's being loaded. And so, when we're training newbies to Linux on this sort of thing, they often only make sense when you show them that they understand that relationship between the parent and the child. So where does Secomp fit into this? Secomp is a mechanism within the kernel that allows those system calls to be filtered, stopped, logged, processed in some way. It's a kernel mechanism, as I say. It uses something called BPF underneath, which is coming up in all sorts of different contexts at the moment. As I say, we're not going to get stuck into the detail of that. Maybe a good place to start, if you're interested, was Michael Kerask did a presentation on this topic. Oh, it must be five years ago now. Have a look on elinux.org. You'll find a link to that. As is often the case, what Michael Kerask says is a good place to start with these sort of topics if you want to know that low-level detail. Luckily, we don't need to know the low-level detail because we have a user-space library, Libsecomp, which provides us a nice set of APIs and supporting tools that allow us to make use of this functionality. So what the diagram shows us at the bottom is the kernel with the second block within that. In user-space, we have our parent process, which is setting up the filtering using the library. It creates a child process. That child process attempts to do something it's not supposed to. Perhaps it's been compromised in some way by a buffer overflow or a command injection or one of the myriad ways that programs can be compromised. And the kernel is able to terminate that process, stop it in its tracks. It can't do whatever the attacker wanted it to do. And that gives us a very powerful level of protection within our system. So as I said, I don't want to dwell on APIs too much. That's what the man pages are for. Ultimately, again, thank you, Michael Kerask and his colleagues. But we can just have a look at the basics here, and I'll build on this in subsequent slides. So we have a little snippet of code. Of course, we have a header file. Always important. And then we're setting up an empty instance of a special data structure which represents the second filter. And then we initialize that filter and we give a default action. And the action we've chosen here is to kill the processes. So if you just left it there, nothing would run in that child process because we're saying any system call, stop. Obviously, that's not ideal. We typically want to do a little bit more than that. So we can then add rules. Effectively, by setting that default action to a kill, we've put a blanket over the whole thing, and we're now putting pinholes in that blanket to allow certain actions to take place. And if you think, look at the list of actions available, we can kill threads, processes, we can trap, we can terminate a process with a chosen error number, we can trace with ptrace, we can log and allow, or we can just allow, or we can lock the bottom one, we can fit into some other monitoring process. The fact that we can allow things and we can kill things means we can think about it from one of two ways. We can say, well, these are the calls we'll allow, but nothing else. Or alternatively, these are the calls we won't allow but anything else is fair game. So we often refer to this as an allow list or a deny list, one way or the other. There's two ways of looking at the problem, if you like. So, slightly expanded example, we're setting the default action to be terminate or kill, but we're adding three exceptions to that. We've got three rules with the action allow, and we've got read, write, and close. System calls map down to integer values, these vary by architecture, so if we're writing portable code, which hopefully we are writing portable code, we use this macro provided by the library that knows what architecture you're on and will map appropriately. The zero here indicates that any parameters are acceptable, and we'll see on the next slide that we can be even more targeted in our rules. Finally, we load the filter into the kernel and away it goes. So this more complex example of a rule, again, not going to get stuck onto the detail, but by placing a three in that third, no, fourth parameter, we're saying we're going to allow sys reads, but we want to check the parameter values. And that allows very fine grain control over what you're doing in that system, and there's a fairly complicated set of macros that the library provides, but what it comes down to is a set of tests that can be applied, equal to less than, greater than, those style tests, and we can see here that we're checking for equality of the first two parameters and then less than or equal to the third numeric parameter, the buffer size. So you can be incredibly targeted in what you will accept or equally what you will deny. One question that might be in your mind at this point is, well, how do you know what system calls your application your process actually makes? You could guess, I suppose, and you could go through a process of elimination, which would be an interesting way to spend a few days, I suppose, but there are a couple maybe more scientific ways we can do that. One is to create an empty filter, but rather than killing everything, oops, apologies, we log everything, and then you run your process, you need to perform tests on it, use cases, et cetera, et cetera, and then you look in the system logs via journal CTL or whatever the appropriate logging mechanism you have, you'll see that SecComp has logged, it's given me the name of the process, just hello world in this case, and the system called number. So that's the architecturally specific system call number. Of course, we need to know what system call that maps to. I don't know if you can see at the bottom, you can see in the PDF, if not. We have a tool provided by the LibSecomp library. You tell it what architecture family you're on, and it will map that back to the appropriate system call. So you can go through that list, make a note of all the different system calls you need to put rules in place for and where you go. We can also use STrace itself, the system call tracer. You can generate a list of system calls. Now there's a lot of stuff on there that we may be less interested in the output, so if you're that way inclined, you can do some command line trickery to strip out all the information and generate a plain list. This is a beauty, this is from some Docker documentation which I found. So it uses the STrace statistics with minus C and does a bit of hacking around with it to generate a plain text list, which is quite handy. One interesting point, if you've got child process that launched subsequent child processes, grandchildren I suppose, then we can use STrace minus FF which will generate logs for all of the descendants in separate files so you can build up a complete picture. And that will even work if you're launching a containerized process. You can trace from the container launch, the container runtime into the running container and get information. So that's a pretty powerful feature. So here's a sort of summary of what we've seen, this very, very simple example. We include the header. We initialize the filter with a default action. We set up our rules and then we create the child process. The parent needs to be compiled with the appropriate link option minus LSECOMP. The child is blissfully unaware, as children often are, that anything is going on here. It doesn't know it's running in this confined environment until something goes wrong. This isn't typically how we're going to work in a real system. We're going to be working as part of a wider framework of processes and applications. It's important to know how this works, but typically we're going to be using something a bit broader. So for example, we can do this as part of our initialization with the init program. Now we're focusing on system D. It's becoming the default in embedded systems. We still see sysvinits being used. And you could do this with a sysvinit, a script-based initialization, but we have to do it by hand, whereas system D has that integration in it. And that's why it's becoming more widely used, because of the integration with security features like SE Linux and those sort of things. So if you're not too familiar with the init process, it's the first user space process, PID1, launched by the kernel as soon as the storage is mounted for the target root file system. And it's responsible for starting up the other user space processes within your system. PS3 is a really nice tool for visualizing those relationships. System D can set system call filters per service. So as part of the service configuration, there's a parameter system call filter that you can use to set system call filters. And you don't have to worry about the APIs underneath. So the example here is just a test service. We've got the variable system call filter equals, and you'll see a tilde or a squiggle, and then the system call uname. The tilde is a negation. It means block. So in this service, if it attempts to call uname, it will get blocked. To make things even easier, system D actually provides a set of predefined sets of functionality that you can block or allow, depending which way you want to do it. The names you can see are fairly self-evident. What sort of things they include, file system, system calls, network IO system calls, module system calls, that sort of thing. You can see the full listing in the documentation. In addition to this, system D has a basic allow list of the things you need to at least your application to start running. So you can add to that basic list, you can add predefined sets in a space-separated list, and you can negate these things as well with the tilde, with the squiggle. So you can build up a quite complex filter fairly easily. So if you wanted, I don't know, certain IO events, but not all of them, you include the IO event set and then negate some of the individual system calls in the same list. So that gives us that control over what we want to do. Often these days, we're dealing with systems that work in some, or work with some sort of isolated execution environment or environments. Systems like LXC and Docker are becoming more and more commonly used in embedded systems. We can have pretty full Linuxes running inside a container with multiple processes, a proper, a normal init program like system D, or we can have maybe very stripped down, lightweight containers that run one or two processes very targeted in what they need to do. And this of course has numerous advantages from a security point of view, from an update point of view, all the aspects of a modern online embedded system. So the two I'm going to talk about are LXC and Docker because they both provide an interface to SecComp out of the box, or more or less out of the box. So LXC's may be not quite the original but certainly one of the very earliest container run times that's been available in Linux. It's still a powerful interface. It used to sit underneath what Docker does, not any more. And if we want to use SecComp with LXC, we have to first make sure that our LXC has been configured and built with the LXC configuration so that the SecComp support is built in. So for example in Yocto in meta virtualization, the LXC recipe doesn't include this feature. You have to manually add it in order to build LXC within Yocto. Once you've got LXC built into, SecComp built into LXC and the kernel appropriately configured with config SecComp, then we can add a line to our LXC configuration file LXC.SecComp.Profile and that points to a text file which contains either a deny list with an explicit statement of that or an allow list with an explicit statement of that. The number 2 there refers to the version of SecComp the support in LXC has been there for so long that it had version 1 support originally. So the deny list we set the name of the system call and the action. What do we want to happen to this system call? So here's an example of the Erno action where we want to return an error code of 1 if this system call is called. The allow list we don't need to give an action because we're allowing it we're just providing a raw list. And so once this container is finished it will apply the system call filtering. In fact that's not quite true if you look at the logs from LXC the SecComp filtering is applied really pretty early on in the launch process of the container and will trap a lot of LXC runtime calls as well. So in my experiments with a single process container that just did a printf in a while loop there were 48 system calls that needed to be allowed before that would successfully launch. So I'm not sure how useful it is. You need to you have to balance getting things to work with there's no point in having a security fence that's got so many holes in it that anybody can get through. Docker as you may not be so surprised to learn has a more fully featured and maybe more structured set of support for SecComp as is the nature of these things. And it actually runs containers with a default SecComp filter. Whether you knew that or not it's already there assuming you've built your Docker with the SecComp feature enabled. And of course you have the kernel support. So it blocks around 40-44 system calls by default some of them are sort of deprecated obsolete system calls. Some of them are known to be risky including the system calls that lets you load BPF code for example. It won't kill the container in the event of one of these calls. What it will do is return a permission denied so it's using the Erno action for that. So you can run without the default profile with the options we can see here minus security opt SecComp equals unconfined. Obviously generally not a good idea. What we can do instead is we can find that default profile we can extend it, we can modify it we can change that and we can add our own calls into that list. It's a JSON file, it's pretty readable, pretty easy to understand. We set the default action and we can set individual system call actions and the parameter switching. The syntax is quite small there but it is fairly straightforward to implement that. And that will go in and it will set the filters for the code running within that container. So what we've seen there is how very briefly how SecComp provides a pretty powerful mechanism for limiting what an application or a process can do. Whether running directly or running in some sort of containerized environment. It's about thinking about the worst case in your system. What if a particular process is exposed? Maybe you only use this on the sort of end user facing parts of your system, the bit where people type in their username or password or whatever it might be, the more exposed parts of the system. We can manage system call filtering per process with the fairly naive mechanism with something like system D or with the container. But it's really important to understand that this is just one tool in the toolbox. There's many, many, many things we need to do to have a secure embedded Linux system. We've broken them down here into various categories. I'm not going to go through all of this, obviously pretty dull but we've got kernel security features operating system features like MAC DAC capabilities. Other things we can do in system D to make the system more secure. We've got the programming the raw code things to think about what compilation options are we using even what language are we using. We're seeing Rust being talked about more and more. Is it the universal panacea that we're sometimes led to believe? We need to think about whether we need to update our system over the air or directly and the implications of that. Are we doing networking do we need VPN secure tunnels platform security very sock dependent secure boot use of if you're on an arm for example if you've got trust zone what could you do in that secure execution environment that would make things better. File system security integrity checking encryption and then maybe more at the management level maybe management is a dirty word sometimes but having a secure development process in place thinking about security at the beginning we deal with customers and industries all around the world and it's sometimes a little bit frightening when people say yeah we nearly finished our project and now we're thinking about securing it and that obviously is an alarming thing to hear. So there's lots and lots of things to think about. So we have a little bit of time so I've got some I'm afraid I'm a coward so I've got some demos that are video recordings rather than doing a live demo I struggled to type and talk at the same time. So if we just pick one of those what we're doing there this is a system D demo so we have a simple service that's just going to call you name it's not doing anything terribly exciting but I just wanted something to be running as a service within my system we're running on a QMU arm 64 built with the octo as it happens so you can see the the process has run it just spits out the kernel version and then I've got the service set up and I have the system call line commented out at this point. So if I just start that service manually we would expect it to start without any issue and be running quite happily and there's the message we expect. Nothing very exciting. If I stop the service edit the file uncomment the second line notice the tilde so I'm going to prevent you names restart the service I think to force the symlinks to be regenerated I need to disable and then re-enable I didn't realise how slowly I type until I see it on the video and we're not seeing any messages as we would expect and I think if we look at the status of that service and I'll start the service first no messages coming out obviously it's not working but if we look at the status we can see that it's failed to launch so the system works for what it's worth all the little snippets that I've used for my experiments again in no way production ready code don't take any of it and put it in your next car because you're going to get problems but they're all on our github it's all freely available and really just down to say if anyone's got any questions ask now or come and find us wherever towards the other side come and find us we'll be here for the rest of the afternoon so thank you very much