 Hello, everyone. My name is Michael Salah. I work for the French Network Information Security Agency. I'm going to talk about a project I started three years ago. So it's about hardening Linux and this talk will be a bit focused on containers. So to pick out the technology that takes place, well, usually start by compromising a process or set of processes and then gain more privileges, so elevate the attacker privileges to finally gain access to some sensitive data. So here we are going to try to harden user space to make the attacker work more difficult. So how to make something more harder? So of course you may want to use some good languages with some good security properties to avoid some common security issues, but you may use some legacy software or third party software, so you may not be able to follow these good practices. However, you may want to follow the list period principles, which means to only give the rights to the set of processes which only needs these rights to the common legitimate actions. And then you may use some compatibility techniques, like for example containers, or only using dedicated users for services. And here we're going to focus on containers. So for containers, there's some constraints and some pros and cons. What you want when you use a container is to add an abstraction from the host. So you may want to update your host in one time and update your container after that. So you don't want to sync all these updates. So what inside container image is maybe unique. So you need to have a security property tied to this container to the services running on it. And this security property should be independent from the host because the container is something standalone. Then what we want for this is to be able to embed the security policy in the container, in the container image to be independent from the host. So this multiple way is to enforce security policy on Linux. Of course, it's the four major LSM, which are a Linux app on board in Tomoyo and Spark. Other than maybe used to set some accesses on files or even tent work and processes, IPC. So it's designed to enforce and control in a fine way. However, they're not designed to embed their own security policy in your container. So you have to configure the security policy for the whole host. Obviously, there's some works going on with namespacing LSM, but that's not here yet. And last but not least, you cannot use, well, you cannot enforce security policy if you're not running as root. So you need to use the system administrator to be able to enforce security policy on your processes, even if it is an influential container. So there's multiple other features available with Linux. A really interesting one is a second and especially second BPF. So it is kind of a firewall to protect the kernel from user space. So I say the documentation, it is not a sandboxing mechanism, but it helps reduce the attack surface from user space to the kernel. And obviously there is namespaces, which are the basis for containers. So this is not an access control system, but it may be used to only show some parts of the system resources, like only some file hierarchies or some subset of the network. But you cannot fully use all these features in an unproledged way. Even if these are user namespaces, you cannot use every namespaces feature with it. And there may be some concern about some security issues. And another interesting point with SecComp is that you can embed your security policy, even if it's not a sandboxing mechanism, in your applications and in your container as well. So for example, there are some container managers which use a second BPF to add in their containers at one time. So it is independent from the host. Obviously the host needs to provide some features, but not the security policy. And then here comes Linux. So Linux is a Linux security module. So it is targeted in fine-grained access control. So you maybe, you should be able to enforce security policy on FirePass and some more complex objects that you cannot do, for example, with SecComp. It is designed to be available for every user of your system, root user or even invalid users. And it is designed to be able to embed your security policy in your application, in your container, without being too tight to the host. So it is a way to create a programmatic access control. So now let's take a look at what Linux may look like at host. So this diagram gives you an intuition of how it works. So in this example, the process sunbugs itself. So it created a set of Linux programs which are loaded in the container. And these programs are coded to design to filter some access to the system. In this case, these programs filter the open system. So every access to files with open or access or whatever. So this investment is only done for the process which enforce security policy on itself, not other processes. So if this process is part of the container, this security policy only implies this process. So it is independent from the host. So there was some version of Linux. And the last one will be the eighth version. So it is not ready yet, but it will be released in the following weeks. But I will give you another view of what's inside this patch set. And anyway, I published an alpha release of this version. So you'll be able to check it and test it if you want. So this patch set for the Linux kernel is a minimum variable product. So it only means to demonstrate that it may be useful for a lot of use cases. But without bringing too much code to limit the work of the kernel reviewers. It is focused on file system access control, which is one of the most difficult kernel resources to enforce an access control on. And underneath it is powered by eBPF. So what is eBPF? eBPF stands for extended declare packet filter. So it came from the classic eBPF that you may know if you use TBDM, Wildshark, to filter network packets. So eBPF is an in-kanal vector machine, which is designed to safely execute code in the kernel. And this code is loaded by the space. And may be loaded by entry-age processes as well. It is already wireless in the kernel. So for example, for the network filtering, second eBPF uses it internally, and tracing, debugging, and a lot of other stuff. So there's two main important features provided by eBPF, which are really important and useful for Linux. The first one is the ability to call arbitrary, well, not arbitrary, but dedicated functions, which are designed to be run for static type of eBPF program. There's multiple type of eBPF program. So there may be one for filtering networks, one for debugging, and so on. So there's also one for Linux. And the other really important features is a new kind of IPC, which is called eBPF map. It is a way to store data, which may be available from user space, and from any eBPF program. So this kind of table, and there's a set of keys, which are tied to a set of values. And of course, because this is, well, eBPF is meant to load code and run code at the kernel level. There's some security implications. So there's some security safety too. The main one is a static program verification at load time. So whenever user space loads a program in the kernel, the kernel checks a lot of things that eBPF program may do, like, for example, accessing memory, and while the kernel follows what the eBPF program wants to do with this memory, so it taints the memory and do a lot of checks with that, which allow, for example, to restrict pointer leaks from the kernel. And obviously there's an execution flow restrictions to only execute code from the eBPF program and not jump in a number of address in the kernel. So that's what is used in Unlock to express an access control rule. And what is used to enforce an access control rule is the LSM framework. LSM stands for our Linux executive models. So it is mainly an API to extend the kernel with code to enforce security policy on user space. There's a lot of pretty decision and enforcement points which are scattered all over the kernel just before the code that may use some kernel objects. So, for example, if you're trying to open a file just before the open actions, there's a check to check if the current process is allowed to go through this path. And, well, there's more than 200 hooks in the kernel, so it's quite complex. And all of them are quite specific to the kernel code and they may evolve with the kernel code as well. So it's not a stable API and it's done on purpose. With Unlock, the idea is to try to bring some part of the LSM features to user space in a safe way and accessible to input processes. So it's a bit different. For example, the idea of hook in Unlock is to define a set of actions on specific kernel objects like, for example, open a pass and walk through a fiber or even open a network connection. So the programs are EBVF programs. Lennoc programs are EBVF programs typed for Lennoc. They're designed to check some access related to a specific Lennoc hook. And triggers are meant to restrict the way, well, to only run this program when they are designed to be run. For example, to only check read, write, or use the elections and so on. So now let's take a look at how we can write a Lennoc route. So a set of Lennoc programs. So it's mainly EBVF. The main way is to write the EBVF program in C. So actually it's not really in C. It's a subset of C, but it compiles with Claim, which is really useful. So you write your EBVF program in C. You then compile it and you get an EBVF bytecode. Then you can embed this bytecode in your application. And when this application is run, it's executed. So this can then be able to load the EBVF program in the channel. So let's take an example. So the example I will follow for the next slide is with two goals. So it is really a common use case to be able to access files in a read-only way or in a read-and-write way. But it's a bit more than only pass. It goes through five descriptors. So then we want this security policy to be enforced on set of processes. And this security policy may be updated at runtime. So you can go check the code. I just uploaded it on this website. But I will explain right now some part of it. So to be able to enforce security on a wireless of file hierarchies, you first need to create some wireless. And actually in the kernel, there's no way to express this for what is needed for London. So the goal is to be able to list a set of files. But there's some challenges here. First, obviously we want this to be efficient because there's a lot of access request at the same time performed by a lot of processes. We want this wireless to be updatable but by user space at runtime to be able to add more files, more restrictions, or to remove them if you want to. And last but not least, we want this wireless to be updatable and to be useful for infillage processes as well. So for example, we cannot choose extended attributes from a file system because extended attributes are dedicated to, well, can only be written by someone that is allowed to write on these files. So if you are in a real new environment, you will not be able to write extended attributes. So that's the problem. And another constraint is that if you want to run multiple security policies in conferences, you want to compose them so you need to be able to identify a file, or find a keys, for multiple security policies which may not be run by the same user. So you need to be able to stack, well, to have multiple identifiers for the same files. So the idea is to create a new dedicated ebf map. So this is already a multiple ebf map for multiple use cases. And this one takes as a key an inode, which is not only the inode number but the device as well. So it's an inode from the corner. So it is used as a file descriptor. And the value is a 64-bit value which may be used as a tag. So it is kind of independent extended attribute, independent from the file system. So with this kind of file system, we are able to identify a file, not a file IACS, but which is quite useful is to say, okay, this directory I want it to be available in a read-only way, but you don't want to list every file in it. So to express a file IACS, there is an Azure feature from Landlock, another properties. Landlock use ebf programs, but it can change them in a session. So there's multiple Landlock program type as well. The first one here is called FSWalk. It is called for every walk through a file path. So every directory you go through, this program will be executed on it. And in this example, I changed an FSPIC program with the FSWalk one. So the FSPIC program is dedicated to evaluate what you allow the access or not to the leaf of a file path. So usually a file. And you can then add some triggers to only trigger this program for a subset of actions. In this example, the open changing directory and getting attributes. So I created a subset of triggers which are related to read-only actions. And for write actions, I created another FSPIC program, which is also triggered for when accessing a file, so the leaf of a file path. And this one will be triggered for write kind of operations, like writing a file, writing to it, or linking and linking and so on. So how does this work? This set of programs are able to share a common value, which may be used to identify a state across execution of multiple programs. So let's see an example. Is this example? So let's say we are in a container running as a web server. And the web server wants to access a file, which is slashpublic slash web slash index.html. So the first execution of the FSWalk program will be to allow or deny the access to the first element of this path, which is the slash, the root of the file system. And at the bottom you can see the map, which was previously loaded, and set with three keys, slashetc, slashpublic, and slashtmp. Each of these keys is tied to a unique value, which only makes sense for the program you only wrote. With this case, I decided that the value was only one for read access and two for read and write access. So this FSWalk program, when executed, looks for the i-load, the file, in this map. So there is no slash in it, so it just allows the path to go through, and then there's another one, which is executed. So this FSWalk program is executed and can then look at which i-load is requested. So the public i-load is in the map, so the program knows that it is allowed to be used as a read-only action. And then this program can then write it's in its context cookie that, okay, I saw this, I saw a file, which is a load, a directory which is a load. So for the next one, let's remember that we are in an, well, an identify file archies. The next execution, see the web directory. The web directory is not in the map, but that's a problem because we know that we are only in a file archie, specific file archie. And at the end, when we reach the leaf of the path, the index.html is not in the map. However, we still know that we are still in the file archies, which is tied to the slash public. And then the specific program, which is tied to the open action, can allow this action to be performed. So let's take a look at how we can write this kind of BPR programs, learning programs. First, we need to define some metadata, some properties. So as, like I said, we need to tie a learning program to a learning hook. In this case, it is an FS hook. Then we need to chain it. In this case, it is a third FS pick, well, the third program, which is a FS pick, and then it is tied to a previous one. From the user space point of view, you identify a learning program with a file descriptor. And then you can attach some triggers to this learning program. In this case, the append triggers, trace actions, and so on. So all of these are dedicated to write actions because I wanted to express this kind of actions. Then we need to write the actual BPF code. So it is here a subset, well, a snippet of the actual code. You can see it in the website. But it's interesting anyway. So the BPF program here is kind of a main function. And it takes as an argument a context. In this context, there is mainly two values. The first one is a cookie. So you can use this value to keep a state between different program execution. And the second one is the inode, which is inode you can then look if it is in a map or not. So here the first inflation is just to copy the cookie. The second one is to call in function. So this function basically looks at the map. And if the inode, which is in its argument, is in the map, then it returns the tag, the value of the inode in the map. Otherwise, it just keeps the tag recorded by the cookie. And if the cookie contains a tag, well, which is allowed to write something, so it is here an average value. I think it's something like two. Then the program allows the action. Otherwise, the action is denied. So once you return your log program and the log metadata, you need to load this in the kernel. For this, we use the BPF Cisco. This is a pro-load command and some attributes containing mainly the BPF bytecode, the BPF type, which is here, learn a cook, and then the metadata, which are type specific to learn a look. That we saw just before. So the process called BPF program and load it, load the program in the kernel. Then to apply this program to the current process, we use a second Cisco. But we do not enforce second filter. We enforce a learn a program. So here there is a new second Cisco command, which is to prepare a learn a program. And as arguments, we put the learn a program. So it is a chain of program. So we loaded the BPF program in the kernel and then we can apply it to ourselves so the process applies its own program to itself. So this process may be a container manager, for example, like the currency and so on. And then for every action that this process performs, we press on the kernel, the workflow goes through the Cisco entry point then reach the LSE book, which can then call the learn a look books, which are then tied to a set of learn a look programs, which are designed to be triggers for such an action. And then this program can allow or deny the requests and forward it to the dedicated code. So this first one is done for the current process, but not only this one. So let's see another example. Here we have a first process, P1, which is about to fork and create a new one, P2. But if this P1 process wants to enforce the security policy on itself, it can. And the new children from this process will then inherit this security policy as well. So it's exactly the same way as, say, COMP works with filters. And if the P3 process wants to enforce its own security policy on itself, it can too. So for example, if you're running a secure server in a container, your container as P1 can enforce security policy for all the container, and the P3, so the web server, can also enforce its own security policy on itself. Or even if you want to run a container inside a container, well, you could. You can enforce nested security policy. And obviously, the P4 process inherits from the P3 security policy and the P1 as well. Now, let's see a demo. So it's really a simple one. There's two kinds of wireless which are configured by user space. The first one is a really set of five arcs which will be available in a really way. So for a web server there's slashpublic and slashtc, slashpc, and so on. And another wireless which is available, well, allowed to be written on. So the slash TMP to be able to write some temporary files. But also for the example, I added the STD IO. So STD out, which is identifying with PROXSELF FD1. So the first file is filters. So let's see that. So let's say we are running as a web server user. So we are in the public directory. We can list the file there. So there is some junk here. That's not one. We can write some files. So here, for example, I write the file index.html. So that's common use case. So let's say it was the root user who configured the web server. And then we want to launch, let's say, a web server. In this case, it's a shell, but that's different. So there's two variables. The first one is an environment variable to list the path which should be available in a really way. So you can see this slashetc, and the stuff slashusr and slashpublic. And another variable is called llpasswithwrites for withwritepass, which includes the STD IO, TMP, and so on. So this London Web Program is a sample that you can find in the code. And it will scan your first sandboxer or a really light container manager. And so it will run the binbash, the shell. So here we are in the new environment. So it is the same space, but only with London World Supply. So we are in the same directory. You can still list the files inside. But as you can see, there's some access which are denied. So you can see that the dot dot is denied. So dot dot points to the slash directory. And indeed, the slash directory was not allowed to be accessed in a really or no right way. So we cannot get some info from this directory. However, we can get some info from the rest. And of course, if I want to write some files in the directory, it is not in a really way. So it is only accessible in a really way. And if we go in the TMP directory, we can list what's inside. And you can still create files, as says the security policy. And so as well, we can see the root. But if there is some private directory, we will not be able to see it as well. And so just to finish, as well, we cannot go to the root. So if we can enforce this on the root, obviously we can enforce this on arbitrary directories. So to wrap up, London Lock is dedicated to other user space. So in a programmatic and embeddable way, so you can program your own access control rules, as you wish, and embed them in your applications for set of processes, which is independent from the host. And one thing I need to demo here is the ability to update the wireless at runtime. So that's maybe really handy. And it is designed for infrared use. So you may run it for any user. So currently some patches, well, I made some little patches in the kernel to just prepare the landing of Landlock. So it's about 2,000 single line of codes. Most of them are in the security Landlock directory. In the links source code. And while you can follow the next, so the V8 patches in the links kernel list on Twitter, and I will release the next one in the following weeks. So we can get some more information and we'll get some more in this website. Thank you. Any questions? So can this only be used to filter authorization to the file system? Or can you also change the API code? What I remind is to be able in a container to have some user ID and to actually access the file system with another user ID, which means that if you actually do some change to the API code, not just utilize... Could I do this? Yeah. Landlock is dedicated to enforce the ability. So it is a new layer of security above the existing one. So if there is an existing security policy or even the back security policy, you cannot change it and you cannot map one way to another. So for this, you may want to use user namespaces. It's going to be a bit difficult to figure out like any other question. Like half the room is changing already. Yeah. It looks like not. Well, thank you very much.