 Hello everyone, my name is Devon Bowers. I'm a software engineer at Microsoft where I've been working on code integrity systems within Windows for the past four years. And I've been working on code integrity systems within Linux for the past two years. I'm here to talk about our new LSM and the last year of what we've done, what kind of prototypes we've added and a feature that we're particularly proud of called integrity name spaces. So real quick, what is IPE? IPE is a proposed LSM. It's not yet upstream, even though the patches are posted online publicly, but it seeks to address code integrity for fixed function devices. Now, this is a little bit of a misnomer. It actually seeks to achieve more of a trust-based access control solution as opposed to solving code integrity in and of itself. So trust-based access control here means hey, we need to prove the integrity of the file and then the policy authorizes what is allowed to run based on that integrity claim. Also, what do I mean about fixed function devices? Well, fixed function devices are devices that essentially only do one thing. So an example that's gonna come up during this meeting, during this presentation a lot is the IoT device or a container host. So essentially, they do one thing if you black box the container for in the case for a container host, all it does is run, launch and report a container, which is essentially one big mechanism of the management of a container. Other examples include IoT devices, as I mentioned before. So recently I just bought a washer, much of my sugar in it is connected to the internet of things so the security is probably terrible. And my washer only needs to be able to wash my clothes, dry it, maybe emit a beep after the cycle is done. It doesn't need to run doom, even though I'm sure somebody else will be able to find a way to get it to run doom. So IP is really designed around these scenarios that help it out to get, to ensure this promise of it continues to only run one thing. Well, this presentation is not, is a walkthrough of all of IP. If you're interested in more of the policy constructs, you know, how does it work under the covers, things like that, I did a presentation last year at LSS that you can look up online and should be able to answer all these questions for you. So real quick, it's been a pretty good year for us at IP. We've added a prototype for SS Verity integration that allows full parity of what we have for DM Verity and IPs, public postings. So you can both allow anything signed by the FS Verity built-in key, built-in signatures, and you can allow or revoke based on a specific digest that is provided by FS Verity utilities. Additionally, we added support for specific file reads, forcing integrity verification. So the way this works right now is if you have an application that you know opens with a specific path from user mode. So say, you know, I know my SSHD server opens ETC, SSH, SSHD. I've seen the open syscall for that. So you can write an IP policy. Hey, anyone tries to open ETC, SSH, SSHD config for read, make sure that it is an integrity verified object. So in that sense, we've managed to accomplish some stuff for legacy applications that you can't really recompile, but it's a relatively fragile system. So this is why there's that in progress at the end because we wanna make that a little more robust and the intensive way that we're thinking about doing that is the trusted for patch set, which was just recently merged, adding a new flag and say, hey, can I trust, does this file descriptor trusted? According to, you know, the trust access control in the system or the integrity claims either be through IP or IMA, and am I allowed to open this and should trust it as such? And finally, the last big one is namespaces, which is a little bit of a misnomer. It's not namespaces in the traditional sense of the word that you think of when you say clone or with the clone system call or the unshare system call where you would actually create a namespace primitive like Mount PID users and so on and so forth. It's more thought of like a variable policy or a contextual policy that a new trust, a new definition of trust is created for that process tree. So it's a little bit different and it's probably an inaccurate name at this point. So some motivations why we wanted to build this kind of variable context, variable policy of trust for systems is that we want to increase IP's flexibility a little bit. So, you know, lockdown systems as IP originally proposed, it was four are pretty rare. These lockdown systems imposed a ton of requirements on the people building these systems. So the whole system itself must be designed around this trust-based access control. So things like updates need to be changed. That story needs to be reworked. The build systems now need to be set up for code signing correctly. You might have to maintain multiple PKI chains. It becomes a little bit of a night resourcing nightmare that essentially means only the really big companies with hundreds of engineers working on a project can essentially get these systems working. So for that reason they're pretty rare and we want to extend IP's applicability to the other kind of systems. Additionally, some execution contexts are designed to be unrestricted just by the function of the device. So I mentioned earlier container hosts are kind of the example that we're working through for this entire presentation. So container hosts in general create basically a sandbox that is trying to protect the rest of the system from being attacked or isolating a small fragment in the system while sharing the same kernel. So in this sense, we want to make IP more applicable to that scenario. Since the whole kernel is shared, if we were to use a normal IP policy, all the containers would be subject to that policy. It's not usually always a great thing, especially if you're running other people's containers as like a Kubernetes node or something like that. But you also want that kind of security policy for the host so that no one can just log in to the host and execute whatever they want. So we just need a way to make sure that those execution contexts that are owned by other individuals or that are properly sandboxed and protected do not have a way to use IP and gain some benefit, which brings me to my next point of a little bit of applicability, trust-based access control or proving code integrity of your executable resources is useful. Even if it's only on one process, you still get some security benefit for that one process and it's better to make small steps towards a big security goal than not adopt the security goal at all because it's too much work in the first place. Finally, some isolation principle. So IP, this is really good for at least privilege since this trust context, this trust policy applies to a process tree. We can now define more of a least privilege style thing on our processes and say, hey, my SSH process shouldn't need to load kernel modules because unless you use it as a remote connection for a headless terminal or something similar. And so I'm going to say, hey, SSHD, you're allowed to execute stuff but I'm not allowing you to load kernel modules. And so that reduces the attack surface of SSHD kind of dramatically with loading like a potentially vulnerable kernel module or that was signed previously but has an exploit available. So this is kind of a classic example of the fork exec. I hope everyone's seen this before. It's a pretty basic thing on Linux. So this is kind of basically how the IP namespace works from an interface perspective. As a consumer, what do I need to do to set up a namespace? Well, this is a pretty common pattern. So we kind of plugged into it immediately and we said, hey, after you call fork and you created that new process or the start of that process tree, here's an interface. Here's a file that you need to write in security of us because we're security FS based as is typical for LSM. And then all you need to do is write what you want to set up. Your namespace like or your trust policy. So you define things like some metadata associated with the context, a name or an ID for the context itself. Whether that context starts in enforce or permissive and what the active policy, what the policy should start with that trust context. And after that, what you do is close as you do with the open syscall. And once that's done, your variable context is created and then you can follow the rest of the calling flow for the classic fork exact to spin off your new process. After the last process calls exit, the namespace is cleaned up or the trust context is cleaned up for the end user. You as the calling application without any you need to do anything. It's pretty easy, simple, low overhead and fits in nicely with this model. So real quick, I have a demo here. So this is a real system with IP enabled real distro. This is Arch Linux where I've applied the IP patch that's in built to full kernel. I think it's about two versions at a date now. But it's a two-stage policy where we bootstrap trust. We start with trust for only the internet ramfs. The internet ramfs trusts the root fs partition which is the de-embarity volume, which is why you see this error on fsck because fsck does not expect to read write drive and it's read only because of de-embarity. So we'd look at de-message. We can say, hey, we loaded the OS policy, which is that internet ramfs policy that expands the trust to the root fs. You can also see that BPF was blocked because it did something that did not match IPs respective of what it should be allowed to do. So if we look at that policy that was loaded in the internet ramfs, we can see everything we expect. We can see, hey, we're allowed to make namespaces or contexts and we trust fsvarity signatures of what I need for the demo to demonstrate my custom version of RunC. And then de-embarity root hash, which is our root file system and then the internet ramfs itself. We also looked at IPs and enforce, this confirms exactly what we expect in the, what's it called, the de-message output. So if I try to run RunC, this container that I've added, we're gonna get permission denied as we expect because it doesn't match any of our integrity policy requirements because this is an unwriteable drive. It's not the root of fs, it's not an internet ramfs, and it's not a Missouri sign. So everything's blocked and as we fully expect, even if we change it into no dice, and we can see that, hey, it's in the path, it's blocked. So we do a root of fs and we try to run steam locomotive, same deal, exactly what we expect. So that's kind of boring, right? It just shows that exactly what we'd expect that we've had no regressions from our past process. So now we're gonna do a little bit slightly different with the namespace still in the same page of, we're still blocked even if we create a namespace with a currently active policy. So we have our OS policy, which is exactly what we want. We're gonna spin off a new process by bash, that new process tree. We're gonna set our metadata ID, one, two, three, four, five, with our extra metadata of two, NS2. Start it with the OS policy and then all we do is we're finishing that right to security, IDE, new namespace. That will then create our new policy and when we change root, it's still going to fail because obviously we're spinning on the same policy, but we're in this brand new namespace, which you can see by the NS name, NS2 right there. And we can even see that the policy, hey, it was started with OS policy and so on and so forth. Now, if I managed to type exit, that we should not see any namespaces whatsoever in our, except for our root namespace in the namespace history. And lo and behold, it's right there, it's cleaned up for everybody automatically. So what happened there exactly? This is a diagram of what we're gonna follow through that whole example. Hey, so we write in bash that created a new schedulable task in the kernel. When that new task is created, there's this convenient security hook that's called whenever new task is created. And we just say, hey, inherit the current namespace immediately. So we just assign our little LSM blob memory region to say point to whatever the parent's name task is and then return back to fork. When they call write, which is a brand new part of our system call, we just check the restrictions on the current namespace. So, are we allowed to create namespaces over ourselves? Other things like that is required to have a policy, things of that kind of nature. And then if that all succeeds, all goes, we set it up, we allocate, we do all the fancy stuff that we need to do to make sure that we're gonna start in a good state. After that, all we do is swap the two regions of memory on the task. So, hey, the current task now gets a slightly different namespace. That we've already set up and made sure that it's completely active. And then we return to write. When you call exec, we just check against the tasks active policy. So that's in the, you know, that context that I was talking about earlier, the namespace itself. And then when something calls exit, we decrement the reference. And you know, when that reference count equals zero, that means there's no more policies that are referenced to it. And we scheduled the free. Pretty easy. So a little less contrived, a little less arcane. Let's walk through, you know, more of a concrete example of how something could be set up using this. So we start off, as I said, we're gonna use a container or Kubernetes host scenario. The host is still fixed function. It's deploy and run the container. We're just blackboxing what the container does. So the system might ultimately look something like this. You got, you know, system Dmobi, which is Docker for those of you that don't know that split. And you might have something like this. Hey, you know, every container namespace has an allow all policy. And then the specific policy for the partition is allow host only, you know, allow children to be created and but require a policy to have in our child. So this comes, this environment comes with a little bit of requirements, you know, workloads are unsigned, they're isolated from the host, but the host itself is immutable and high uptime per sensitive because if you're in the, you know, the cloud wars, one percentage of perf lose you customers. And finally, the big one is that as the host owner is not the same as the container owner, the host owner is some, is a person that has no contact with the container owner and vice versa. So it's important that we be able to accommodate basically the least restrictive needs while also allowing the container owner if they have the knowledge and the technical expertise to leverage exactly what they need to do to set up their own, you know, namespace for their container. So in this, we're gonna use that same system before we're just gonna log in real quick and we're gonna load a more restrictive policy because we wanna trust only the host. So this first more restrictive policy I've called, you know, IP restrict very, very intuitively. We're just gonna load that into the kernel by catting it to the new policy node which adds it to exactly our current context. So IP new policy, we've created and loaded it. If we look at the policies node in the IP security tree, we have, hey, there's that host lockdown policy. So we're just gonna make this active. This is done by writing one to the active policy name. Well, first we're gonna show the policy improve, you know, hey, those nameshaces, you're only allowed children, you have to require policy, you have to require enforcement and we are only allowing necessary signatures because I needed a custom version of run C and then also de-enverty, this specific de-enverty root hash which is fine because we've left the new run of us and de-enverty root hash is our specific root FS. So if I echo one to active, it's going to make that policy active and now we're now enforcing. This is kind of boring because there's nothing really changing about the execution state. We can see that it was loaded and then activated that 1800 event is the load, something was loaded into the kernel, 18.02 event was, hey, this was made the active and enforcing policy. So if I create that new, if I first run my enable FS Verity script, which loads a my key into the FS Verity key ring because it's a slightly different key ring than de-enverty which uses the system built-in key ring and then I write the namespace policy that I'm going to put used for these container namespaces so that I can use it when I set up my namespace itself. So if we look at this content of this namespace or this policy that we're using for our children, we can say, hey, it just allows everything but you're not allowed to create namespaces of your own because I don't trust you that much. So then we're just gonna quickly go to the VAR demo. We're gonna try to run RunC. My custom version of RunC, you know, this is really RunC, nothing sketchy. You can see the commit v100, dirty, real help text so on and so forth. So if I try to run even the upstream version of RunC, it's gonna fail. This is exactly like we expect because hey, you can't create a new name, doesn't know how to create a new namespace and fails immediately. If I try to use my custom version of RunC, we get into the container because it's created a namespace and we can run steam locomotive. There goes steam locomotive right across. It's a fantastic application if you need people to learn how to type out us correctly. So if we look at the namespaces tree, we should be able to see our namespace that 98765 is new. That's exactly what we're running. If we look at the policies within that 98765, we have that NS policy, which is exactly what we loaded before. And if we exit that namespace, our last reference is dropped and we are free. There's no more namespaces, it's cleaned up on behalf of the user and we're good to go. So what happened there exactly is we have our typical container stack here. We're starting from RunC to the right. And in RunC, there is that nice little fork exact diagram that we saw earlier. In this fork exact, just like our proposed changes, it does the same kind of thing where it calls fork, creates the namespaces and these are the official namespaces like MountPid, user, whatever. The app armor and then it calls the security systems to build their own more contextual policies or to apply to the container. So app armor goes and set comp, which isn't an LSM, but is a very similar product of similar security value and then SE Linux. So we just split IP right at the end there to do our open, right and close that we mentioned previously. And it's important that we go at the end because when other security systems need to be set up, we're changing essentially what is allowed to be accessed as far as execution or reading. So the other security systems might have like LibSE Linux and it might not be allowed by this new trust policy. So we wanna go at the end and make those applications as late as possible just before we start the container appropriately. So now we're getting into more of the theory about or in the design of how we created a LSM namespace. So when we originally approached our design, there was two approaches that we ultimately considered a general security namespace. This would be something that would be like a more official MountUserPid namespace that allows multiple LSMs to leverage the same kind of framework and create something that is appropriate for their needs. This is a really good idea in theory, but it falls short in execution. The reason is because you can't manage to solve any of the harder questions when it comes to namespaces on behalf of the LSMs. So things like how do you reconcile two conflicting policies? How do you, you know, if there's a super policy inherited above or like in the root namespace, does that, how does that translate to everything in its children namespaces and things like permissions, like who has permission to view its policies and security of us. So this turns into a really complicated problem that isn't really possible to solve the LSM layer as an abstraction. So it essentially just turns into LSM blobs where you're giving a reserved memory region for and maybe a more customized interface that will ferry information to the correct LSM and it just turns into LSM blobs. So we abandoned that idea pretty quickly and we went into LSM specific, you know, namespaces and we created for IP. The concept was pretty natural to be expanded. The old approach was essentially, we already had an implicit context. It was for init, init being the first, you know, first thing to spun up on the system and that was all that we had our trust context for. So all we did was make that init be generic. So it could be login, it could be shell, it could be anything you want, you name it, it can be it. And it applies to a process tree. So a process in all of its descendants no matter how far deep you go. And it relies on, so in general, this is kind of the principle of, it's just a new trust context for that process. If you want to isolate things, you should be using the other namespaces which are far more official for the Mount PID and username spaces. So in general, we also wanted to have some stuff for the policy, you know, policy should be able to both restricting span controls which I'll go into in the next slide a little bit. And we needed the parent to be able to provision limited information about the children because as soon as we jump into that namespace essentially we're treating this as its own thing but whoever set us up, we trusted at the time that they made that decision. So for IPs namespace properties, we needed some flexibility, you know, the existing namespaces enforce rigidity. So in the early in design there was a comment from one of the people that we were reviewing this with and says, hey, can you attach this to the username space? Does that make more sense? Because, you know, privilege namespaces don't have a username space and does it make sense for this to be a separate thing? There was two responses to that. The first is it's a pretty a lot heavyweight system for a username space. So like a username space would work for a container host which runs generally on privileged containers, on privileged containers, don't have a username space, you know, root is real root. You know, but with a multi-user system, so, you know, multiple people logging in where every single user logging in has a different right to what is allowed to be executed through that trust policy. It's pretty expensive to, you know, spin off a new username space every time you log in and does that even make sense from a conceptual point of view for the system? This was also raised during my presentation at LSS this year of how does this make sense with privileged namespaces? Well, it's still on the same vein as like SE Linux policy. So Alexi also recommends, you know, you use, if you're going to use privileged containers, you should lock them down with mandatory access control system. Privileged containers still exist. So this is just another way to also further restrict control and protect a privileged container in this sense because the privileged container can be started with a fixed policy that you know is going to run. So you can mitigate some of the damage by having a privileged container which has more control to effectively screw up your own system. Additionally, you know, requirements can change for namespaces and our implementation should be conscious of that. So examples of that include, like, you know, we might have a VM host node in the cloud is responsible for only spinning up a VM and it's, you know, has this trust based access control that says, hey, I'm not allowed to execute anything that I don't trust based on this code integrity claim. It's locked down by default, you know, there's no debugging tools, but when something goes wrong, someone needs to be able to log in and diagnose what's going on. As soon as that happens, they need debugging tool access unless, you know, you have logs a million miles long and S trace on every synthesis, you need to figure out what's going on that way. So this poor soul needs to log in and get access to debugging tools and it needs to expand. When you log in that integrity policy, that trust policy needs to expand to now include the number of debugging tools available for the diagnostic purposes. And then when they leave that node, it will restore the old integrity requirements. Additionally, container host, which is the example that we've been working with, so I'm gonna breeze through this, you know, host's integrity verify, the containers need to run, which may or may not be. So we can remove the integrity requirements in the container or we can alter them in the case of privileged containers where you can lock it down even further. Additionally, the final example is what I called, you know, shared EDU Linux is just off the top of my head. So we have, you know, a situation where everyone, a bunch of students are using the same computers like a laptop cart or something similar. And when you log into a laptop computer, you know, you have all, especially at schools, it's completely restricted, you can't run any games or anything. So, you know, this application control is, you know, 99% of students have the same level of application control in trust-based execution as before. However, when you're a computer science student, you do need a lot of things. You need a compiler, you need an IDE, you need to run your own code. So how do you do this? When I was in high school, they gave me a whole separate account, which was a very heavyweight solution, a whole separate computer. Instead, why don't we just make it so that you can create a different kind of execution context when you log in so that these customer, these students can basically execute exactly what they need to do, which aren't typically signed. So it creates this kind of system where we have different integrity requirements. In general, IDE takes the approach for all these, requirements can change by making the policy, being the driving force to enforce the requirements. So things like the policy can say is, you know, require policy to be active, always require everything to be enforced, no more permissive switching for you, and prevent additional namespaces from being created. The next big topic of IPs, you know, namespace properties is a little bit of isolation. So when you do say the word namespaces and as I've said, namespaces is probably a bad name for this particular feature, it implies isolation from other parts of the system. And implicit hierarchy, you know, everything belongs to a namespace implicitly and higher in the tree of this hierarchy can see lower in the tree and will have more power over it, so to speak. So in a general isolated system, you might have something that looks like this where, you know, you have namespaces with various different policies, very different applications, nesting and so on and so forth. So this, originally when we designed this, we wanted, you know, namespaces to be always able to create a bond basically between a root namespace and a lower namespace and have the policy and the root namespace effect continuously over time, all of its children. So this created actually an issue which is why we abandoned it with this dependency graph of between these higher level namespaces and lower level namespaces. There's a lot of edge cases regarding when a process dies in the middle, you know, your integrity requirements now suddenly change, what happens? So instead, we just kind of abandoned that issue entirely and said, hey, when you create a policy, when you create a namespace, we provision the initial state and just leave it alone. They are now effectively completely isolated in separate instances for our purposes, only used for permission checking. So additionally, there's this interfaces. So in our example, we say, for our interfaces, we said they're exposed through security FS. Security FS has one instance across the entire kernel when you mount security FS in a new place, it brings the whole tree with you. So everyone can see every security FS node. So in the case of privilege containers, you can see now every single node that you want. So we're just gonna assume that all forms of namespaces or anything that's using our IP namespaces can mount security FS and see everything that they care about. And then we need to assume, since that is true, everyone can see the namespaces. So if we have a multi-tenant device, something that has multiple users, so for example, the US military or the US government and the Chinese government, if that ever were to be an actual thing or Home Depot, Microsoft, Google, whatever, they're all running on the same host, we need to make sure that they keep their separate from everyone else, because obviously you don't want another company to be able to view, another entity to view into another entity's container, especially if that is part of a targeted attack wherever you can. So for that reason, there's two things. You can name the namespaces however you want. You just have to be aware that when you identify a namespace, there is going to be portions that leak. So if you name all your containers based off what tenants using them, you're implicitly leaking some information there. So we've created this kind of purpose, that this division between identity and purpose in IPs policies or IPs namespaces. We have the ID, which is something that you expose to everybody. Everybody knows who your ID is and it can be whatever you want. And then we have a name on the inside of our namespace listing, which is more metadata like who am I, what am my purpose. This doesn't provide any security benefit. The real security benefit of protecting these interfaces is handled through a permission handler in the directory and saying, hey, back to that first concept up with the hierarchy. Only if you belong to a higher namespace than me in my tree, can you view my information. So in our multi-tenant example, the root namespace, the owner of the container can see all into the Microsoft, Google, Amazon, whatever containers. And the Microsoft, Google, Amazon containers can't see into each other because they're not higher than them in the tree. So we divide these two concepts so that we can associate metadata and private data with these other namespaces, but still have a way to identify the containers as a unique primary key. This results in only the attacker knows how many namespaces there are in the box if you use anonymized IDs. And it may not get that information disclosure attack. You know, so this implementation isn't without its flaws. I will admit, there's some serious issues with management. So in general, when you come to management and namespaces, there's two philosophies, essentialized and decentralized management story. In general, I favor centralized management because if you do a system as a centralized point and updating comes a lot easier and you know, you only have one point of failure to investigate when something goes wrong. However, right now we have this problem. You're essentially trusting the namespaces completely blind that they're not lying what kind of names, what namespace they're in. Because currently there's no, you know, extension to say, hey, based on your process ID or who's contacting me, I can look up in the root namespace, what namespace you belong to. So this is a big flaw that we're looking to solve eventually. In general, you can also get around this with a decentralized namespace, a decentralized management solution. However, that comes with its own issues like updating. You basically have to maintain a separate library for every single thing that wants to manage their own namespaces that involves updating and cascading updates. And it becomes a very complicated process very quickly. So for future work for IP, we wanna be upstream eventually. This is going to be a long and arduous process based on my current postings. The patches are available upstream as of today, October 17th. And I would really, if anybody's interested in reviewing, please feel free to jump on and criticize me, boil me alive, whatever, do the Linux kernel maintainer thing. Next, support for targeting specific signers via policy, which is the next big feature that we're gonna try to tackle. So the key rings kind of have a cross-volutionation issue in our view that basically when you add something to a key ring and validate a signature against said key ring, you can match any key in that key ring whatsoever. So if you have something like a kernel mode signer versus a user mode signer, you have no way of knowing which key it matched or forcing the key to match to a particular set other than creating a whole separate key ring. So we wanna have a way in policy that you say, hey, use our system-trusted key ring. But for current modules, I only wanna see this certificate as the signer. And for the standard execution workflows, I only wanna see this certificate or these configuration files, I wanna see this certificate. So the policy gets to drive a little bit more finer-grain selection of what the certificates are. Additionally, I talked about this a little bit earlier. We wanna support that trusted four-patch set. This is something that we wanna do when we're upstream and not the other way around because it's much better as the system as whole to get upstream so everyone gets the goodness of the implementation. It's tightened and hardened and everyone feels comfortable or at least less bad about the implementation before you start changing around other things in the kernel. So additionally, before the last two examples kind of glossing over a little bit, hardened resistance against rollback attacks. So IP is a policy version. I mentioned this before in my last year's presentation. We don't allow you to rollback to policy version. So preventing that fall rollback to vulnerable policies but the kernel loses all state when you reboot. So the policy version gets reset back to zero giving you a slight window of opportunity to use an insecure policy to potentially compromise the system. We wanna do something about that. We wanna investigate that, figure out how to close that gap. And then the final big thing that's gonna come much, much later is hopefully we would like to figure out how to verify EVPF programs. We've done some initial work in the space to see how feasible it would be with the loader rearranging instructions and all that stuff. But it's somewhere along down the road. And on that note, and on that note, I would like to thank you for your time and I will take any questions at this time.