 All right. Next up we're going to have John Johnson presenting on LSM Stacking and Limbspacing. All right. Thank you. So, as Stefan said, LSM, making the LSM. Basically we want to make the LSM available to containers. Currently it's a real problem. So what is the LSM? It provides security in Linux. So it's a kernel subsystem. You often see it associated with Macs when they talk about it. So SE Linux, App Armor, but it actually is used by Yama, Loadpin, others, Tamayo. And then there's a whole bunch of other proposed LSMs out there that haven't been merged yet that want to do other things than Mac. And the problem with the LSM right now is that you get one subsystem in the kernel. It's shared by all your containers. It's not what we want for containers. And the problem is the LSM doesn't want to be names-based. Not really. We can have multiple LSMs enforcing at the same time. We'll talk about that in a second. But it's very limited. And the reason the LSM doesn't want to be names-based is it's providing security for your system, right? So if you bring up your container, say you're on Fedora or whatever, and you have a SE Linux running, it's running on Fedora, it actually has policy that's controlling and confining what that container can do, and it's protecting the system from the container. And so that's the big reason that they don't want a traditional, you know, here's your little namespace off on the side. What's wanted by the LSM is to be able to have the system enforcing something and maybe, you know, have the container having its own. Not every LSM has the same requirements. Yama or load pins certainly don't have the same requirements as SE Linux. So we can do simple things to have... We do have multiple LSMs. This was landed in 4.2. This was landed by Casey. What it does is it gives us what we call a major LSM and a minor LSM. There's a limitation of a single major LSM. So you only get SE Linux, you only get App Armor, or you only get SMAC or Tamoyo. And then you can have a minor one like a load pin or Yama. Anything that doesn't use basically the data structure. So the LSM provides a service to the modules, the security module, and it provides a set of hooks along with those hooks on the internal data structures in the kernel. There is a security blob that the module gets to use. Casey has been working on bringing us better LSM stacking. He's been doing this for a long time now, almost five years. And the idea is to move the security blob management, which is the defining feature of a major versus a minor LSM into the infrastructure that's provided by the LSM subsystem in the kernel. It's not quite so simple though. So container LSMs, they can't change the LSM that the system has, right? So you're going to have to have something more than that. You can't change the policy rules of the system. You're going to have to do your own rules separate. All these LSMs with the way stacking is set up are defined at boot. So say if you did have stacking working, the major stacking working, when you bring it up, say, App Armor and SC Linux, your host system has App Armor and SC Linux. It's not just, say, the container. That's not really what you want to do. So right now, the other problem is app containers. We've been talking mostly system containers, but there are app containers that have started using LSM policy. Snappy does it. And it would be nice to be able to make it available to other system or app containers as well, I should say. Problem is the kernel infrastructure was never designed for sharing. So this has made the problem, this has made fixing stacking issues for Casey really hard. So we start with the user interface problems, right? The interfaces that provided for the LSM, they were never planned for it. So we have the PROC PID add or star stuff. And that's so pure sec. Lots of services use these actually. Like if you do PS minus Z to look at what your confinement is. And so how we have to handle this is we basically, we virtualize them. We give them a per task definition and say, well, you're under this LSM, even though multiple LSMs are running. And that works okay. We set it, so the interface, we set it to default. So when you boot up, you get a default LSM. So it looks like maybe you just boot it up with only SC Linux. But you maybe have app armor and SC Linux at the same time. And then new interface versions are added. So these can be used, but they have to be picked up by, you have to extend your user safe based applications. So PS minus Z, before it to know about the other confinement besides the default one that we're virtualizing, you have to update it to be able to read the new interfaces and use them. So that's, you know, live SC Linux needs updated, live app armor needs updated. The PS tools, there's several other tools top. That's not too bad, we can get there. We're working on it. Networking, networking's the real problem. There's these awful things called SEC IDs in the kernel. They're a 32 bit number. They have no lifetime. They're exposed to user space. I'm really poorly exposed to user space of that. They're all we can get into the network subsystem. The subsystem maintainers won't give us a full pointer or anything to work with. They're used by the audit subsystem. They're global. None of this really works for container. I mean, if you have, say you're booting a SC Linux system and you want to run a container with SC Linux on it, the container and the host are going to share that same number and that's not what you want because it can't distinguish between them in the networking system. So what Casey's been working on and doing here, this is a work in progress here, is virtualizing the whole SEC ID number. So there is a single global SEC ID number, but the infrastructure manages virtualizing it and combining them together, finding a new one. It's kind of ugly, but it's about all we can do. So we have to remap these at runtime and we still have lifetime issues with them that still have to be resolved and that's going to take quite a bit of work to fix in the kernel networking stack. Another one is SEC Mark. So I don't know if you know what SEC Mark is. If you've written IP table rules for SC Linux, you will. The problem is SEC Mark is exposed to user space and it was never really planned for having multiple LSMs at the same time. The way the networking is set up right now is you get a user namespace and your network namespace is tied to a username space, but the SEC ID itself is actually global and doesn't fit in with this model and the SEC Mark being converted to the SEC ID. And so there has to be some work there that has to be figured out yet. But unless you have two LSMs that are trying to compete on the SEC Marks at the moment, we can kind of work around it. Net label stuff. So we're talking Cipso, Calypso, XFRM. They weren't designed for this either and it's a huge problem. So we have a few potential solutions. One of them being you only get the label if they both agree. So if you have something stacked, I don't care, smack in SC Linux and they are trying to label a packet. The packet will only actually go out if the labeling from the two are the same. Otherwise, you could also have a fallback and say, as long as only one of the LSMs is trying to label it. Otherwise, there's not a lot of space to actually handle and combine things on these. Mostly, thankfully, they're avoided. LSM stacking is not namespacing, however. It's a good start, but it's still not providing what containers need, right? So like I said, you bring up a system, you bring it up. Say you want App Armor and SC Linux Avail, you bring up your system. They're both booted. Your system is booted with both of those on the system, on the host. That's not what you want. So what we need to do is we need to extend LSM stacking, right? The current situation, right? Maybe we bring up all three and the system has those and the containers have those. It's no good, right? Even worse, none of the LSMs originally were written with namespaces in mind. So AIMA, they've been discussing it and they have some patches where they're starting to provide some namespacing for their audit and a bit of their infrastructure. SMAC, it has some patches to provide labels to namespaces, which is an important part of the work, but it actually isn't namespacing SMAC itself. They do have a weird thing that they have a little used thing called preprocess rules that Casey thinks that they could use for namespacing, but he's not working to that level yet. He's still working on the stacking end of things. SC Linux has begun the discussions about namespacing themselves. They sent some stuff out to the mailing list in October. Where they're really at right now is they have a lot of global state. If we're going to containerize this, that global state has to be moved off and set up so that they can have namespaces, and that's where they're working at mostly at the moment. The audit system is still working in progress as well for namespacing, and that's going to be needed because both SC Linux, AppArmor and SMAC all call into the audit subsystem. AppArmor is the exception. It has its own policy namespaces. It does virtualize its interfaces, and it does have an internal stacking separate from the LSM stacking we've been talking about. That's a product of how it's doing some of its policy, and that can actually be leveraged to do stacking and some namespacing today. So LXD does support doing an AppArmor on AppArmor's confinement container, so you could have a system with AppArmor and then Buddha container that has a system with AppArmor, and they'll both have their own policy. This is again separate of the LSM stacking. But the LSM infrastructures not our only problem. We're missing security blobs from namespaces, so a way to actually set up namespaces in the kernel system and store information there. That's got to be added. We need some new hooks in the system to get every transition we need around namespaces. So namespaces, if you follow the kernel at all, they're not... We don't have a container concept in Linux. We have a whole bunch of different namespaces, and there needs to be better integration for this to happen. A big one is attributes, right? So if you think of SE Linux, when it's got files, it stores its labeling on the files in the Security X Adder. And to do namespacing, it's got to namespace that X Adder somehow. The Security X Adder is already namespaced off of the security module. So when you go to the Security X Adder, you go to the SE Linux namespace. So SE Linux has to come up with a scheme that's backwards compatible that they can namespace their labels inside there. And the same idea goes for SMAC as well. This will also make X Adders larger. So when you run a container that's writing X Adders, it can make the X Adder quite large, which is going to be a performance problem. We've had pushback from the file system people about that, but you just have to do it. There's not much you can do else-wise. And there needs to be some kind of interface to set up an LSM namespace. We don't have an agreed-on interface. We don't have a consensus on what that should look like or exactly what it means to namespace the LSM. But we're coming to something that might work. So remember I talked about the default LSM a while ago and how in a stack you could, to support the old interface, you could set that each task could have its own default LSM. What it's expecting to see, and that could be changed. So the idea basically is LSM stacking already can recall back into the LSM modules multiple times. So we can extend this and I'll support it per task instead of just at system level. And so that becomes kind of the namespace there. And what we do when you're in a namespace, which is a list of LSM stack, so you could have maybe five LSMs on that list, whatever, is we can subset which LSMs in that stack you can see. So when your container's up, it only cares about and only can see that it's got app armor in it. On the system, it's namespace, it doesn't see app armor, it only sees SE Linux. And so with a very thin layer on top of the stacking, we can actually simulate what's a namespace on what the existing stacking patches are. And that also provides the ability for a container to extend that namespace stack. That's about there. So this has been a lot of work, like I said. Casey's been working on the stacking stuff for about five years. I picked up some of the namespacing stuff and started looking at LSM stacking namespacing last year so that we could get this working with containers. App Armor Team has done a lot towards providing container support. Ima's now working on it. The SMAC team's working on it. SE Linux people are working on it. And the LXDD team have put a lot of work so that you can actually do and use that app armor feature that was added so you can do container on container with app armor. Unfortunately, I did have a demo, and unfortunately in the grand tradition of presentations, it broke this morning, and I couldn't resurrect it before other presentations. So any questions? So stacking of LSM modules more than one level down, would this be a big performance and how it is going to be implemented in the kernel because more than one LSM module sub-module is a problem in the moment? So more than one LSM sub-module isn't a problem. So the way the current stacking is set up is each LSM hook point actually iterates down a list. And so there is a performance impact, and it depends on your LSM what that performance impact is going to be. So right now, say on Ubuntu, when you boot your system, it has three LSMs on by default. It has app armor, it has Yama, and it has capabilities. The capabilities are actually run as a LSM as the default LSM, and they're actually always used, but they run through the same infrastructure. So the system is already running three of those. Adding another one, it depends on the LSM, what the cost is going to be. Each LSM has a different cost, and so there is an incremental cost every time one of these hooks is called because if you have four LSMs, obviously you're walking down the list four times, right? So for containers, what we're trying to do here is, like I said, take that list that the system already has, and instead of making it system-wide, you make an LSM namespace block that has the list of hooks. And so only tasks within that set are going to have to do that extra hook, right? So when you open up your, say, Fedora container on Ubuntu, it's going to boot up and it's going to, if it's set up right, if we can get the LSMs cooperating, you'll add the SE Linux hook to that set of list. So now it's a list of four. Even if you have Yama, Fedora does have Yama, it can share the other hook that's already there. So we're not adding an extra Yama hook call out, so we're adding one extra layer for that one, and only the container incurs that hit. There certainly is a hit, and like I said, it's very dependent on your workload. I have not done any benchmarking of it. This is still early days that we're actually booting App Armor and SE Linux together and getting containers to run like this, and we're happy, just happy getting that far right now. Sorry, I don't have a better performance range for you. One thing I can mention quickly as the Lexity team member, in our case, what we're most looking forward to is like, first thing, it will let us run CentOS or Fedora or any of the Linux, of the standard SE Linux distros on top of Lexity on Ubuntu or Debian or anything else, really. The other thing that's going to be interesting for some folks is being able to run Android inside containers that also do use SE Linux. That's going to be a game changer once that work is done, merged and working. We're getting very close. We're at the point now where with those branches, we can get reasonably easily App Armor enforcement in containers on top of an SE Linux kernel. The other way around is still a bit tricky. The other way around is a little bit trickier. I did have a demo of it. The reason it's more tricky is because SE Linux is labeling the objects. If you don't have SE Linux labeling the objects when you boot your system up, the container, orchestrator, whatever, however you're setting it up is going to have to do some labeling of anything that you're bind-mounting in. Any kernel object, any file system, any mount that hasn't been labeled by SE Linux has to be labeled before it can be bound-mounted into an SE Linux container. That's where the difficulty is right now for that. The other one that's a use case, this would allow you to run snappy applications with the confinement on Fedora, where today what happens is they fall back to what's called a classic mode with no confinement. Docker images, we could have Docker then on some of those images. They do have some app armor support. We can extend it now with this so that they can load up on, say, Fedora and have an app armor profile there on Fedora. Ideally, we'd get to the point where they could have an SE Linux profile within their Docker container if they really wanted or smack or something else, which would bring you Tizen apps too, possibly just like Android. I know the questions. Basically, because you would have two systems running, so SE Linux and app armor. One is namespace, the other is namespace. Is there any chance of having conflict or would the main module still override the one which is being used within the container when it comes down to, let's say, doing control over file access permissions when it comes down to shared volumes or something like that? Right, okay. So how it's being done is both of them will be consulted and for permission to be allowed, when you have them both on the list, they both have to say yes. So say I boot up a Fedora system and there's an MCS label they put on the container and my container is an Ubuntu system, right? When I go to access a file in that container, app armor is going to be consulted and it says, can you access this file? And SE Linux is going to be consulted and it's going to say yes or no as well. And so they both have to agree and allow it. If either of them disallow it, then it won't be allowed. So the idea here is, again, the host confinement on that container is still being applied and then the container itself gets to add extra confinement on top of that. And to be clear, that's needed and the process can switch its own display LSN effectively. So it can switch between SE Linux and app armor and you don't want that to be an obvious way of bypassing the preconfigured LSN system. Another question? Yeah, any chance we could see, like, mount labels on bite mounts? So that's one of the things that still needs to be worked out. There's cases where labeling is not very good, right? So bind mounts, what you get to see is you don't get to see, there's no label on the mount structure in the kernel. It's on the underlying object. There is the default, you can set a default context for a mount, but it doesn't apply, it applies to the whole and everything underneath it. I mean, SE Linux has, you can set context equal and it applies to everything under the mount or you can set default context and it applies to anything that isn't labeled, right? We would like to get to where you could virtualize and see only what you can out of that. The other one being the file system is difficult right now depending on, like, how you look at the X adder, how to virtualize that as well so that the container doesn't go out and say when it goes and looks at that file, oh, I can see the host label as well on that, right? So that's an area that still needs work. Anything else? All the way up. It sounds like this work could be useful to allow ordinary user accounts to use security modules to contain their programs. For example, if you're compiling something you've made, can you want to make sure it doesn't write outside the source directory? So, yeah, the LSMs are used to confine user accounts. You could conceivably have different users under different LSM confinement. I don't necessarily know of anybody planning to use that but it would be no different than, say, running your application containers under a different confinement, right? So I think the use guest was discussed here would be effectively every user would be running under some kind of confinement profile with an MSPACE setup so that they can themselves load additional profiles so that Denny's make file can load a policy for his make file to make sure that it can't read anything other than the source code it's meant to build. Yes, okay, so that is possible. The question is how much you open it up and where you open it up. Do you open your policy loads? Are you comfortable to open them up to anything beyond? What kind of scrubbing are you doing on them and how far do you open them up? Do you open them up to just container root so that the namespace root, whatever you want to call it, user namespace root, or whether you open them up to regular users? This isn't available in AppArmor yet, but I can tell you that actually that is a use case that AppArmor has planned for. They have not opened up the ability for a user to load policy into a namespace yet. Being able to do that, you have to be really sure that the user's not going to break something when they load their policy, and that's something that has to be properly vetted. It just takes a long time to get to that stage. Anything else? Guess not. Alright, thank you very much.