 So I'm John Johansson. We're going to talk about namespacing LSMs a little bit. So containers, they've got lots of use cases. We've covered this before. You can see them if you haven't. Some of them even involve LSMs. Snappy uses App Armor a little bit. System containers, the use case is, for example, running an Ubuntu container on top of Fedora and wanting Ubuntu to run just like it is at native with App Armor, or even doing the reverse with Fedora on Ubuntu, that kind of thing, right? Another one is Anabox or things like that with that are running Android in a container, right? So for SE Linux. So there's some different use cases, and if we can do this, this will open up even more. Of course, we all know that containers aren't really a thing on Linux. It's up to the container manager to define what they are. They're made up of a whole bunch of different components. Nothing new there. LSM, quick review. I'm sure probably everybody knows what it is. If not, it's just a set of hooks and data in this blob data in the kernel infrastructure that is leveraged by App Armor, SE Linux, Tomoyo, SMAC, a couple others in there. And then a whole bunch of out-of-tree ones they're trying to get in. Containers, when they use namespacing, it kind of looks like this. You're imagining, I've got a container here with App Armor, I've got a container here with SE Linux. LSMs really don't want that. They want something more like this. We've got SE Linux at the root or whatever on the host, and then the container can have something else. That's fine. We want to have a bound on the container. Like the host wants to apply policy. And the container, whatever container is running as a separate LSM, it can have a separate policy, but it can't get around the bound that the host is applying. You could do App Armor and App Armor, which we've actually been doing for years. We can discuss that a little bit in a minute. For the past, in the past, we've had the limit of a single LSM that got lifted a while ago to minor LSMs. And now, with Casey's stacking work, we have major LSMs, multiple LSMs. And now, you'll notice that this is different than what the container model I just showed was. We've got these LSMs all at the system level, all at the host. You have all of them when you put it in that that's what they are. They're there. And they apply to every task on the system. Now, this doesn't match completely with what we were talking about before, but it does work for some containers. Every LSM, so there's user space interfaces, right? That the LSM provides a few. But every LSM also provides its own. So App Armor has its own set of interfaces that are different than the SE Linux or SMAC to Moio interfaces. Everyone is unique. And then the LSM infrastructure provides some interfaces. You will notice that two of those are highlighted in yellow. Those ones are mistakes. Somewhat, they're common. So they're shared by SMAC, App Armor, and SE Linux. This is actually a big problem. Casey's going to talk more about that. He threw in that we've got, for his stacking work, there's this display LSM value that's part of the stacking. And then there's new interfaces as well. We're not really going to cover those. We're going to leave those to Casey. And you can also, with the stacking, see what LSMs are in place and running on the system. But those yellow ones, we'll get back to them. The yellow ones, if we virtualize them with the display LSM, it kind of starts looking like the model we saw before. We can have all three of these LSMs at the root of the system on the host. But using the display, the container is really only as long as it's old code and doesn't know what it's doing, it's only seeing the one LSM, whatever said is the display. This can actually work for new code as well, stacking aware. There just has to be some work in the LSM itself to do some things. We can talk about that a little more later. So let's boot with multiple LSMs. We're going to use the LSM equal to set different things. We're going to boot App Armor and SE Linux here. It's coming up on Ubuntu, and I'll just take a minute because I'm booting a full GUI for various reasons here. And not liking the demo gods. They always blow my demos away. This is a screen capture, right? So it should work. So you can see we've got the LSMs there, right? We booted up. We got App Armor, we got SE Linux in place, all well and good. Let's do it again. This time we're going to switch it. We're going to go SE Linux and then App Armor in the LSM stack. So it's coming up, maybe. It's freaking out a little bit. And you notice all that red there, right? We start getting failures. In fact, we'll get another one pop up there, and it never gets out of this. It won't boot. So why, right? This is one of the gotchas that we didn't catch when we were doing this originally. It, in fact, doesn't surface with a lot of the use cases we were doing, but it does surface with SE Linux first under Ubuntu. It doesn't affect Fedora. We'll get to there. So let's back up. Why does the order matter? So we're booting with App Armor policy. We're not booting with SE Linux. So SE Linux is enabled, but it's not loading any policy. The old code, it's not aware of the LSM stacking. So it's using the display LSM. Old code is using the shared interface, like we talked about, right? Display LSM set. The display LSM is the first LSM in the stack by default. So App Armor, in the first case, SE Linux in the second. There's the clue. Ubuntu is built with support for both SE Linux and App Armor enabled in all its user-based components. By default, D-Bus is built to support SE Linux and App Armor. So D-Bus, SE Linux, it goes and looks and says, I've got no policy. I'm not enforcing anything. That's OK. App Armor comes in and says, I've got policy. That's great. And I'm enforcing, OK. I get an SE Linux label. And that's a no-no. So it kills it. So that's why we failed to boot. So first gotcha, right? So those virtualizing those interfaces wasn't quite enough with the display LSM. App Armor has an A enabled that its library calls or some applications actually pass the library and do it directly. And what it does is it checks for the security FS mounted with App Armor in it so that it knows it's there. And it also checks that App Armor enabled kernel parameter, or most of them do. So what's happening is we're finding App Armor is enabled, but we don't see that the display LSM isn't App Armor. And we're talking old code here. We want to run old code. We can't change the ABI from the kernel to user space. So this has got to work. So what do we do? We change App Armor a little bit so the A enabled, the security parameter and App Armor enabled, it gets virtualized by the display LSM as well. So the old code that's checking this will see, hey, I'm not enabled if the display LSM is not set, that fixes this problem. Another gotcha. And this is all gotchas, by the way. Pretty much anyways. OK, Fedora host, SE Linux policy on boot, no App Armor policy on the system. So we're just going to boot. We're going to, then once we boot, we're going to create an App Armor policy namespace or just load some App Armor policy. This is what we get. SE Linux is denying App Armor. So what's going on? So this is the App Armor call. This is App Armor code, what it's doing and it's capable for the current username space. Cap Mac admin, we'll cover why in a second. But that loops back into the system. And it kind of looks like it loops back into the system and the LSM stack loops it back through. So SE Linux gets the capable call. And so what happens is SE Linux says, no, you can't manage my policy. It denies it rightly, right? So what do we have to do? We have to go back in. We change the App Armor code, it's really simple. We change it to a direct call to the system capability instead of looping back through capabilities of CAP capable. And then we do the App Armor capable call directly. The only reason it was done through NS capable before is long history is originally the LSMs were stacking, composing the capability modules themselves. With stacking that went away. And so instead of having it composed at a whole bunch of places, we composed it in one place in the capable hook. So then we just had to call capable. We'd compose back through. And we're good. Well, it bit us. So we can try a simple container. So we're going to do a charoute here. And a charoute is not really a container. And it should not be doing that. OK, so anyways, we'll just describe what's going on here or should be going on here. I'm not sure why it's dead. So a simple charoute, we set up a charoute, just bind mount, proc, and bin, and everything into a simple charoute directory that we set up. Charoute in and load some App Armor policy. So what happens is you have two shells open. One's in this charoute. One's outside. Run the command in the charoute. Policy gets applied. Great. Run the command outside the troupe. Policy gets applied. Not so great. Again, we go back to the original example demo of what the LSM stacking is doing. It's applied to the whole system, right? So we need more than just doing loading policy once we've charouted or set up a new namespace, whatever. So yeah, more than LSM stacking, right? So we're going to back up. We talked about how App Armor is being applied to the whole system, SE Linux applied to the whole system. Not just the charoute. In this case, if we're setting up a container, we want it to apply just to the charoute or the namespace, whatever we're setting up as our container. We needed a way to apply the LSM to just part of the system, right? A namespace, essentially. LSM stacking is not LSM namespacing, at least not yet. Maybe one day. Let's just get stacking at first. OK. So App Armor, however, needs to set up this policy namespace if we're going to do this. And then we're going to have, if it sets up its policy namespace, we're going to enter the policy namespace and then do the troupe, and everything's good, right? Like you'd expect. App Armor, for a long time now, has had policy namespaces and its own ability to do some composing, bounding, stacking, whatever you want to call it. So App Armor policy namespaces are hierarchical. A task can have multiple confinements on it. So there's a system confinement on that task in the blue and some sub namespaces confinement on a task. And App Armor will compose those, too. So it's bound by the system. So the container, you don't just replace the App Armor namespace or the App Armor policy or host policy, whatever. You add to it. It's composing. This was, we've been doing this for a while on Ubuntu, so LXD can do App Armor and App Armor containers on. But it doesn't work with any other LSM because it's just in App Armor. To get there, every one of App Armor's interfaces are virtualized. Security FS entries. We use a special jump link, a magic jump link, whatever you want to call it, in Security FS to jump into the App Armor FS to support the policy namespace directory. And that will virtualize the directory structure for us. PROC FS files. So when a task goes and looks at the PROC files, they're actually virtualized by the namespace so you get different labels out of them. GetSockUpOption, so that's another interface. That has to be virtualized as well. And this is all App Armor doing it. It's not the LSM. Every LSM is going to have to do that currently. Even some of our kernel parameters have to be virtualized. Like, the enabled. In addition, for some, this was a feature development. Not necessarily what it was needed for. The container case, system containers anyway, is CreeU. You've heard about CreeU for SE Linux today, I believe. So we also had to add CreeU support. So we needed to hold on to the policy in a form that we could dump it back out. And App Armor actually transposes its policy, unpacks it, and does operations to make sure it's kernel-friendly and fast. So we actually have to hold on to the old policy and we can press it, store it. And then we can dump it back out to user space if CreeU wants it, and then they can reload it. The policy has to also be namespace-aware when we're doing this. When the container tries to do this, it sets it up correctly again for the proper namespaces. Obviously, we had to add interfaces for container managers to do this. We're not going into a lot of details. While we're at it, we're going to hit some more issues with getting these namespaces working. OK, system containers. Security FS. We talked a little bit about this. We had to virtualize our security FS files. Here's another problem. Even with we're virtualizing them, when you boot up an operating system image, it says, I'm going to mount security FS. Security FS is not multi-mount capable. So what happens is you try to boot the image, and it fails to boot. Great fun. So short term, we can have the container manager map security FS in. That's what we're having Alex do right now. And then the image has to be container-aware. It says, oh, I'm running in a container, so I will do this instead. I don't have to mount that. Really sucks. We don't want that. Long term, we need to make these things multi-mount capable to make this work. But you can work with it as it is now, as long as your container manager sets things up right and your image is set up for it. What can you do? Setcomp and no new privs. This one is so much fun. So no new privs was added with setcomp. PR control and what it does is lets the task say, hey, I don't want to have any new privileges. So lock the system down. And the LSMs respect that, sort of, largely. But the problem is if the container manager is doing this and doesn't expect things, it doesn't know what's going on. You try to boot, say, an image with LSM policy. And then the LSM policy is trying to do some domain transitions, do some other stuff. Those aren't going to be allowed. So it tries to do something. They're locked out, and it fails. And this is real problematic. There's issues all over the place with this around the LSMs. We have to build it in policy that we are allowing overrides. But if you're running a host image, like in the system container case, or a system container, and you're booting an operating system, it doesn't think it's in a setcomp jail. It doesn't work to make the LSM policy aware of it because it can't be. So we go back to what Abram is doing with its namespaces, its stacking, internal stacking bounding. So we had to add the ability for App Armor to track what the confinement was when no new privs is set. So when that gets set on the system, it's not exactly when it's set, but we won't get into that implementation detail. It grabs that confinement, stores it off, so we know what it is. And now, when we go to make transitions in our policy, we can allow those transitions. And as long as those transitions are stacked against or bounded by the policy that was in place when no new privs was enforced, we can guarantee that no new privileges are being added, and we can allow those transitions for the LSM. So you boot your container, your system image, it works. It can do those changes. You've got the host OS that's locked down, its container image or confinement, and underneath them, the container is free to make changes. Any other LSM is gonna have to do something similar if it's respecting no new privs. Whether it's actually supported through the LSM infrastructure or not, that would be something would be nice, but at the moment, it's not. Anybody who wants to do that's gonna have to do that. Nesting of containers. So LXD, not just LXD, there's lots of cases for nesting of containers. So a case would be LXD containers, you boot an operating system, and then in that operating system, your virtual machine, whatever you want to call it, you say, I'm gonna use LXD and boot something else, right? There's one case. You're in a container, again, you've got your host, you're in your container running something, and you want to run an application container, a snappy docker or something else, where it's, again, using an LSM. So it doesn't actually have to be nesting system containers. It can be a system container with something else nested. We run into problems, again. Specifically, user namespaces. Go figure, right? Currently, there's no way to map capability requests to the policy namespace that App Armor has. No way to map a user namespace to the policy namespace, the capability requests that comes along with it, and no way to know when the user namespace is even created. Again, we're missing some hooks here. This is a solution that needs work, and we're working on it. But we can do something. We can associate root username space and the root policy namespace for App Armor that we have. And we can limit our stacking depth for host containers, containers doing this kind of thing right now, to two, right? So we know if the user namespace is equal to the root namespace, we're good, we can use the policy namespace, we can get that mapping that we need. If it's not, we're limiting our stack to two, and then we know it's the other. So then we can actually figure out where things are. Long-term, what we need to really fix this is we need some new LSM hooks, and we need to be able to associate policy to the username space. What we don't want, though, and this is general, not just App Armor, but for other LSMs as well, we don't want to actually tie our policy namespaces to the username space explicitly, whatever, like network namespaces are tied to the username space. We actually don't want that. We want more flexibility for the LSM namespace, whatever you want to call it. We have cases where we're using policy namespaces, App Armor is anyways, where you're not going to change the username space. So there's a lot there that we don't want. We need to come up with some infrastructure, and there's some patches and works, but they haven't been posted yet to propose ways to fix this. So with a little support from LXD, we can actually run Ubuntu container on Fedora. So little support being LXD, like I said, has to be aware of App Armor. It has to be able to set up the policy namespace. It's got to be able to set up the mappings that we already talked about. A second video, not showing. Okay, so we've showed this demo before, unfortunately. We can try, see. Can we get that to go? So we had a demo there of just simply launching an LXD container. You can see the App Armor policy at the host level and how it's confining and changing in the container. And then you can go in the container and look at its virtualized policy. It's a really simple demo. I will see if I can get that up for people to look at. Not sure why it's dead. Future. We do have some additional things to do. So moved to dynamic LSM stacking. So right now, for the namespacings that we have right now, App Armor is doing... I mean, it's using LSM stacking and it's multi-LSM, App Armor, LSE, Linux or anything else, is not possible without LSM stacking. It's all built on top of LSM stacking. But App Armor itself is doing the additional work of stacking the policy, tracking that, and doing the bounding. We could conceivably do this in the LSM infrastructure and convert some of that from App Armor into the LSM infrastructure. And so other LSMs could take advantage of it and not have to recreate that part of the work. There are some costs to doing that instead of doing pre-allocated vectors and some of the optimizations that the infrastructure is doing right now would have to do some dynamic mapping of some kind, link lists. There's different possibilities, and we would actually have to see whether the cost is worth doing or whether we force this on each LSM to support if they're going to. That's something we have to look into for the future still. There also would need to be some kind of interface agreed upon. Can any interface agreed upon is fun? Casey, you want to take that? Okay, so any questions? So I'm very sympathetic to the No New Pros problem. I think it took us three or four kernel releases before we sorted that out, and that's not in the stacking case. But when you were describing it, I was curious. So it sounds like you basically, if you have a process, it spawns off another nested App Armor, namespace instance, whatnot, and that process is running under No New Pros. It sounds like you take a snapshot of that policy and do your calculations? Yes. Well, I guess what I'm wondering is, are you concerned, because this is something I'm trying to figure out for SC Linux, but what happens if you were to change the host policy, then is the nested App Armor policy still going to be operating under that snapshot that you took which might not be valid anymore? No. So what happens, and it's kind of ugly, so we grab the confinement labeling that is in place when No New Pros gets invoked, and we store that off, and we also actually carry along the new confinement that is changing. So what it's supposed to be without that. When we go through, we dynamically compose those when you create a new task, or a new exact or change confinement. It doesn't have to be done for every task. You can just inherit the composed one. And then that composed one, okay, we're doing the bounding, so we have to walk everything in that. So then when we go to change confinement, what we do is we start with the, we start with the broken apart, so label, and we compute what we need for, sorry, for the new confinement. And while we're doing that, we also are measuring if the No New Priv saved confinement needs to change. If policy allows that change, then we have this direct mapping about which components change. And we update the No New Priv, locked No New Priv, and we'd save that new one off. And so now we get the new composition of the changed host policy because it was allowed. And it's locked down as if it's No New Priv under that change because policy is still No New Priv. You don't pull that away from the task. And then you get the whole new composition of the two. It's ugly, but I couldn't come up with another way to do it. So I also have a question about the No New Priv thing. Just to clarify, did I understand it correctly that the problem is that you're trying to spawn a container and the process that's trying to spawn a container is already subject to No New Privs? Or is that different? Potentially. That's not particularly necessarily the case. The problem is, there's several ways you could set it up. It could be that it's already in No New Privs. Ideally, your container manager wouldn't apply that to late. The problem is, you set up a new container. Say you don't set up No New Privs first. It's almost your last thing you're going to set. So you're not confined by No New Privs when you set up your container. You set up a new policy in the container. So your task has an app armor namespace and it has some policy in that namespace that it's taking. Let's just say it's unconfined even, right? So now when you start your container running, that task starts running and it has No New Privs applied to it. It hits an exec and it says, okay, I want to change, I want to do this exec and the LSM rules app armor policy says, when you run this program, this exec, I'm going to change the policy for this application to some other label, some other profile, whatever you want to call it. But No New Privs is going to block that. And the image, the system image that you're running in the container isn't aware that it's No New Privs. It's running like it's on the host. But you said the container manager is No New Privs and you're not setting No New Privs? No, the container managers, let's say the container manager is not No New Privs. Let's say we're doing this seenly. And the container manager is the worst case and the container that it's setting up is set to No New Privs right before it starts running the container. So we don't have to deal with trying to figure out what the confinement is when we're setting this container up. Oh, so why do you set No New Privs on a container? Some container managers are setting No New Privs because they're setting sec comp policy. Now, not everybody's doing that. But when you're creating a username space, you can set the sec comp policy without setting any pros. Who says you're using a username space? Okay, that answers the question. Thanks. Right. So containers are this amorphous blob, right? You have all these different types of container managers. Not everybody's using username spaces. Docker, for example. So yeah, No New Privs is a real pain. More questions? If not, let's thank John.