 Okay, can you all hear me? Is that all right? Good morning. I'm Jake Edge. I'm here from LWN.net. If you haven't checked out LWN, it's, we like to think of it as the premier news source for news of the Linux and free software development communities. So go give it a look if you haven't visited before. I put together the weekly security page for LWN and also write articles about various other topics that are hopefully of interest to free software developers of various sorts and Linux developers. And today we're here to talk about name spaces for security. So this is something of an agenda, I suppose, what we're going to be talking about. So we'll start off talking about the kinds of threats that systems face these days, the kinds of effects that attacks against systems can have, the defenses that are used typically, and then we'll start talking about name spaces in particular, the types of name spaces, how you create them, how you use them, some examples of ideas of things that you might do. As with any security technique, name spaces are no silver bullet. There is no silver bullet in security or perhaps anywhere else. So this is sort of part of a defense in depth. You could add name space based security to other techniques that you're already using or plan to use. So he talked about questions anytime you have one raise your hand and we'll get the mic to you and you don't have to wait until the end. So what kind of threats are we talking about? Typically there are, well there's lots of different kinds of threats of course, but you have targeted attacks where somebody is after your data in particular or your or access to your system in particular, those are much harder to defend against and sort of maybe outside of the scope of what we're talking about today. These days I guess it's not just attackers we're worried about, but maybe governments going after data in a particular location, doing a lot of reconnaissance and so on. That's a very hard, hard threat to defend against. So what we're talking about here are mass attacks where someone is running a particular script against a million web servers trying to exploit a particular web application flaw say or or that sort of thing. No problem. So typically one of the things that gets attacked is network facing services, right? I mean the attacks that we suffer today come from the network typically, but network clients shouldn't be forgotten as a potential attack. If a network client is talking to an attacker controlled server, it's possible that that server could induce some behavior in the client that could lead to a compromise of one sort or another. DNS cache poisoning is one that I think gets less attention than it deserves in that there are millions, hundreds of millions perhaps of Linux routers out there, generally Linux based routers out there that people are using to access the internet. They often have a caching name server on them and if an attacker can poison the cache on that name server then they can redirect any traffic for your bank.com to an IP address they control. That's a pretty powerful, pretty powerful technique I suppose. Web application flaws of course, you know, they're reported daily if not hourly or even minutely. There are any number of them at any given time and they often don't get patched necessarily all that quickly, so then these mass attacks try to exploit those flaws. And then there are the cross site attacks, cross site scripting, cross site request forgery. So those are the kinds of threats that systems under our control face these days. So what kinds of effects can a typical attack have? What can an attacker do when they successfully attack a system? Typically it's whatever the network service that's running, generally they'll be running under their own user, let's say you know the Apache user will be running the web server and that account gets compromised which means that the attacker can now run any code with the permissions of that user. And the Apache user on most systems is not a privileged user but that almost doesn't matter. That's not typically what attackers are looking for. You know the Apache user can typically access the network, it can access the file systems and generally can only access some subset of the file systems that are set up for it to use. It can access other processes. So where does that lead us? Well, network access gives an attacker the ability to set and spam. That's one of the things that is very common these days to do distributed denial service attacks. There are lots of botnets out there and these botnets get rented or sold or whatever to people who want to do these kinds of attacks or want to send spam and so the network access on a regular user allows an attacker to do that. They don't need any privileged access at all. File system access, of course, confidential information, configuration settings, keys to be able to log into some other system, whatever else. I mean there's a lot of stuff. Now one would hope that the Apache user wouldn't have access to all that information on the system using traditional UNIX access control mechanisms, but that is certainly a danger. And if you can see and deal with other processes, you can do things like P-trace them or send them signals or kill them outright so that you have a denial of service. P-traces particularly pernicious in the sense that, for instance, if you had multiple Apache servers running and one of them was compromised, then the Apache user can P-trace the other Apache servers that are running and possibly extract confidential information like usernames passwords or some kind of credentials or whatever. And then privilege escalation, I think it's fairly well accepted by most kernel developers. Matthew Garrett said something similar to this in a talk yesterday that on any given kernel there's some way to escalate your privileges to root available. It's probably not known. It's some sort of zero day, but they exist. And so compromising a regular user always seems like such a, maybe not a minor event, but not a major event, maybe. But in any case, if you can then escalate your privileges beyond the Apache user, say, to root, the whole system is compromised at that point. So our standard toolbox to avoid these kinds of threats are things like Unix permissions, your traditional RWX on the file system that restricts access to particular files or sets of files. Users and groups kind of go hand in hand with that. Certain users have certain access and so on. We also have mandatory access control in the Linux kernel. There's SE Linux and SMAC and App Armor and Temoyo. All of these add additional layers of security for protecting different kinds of resources. There's capabilities, which was an attempt that I don't think really worked out very well overall to separate the, you know, root has an enormous number of things it can do. And capabilities was an idea to separate those out into individual capabilities that could be granted to a program or a user to give them a limited set of the abilities of root without having to become root. The idea was good, I think, but it turns out that it's very difficult to specify what those capabilities should be and the way they were implemented now, it's a little difficult to add new ones. It's difficult for programs to know which are available and which aren't. And then they're a handful of the privileges that are supposed to be less than root can be leveraged to become root. So they are essentially the equivalent of root. It's a bit of a mess. A more recent one is the SecComp Sandbox. It's a way to put processes into a sandbox and only allow them to do to make certain system calls or make those system calls with certain arguments. So you could put a process into a sandbox and say it can't make the right system call or it can only make the right system call to standard error and standard out. It's used by a number of things or it's in the process of being used. It's pretty recent. Chromium, the Chromium browser was sort of the genesis of it, I think, at some level and the idea there was to put the rendering process into a sandbox with a restricted set of things that it was allowed to do so that if the JPEG renderer had a flaw in it of some sort, what an attacker could do with that was very limited. It could only call the system calls that the normal rendering process would call. There are lots more. Linux certainly does not suffer from a lack of security techniques, I suppose. So namespaces have actually been around for quite some time. The original namespace was put in in the two, four days, I believe. But the set of namespaces has been growing over time and at some level they may be reaching completeness, though people keep proposing new ones. The most recent one that was added and I guess more or less completed in the 312 kernel, the last piece of the puzzle was finally dealt with, is the user namespaces. But the idea of a namespace is a way to partition the global resources of the system in a way that members of the namespace can see a different view than members of a different namespace of that global resource. The global resource could be things like process IDs or the mounted file systems or the host name. So the idea is that it sort of provides invisibility. If you have two namespaces that are siblings to each other, let's say you have the root namespace, which is your standard default namespace. If you don't do anything to add namespaces and you have two children of that that are siblings, essentially, they can't see that resource that the other one is using. So they each get their own view of it. It's kind of a lightweight virtualization technique. You can combine it with things like C groups, control groups to make containers, which are almost like virtual machines in that they have their own view of the system. They believe that they are the only things running on that system. The processes in that container are the only things running on that system, but in reality, they're all running on the same Linux kernel. That's the difference. That's the lightweight part. VMs, of course, are running on their own kernel, each running on their own kernel. They can be used for a number of different things, containers being one of them testing and debugging. You can isolate a process, a program that you're working on or a program that you've gotten maybe from a dodgy location that you're worried might do something unexpected and you can isolate it from the global resources like the global file system and so on. And then what we're here to talk about today is they can be used as an isolation mechanism for security purposes. So if you run a distro kernel, which probably lots of folks do, you probably have most of the namespaces already built in. User namespaces would be the exception for a couple of reasons. One is that the 312 kernel that finally completes the picture isn't even out yet. That's not due into Halloween. But also, there are some questions about user namespaces. I'll get to in a second. But in any case, to configure a kernel, to use namespaces, it's in the general setup menu. There's a namespace support. And so then you have config namespaces, which is the overall namespace support. And then you have configuration options for each of the different kinds of namespaces. And we'll get to those kinds in just a second. So depending on which kernel you're using, config user namespaces, the user namespaces were not available depending on what file systems you also had built into that kernel. And all of that goes away with code that Linus just merged last week for 312. So post Halloween, at least if you're using the most recent kernel, those problems go away. User namespaces allow regular users to become root in their own namespaces. And as you might guess, there are a number of subtle bugs that have been found, where there are sort of assumptions in the kernel about what root can and cannot do without considering the namespace issue. So I suspect it'll be a while before distributions will turn those on or at least turn them on so that regular users can create user namespaces. They may well turn it on so that root can do it. But there are almost certainly a handful of subtle bugs still out there that need to be shaken out. So you create namespaces, generally with the clone command. I'm sorry, system call. Clone is underlies fork. Unshare and setns do different operations on namespaces. All three of them share this set of clone flags that talk about which kind of namespace or namespaces are to be created. So if we look at this clone new NS, that is actually the mount namespace. At the time it was added, mount namespaces were added. It was the only namespace. This was way back when. And no one, I guess, at that point, envisioned other namespaces. So it was just create a new namespace. And the other ones are UTS, PID, NET, IPC, and user. And we'll talk about what each of those does in a minute. But clone is the system call that starts a new process. And so if you pass one or more of these flags to the clone call, then that process will be started in a new namespace. Unshare does sort of a similar thing, except that it doesn't create a process. It creates the namespaces. It takes the existing process, the current process, and puts it in those namespaces. And setns is used to join an existing namespace. The trick there is in order to join a namespace, you have to have some sort of handle for the namespace it is that you want to join. And we'll see how you go about getting that. One thing to note, the system D program has a mixed reputation, I guess you could say. A lot of people like it. A lot of people don't. I tend to like it. But in any case, whether you like it or not, system D N spawn is a useful tool for messing around with, with namespaces. It's also very useful. The code, of course, is available. And, you know, it's pretty nice clean code that's easy to look at to take a peek at how it uses namespaces, how it sets them up, etc. And when I do my demo, you'll see that I use it. It's certainly nothing to be feared, even if you don't like system D. But did I skip over? Yeah, we didn't do this, did we? Sorry. Yeah, I thought it got a little confusing there in the middle. Keyboard bounce or something. So backtracking a little bit. These we have these different kinds of namespaces. The mount, as I was saying, was actually the first created. But the UTS is probably the simplest that UTS stands for Unix time sharing. And it all it does is essentially virtualize the host and domain names. So if you create a different, two different UTS namespaces, and we'll see this, they will have, they can have different host names. So this is all part of the idea of doing a lightweight virtualization. Mount namespaces will give you a different view of the mounted file systems. This is another thing we'll see in the demo. The process or PID namespace virtualizes PIDs. So each PID namespace has from one to PID max, whatever that is, to itself, it has, you know, it has, it has a mapping of those process IDs. And those, those process ID numbers in the namespace map back to an actual process ID, a real process ID, if you will, in the root namespace. But the view, you know, if you're inside of that namespace, you can't see the process IDs in the root namespace, you can only see the process IDs in your PID namespace. Similarly, with the inner process communication, that's the system five stuff, the message cues and shared memory and so on. The IPC namespace virtualizes those networking is an interesting one, it separates all the networking devices, you know, your ETH zero, or whatever into separate, separate namespaces so that when you create a new networking namespace, it has no networking devices. You have to actually add the networking devices that you want that namespace to be able to use and processes inside of that namespace cannot use the global networking devices, thus the global networking setup, unless you configure things that way. And then the username space is the most recent one, and it virtualizes the UIDs and GIDs. And so that you can have a user 1000 in two separate sibling namespaces that have no correspondence to one another, that this 1000 can't look at this 1000's files and vice versa. And more interestingly and importantly, you can have a root in both, neither of which is the root in the global of the root namespace, but is root for that username space and can operate as a privileged user within the namespaces. So let's see if I can go a little slow back to where we were. Okay, so this, pardon me, is my attempt to use LibreOfficeDraw to show how this is sort of a diagram of what you're about to see in the demo. We've got the root namespace and it's got its set of PIDs, and it's got a particular file system or directory in this case. And that directory SRVF19, sorry about that. SRVF19 is it's just a directory with a stripped down Fedora 19 in it that I use for the demo. And it gets mapped to the slash directory inside of the namespace. And this is a namespace that gets created by system DN spawn. It has, you know, its own set of PIDs that correspond to the PIDs in the root namespace, but processes inside of this namespace cannot see, though it's to see or use these numbers at all, right? If they refer to process 444, it doesn't exist. There is no such PID inside of the namespace. So then I talked about needing to be able to get a handle on namespaces. And the way that you do that is with the proc file system, you use the, for a particular process ID, there is a names, an NS directory, namespaces directory, and it has files and which files it has depends on which namespaces you have configured into your kernel, but it could have as many as six files for all the different namespaces. And those files are references to the namespace so that you can for set NS, you can open those files. So you have an open F file descriptor that you can pass to set NS to say this is the this is the namespace of interest. This is where I put my process in this namespace. So and we'll see, see what those look like. I mean, when you do an LS, they sort of look like symbolic links, but they're actually some magic kernel object that that you can use to to reference reference the namespace. So this is almost by way of a caveat, or at least it really confused me early on when I first started playing around with this stuff early in the year, sort of in the abstract, based on what I've already told you, you would expect that given two separate mount namespaces, if you mounted something in one, you would not see it in the other. And if you mounted something in the other, you would not see it in the one. And that can be true. But in general, may not be true. Evidently, that sort of conceptually correct behavior was painful for some of the kinds of applications people were trying to do with with mountain namespaces, and may still be. So this whole idea of shared private and slave mounts came about. And so if you if a particular file source system hierarchy is Mount shared or has the shared flag, then child namespaces see what gets mounted in it, and parent namespaces see what gets mounted in it. If it's private, then neither of those things is true. And then the third is slave, where child namespaces see it, and parent namespaces do not. And you can change the behavior using these mount commands that I have here. And you can use the recursive variant to change that for all mount points underneath. It's basically a question of where further mounts will actually show up. And a lot of, well, maybe not a lot, certainly Fedora defaults to shared mounts. So if you start noodling around with this stuff like I did, and you, you know, mount some things in one place and don't expect them to show up in the other, you will be very surprised. So you'll see when I do the demo that I set them all to private so that that doesn't happen. Oh wait, you should wait for the in. So for shared, it's really just children and parents. And it's not just, it's not all namespaces on the system. It's just the children and the parents of the current one. If they are shared, all namespaces in the system see it. And if they are, yeah, if the slave only the child, only the children will see it. So now, assuming all goes well, I'm going to show you a little demo of how this all works. So first off, is this font big enough for people? Can everybody see that? Is that reasonable? Okay. So these are just regular. I have two shells here. You know, just me nothing, nothing up my sleeves. And I'm gonna log them in as route because messing with namespaces requires route privileges. So okay, so first off, let's start by making the mount space private. So that recursively sets the mount the mount space private for the entire hierarchy on the system. And then we'll start. So there I have told system to end spawn that I wanted to use SRV F 19 as the root file system for the set of namespaces that it's going to create. And we'll see in a second which which namespaces it created. And it runs this bash process as the init within that within that set of namespaces. So if we look here, all we have is just the the bash init process. And then the PS that I just ran, that's within the namespace. And if we go outside of the namespace and we look, we'll see this bash here to 402 is actually the one that it reported right over here as the init process in that container. And so but if we let's say wanted to try and kill the system to end spawn from within the namespace, there is no process that 24 01 it exists, it's the bash, but it's not, it's not, you know, mapped to that process ID within the namespace. If we look at the, if we look at the namespaces now, okay, so if we look at the namespaces for that process, and this is this is the the process, the init process in the namespace, you can see the different namespaces that it is associated with. And if we look at the same information for something outside of the namespace, we can see that we have some different values. The IPC namespace is different between them. The Mount namespace is different. The net namespace is the same. The PID namespace is different. And the user and the user is the same and the UTS is different. So the hostname on my laptop, it was about land. The hostname over here is F 19. And then if we just touch a file in the temporary file system, so now I guess I should have shown it wasn't there, but it doesn't really matter, I don't think. So anyway, so it's there. And we can actually see it on this side because we can see same same file is there. But if we bind mount, if we bind mount the password file on that, we see that the password file is is bind mounted on that particular file. But we come over here, still empty, append something to the file. They're not they're not related. They're in separate namespaces. And and the Mount propagation was set up such that such that that won't the mount that was done in the root namespace won't show up as part of the child namespace. So if we talk about a few examples of the kinds of things and once you start thinking about this and you can you can come up with with lots of different ideas. And when you start doing things like adding in usernamespaces or network, well, network namespaces I talked about here, there there are additional ideas that you can come up with of ways that you can further isolate your processes so that if they are compromised, the amount of damage is limited or or eliminated, depending on the situation. So if you have some sort of update checker that checks for updates, but you're concerned, perhaps that the that the other end might be compromised or that your DNS cache gets poisoned sometimes and and you go off and and actually check for updates at a at an attacker controlled site. Maybe you would set up a mountain namespace to run that update checker, allow the the access it needs to the libraries to run and has its own private temp. And then if it finds that there actually needs to be an update applied, it you know communicates that to the system somehow some other way, or you know, via some mechanism. And another might be to run multiple instances of a web application in a in separate pid namespaces. In that case, then if that web application got compromised in some fashion, whatever program or shell or whatever the attacker was doing would not have access to the process IDs of the other processes that might be of interest so that the P trace thing that I talked about earlier wouldn't wouldn't work. You could combine the two those two things and have a separate mountain namespace and pid namespace for a web application. So that you know, if it gets compromised, it can't actually access any of the files elsewhere in the file system. Network namespaces, you could set up a separate network namespace to run a HTTPD web server worker process. And that network namespace has no networking, essentially, and you pass it a file descriptor with the with the connection from the client. So it can still read read and write information to that client, but it can do no other networking. So if it gets compromised, it has no access to the network. You could also separate your network namespaces in such a way that you have some for local access, local network access and some for internet access. And so any internet client, let's say someone runs, you know, something that talks to the internet and, you know, gets a bad JPEG or whatever that the compromise that results from that can't access your internal network, because it doesn't have any of the of the networking devices and configuration to do so. And you can keep keep going from there. So that's all I have today. I'm happy to take questions. So what are the any performance penalties you might incur on like large servers that you're trying to isolate applications? Do you take any hits and performance that are noticeable? That's a very good question. I don't know the answer to that. That's probably something I should look into. The if if you are running a distro kernel, you know, you already have those checks in there, even if you don't have namespaces enabled, right? The checks for, you know, yeah, I mean, you know, they're they're very simple tests. I would think the impact should be minimal, but it's definitely something I should look into. Thanks. So why would this be preferable to just charooting an app? I'm sorry? Why would this be preferable to just charooting an application? Well, charoot is has its own set of problems, right? I mean, charoot is is is fairly easy to escape if you can be if you become root on the box. And charoot does not do things like separate PIDs or separate a network devices or, you know, it's only focused on the file system. So if that's the only piece that you think you need to protect, then, you know, it really depends on your application, of course, but but you could, you know, namespaces are a more foolproof charoot in the in the file system area, and then add all these other kinds of capabilities. Does that make sense? I could just point out that if you're charooted in and then someone manages to get root within the charoot, then they can make nod your root file system device and then mount it within the charoot and it's rooting to that and now they have root on your system. Any other questions, comments, thoughts? Well, thank you very much for your time and attention.