 So, apologize for the glasses. I'm Jake Edge. I'm from LWN.net. We like to think of it as the premier news and information source for free software and Linux developers. We certainly have a developer focus. I've written a bunch of articles from the conference already, and no doubt we'll write some more. If you haven't visited, please give us a look. That's the advertising part of the show, I guess, moving on from there. I'm here to talk about namespaces for security. I write the security page or I edit the security page and generally write the lead on that weekly for LWN. So, I claim to know something about security and I'm learning about namespaces. There we go. Sorry. So, what are we going to talk about today? So, first off, we're going to start talking about threats, the kinds of threats that we want to protect against. Talk about the kinds of effects that attacks, that those threats basically can cause. The defenses that are typically used to thwart or reduce the effectiveness of those attacks. And then get into namespaces and what types of namespaces there are, how they're created, a little bit about how they're used, some examples. That's kind of an overview of where we're going today. So, this is the embedded Linux conference, so we'll sort of try to focus a little bit on embedded devices. And one of the big problems for an embedded device maker is a mass attack, when some attacker out there figures out some sort of vulnerability in a, let's say, a home router that an ISP has put 100,000 of into people's homes and they own some large percentage of those routers and the device maker gets a black eye and the ISP gets a black eye and not only that, but the customers, your customers have now been violated essentially, so it's not just a PR problem, it's a real problem. And that normally happens through some sort of network-facing service. Web servers are the most likely one, but there are other network services that could be affected. Or sometimes it's a network client, you know, maybe the device has some sort of update service that it, well, not service, sorry, update client that it runs that goes out and talks to the service to see if there's an update available for the device. So, if an attacker can interfere with that traffic in some way, they may be able to do something bad to the device. DNS cache poisoning is kind of in that same area. You know, if I can convince your DNS cache or if I can change your DNS cache so that it believes the mapping from a particular name an important name to an address is wrong, I can do a variety of different nasty things. Web applications, of course, are the poster child for the kinds of threats that are faced, and more and more devices are having web interfaces of various sorts, either for configuration or monitoring or maybe that's part of what the application or the function of the device is. And cross-site attacks are sort of one of the more common attacks for web applications that you hear about, that's cross-site scripting, cross-site request forgery, those kinds of things. So those are the kinds of threats that a device faces. And what is it that an attacker is going after when a device is attacked? Typically, if an attacker is targeting a particular individual, all bets are off. That's a very hard attack or scenario to handle. But the more common situation that we see is where a group of attackers are doing a mass attack of a large number of devices or systems through a single kind of vulnerability or a multi-vulnerability cocktail, I suppose. So what are they trying to do? What's the end goal? And the end goal really is to compromise the system in some way, and that typically means compromising a service, and then that means that the attacker can run code with the abilities that the user ID associated with that service has. And that user ID doesn't need to be any kind of privileged user. Most regular users on a Linux system have lots of capabilities that an attacker might want to use. The ability to use the network, make connections on the network, set up servers to accept connections on the network as long as they use a high-numbered port, look at the file systems, there's various kinds of information in the file systems. Other processes are out there, and those processes may in fact contain interesting information. So if the user that gets compromised has network access, then the attacker could use it to send spam, to do a distributed denial of service, to make the system join a botnet for other kinds of things. File system access, there could be confidential information or configuration information, or the system itself, maybe it's a storage server of some kind, and so its whole functionality is to make available all of this information, which is maybe the target of the attacker. And if you can access processes, then if those processes, let's say, have contained credentials from a user on the other end, let's say, username, password kinds of things, or keys, processes inside of their memory often have useful information that an attacker might want to get access to. So if they can access the processes, that's the kind of thing they might be able to do. They can also send a signal with a kill, or just kill it and cause a denial of service. The other thing to consider is that often local privilege escalation flaws. The internal has some sort of local privilege escalation where a local user can turn their privileges into those of root. That often is considered to be a lesser severity of vulnerability because many people run systems without any untrusted users, and if you don't have any untrusted local users, a privilege escalation is sort of a non-event. But you do often have a lot of untrusted local users in the sense of the services that you run. So if you're running some web app using the Apache user or SSHD or something, those users, if you can compromise those services, those users become the untrusted user, and a privilege escalation then can lead to a root compromise. If I could hit the right button. There we go. So Linux has a huge number of different security technologies that can be used to try to thwart attacks or to reduce the impact of attacks when they do happen, starting with the Unix permission model users and groups and the ability to not allow certain users to access certain sets of files. And then we go up. That's the discretionary access control is what that's called because users can decide what permissions to put on their files. Unlike the mandatory access control, which is things like SE Linux and SMAC, which the administrator gets to decide what the permissions on files are. But that's sort of another step up in terms of protection. The knock on SE Linux is that it's very complicated. It's also, I think, fairly large. It's typically not used in embedded, at least resource constrained embedded environments. Although some folks have done that. There are some like either hack down versions, smaller versions of SE Linux, or there's SE Android which runs on Android systems. Tizen uses SMAC, which is another kernel mandatory access system as a large part of its security model. Then you have the capabilities. Capabilities are sort of an attempt to take all of the different things that root can do and split them into a fine-grained set of capabilities, and then you can apply those capabilities to a particular program selectively so that it only has the pieces of root's privileges that it needs to do its job. In theory, that's all really cool. In practice, it's a little dodgy, or it's a little... It suffers from lack of attention, I think. There's really no capabilities maintainer in the kernel. The capabilities were added haphazardly. They were assigned haphazardly. There are ones like Capsis Admin, which is this huge thing that covers a whole raft of different kinds of abilities. And then there are some that are very focused. There are also quite a number of them that can be used to leverage to get root privileges. So really using them as a restrictor so that you can't get all of root doesn't really work. Brad Spengler did a sort of a study on, I don't know, like a dozen or so of them that can lead to root. The Sectomp Sandbox is something that's relatively new. It's been added in the last year or so, and it allows a process to limit what a child process can do in terms of the system calls that it can make and the arguments to the system calls that it can pass. So it was originally done for the Chrome browser, at least that was the reason that the person who wrote the code was interested in it, but a lot of other things are starting to use it as well. The idea for Chrome was that the renderer could be spawned off with only the set of system calls that it needed to do its job, and then if there was a JPEG vulnerability, and the renderer then would be very limited in what it could do if it got hit by that vulnerability. There are others, Linux certainly does not suffer from a dearth of security systems. So namespaces aren't new by any means. They go quite a ways back. Al Vero added what we now call the Mount Name Space as just the namespace quite a ways back and at that time at least he was not thinking that there would be other namespaces, but since that time there have been another five added I guess. And with the advent recently of the username space with large chunks of which were just merged for 3.8 which was released what last week, namespaces at least conceptually at some level are complete. There's still certainly work to be done for namespaces but the sort of conceptual hole is there or at least visible on the horizon. What they are is they are a way to partition global resources within the kernel so that different sets of processes have different views of these global resources. So you could imagine a group of processes that sees process IDs that are different from those of a different group of processes let's say. So the process IDs that this group sees are a subset of the process IDs that this group sees and the numbers make no correspondence between them. Or you could imagine that a set of processes have a certain view of the network, what network devices there are that's entirely different and doesn't correspond to the network devices that another group of processes sees. Same goes with the file system tree and others. So it's a way to take the global resources of the kernel and split them into separate groups that certain processes that are members of a particular namespace can see. It was done or at least from the outside it seems to be focused on containers. The idea of a sort of a lightweight virtualization where you have a set of processes that see a vision of the system that makes it appear that they are the only ones running on that system. There is no other system that they are in their own system but what's really happening is they're running on the same kernel possibly with many others of these same containers. It's a lightweight virtualization instead of hardware virtualization where you have virtual machines and you have separate kernels for each one and separate file systems and so on. You can share the resources of the same kernel and some of the resources of the file systems and it's a much lighter weight solution. There are a number of reasons you might want to do that certainly testing and debugging. You can set up a little environment that on your Fedora system is let's say and I'll show sort of an example of this later a Debian SID system and you can go in there and inside of that namespace it acts just like a Debian SID system it uses the executables from Debian SID and so on but it's running on a Fedora kernel in a normal Fedora user space. There are also security uses for namespaces probably in my description of the separation that I described earlier you could perhaps even see where I was headed with that. So there are six different types of namespaces currently available if you're running a 3.8 kernel anyway. The first that was added was the mount namespace in fact the flag for setting up a mount namespace is just new NS, new namespace as I said Alvira was not thinking about other namespaces at that time probably the simplest namespace is the UTS namespace that name comes from Unix timesharing there's the struct UTS name where a bunch of names are stored like Uname uses for instance and that namespace allows you to change the host and domain name inside the namespace it's different from the host and domain name outside the namespace and you can see that in a container situation you might want to have 20 containers each with their own host name and domain name well perhaps not domain name but perhaps also and so for the processes running in that namespace as far as they can tell they're running on a host with an entirely different name than the processes that are running in the root namespace you have process a PID namespace for processes and in that case then the PIDs inside of the namespace bear no relationship to the PIDs outside of the namespace now you have the root namespace which has all of the processes in the system that's normal Linux and it may have a PID for a particular process that inside the namespace has an entirely different number so if you had a container it would be running it would have its own PID space totally separate from the host if you will and the correspondence between those the PIDs that are in the namespace can only be used by system calls and processes in that namespace if you use a PID that's outside of that namespace if you somehow magically got a PID number from the host system the system call would fail there is no PID of that number the IPC namespace is for the message queues and semaphores and shared memory stuff networking namespace that's for separating networking between these different namespaces so that the view of the network inside the namespace is different than the view of the network in another namespace or outside of that namespace and then the username space does the same thing with UIDs and GIDs so you have an entirely separate set of UIDs and GIDs inside the namespace and outside the namespace and username spaces are part of the reason behind them existing is the idea that regular users will be able to create namespaces so inside of a namespace there will be a root user and inside of the namespace it will have UID 0 outside of the namespace that will have a different UID but let's say it will have my UID because I created the namespace so outside of the namespace it has no privileges other than the ones that I have but inside it's the root user so if I can create a user, a namespace in which I am root then I can create other namespaces underneath that of other types which means that I can create namespaces that a regular user can create namespaces rather than today's situation without user namespaces where you have to be root to create namespaces there's of course a big issue here if the root inside the namespace has privileges outside the namespace it's a very easy privilege escalation if I can create myself a root user essentially so that's the work that has been done over the last year or so well I'm sure more than that but visibly over the last year or so is to ensure that the root user inside a namespace does not have any extra privileges outside the namespace and in fact as we'll see a little bit later that job is not quite done so you can't build a kernel with network file systems and turn on user namespaces at this point so if you want to build a kernel with namespaces if your distro doesn't come with it or if you're building on some device and you haven't turned them on it's easy to do it's in the namespaces support submenu I guess under general setup there's a half a dozen different config options and right now as I said config user namespace depends on the network file systems being turned off because they have not yet that code has not yet been converted to do the right thing for the user ID changes that code is in flight now the patches to do that are in flight now and I assume we'll probably land at 3.9 and so then we'll go merely on our way so to create a namespace the main sort of entry point to do that is clone which is sort of the underpinning of fork I guess it's on level it's a lower level system call that creates a new process and you can pass flags to it to control its behavior some of the flags that you can pass are these clone underscore new star types of flags those can be ordered together that you can get multiple new namespace types in a single call so you could call a clone with a new NS and a new net and get a new mountain namespace and a new network namespace that are associated with the process that it creates Unshare is a way to create a new namespace without creating a new process so essentially you are saying put me in a new namespace of whatever type you specify with the clone flags set NS then is a way to join an existing namespace so the trick there and we'll talk about it in a second is how you specify what a new namespace it is that you want to join but set NS says take this process and put it in this namespace systemd nspawn is something that comes with systemd the uber popular init program that nobody has any opinions about but in any case it doesn't use systemd the init process particularly it can but it's an interesting program and we'll see it here in a minute it can set up a container for playing around in very easily it's also a pretty nicely easy to read piece of code that uses things like namespaces and bind mounts and some of the kinds of techniques that you might want to use that you are going to use namespaces for security and for partitioning on a system so it's worth looking at even if you hate systemd the code is there it's interesting stuff so here's my attempt at a diagram to sort of give you the idea of what these namespaces look like so you have the root namespace when you boot your machine and you log in or even if you don't log in there's a single namespace that's running that's the root namespace and it's for all of the different namespace types and so what we're showing here is we've created a child namespace let's say with a clone and we did a clone newpid and a clone newns so we got a newpid namespace and a new mount namespace and so when we did that the... so the new process that was created was 238 in the root namespace and that became 1 i.e. a nit in the child namespace and then sometime later somebody did a PSAX in the child namespace which was pid12 there which corresponds to pid249 in the root namespace so those processes are the same but inside the child if you referred to pid249 you would get nowhere because that pid doesn't exist you would have to refer to pid12 or pid1 and sort of similarly we've got this we've got this mount point serveSid that is mapped to the slash in the inside the child namespace and so the child namespace can't see serveSid it can only see slash and then systemvn spawn you know does mounts some things like slash proc and slash private slash temp and a few other things like that that are completely contained within the child namespace and can't be seen in the root namespace that's the separation that's the level of separation so when I earlier when I talked about setNS I said you needed a way to be able to have a reference to a particular namespace when a process is created well all processes have a procpidNS directory and under that directory depending on which of these namespaces you've actually configured into your kernel it will have entries for mountpid UTS, IPC, net and user those are magic kernel files essentially that can be used to reference the namespace so and we'll see sort of an example of this later the if you need the mount namespace of a given process you can open the procpidNS mount for that PID and that will give you a reference to that mount namespace you can pass that fd to setNS if you wanted to put another you know put yourself into that mount namespace so that's yeah that's what those are for if you do that can be confusing depending on how your distribution is set up many distributions by default propagate their mount they do what's called mount namespace propagation so you probably would expect given what we've said so far that if you have two mount namespaces and you mount something that you don't see it in the other and that would be the sort of the normal situation but because of the way mount namespaces are often used many distributions or I don't know about many some distributions will by default make it so that those mounts are visible outside of the namespace basically they'll propagate the mount into other namespaces that refer to the same mount point you can change that behavior so there are sort of three different modes there's private which is sort of what you would think of as the as the proper way to do it given what mount namespaces are supposed to be right in that if I have a mount point here and it's in these two namespaces and I mount something on it here it's not seen here and vice versa that's a private shared is the reverse of that you always see whatever happens and then there's this idea of slave so if this namespace mounts a file system or sorry the mount point is marked as a slave it will see any changes that the parent namespace does or grandparent or whatever but it will not but whatever it does will not propagate to the parent or grandparent and there are recursive versions of those types that so you can basically you can change how the mount points are treated and it can be very confusing trust me on this if you are playing around with mount namespaces and you didn't realize that your distribution does this to you you mount something inside and you expect that it's not going to be seen outside and if the thing that you mount is slash proc it makes it even more interesting so anyway be aware of the fact that some distributions will do that and in this case systemd on fedora is doing that by default is making all mount points shared so it's easy to turn that off but you have to know that it's happening in order to do that so let's go ahead and if I can manage not to mess this up too terribly look at so I have a directory here called serve that has two subdirectories which are maybe exactly what you'd expect them to be I used yum in the raw hide case and deb bootstrap in the cid case to grab sort of the rudiments of a raw hide or cid system into these directories so they look pretty similar and these are just directories and I'm just running on my normal system I just have two shells here running this route so but if we use systemd nspawn to start a container what it calls a container on serve cid it has now created multiple namespaces for this process the process being the shell and and mounted things in such a way that the these the root directory is that serve cid directory and so now I can run commands inside of inside of this container and I'm running the Debian versions of these commands you can see that that is sorry yeah that is the all of the processes that are running inside of here you can see that the pid one is the shell and but outside of the container if we look you can see this nspawn here has created this bash 5506 and if we look we can see that all of the different namespaces that this that that process belongs to and if we look at this we'll see the namespaces that a command just in my regular root namespace has different identifiers for the IPC namespace it has the same for the net namespace different for the pid same for the user and different for the UTS and if we look I mean my prompt sort of shows it right the hostname here is oozle here it's cid that's the different UTS namespace that's something that that systemd nspawn sets up automatically and so then we can do things like touch a file in the temp directory and if we do there's that file and if we bind them out onto if I could type I'd be dangerous right so now if we cat we get the password file come in here am I suffering from demoware here I'll swear this all worked an hour ago of course well I guess I'll describe what worked an hour ago very odd so you could mount a password you look at it on look at it inside the container and it was still just an empty file because I touched it over there and that's how I created it I could echo something into it it was there in the inside the container outside of the container it was not and no idea what happened there but anyway it that's the kind of separation that the container is supposed to provide I wonder if my mount propagation got changed anyway seems like there was so one other thing that can possibly screw up if we so sorry let's go back in there and if we do an IF config I've actually not been able to get my wireless to work very well here but if we do an IF config inside the container you see there's an ETH 0 and a loop back and if I could get out things like pinggoogle.com would work but if we add private network no ETH 0 but here we still have an ETH 0 and we could have a WLAN 0 if I could actually get on the net so the view of the network between the two is different so if we talk about some examples of the kinds of things that you could do to use this namespace idea to separate and either eliminate a whole class of things that an attacker or a compromised binary could do or at least limit it severely so if like I was talking about earlier we have an update checker that periodically checks with an update server and we're concerned that somebody man in the middle gets in there and in some way messes with it and we're worried about what kinds of things it might be able to do under those circumstances we could set it up in its own namespace just put the things inside the namespace that it needs to run mount them read only it can do very little have a private temp that's accessible to the outside but that's all it gets and the update checker can check for updates and put a file into that private temp and that's all it can do and if it gets compromised it can check for updates and put a file in that temp and that's all it can do now it still has access to the network because we haven't put it into its own namespace that's another thing that could be done it could be put into its own network namespace and the and some kind of connection could be handed down to it that it could use to periodically check or the network namespace could be configured in such a way that it can only talk to a particular set of hosts because each network namespace because they if you put network devices into a network namespace then each one gets its own set of IP tables and so it can have its own firewall rules so there are a number of different things you could do there you could run multiple instances of a web application in separate PID namespaces maybe this web application takes passwords and or usernames and passwords from users and then does something whatever it is that it does and so an attacker wants to try and get those passwords perhaps and would subvert one of those processes to then P trace the others and get information out of those processes you could turn off P tracing or there are a variety of things that can be done there or put them in separate PID namespaces then the the one that gets compromised can't even see the PID to P trace the other you could combine those two ideas to isolate some web application I just pick on CMSs and PHP because they seem to be constantly in the news with various kinds of flaws even further set up a network namespace to run HCTPD worker processes so you spawn off a worker process you hand it the file descriptor of the connected socket from the client and it can handle that one particular connection and if that worker process gets compromised it has no access to the network you can have separate network namespaces I think I sort of Sammy mentioned this earlier go ahead so files or you could mount them the separate namespaces don't eliminate shared files you can still have the same thing mounted in this namespace and that namespace and they can see the files but if you mount something differently do you see the distinction I'm making well if you want to share them via files another way you could pass file descriptors via unix sockets or there's a whole variety of different techniques for doing that does that make sense you can tell that you were in a container because the name of the that's an artifact of how systemdn spawn set that up there's actually another mode that you can run systemdn spawn in where it will run systemd and actually boot up the container then the process one will be in it yes other questions going once I think this is pretty much all I have I haven't looked at FreeBSD jails in a long time so anything I would say would probably not be it is a lightweight virtualization kind of idea also and it's specifically targeted at security applications as I remember it's also specifically targeted at the file system not you know things like processes or network or that sort of thing but I could be well out of date okay yeah I think that's true if my knowledge of jails is correct which that could be in question then I thought that that was mostly targeted at the file system and separating the view of the file system inside of the jail from the view of the file system out of the jail but I could be wrong about that if so then this adds more capabilities and like you say is finer grained if you will agreed the what? well yeah I mean it's come through a variety of different routes there's all the LXC stuff and yeah okay well I mean it's kind of like linearly yeah they're all related and basically this is what people were eventually able to get merged yes yes and they can go things like copy on write uh they can they can share some storage and then do copy on write if one of them changes it essentially yeah I'm sure but I don't yeah I don't know anything about it I'm afraid no they're not exactly synonyms you know I mean a container is sort of a conceptual thing namespace is sort of part of the implementation that can be used to create a container and you know containers often also use control groups you know to restrict the amount of resources that the container can get so the container is sort of a bigger thing and namespace is a piece of the implementation for a container I don't know I think you know well I mean that's essentially no different than if two processes on a regular system trying to do weird things to the block device right I mean they both have to have permissions to it and if you do that sort of all bets are off right I mean this is at the file system layer the mount namespace is at the file system layer so it's dealing with file systems and not block devices I mean obviously down below somewhere there's a block device yes well the you're talking about a username space so you're I mean because if you're root in a non-user namespace then you're root period well the that's a good question the one of the things that system D N spawn does is set up a device tree that's the wrong term a dev directory that contains the limited things needed for so there is no dev mem in here and this is not a username space either so that's not really a helpful answer to your question well as it turns out system D N spawn does not drops the capability for make not so you can't make not I forget the syntax but what is it name B 234 256 so you know the the regular root has lots of abilities that to probably circumvent much of what much of what namespaces try to separate regular root is maybe not all powerful anymore but it has all sorts of tricks up its sleeve so typically in a when you're going to roll this kind of thing out you're going to be running it as some other user either in a username space or as just some regular less privileged user inside the namespace you know the regular root namespace user but less privileged you know like a patchy for the or something I think you need Dan Walsh here the the I mean SC Linux is a very complicated system I think it's getting better in the sense that the fedora folks and the rel folks are really writing her on it to try to reduce that complexity but I still turn it off the first thing I do when I install fedora and I'm arguably a security guy at some level right so yeah I mean SC Linux has its place but yeah I'm not sure the embedded space is where where where that space is I mean it's pretty resource intensive for one thing but it's complicated but on the flip side of that SC Linux does very well in sort of non-dynamic environments where things aren't changing so you don't have to relabel and you don't have to think of all the different possibilities of who wants to access what if it has a set sort of functions then SC Linux might be right does that sort of answer your question right this kind of stems from people wanting to host multiple websites on the same kind of box sort of although one of the things that the folks at Red Hat are doing they have this open shift they have so many open things but yeah open shift I think where you can just sign up on their website and you get a container essentially with an Apache web server and some amount of you know like Django or Rails or whatever installed and you can go build a web app in this open shift thing and that's done using sort of a combination of containers and SC Linux and these days now it's actually been sort of plumbed into systemd so that when that connection comes in and the user authenticates then systemd spawns this container and switches everything down to it and how things go other questions yes a network device that gets hot plugged well there shouldn't be any barriers to the namespace support right where it will allow you to make these namespaces is targeted the network one is targeted at network devices so if your device got plugged in and came up as a network device yes it could be put into this namespace and some other thing put into a different namespace and so on and I think I'm running out of time thank you