 All right, so my name is Jorge. I work at Google. And it's an interest. It is obviously not a coincidence that the stocks were scheduled this way. I'm going to be covering a lot of stuff about namespaces. I think I'm more on James Camp. I think username and spaces are mostly OK. I'm not super worried about them. We are not super worried about them at Google either. So I'm going to be talking about Mini Joe, which is a tool that we wrote at Google to do sandboxing, all sorts of sandboxing. We try to make a tool that we were able to use for anything that the Linux kernel grew. We try to add it to Mini Joe and make it work. We use it on Chrome OS. We use it on Android. We use it on server side. We use it for build farms. We use it for fuzzing. We pretty much use it everywhere, which is interesting because it gives us a good idea of how people like to use sandboxing and how to make things easier for people to use. The reason we wrote Mini Joe was because if you run this command in the laptop that you're using right now, if you have a Linux laptop, it's probably going to look something like this. If you have a Mac OS laptop, you can run it and it works. Do people mind closing the doors? There's a lot of noise from outside. It's going to look something like this. And the red there is obviously because red is bad, root is bad. Why is root bad? Because Bluetooth D is listening on the network. It's listening on the radio. So there's essentially one bug that's separating any random attacker close enough to your laptop to attack its radio to compromise Bluetooth D and use root privileges to just load a kernel module. On most Linux systems, root equals kernel module loading, and that equals kernel code execution. So we're one bug away in Bluetooth D's code base, which I don't know if you guys trust it. I'm not picking Bluetooth D. There's a lot of red in there, so we could be anything in there. But these code bases are not necessarily super robust. Any of those processes are listening on the network on our radio. And there's one bug away from essentially just fully compromising the systems. And this is the Linux workstation that I use day to day. It's running a super modern kernel, but it still looks like this. And why does it look like this? Well, it's basically because we have kind of misaligned incentives. On the one hand, the admins that put together this drill don't really know what permissions the software needs and the devs that write the software don't really know the environment in which their software is going to run. So they cannot really make assumptions about how much sandboxing or privilege dropping they can make. So it's always a little complicated to, I guess, be sure of what environment you can assume. And use that to draw privileges. So that's the result. The result is that things look like this. And you're one bug away from being completely on, which is not cool. If there's any sort of privileged access on that machine, well, it's not completely crazy to think that I could just set up a beacon right here in this room and just exploit everybody's vulnerable Bluetooth stacks. That would be super easy to do. I have not done that just to be clear. But we are in a situation in which the person that writes the software is not the person that runs the software. So we need a way to bridge these things. And the other problem that we have, at least that we found, is that sometimes developers will try to use the privilege dropping mechanisms that are provided by the Linux kernel, but they might do it wrong. Why? Because there's so many pitfalls that can happen. In this case, somebody tried to write a switch user function that will drop root, but what did they do? They forgot to check the result of the setUID function, which, if you're in security, this is like, oh, they're super noobs. They didn't know this. But it doesn't matter what we think we are better than everybody else. It matters that this software is out there. And we want to fix it, right? So what's the problem here? The way to exploit this is cause the setUID call to fail. The program will still run with root privileges, and then you will exploit another bug that's in the process, but that, instead of exploiting a program that's running as a non-root user, you will exploit a program that's running as a root user. The right way to do this is if setUID fails, you abort the program, right? So you only say, we're to do it. And this process also tries to set up capabilities, and it mostly succeeds, but it requires I have omitted the 15 plus lines that you require to set capabilities using almost any interface available today, because it's just so tedious and possibly it's more likely that you're going to introduce a bug when it takes 15 plus lines to set up capabilities than if it just takes three. And this is what we did with Minigill. Minigill is about not reinventing the wheel. We don't want every single developer writing Chrome OS system software or Android system software to have to understand all the intricacies of dropping privileges using Linux kernels. We want them to have a single library or executable that they can use to just do the right thing easily. The easier it is for developers to do the right thing, the more often they will do it. So by using Minigill, we turn the 15 plus lines of setting capabilities to like one or three because of formatting. And Minigill is unit tested, integration tested. It's never going to forget to check the result of setUID call, right? So this is good. This means that we have one single library that does privileged dropping code, which is obviously a security critical piece of code. That's good. We can test it. We can unit test it. We can integration test it. We can make sure that it always works. But eventually, I guess some people on the team realize that we were like maybe 85% there to building real containers, the kinds of containers that people have been talking about today. We were talking about, Robert was talking about earlier. So we took Minigill all the way. And Minigill is essentially undermining this, which many people might not realize what it is. But this is an Android app. And this is Chrome OS. So Minigill is essentially underlying this new technology that Google added to Chrome OS, which allows you to run Android applications natively with no sorts of emulation or virtualization or anything. There's just an Android system running inside a container. Things are plumbed in and out correctly so that essentially this works. You can click. You can have your plans, speed your zombies on Chrome OS without breaking most of the security guarantees of Chrome OS like a verified boot. So how do we do it? Well, we built this thing called Minigill. It's just sandboxing slash containment helper that we use on Android on Chrome OS and Umbrella, which is an Android-related project that Google is building for IoT, for the internet of things. And it essentially allows you to do many of the things that people have been talking about today in a way that we think makes it easier for non-security people to use correctly. So the first thing that you can do, the first thing that you want to do, is kill all the red lines that I had on my slide before. We don't want to run things as root. Most things don't actually need to run as root, usually. So why do we do it? Well, let's not do that. I can run the ID program very creatively. If I run it under Minigill, I can drop UID on group ID and just run it as my normal user. And your reaction at this point should be not being impressed. Because Pseudo can do this. Pseudo can do this very, very easily. And it's OK. You should tell me that. But my reply is going to be fine. If Pseudo can do that, why does my Gubuntu workstation in 2016 look like this? There's no answer. The reason for that, there is an answer for that. The answer for that is that it's not as easy as saying, oh, I'm just going to not run as root. Because if you try to not run Bluetooth D as root, it will complain, because why? Because it needs a certain amount of root permission to set up your Bluetooth interface. And that is where capabilities come in. So capabilities are a way to partition the permission that are usually allowed by root in a way that you can grant specific subsets of that functionality directly to a process without granting the whole functionality of that process. For example, I'm going to keep picking on Bluetooth D. Bluetooth D needs permissions to configure a network interface. But that shouldn't give it permissions to, for example, reboot the system or mount things. So what we're doing here is an example, but it actually it's using the same capabilities mask as Bluetooth D. That capabilities mask essentially means give this process the permission to administer interfaces and to open raw sockets. It's cap net admin and cap net raw. And as we can see, if we run cat under minigill in this case and we look at what its capabilities mask looked like, well, they look exactly like what we set up. And now when we combine capabilities with UID changing or root dropping, then we end up in a situation in which our system, or in this case, a Chrome OS system looks like this. Everything that's exposed to the network is not running as root. Now, it's not completely deprivileged because most of these things, DHCP CD and WAPLiS applicant and Bluetooth D, do need a small amount of permissions to set up the interfaces that they control. But it's one thing to be able to reconfigure an interface. And it's a significantly different thing to be able to remount a file system or to reboot the system or to do any of the other things that a full root user will be able to do. So, minigill allowed us in Chrome OS to essentially take a system in which most things were running as root and turn it into a system in which most things are not running as root and essentially try to follow the principle of least privilege and just only get access to the things that they need. Yes, yes, but most things don't, unfortunately. Also, it would have to drop things from its bounding set to, but yes, most programs, if I go back to my PS output, if you list the allowed capabilities for most of those things, they're not dropping anything, essentially. So we kind of want to get to this situation in which things are not running with privilege, cannot gain extra privilege. But this is not really all that we can do or all that we want to do because even though these programs don't have access to root functionality, they still mostly have access to the entirety of the kernel API. And the kernel is a really big piece of software and it has a lot of lines of code and a lot of bugs. As Kase was mentioning this morning, there's always going to be bugs. So even though the system or the program, in this case, any of these programs might not be able to directly, for example, mount a new file system or remount a file system because they no longer have root permissions, they might still try to exploit kernel bugs because they have full access to the kernel API. But at the same time, if these programs were never expected to mount anything, well, they probably should not have access to the mount system at all. And some people might know where I'm going this and where I'm going with this is seccomp. So the next logical step after you've made sure that the kernel will not allow functionality to be available to these processes is also to remove any access to the system calls from these processes. And the reason to do that is because even though a program that's running without root privileges, if it tried to call mount on something, eventually, mount will say, no, actually, you don't have Capsis admins. So I'm not going to allow you to do any mount anything. There's a lot of code that runs before the kernel actually returns any reasonable thing to the user. And every single line of code that runs at that point is one opportunity for a bug to allow you to do stuff with the kernel that you were not allowed to. So seccomp is a way to essentially give to the kernel a decision tree that tells the kernel which syscals should be allowed for this process and which syscals should not. There's a line on proc self-status that shows if seccomp is enabled. In this case, we can use Minigill to set up a policy, in this case on cat, that essentially looks something like this. The policy language that we're having in Minigill is somewhat F trace inspired because of historical artifacts. But we only need nine system calls for cat to work. And I've lost count, but there's probably how many 350 system calls right now in the Linux kernel, something like that. So we don't need to expose 350 minus nine system calls to a program that doesn't need them. Seccomp runs on syscal entry. So there's literally no syscal specific code that gets reached by a possibly entrusted or malicious program if the seccomp decision tree doesn't allow it. So by combining non-root running plus capabilities plus seccomp, we end up in a situation in which the process really cannot do anything else besides what they were actually meant to do. And cat works. Those nine system calls are essentially everything it needs. Fine, cat is a very simple program. But even the seccomp policy for the Chrome renderer, which is probably one of the most complex pieces of software that you can put inside seccomp, probably shapes off more than 50% of the syscalls, 50, 50 syscalls that are available to any Linux process. So there's a lot of stuff there that code that's either potentially malicious or dealing with entrusted input definitely doesn't need. And by making it accessible for non-security developers, we actually got people from outside our security team writing the policies themselves, which to us, at least, given the size of the security team versus the size of the full engineering team, the only way we could realistically achieve a situation in which most of the software in the system was running with reduced privilege. Now, the way this works and the reason why this looks as tight as it looks is because when we sandbox dynamically link programs, we actually apply most of the sandboxing techniques in the process that's supposed to be sandboxed. Minigill interposes a very small loader for dynamically link programs using the LD preload environment variable. And this loader will intercept the libc loading function, which is libc-star-main. And it will use that to find the location of the real main function, which is over there. And it will, essentially, enter the minigill right before calling the real main function. And there's two important reasons for this. Reason number one is the libc will do a lot of stuff when preparing the runtime for a normal Linux binary. And we don't necessarily want to have to allow all those things in our policy. There's a lot of stuff that libc will do, but the actual program, once it's executing its own code, won't need to do. So by loading minigill, by entering the minigill or by applying sandboxing right here, we don't need to put a bunch of syscalls that libc will use, but the program won't use in the syscalfe entering policy. Now, if you craft a malicious elf executable, there will be code executed by that elf executable before any of this happens. Our vision with minigill and Chrome OS in general and process on Android is usually we're more worried about trusted programs that is programs compiled by us dealing with untrusted input rather than completely untrusted programs that are given to us in binary form. So this works. And the fact that we can apply sandboxing right before main and not right after fork means that we can exclude a bunch of things from our allowed policy that libc needs, but we or the program doesn't need. There's another reason why we need to do this preloading trick. And the reason for that is this is how capabilities are inherited after exec or over exec ve. And if you notice, if all the p's are process capabilities and the f's are file system capabilities. Now, if you don't have file system capabilities enabled, all the f's are going to be 0. It's very easy to see that if all the f's are 0, then all the primes are going to be 0. Because there's an and, there's an or with two ands at the top. If the two things in red are 0, then the two ands are 0 and the or is 0. If f effective is 0, obviously that will be 0. And inheritable doesn't matter for permitted or effective unless those f's are 0. So we're essentially left with a situation in which the only way for capabilities to be able to apply or be used over exec ve, which is kind of what you want. You want a loader or a launcher that launches you into a sandbox container. The only way to do that is using file system capabilities. But file system capabilities are tricky because they allow programs to gain new runtime capabilities. And you might want to have a system in which there's no way for anything in there to ever gain new capabilities. You might want, because it's easier to reason about a system in which processes can only shed capabilities but never gain them. That makes it a lot easier to reason what any process can ever do on the system. So we're essentially left with a situation in which we can either accept to have a system in which processes can gain capabilities as well as shed them or a system in which you can never gain capabilities because there's no file capabilities but nothing capabilities will never be able to survive an exec ve. And that's not really an ideal situation. We would like a situation in which capabilities are preserved again over exec ve. But there's no way to gain them. There's only ways to shed them. And obviously, we're not the only people who notice this. Everybody who's tried to use capabilities to do anything useful has noticed this. So eventually, a bunch of people figure this out. And eventually, Andy Lutomysky submitted and landed ambient capabilities, which essentially work the way you kind of want capabilities to work. They can be inherited across exec ve. And processes can drop them. But unless file capabilities are enabled, you can never gain new capabilities, which makes a lot of sense. You essentially want to have a bunch of process trees started by a net in which you give processes a small set of capabilities which you think they need. And if they don't need them, they can drop them. But nothing can ever gain new capabilities. And when you're in a stat situation, reasoning about your system becomes a lot easier, which is kind of cool. All right. So this is one case, I think, in which user space kind of needed kernel developers to change something in a way that made the kernel security primitive significantly or even possible to use. What we learn with Minigill is sometimes when we try to combine all this security functionality provided by the kernel and use it in a Chrome OS or an Android, sometimes things don't quite gel the way you would want to use them when you're not only dealing with the security team, but the security team is working with a significantly bigger team of engineers that might not necessarily have security training. You want to make things easier for them because you can accomplish a lot more when the whole team is helping rather than we have three security engineers that are trying to keep everything from going crazy. However, and there were some illusions to this in previous talks, sometimes code, it's not expecting. Whenever a program accesses a resource, you kind of have two possible answers. If you don't want to grant that resource, right? You can either say no, which is kind of what we've been talking about. Capabilities will cause some system calls to return ePerm and second will cause some systems calls to return ePerm. Sometimes the code is not prepared to receive an error. So you have another option, which is you return a dummy object. And that's kind of how I think about namespaces. Namespaces allow us to virtualize or separate certain pieces of functionality and allow the kernel to return essentially not dummy, but fake objects in a replacement of real objects. And that's kind of cool because it makes it easier to port random software, third-party software, that we might want to run securely on our systems without having to completely do open heart surgery on the code and changing everything. So we actually implemented it. It was kind of cool that James had his list of namespaces because it allowed me to prove that all of them are implemented in Minigill, which I didn't know. They keep adding namespaces, so we sometimes kind of keep up. But all of them are implemented in Minigill now I know. The one that we use the most are pin namespaces because it's always a good guarantee to know that the process that you're running will not be able to exploit or gain privileges horizontally by exploiting other processes on the system. It might try to go through the kernel and will have second to deal with that and all of Casey's work leading the kernel self-protection project to deal with that. But we want to prevent the process from exploiting horizontally. Now, it's not easy to necessarily talk to other processes when they're running as a different UID, but that code has bugs. Like the P-trace check to avoid P-tracing other processes has had several bugs in the past. So it's a lot easier when the kernel will not even allow you to see the other processes. So Minigill also allow you to create pin namespaces. It works very similar to the NSEnter program in this case. But the value to us in this case is that it does what NSEnter does, but it also does what Capshell does. And it also allows you to apply a second policy on it. So we have one single binary that essentially allows us to use every single sandboxing and containment primitive available in the Linux kernel at the same time. And that has proved invaluable for people on our teams who don't necessarily have security training to be able to use these primitives. So this is what PS will look like if you run it inside a pin namespace, which makes sense because it's a pin namespace. So the only processes that are going to be seen are the ones that are inside the namespace. But there's obviously a trick here. You might expect only PS to be inside the namespace, but that's kind of tricky because the same way that any system has a init process that is responsible for launching every other process, but also for reaping their processes, well, each pin namespace has to have an init-like process to reap all the dead processes inside the namespace. Those processes do not get reparented to init outside of the namespace. But we want to run processes inside the pin namespace that don't necessarily know how to be init. So Minijoe will support launching a very small init-like process as pit one so that way we can run things inside the namespace that might not expect being in a pin namespace. Now, PS is not a very complex piece of software. It essentially just reads slash proc. So the previous invocation, if you instead of doing PS, you list slash proc, it will kind of look the same way. It has to look the same way because PS is just listing slash proc. Now, how does this work? The way this works is when you enter a new pin namespace, if you want slash proc to represent the new pin namespace, you need to remount slash proc. Slash proc will always show the state of the system tied to the pin namespace that is assigned to the process that mounted in slash proc. So if you are in a new pin namespace, you want to process inside that namespace to remount slash proc so that instead of listing everything outside of the namespace, it only lists things inside the namespace. And the best way to do this, at least for us, was to use mount namespaces. So essentially, in Minigill, usually the use of pin namespaces always requires entering a new mount namespace at the same time, which allows us to remount proc without modifying the parent's mount situation. You could also trute and mount proc inside the trute, but we already have namespaces, so might as well use all of them. James also showed this. I don't really care if the identifiers are long or not, but sometimes I want to know if things actually got put in the namespace that I expected, or at least a different one. So that's what I use the numbers for. In this case, it's very easy to see that they're different. I mostly want them to be different, even if they're not the most useful identifiers right now. And now it is, user namespaces, we might have security concerns about them, but they are what essentially ties everything together. And the two key properties for us and for everyone is the fact that up until today, it is not a coincidence that all my command lines had a pound sign before Minigill. All of these things were actually run as root. And until we implemented user namespaces, Minigill itself would run as root, and it would generate a bunch of non-root process trees on the system. But we had a lot of questions from people inside Google that wanted to use Minigill in situations in which they didn't have root privileges. And obviously, this didn't just happen with Minigill. It happened with containers as well. And user namespaces kind of tie all these things together. So the same way that you can use in a center, this is kind of what it looks like with Minigill. That's my user ID on my Linux workstation. And this has allowed us to start using Minigill in a bunch of situations, especially in like build systems and fuzzing infrastructure. There's a really big fuzzing cluster that Google uses on Chrome and Android. We call it a cluster fuzz. And we use Minigill to essentially sandbox a bunch of non-trusted binaries that we wanna subject to our fuzzing. And we want it to be able to run this without root. But sometimes, you need to run things that expect to be root inside the container, like Android. Android init expects to be root. We are gonna lie. We're gonna be like, you're just root inside the container, but we need to be able to allow Android init to mostly do most of the things that a normal system, normal init, would try to do as root. And this is mostly what the system looks like. There's a lot of boxes there, but essentially, we bind bound a few things to make sure that when we run slash init, which is Android init, it's the init program used by Android, it has everything it expects in the locations it expects. And then it will actually just start a bunch of processes that happen to be the way Android works. And everything mostly just works because of the magic of namespaces. The two big modifications that are done to the Android system to make this work is that input events are plumbed in using, are plumbed in to the Android system instead of being read by the system directly, which is what allows you to click on the Android window and how those things work. And graphic buffers are piped out using FDs. Essentially, textures are written to FDs, and then the textures are composed outside of the container. But apart from those two things, there's no other modifications to the system required to run Android on Chrome OS, which is pretty cool. To me, this is pretty awesome. Thanks to the magic again of containers, we can let most things happen unmodified, which would have never been possible without something that, instead of denying a request from a process, just return something that's kind of like the right thing, but not exactly the right thing. And it's super cool. It kind of solves a bunch of problems in Chrome OS. And it also means it was mostly the driving force for Minigill to gain the last bit of container functionality, mostly related to user namespaces. And it mostly looks like this. These are all Minigill API calls, and we essentially set all containers. The C-groups namespace is missing there. We set up the UID and group ID maps that James mentioned, and we do use Pivot Root, which I didn't have time to explain in more details, kind of like Trude, but easy to reason about, and we essentially just run Android in it. We just execute one binary, but it's inside a container that has everything set up correctly, and everything just kind of works, which is cool. And almost out of time, so important acknowledgments. I didn't write this. I wrote a big chunk of it, but not all of it. Willjoo rewrote the initial version of Minigill, and Ellie Jones rewrote it in C. Dylan Reed wrote a big chunk of the container stuff, and the Chrome OS team contributed a bunch of other container stuff. Lee Campbell wrote the ELF parser that we use to know whether we can use our LD preload trick, in case Cook over there reviewed a lot of code. Like, I haven't really been able to find anyone else. Google has a very strong code review culture, so you kind of need someone to review your code before you land it. Like, nobody really knows about this, but case does. So I'm like, sorry, case. Another seal your way, and this is probably like number 100. It is what it is. We use Minigill a lot. It's shipped from the beginning in Chrome OS. It's gonna be used on Android, starting with 7.0 with Nougat, mostly for SetComp, and it is used to implement Android and Chrome OS. And there we go. You can clone it, you can compile it, you can use it to be it's licensed. It lives in the Android, GoogleSource.com, because it's mostly where we develop it the most. It works. The executable will work on any Linux system. It will work on Android systems. It will work on Chrome OS. It probably will work on Windows 10 if you stole the weird Linux compatibility or thing. Maybe not. I don't really know. I haven't tried. I've seen only one Windows laptop in this room so far. So if you wanna try, be my guest. I would love to know. And that's our mailing list. There's not a lot of people there, because we haven't really... There's a bunch of Minigill forks on GitHub, which is totally fine, because it is BSD. We've actually, some of them actually, some people actually work at Google and they maintain external Minigill forks. We've very successfully asked them to contribute back some of the improvements that they made. We're trying to not be so, I don't know, to be a little bit more, to kind of guide development a little bit better and not have people just have to do random forks or to improve functionality. So same questions, I'm subscribed, happy to answer. And yeah, that was Minigill. Questions? No questions? No answer? That's a good question. So far, this kind of came up, this similar discussion came up when we were porting Minigill to Android and just kind of understand how we fit with SL Linux. Right now, I see them as very orthogonal kind of approaches. So the answer is not tomorrow. But my view on Minigill is it should do what people think it's useful, because that's the whole point. It's a helper tool. Like I don't care what goes in. As long as it makes sense, I want Minigill to be useful. So it actually, it did come up in some context in which it would be useful to be able to use Minigill to launch a process on Android specifically and set up SL Linux at the same time. So right now, that would be kind of hard, slash impossible. So yeah, maybe. Why not? We should do it. I don't know. I think we're good. Over there. So we do for a second, capabilities are usual. Capabilities, they're not that many of them, so we mostly done them by hand. But there is a script in the Minigill source code repository that will take an S trace output, and many of them actually, and compile a second policy for this language that will order the syscalls the right way. It will make sure that the policy is consistent. Like if you have like M-map and your architecture actually exposes M-map and M-map too, we will put both in them. It will make sure that the, yeah, it will make sure that the right systems are added. It will make sure that architectural stuff works correctly. So essentially, we kind of compile all of this logic into that Python script. And make sure that a basic execution environment kind of works and things like that. All right, thank you guys. Thank you. Thank you.