 Well, I can make it small enough and just run it not in full screen mode, that's fine. So this should all work. So the point of this talk was to talk about unprivileged containers and what we're doing from the file systems point of view. So I did give a talk a few days ago about unprivileged containers. And so I'm not going to repeat everything in that talk. So I'm going to really skip over the bit about containers except the reason I was altering these slides so that I had something different to present is because I was doing a bit of changes in response to what various people have said about container technologies and the state of containers. So this is a bit about who I am. I've been an open source advocate for a long time. I got into containers in 2011 and I've been doing them for that amount of time. So in many ways, my container technology experience proceeds darker. But the primary point about this is that I'm principally interested in the kernel primitive interfaces for containers, not most of the Docker DevOps stuff. But fortunately, most people here are sort of interested in kernel stuff as well. So I don't need to explain the big difference between them. So we can just get straight into the kernel API for containers, which Richard Briggs went over a bit, the C Group and the namespace API. And the fact that we have every orchestration system for containers uses this same API. So if you're working in this API on any feature or any security thing, whatever you do is automatically inherited by any of the orchestration systems. But as a caveat here, it's provided they turn it on. So one of the problems with, or I don't see it as a problem. I see it as a benefit of container technologies. The virtualizations are completely granular, which means that if you choose not to have them, you don't have to have them. And that means things like Docker can get away with not turning on the user namespace because it's not forced to have it. But equally well, it means that there's an infinite configuration space in terms of the way you could set up containers. This alleged OCI project standardized container format is nothing more than a lie in these terms because all it's standardizing is the packaging format for IS containers with immutable infrastructure, nothing else. So we'll do nothing to standardize the format of how you actually set up containers. So we'll still have this infinity of possible configurations. And if you think of an infinity of possible configurations, a subset of that, which is also infinity, are absolutely demonstratively completely insecure from the get go. So you have to be very careful about how you actually set up containers. So if we go with the C groups, which I'm not gonna bother with very much because most of the interesting container stuff is going on in namespaces at the moment. If you look at what's going on in the kernel, C groups are arguing basically about the V1 versus the V2 API, which it's not really feature enhancing. It's sort of a somewhat political discussion we've got into at the moment. Namespaces are where most of the interesting sort of development and certainly the interesting security development is going on. So I'll just mention a few of the C groups just to prove I actually know what I'm talking about, principally because you couldn't fact check me at the moment even if I didn't. And then we'll move on to the interesting stuff which are the seven namespaces. So there were six plus the new C group namespace. So it's network IPC, which is used to virtualize the system five IPC primitives, which means we can have a separate IPC namespace per containers if we choose. The mount namespace, which is one of the really important ones, because it means we can have a separate mount tree per container. The interesting thing about the mount namespace in Linux is that Linux mounts have always been based on something called the subtree concept. So when you set up a mount namespace, your subtrees, you actually get a clone subtree right at the top and you get identical mount namespaces for the two containers. And if you want the container mount namespace to look different, the first thing you have to do is start throwing out attachments to the subtree until you get there. So there's somewhat of a complexity in setting up mount namespaces securely. And we'll get into something of that when we come to the problems with unprivileged containers, because primarily they do lie in mount namespaces. The PID namespace you've heard mentioned, a lot of container subsystems Docker included use it. I've never quite figured out why, because the main reason for the PID namespaces is if you're doing system containers like LXC. And for those of you who don't know what the difference between a system container and an application container is, principally a system container runs in it. So it looks like a fully booting machine operating system. An application container. So Docker would like to pretend that an application container just runs applications, which is sort of true. But it doesn't in the Docker case, since they have a whole ton of infrastructure in there that they get through the mount namespace. But, oh boy. Yeah, well, that's okay. So I could prove the difference between Linux and Windows by doing the install in front of you while I'm doing the presentation. But I consider the risk is just too high and the publicity benefits to Microsoft are too great to do it. So if we continue, the PID namespace primarily is so that Inet can run a PID one in multiple containers. That's the main reason why it exists. One of the other benefits it does is it virtualizes the process subtree at one point. So if you wanted one container not be able to see the processes inside another container, then it's a security feature as well, which is why Docker mostly turns it on. The UTS namespace allows us to virtualize the host name. So the host name and domain name need to be virtualized if you want to bring up an NFS server inside a container. Because the NFS server and the NFS client often use the kernel primitives for getting the host name and domain name. And then we have this final thing, which was what got disked this morning, which is the username space. So instead of telling you why you should fear username spaces, I'd like to tell you why you should love username spaces. So that's what most of the talk will be about. And I give brief mention to the C-group namespace. It exists. It's a recent 4.6 feature. And it exists primarily to virtualize certain information about C-groups. So in the long run, we have a hope that the C-group namespace will actually enable us to bring up unprivileged C-groups. But it looks like there's about half a decade of arguing between where we are today and where I'd like us to be. So I won't waste any of your time by telling you about it. Because as you know with Linux, by the time we get there, it will look nothing like what I predicted in front of you. But one of the things that you hear a lot about this API is that it's completely toxic. So it's very difficult to use. Docker will proudly boast that the reason you should use Docker is because you can't use the native container primitives in Linux. But unprivileged containers, as far as I see them, which is a way of setting up containers so that you don't have to rely on an orchestration system, is also the backbone of all security in container subsystems. And we'll go into that as well, why? It's definitely a way of setting up containers without root privileges. And this is a very contentious term because it means different things to different people. So what it, in truth, is it's a way of putting an ID into a container that inside the container can pretend it's root, so it can be with root privileges. But if you look from outside the container, it's a nobody UID, so it's an unprivileged UID. So this idea of setting up a container without root privileges, you can do it in two ways. One is the simple way, which is basically don't damn well put root into your container. This is what a lot of people should do, but don't. And the other way is that, OK, so you realize you have some sort of application which requires some sort of rooty privileges, so we have to give you a UID zero inside your container that has some sort of elevated privilege over all the other UIDs inside your container. User namespaces can be used in both cases for the separation, but the contentious bit is the unprivileged root user that actually has some privileges. So if you think about the problem, the ideal would be to put a fully unprivileged UID inside the container and just pretend it's root, give it no additional privileges whatsoever, but you unfortunately find that a lot of applications that depend on root to do things start falling over because there isn't sufficient privilege there to actually do stuff. So this is the reason why you have problems with the username space, because I use it as a way of setting up containers without root privileges. But the way you set up a container, you have to call unshare, and unshare is a privileged call to the root user, which means I can't do it unless I first set up a username space and pretend to be root. So a lot of the container primitives in Linux can't actually be set up without becoming root, and that means that if it's an unprivileged container you want to use container primitives, you actually have to have a root set up inside your container to do it. And obviously I want to do this independently of any orchestration system, as I said. I'm interested in things like micro-execution containers. So these are very tiny short-lived containers with a truly invariant thing that just have a script inside the container. And the current state of unprivileged containers is that in 4.8 RC3, user namespaces work really well. And C groups just don't work at all. So this is why this talk is entirely about namespaces because I just can't get C groups to work for me. So let me explain a bit more rather than dissing them about how user namespaces actually work. Because this is the key to understanding what all the security primitives that Docker will eventually be relying on and why there is such a problem and such an issue in security terms. So the real issue is that, effectively, user namespace gives enhanced privileges to a user. So it's a mechanism for elevating the privilege of a user. And every time you do that, you get the potential for security threats, which is what Richard was complaining about this morning with what Red Hat has seen. However, what I can actually tell you now is that it is also very possible to use this mechanism to actually give a user sufficient privileges that they think they're rude, but they can't do any damage to your system. That is the sort of ideal we're all aiming for. So the allegation that I heard this morning is we're not there. But I'm here to tell you that we are because the Bluemix container cloud, which if you've gone to any IBM keynotes you'll have heard up to your ears, is currently running on a bare metal environment using only the user namespace to actually protect the hosts from the guests. So in fact, we're demonstrating in a publicly accessible cloud today that user namespaces are sufficiently well set up for us to actually do security separation. So we've got essentially the running of our business cloud bet on this. That's obviously an invitation to go and hack. So please don't tell IBM I told you to do it. It's Eric Biederman, I know. But the point is that Eric is very conservative. And you also have a guy called Dan Walsh at Red Hat who says very much the same thing as well. I'm not telling you that all of the problems are solved. What I'm telling you is we think it's secure enough now to actually do the correct privilege separation that allows us to bring up bare metal containers. And if you think of all the promise of containers to the service providers, bare metal is really where we need to be because that's where we get all of the density benefits of containers without any of the junk overhead of running hypervisors. So this is a very important milestone. And this is why we're willing to sort of, well, I say we're willing. I suspect IBM research sort of did it and the cloud group went along with it and they don't really understand what we've done. But this is what we've done. So the key point from you as low level system people to understand is we have a public cloud that is bare metal containers using only user namespaces. But the thing that I'm interested in with user namespaces is certain rooty privileges you have to have if you want to bring up unprivileged containers. So I spend a lot of my time worrying, how do I actually do this in the cloud? And the key thing is actually the ability to create other namespaces. What a user namespace is is actually a mapping between interior and exterior IDs inside a container. It's controlled by about four files here. So we have mapping, these are actually basically a triple parameter. It lists from, well, basically a target UID, a destination UID, and a range. So in order to map, say, 100,000 UIDs, I don't need 100,000 entries. As long as they're contiguous, I can just do them as a range mapping. It's basically just a shorthand way of doing it. So we can map UIDs. We can map group IDs. You don't have to worry about this project ID. This is a group quota thing. And it only really applies to XFS currently. And I think EXT4 is growing it. So you can forget that. And then we have this thing called set groups. Because one of the security problems we discovered with user namespaces is the ability to actually change group once you're within the user namespace. There's an exploit there where if you use the UNIX group semantics, you can actually use group membership as the ability to deny access to resources. And if you can call the set group call, you can actually drop this group and suddenly get access to the resources you were denied. So for those of you who use this obscure mechanism, that's why that exists. And the shadow utilities actually provides a way of assigning ranges of user IDs and group IDs to unprovalaged users. Right at the moment, if you start a username space as an unprovalaged user, you're only allowed to map one UID, and that's your own. And you're only allowed to map it either to yourself or to root and to nothing else, which is, again, a security thing. But if you want to cope with ranges of maps, you actually have to have some privileged program do it for you. And so new UID map and new GID map from the shadow utilities package are the way you do this. And what user namespaces do is they retain the concept of an owning namespace. So you can get an interesting issue where you can set up many namespaces. But unless I actually have the right username space, I'd get permission denied while trying to enter them unless I were real root. So the way you set up unprovalaged containers is you have to begin with a parent username space, and then all other namespaces that you create are owned by this. Unmapped UIDs inside the username space are commonly completely inaccessible. Again, this is a security feature. So usually when you set up a username space, you set up a subset of user IDs, not the full four billion we have in Linux or whatever, that huge number. Usually you set up only a few thousand to a few hundred thousand, depending on how your container operates. If you actually were to escape from the container, all of these UIDs outside your mapped range are fully inaccessible to you. Even if you have rooty privileges, you're denied permission to access these containers. Or these files that are mapped outside of your namespace. I can give you a small demo of this, assuming I have enough time after the project to cock up. But this is basically what a username space has. It has a mapping from the interior UID, the UID in the container, to what the kernel sees as your UID, which is effectively the exterior mapping. And the range that's mapped is a subset of the range that will be allowed. Anything that's unmapped, you will never get access to. And then you have the concept of an owner KUID as well. And this is roughly equivalent to root. Although if you don't set up a mapping to root, the owner does not get rooty privileges, because most of our root checks rely on a UID check, which is fudged by the username space. So just being the KUIDT is not enough to get you root. You need the owner KUIDT, and you need a mapping of that UID to zero within the container. And then you get rooty privileges. Like I said, every other namespace hangs off this. So the security problem that I want to talk about is container images. Because if you think about the way this all works, all file system accesses occur using the kernel's view of the UID, which is basically the full range. And what that means is that if I'm trying to mount file system images to an unprivileged container that's operating in a UID range that's way, way, away from, say, 0 to 1,000, which is the usual one for system images, if I put a standard docker image into that unprivileged container, everything will fall over because my UID range is completely wrong for actually mounting what is a real root in that image into my container. Because let's say I've set up a container where I mapped from 0 to 1,000, and I map them to real UID 100,000 to 101,000. When I try and mount the image that's at 0, all I see is nobody, no group, and I have no permission to do anything. Operating system images tend to get rather annoyed when that happens and they refuse to boot, just as you saw with my laptop and system D, because they can't alter any of the directories that they need to alter to be able to get the operating system to run. So the only way we can actually get operating system images to run is if we also had some mapping on the back end from the KUID, KGID, the real kernel UIDs back into the file system that actually allowed an unprivileged user at 0 to come into the kernel at 100,000 and then that 100,000 when it tried to write to the disk to go back to 0 again. So as you can see, this is starting to set up a security and permissions problem that has somewhat of a potential for being a nightmare. So this is the way it happens today. There's my container with what it thinks of its UID and GID. On top of that, every time you come across this container to kernel boundary, you get mapped back from your UID to the kernel's view of your UID. This kernel view is what's actually visible in the host, which meant that if UID 0 escapes from the container into the host, it's a fully unprivileged user ID 100,000, which is where a lot of the security aspects of this come from. But the problem is that if I'm trying to go from here to a file system, the writes occur at the kernel's view of the UID and GID, not at the container's view, the specific problem here. And this means that if your image is shifted, and some of the Docker images in the repository do have UID shifts in them, it will work. But the problem is that you actually have to know ahead of time what this range shift is and have programmed your container for it, and it's just a complete nightmare scenario that nobody wants to deal with. So the ideal is that all images that go into any repository will always be at physical UID 0, and we have to manage the mappings on the back end. And a lot of the argument is about how we do this. And so the way that we actually want it to work is that the container writes at its UID into the kernel. That UID gets remapped for permission instances. So if it escapes, it's still at KUID. But as soon as it writes to what is its root, it would write at what its UID is. But if it got access to any other part of the file system tree, it would still be writing at what the kernel's view is. So the problem we have is, how do we actually set this up so it will work? And right at the moment, this is an unsolved problem in Linux. So it's what we're actually trying to work on. So an old solution that has existed for a long time that everybody seems to know about, but nobody uses. I have actually tried to use this, is something called bindfs. It's actually a fuse file system that can do UID shifts. It's actually very useful. The drawback is that when you specify the UID and GID shifts, it's one UID mapping per entry. And that means if I want to map 1,000 UIDs, I need 1,000 entries. And unfortunately, that's a bit bigger than the fuse command line can cope with. So it's a soluble problem, but no one's actually ever bothered to try and solve it. Principally because nobody likes fuse, and they don't want to use this as a basis for their containers, mainly because everybody whines about performance of fuse. Now, I don't want to get into debate. I think fuse would be fast enough. But it turns out that it's fairly easy to do this in the kernel as well. But if you actually want to check out bindfs, it's done by, I think it's Matthew Partel. It's done by him. This is a GitHub account. You can download it, compile it, and play with it. And all it really does is a simple remounting of a subtree from one part of Linux to another with this UID shift. But once you have a subtree, you can set up a container's mountain namespace to have only this subtree. So effectively, you could confine that subtree so the container wouldn't have access to anything else outside of this UID-shifted subtree. And if it did, it would still be trying to access the files at its unprivileged UID, because the only one with the correct shift is where you've remounted this bindfs thing. However, in the 4.6 time frame, which we're now well into, actually in the merge window just before 4.6, two other solutions for actually solving this problem appeared. One was the one that I did that I'll give you a demo of. I would like to have demo both of these. But the problem is that the code for these two is somewhat incompatible when you try and put them both on the same laptop. And I couldn't, in time, come up with a kernel that would do both, so I can only demo mine. But the other one was the portable root file system solution by Jalal Harouni. So we have this current case of we have two proposed solutions, and nobody really knows which one to pick at the moment. So let me tell you how they work. Shiftfs is basically an internal version of bindfs with all of the crap thrown out and range mapping supported. So it's much easier to set up. It takes a mapping range as a mount argument, so it's very, very easy to do. And again, it's basically subtree shifting. So you take a subtree at one point and form a new subtree at the different UID shifted another point. And then it becomes your problem to make sure the container has access to this. But if you get it wrong and the container goes back to the origin of the subtree or to anything outside it, it still sees everything at the kernel's view of the UID, not Shiftfs' view of the UID. So only the subtree you mount is shifted, which gives you some form of security protection for Shiftfs. However, the fundamental drawback is that root is required to set up this mapping. That is how I did the security piece. So this mapping has to be set up by root ahead of time, and then the container could bind mounted. So it allows unprivileged users to use this, provided some administrator has already set it up before you actually do it. And obviously, if you think in terms of container orchestration systems, particularly things like Docker, they already have root privileges anyway. They can just do this. It's fairly easy. If I think of how I would really like to do it, we would probably do a new UID, new GID solution, which is there would be a file somewhere in Etsy that says what pieces of the file system which users should have access to and what shifts. And I'd just be able to execute a CID binary, or it would actually be CID Shiftfs or whatever if you're doing capabilities, allow that to mount inside my container and be done with it. So there are many ways of actually solving this problem that would actually work. And the point is that if you do it within the file system, the container can then remount it, and everything just works. And the security is basically in the mapping range. So the way you would get security for multiple independent tenants is you'd give them non-overlapping UID ranges, which means that if one tenant escapes into another tenant's container, they're likewise fully unprivileged in that tenant's container. And so Shiftfs can give you the same security because you bind non-overlapping ranges. And so if you try and mount a system that you're not supposed to have access to, the UID Shift is wrong, and you're still fully unprivileged there and cannot get access to it. So it works reasonably well for sort of reasonable security primitives in this case. The portable root file system works in a different way. So it allows any file system that's mounted on Linux to be marked as shiftable. And once that file system is marked as shiftable, you can actually bind mount it into your container using the UID Shift of the namespace that actually exists in that container. So this is slightly more insecure because it means that you have to be careful what you've marked as shiftable because almost any container can bind mount it. And so we were discussing adding a few more security primitives to make sure that this actually works. The other disadvantage of this is it requires changes to underlying file systems. So the reason the Shiftfs code looks more attractive to Alvero than this code is this code requires a lot of changes to the underlying VFS code. So it threads all the way through the VFS code. And then because it has markers on the underlying file system, it actually touches even individual file system code in the Linux tree. Whereas if you look at the code for Shiftfs, it's basically a single contained file. It can do everything with the current VFS interfaces. So no extensions are actually required. But like I said, both can be used to bind mount the root file system or an image for unprivileged containers. And the permission to mount is still a bit problematic. So the way it was coded the last time I saw it, any container can mount this provided the file system is marked, which I think is too permissive in terms of the way it goes. So we probably still need some permission system to say, this container can mount this file system, but this container can only mount this file system, and so on. So that's yet to be worked out in this patch set. But now what happened, the reason none of this is upstream, is we also had another problem that Cest4 sheaf canonical is working on, which is trying to do unprivileged fuse. So the problem with fuse is that fuse current requires root permission to run. And if you're routing the container through a username space, you're not root enough to be able to run fuse. So refuse to actually run inside your container. So Seth and actually Eric Biedermann are trying to solve this problem as well. And one of the ways that they're trying to solve it is they're actually looking at adding a separate username space that does the mapping from the kernel to the file system. Who screamed? So well, if Casey you want to scream at them, the container's mailing list is where it's all at. Now if you want to rant on the security list, fine, but no container's person will see it. If you want to object to this, you have to come on to our list and object. But this is currently what they're thinking of. So each superblock would effectively have its own mapping, and that mapping would be in effect every time a right was emitted from the kernel to anything that underlay that superblock. So this is roughly what it would look like. So the container would have its own UID and GID. That would be translated by the username space into the kernel's view of that UID. But that KUID would be translated as you go across the VFS boundary into this superblock version of UID and GID. So effectively, it's a double mapping system. The thing about this is that if I make use of this in ShiftFS, lack of security becomes Eric and Seth's problem. So for me, it looks like a very good solution. Obviously, I can see why you might not think of this as perhaps the best solution to have. But for me, it means that the problem, it's the SEP field. It's somebody else's problem now. It's not my problem. But the patches for this basically appeared in the 4.7 cycle. They are not in the 4.8 merge window. There is plenty of time to discuss this if anybody has security concerns about it. So if any more people want to scream, don't do it at me. But the challenge is going to be, how do we integrate all of this? The reason I'm a bit dubious about this is double shifting just looks like a much more complex administrative problem to me. But if Seth and Eric can come up with the useful primitives that allow us to apply it easily in a secure fashion, I won't object. And I can modify ShiftFS just to take advantage of this. So the bindments of ShiftFS would just make sure the SUID and the UID mapped instantly to one another, and it would be fairly easy. So there's an easy way of actually for me to make use of it, and then the security problem becomes their fault. But the challenge really is how to integrate all of this. So now, effectively, instead of having two separate solutions that we just had to choose between, we have two solutions and a proposal that's radically different from both of those solutions. And now we have to alter everything to fit so that it would eventually go upstream. So I don't even think any of this will be upstream in the 4.9 merge window. It'll all be several kernel divisions beyond. So there is time to fix all of this. And both ShiftFS and Portable Roots, they're both based on Superblocks. And since the translations for the file system mapping were done in the Superblocks, we can both trivially take advantage of it. And one of the things that would help Jalal with with his Portable Roots is he'd no longer need to mark the file systems. So it removes some of the objection to his code patch was just spraying through the VFS. He'd still does the translations of the inodes inside the VFS. But looking at his code, I believe it might be possible to defer those translations to the ones that Eric and Seth are doing. And so for him, the advantage is that Al Vero's Roth is redirected to Seth and Eric. And he doesn't have to suffer from it. So it actually, to me, to both of us, it looks like a reasonable solution because I don't have to accept Rotten Tomatoes for you for the security thing, because that's Eric and Seth's problem. And Jalal doesn't have to accept Al Vero's Roth, because that's now Eric and Seth's problem. So it looks reasonable. And if you know something about the way kernel development works, anything that takes problems away from me and gives them to somebody else means that my code is more like to get upstream, this sort of thing may actually happen at some point. And the problem really is going to be how do we parameterize all of this in a secure fashion? So let's see if I can actually persuade the laptop to do a demo. I just need to shift back so I can actually see my screen when I'm doing this. Can everybody see that? Is that legible? At the back? Are you OK? OK. So why doesn't Richard like user namespaces? So does everybody know the kernel primitives for setting up user namespaces? Unshare and NSNT, you've all heard of this? Or do I need to give you a brief overview? So you enter user namespaces via either an NSNT which attaches to a namespace or unshare, which actually uses a fork, a clone command, to create a new namespace. Within Linux, user namespaces are all marked by a weird iNode. And you can see this if you look in proc. Let's look at proc self NS. So all of the namespaces sitting in here. This is a new kernel. I think it's 4.6. So you can see that these are sham links to this sort of weird iNode number. And each of these iNode numbers represents a separate namespace. And so this is perhaps why that we don't want to use this for tagging audit, because these are the seven numbers you'd be using for the tag, until we get pressure for randomization of our iNode numbers, right? Which, I mean, you've pressured us to randomize everything else. Why wouldn't you pressure us to randomize that? Yeah, I'm sure we'll come up with a solution. I was just showing how bad it could be. So let me create a username space. So if I just unshare the username space. Oh, let me demonstrate, first of all, what my problem is. So I can't create a mount namespaces me. Right at the moment, I'm user JJB. I'm completely unprivileged. The system denies my permission to do this. So I cannot create any namespaces other than a username space, just to prove that I'm unprivileged. I'm at UID1000, GID100, which is a standard users one on SUSE. And I also have 469, because I do occasionally play with Docker. So the reason that username spaces are so hated is because of this. So this is not an SUID binary. And I am now root. So what unshare minus R does is it's installed a mapping in the UID files. If I look at the UID map, and it says, my original UID of 1000 is mapped back to 0. And that's the only UID that's done. So every other UID in the system is fully unmapped. And this means that if I look at something in my home directory, I'm now root and root. Great for me. But if I look at something and remember, I've only set up the user namespace. I have no other namespaces on the system. So I have full access to the rest of the system. If I look at root, I am completely unprivileged there. So this is the root file system. Completely unprivileged there. The UID0 is unmapped. And if I try and do something in that file system, I'll get permission denied. So although I appear to be root, I don't have root enough to modify anything outside of my mapping, which for me now is basically just my home directory. One of the things that we really want to do with ShiftFS is actually solve this problem so I could bring a root file system up at UID0 inside my container. But if you look at the way I actually set up containers currently, so this is my directory for architecture emulation containers. And as you can see, they're all roots, but they're all owned by Shifted UIDs. So I can actually, as myself, bring this up in a fully unprivileged container. And so I did this demo a long time yesterday, so I'll just skip over it quickly. I won't show you all the scripts that do this. But let's build a container for AH64. So this is set up to bound namespaces for me. And I can do NS Enter. And I have to enter the mount namespace as well. There I am as root again inside this. This is actually an AH64 container. Sorry. An AH64 container, so I'm running as AH64 now. This is why I like it. But now I've actually got a range of UIDs mapped because of the way I set this up. I've given myself permission to do this. So I've mapped the 0 through 1,000 as mapped at 100,000. I've punched a hole in it for my own UID. And I have to map nobody as well because operating systems require that nobody use it to operate. And this means that I can do things like this inside this environment. So I can just become me, and I can actually go to my own home directory. Even if I did a DF, and the root you see above me is actually a fully AH64 root that I'm just emulating. So I've actually flipped the real root out of the way, and I've bind-mounted this container root. And I can actually show you this because if I go back to being root and I touch a test file in that root, it will actually appear in my containers root. And there it is. And so the thing we want to do, and if I come out of the username space and just go back, you can still see it there. But this time, it's still owned by 100,000. So I've done nothing that was outside of my UID range when I did this. But the thing that I can do with Shift FS is I could do this UID map from me to root, only me, and the GID map the same. But if I map the root file system into something in my home directory, this is my real root file system now. Oh, shit. This must be because. OK, we'll just pretend this worked. What I would have done is it's because I prepared the demo for 48RC3, not for this. And of course, Shift FS is not upstream, so I needed the module. And I've obviously forgotten to put the module in. I think I switched on to 43 stable, and I've forgotten to update something. But anyway, I had to do that for the projector. So anyway, what have happened is that I would have done the same demo with the test file, but this time I would have showed that as myself, I could create the test file in my real root file system and it would have appeared there owned by UID0 and GID0, which really is just a demonstration of why this bind mounting Shift FS is so dangerous, because if it's misused, it goes to an unparalleled container, the ability to access any file system on the box. And that's obviously the wrong thing to do. So, end of demo, let's go back to the slides and get the conclusion, because I'm practically out of time. Work, please. So that was the demo. And now what we basically have is the conclusions, which are the problem of connecting file systems to unprivileged containers is still pretty much unsolved. So I've previewed to you roughly how we're going to do it. If there are security concerns, which the squeaking in the room seemed to imply that there might possibly be, now would be a good time to object, because we're actually just discussing this at the moment. But what we are hoping is that this problem has to be solved soon, because without it, we cannot mount root images securely into unprivileged containers. So right at the moment, your choices are something stupid like Docker, which is going to put root into the container, because this is the only way it can actually alter its images. Or we do unprivileged containers and force Docker to come along, which also means we have to be able to unmount what are effectively privileged root images into an unprivileged container. So that's effectively a choice of two evils. And we have to take, I honestly believe we actually have to do the mapping path, because allowing real root to exist in a container is much more of a security threat than building a fake root into a container. So with that, I'll just say this was done by Impress.js, it's a web page, this is why you won't find it on the Linux Foundation's website, because apparently they can't cope with this format. But what I'll do is I'll put it up on my website and send a URL, which at some point within the next 10 years, the Linux Foundation might be able to redirect to, but James will kindly send it out to the list and he'll be able to get these slides. And I'll just say thank you and call for questions. If we have time, I think I've taken us into the tea break now, haven't I? Okay, thank you.