 So, Amir asked me to do a little boff. Huh? Oh, really? Sorry. Should I? I probably, I don't speak very loud. Yeah, Amir asked me to do a little boff about I.D. Met Mounds, which is something that some of you might have heard about. Probably over the last two years or so. And I guess I first want to start with a little introduction why we did this. So what is the motivation for this? Here's a couple of use cases. So first of all, a noncontainer use case, because I always like to stress that it is not just purely a container feature. This is a generic VFS feature is about what is called portable home directories. So essentially the idea that the UIDs and GIDs that you put on disk, if you, for example, have a banner of as sub volume as your home directory or a USB stick or some external disk with XFS or an X4 file system on it. And you want to take this home directory between two different machines. And obviously you can have different login UIDs and GIDs assigned, but then you wouldn't be able to interact with all of the files on disk. So the only option to solve this problem in this case is you need to recursively shown all of these files, which gets even more problematic. If you consider that in your home directory, it's not usually the case, but not necessarily the case that only UIDs and GIDs have been files with UIDs and GIDs have been written to disk that corresponds to your user's UID and GID. There could be files that some demo created with a specific UID and GID in there and so on. So if you recursively shown, you lose all of that information. That is an issue that SystemD, for example, frequently ran into. And the other case comes from obviously containers because they're a really big thing right now, or they're by now. If you have a container that runs inside of the username space, then you will often have a UID and GID mapping in place to ensure that UID 0 inside of the container doesn't mean UID 0 on the host. So if you break out, you're not automatically escalating to root. And that means the root fester of the container usually needs to correspond to the ID mapping used for the container, which means that root file systems need to be choned as well recursively. The big problem with this is obviously the larger your root fests gets or the more layers you have that you use for an overlay fs mount and so on. The more expensive it actually gets to choned them. The next thing is if you have something where you want to give each container its individual ID mapping. So it's separate username space with a different ID mapping, which is a use case that some people have. Then you also have the problem that you need to duplicate all of the storage. So each container, they can share layers anymore if we're staying in Docker terminology, but each root file system has to be copied into a separate directory. So you're wasting a lot of memory and space for no particular reason other than that you want to change the ownership of the files. And there's a bunch of other use cases on there where you, for example, want to share data between the container and the host. Let's say you want to have your home directory or something that recently someone wrote a blog post about. They wanted to build a minimal container where they use slash user on their host system inside of the container. That obviously won't be possible to do nicely because the ownership of all of the files on this will not usually match the containers UID and GID mapping. There are some solutions to this problems already. There is a very limited number of file systems that allow you to mount them inside of containers. The most interesting file system that exists in this area is Oval AFS. So you can mount an Oval AFS file system inside of the container. That doesn't necessarily help you because the underlying layers that you use for Oval AFS will have their ownership changed anyway. The Oval AFS stack and file system mounted on top of a bunch of lower layers. So if you share layers, these layers can't be shared anymore if the containers use different ID mappings. Other file systems are rather uninteresting for data sharing or for using as root of S. You have tempFS, you have deft, pts, you have sysFS, you have procFS. But neither X4, XFS nor butterFS are actually mountable in this way. Another thing is that the file system-wide ID mappings do what they say on the tin. They change ownership file system-wide. So even if you were to be able to mount one of those file systems inside of the username space, the ownership changes would apply to the whole file system. Which especially if you think about containers as a bunch of mounts put together, which they usually do. You have a root of S and then you have a bunch of additional mounts in there, data sharing mounts and so on. So a couple of bind mounts. And it's not necessarily always the case that all of these individual mounts are supposed to have the same ID mapping that is used. And so a longstanding idea has, well, there are a couple of different solutions that you can take with this. But the most flexible solution that covered all of the use cases that people came to up with us over the years. It took a long while to actually get to all of those people and talk to the different stakeholders. What they wanted was to make it possible to change ownership on a per-mount basis instead of a file system-wide basis. And so it's a temporary and localized change in the sense that the ownership change is tied to the lifetime of the mount. And this is in a nutshell from a high-level perspective everything that ID map mounts are about. You can change ownership on a mount-specific basis instead of a file system-wide basis. Which makes them very suitable for containers, for example. And the API for this is based on the Mount Set Adder system call that's already fairly widely used, which allows you to change mount attributes, various mount attributes, just the ID mapping that is used for a given mount, but also stuff like read-only, read-write, and so on. Yes, Ted. Yeah, so when you say on a per-mount basis that's mount not bind-mount? That's bind-mount. Oh, on a per-bind-mount basis. Sorry, I forgot. I can use a simple VFS terminology on a VFS mount basis. Okay, on a VFS mount. Great. Okay, thanks. And so this is the API. This allows you to, I don't know how many of you have seen this, this allows you to change mount attributes recursively, which something which the mount system call, the base mount system call, didn't allow you to do. You can make mounts read-only, no-def, and so on. And this is the specific API that you need to use if you want to create an ID map mount, which is to raise the specific flag and then pass the file descriptor of the username space in that you want to apply to this mount. And this is basically the whole magic. The VFS had to be taught to deal with this. File systems don't need to be really aware of it. There are APIs that abstract the necessary, the gory details away. At least we tried to make it so. Is there like a command line helper program that makes this easy? Yeah, so this should be merged fairly soon. I can probably give a demo here. Let me try. And now let me significantly increase phone size. Can you all see this? So, talked with Carol, Zach, the maintainer of Utah Linux, and because system D services already want to make use of ID map mounts directly for isolated services. And so they wanted to have this available in the mount tool. And so it is actually available, should be available in the mount tool soon. Let me see if I have something mounted already. For this room, it's worth knowing that XFS tests already has a binary that gives you this. Sorry. For this room, it may be useful to know there is already a binary in FS tests. Oh yeah, XFS tests. I should probably say we have a 15k test suite associated with ID map mounts that is upstream in XFS tests that aims to cover the behavior of ID map mounts under all possible combinations and tests via FAS behavior including ACLs, capabilities, setting and getting, setGID inheritance, setGID and setGID execution because this is all of where this stuff becomes relevant. And so file systems can just run XFS tests with that and they should have a clear idea whether or not they implement this correctly. Every time we fix a bug or see a regression, we immediately add a test to XFS tests and that also has a binary to create ID map mounts. But for user space, it would obviously be nice if you could do something like this. This is the command. So mount has a set of options that are called x-mount. And there is a bunch of complicated stuff that you can do that people don't know and it will gain a new mount option called x-mount.idmap and then you can specify ID maps or explicitly, so to explicitly say this is the ID mapping that I want to use, but if you wanted to, you could also say proc some PID NS and then the username space that you want to use. It's fairly flexible. The syntax is something like this. I want to map UID 65534, for example, to UID 1000 and then you can give it a range for how many UIDs and GIDs you want to map. Here I just want to illustrate, this is a shortcut for mapping both UIDs and GIDs. So, yeah. This means map 65534, which is the nobody, no group user to UID 1000 in the target mount though. If we look at source target, source mount, then you can see there are two files in there, a directory and a file that are owned by nobody, no group within this mount. But since we created an ID map mount, because you can see here at target mount, if we look at this from target mount, you'll see brown and brown as my UID, which is UID 1000, I can prove this. So, in this mount... This is very fascinating. It sounded like you mentioned there is a way to call out to a service or call out to a pseudophile for the mappings as well. We're going to specify something, because typically these would be stored centrally, so the example I think of a lot is your NFS or some file system, and you have exactly that thing, right? Oh, you mean you want to call out to a service and use this ID mapping? Well, let me give you an example. You have two containers running on the host. Container one is a member of domain Pepsi. Pepsi does not have a user brawner. So you're mapping that. If that user showed up there, he's guessed. You already have that. But if it's Microsoft, another container running on the same host, that container, yes, you do exist in your UID 1196 or whatever. So the central storage of these, however you do it with SSSD or Windbind or whatever it is, or some future service, it makes sense that depending on who owns that container, is it Coke, is it Pepsi, is it Microsoft? Those IDs could be mapped differently. They can, yeah. And so what I'm kind of wondering about is how you would call out to a service that knowing what namespace the container is, was it Coke, is it Pepsi, is it Microsoft running it, would provide you the UID mapping for that? Sort of like what SSSD or Windbind does today. I mean, that seems like a, we totally misunderstanding, but that seems like a user space problem. But if you essentially have infrastructure to call out, give me the ID mapping that you want to use for this container, then you retrieve it and then you can set up the container with that ID mapping for that specific mount and so on. I guess what I'm saying is that there's thousands of these entries, they're stored centrally and then cached in these services like SSSD and Windbind, they cached other things too, but they cached group memberships and they cached all these things needed for actual evaluation. So these services already do all of that, but what they don't know is until you ask them, they're not going to provide you the data unless you ask them, because they're typically hooked into by PAM and NSS. You know, you're looking at logon, you're looking at who am I and commands like this. But the thing that I'm a little bit confused about is when you set up these mounts like this, is there a way to automate it so the mount command can just go off and ask the user space service the right thing? Yeah. Yeah, that's possible. That's what I meant. It's a user space problem. If you have a way of retrieving these mappings that you want to use and then translate them into a form that can be consumed by the mount binary, then this is doable. What I wanted to, in the last step, illustrate is the important thing and this is the test. If I now create a file in there and say, and I look at it from here, then it will be owned by brown or brown or if I look at it from source mount, then it will be owned by nobody, no group. So the ID mappings essentially work in such a way that if you create a file on disk, if an ID map mount says, map 65534 to UID or GID 1000. What that means is if I call stat from that ID map mount, I'm getting reported 1000 as the owning UID and GID for a file that is stored on disk as nobody, no group. So consequently, if you create a file or if you change ownership and say, I want this file to be owned by UID 1000 or I want to create a file as UID 1000, that means I'm putting a file to disk as nobody, no group which conversely will be reported as being owned by UID, GID 1000 through stat. This is an important thing to note. And this is used already. So this is being, so for people a little more active in the container community, there is, it's now part of the spec or it's becoming part of the run C spec. It's used in a C run. It's used in run C. UID has a pull request for this open or already support system B and spawn supports it right from the right of the bat, makes heavy use of it. System D home D makes use of ID map mount system D services will gain support for this. So there's a lot of activity going on around this. The file systems that currently support this are X4, XFS, ButterFS and we have a patch series for OvalAFS that is scheduled I guess to be emerged for. What does the file system need to do to support this? So that depends on your file system. No, it doesn't really depend on the file system. Network file systems are a bit special, but I have prototypes. I had a prototype for Cephaphase, which I think Jeff Layton already saw and it needs a bit more work, but in principle it's really easy. Like all of the I know of methods already passed down the relevant ID mapping or the username space that is attached to the mount, and then it just needs to switch to the generic helpers that we have include Linux mount ID mapping. And we also, I have written like a 900 word, 900 word, a long document describing how ID mappings work. And as long for example as the file system uses iNode in it owner, then everything is already there. And it's really, essentially, if you look at the patches for X4 or XFS, they were fairly minimal. The only time it needs a bit of thinking is when your file system does anything directly with UIDs and GIDs, which not a lot of file systems do. XFS did it, for example, in a few quota allocation paths, but other than that it should be fairly simple. And I'm obviously always willing to help out with this if people have a use case for this and think this is something that they want to support. I mean definitely an NFS and SMB, I mean these come up all the time, right? Is that people are running containers all the time over NFS or SMB. I think KS. What I'm thinking at a very high level is that for whether you're talking about AFS or whether you're talking about any of these, most of these don't use UIDs. The ownership is expressed as a globally unique number. So in the iNode they have a globally unique number. And what's needed is a way to translate that globally unique number to a specific UID that's different for each container. So for network file system things get a bit, I looked at, even before this was merged, I looked at networking file systems because I was like, oh this is probably going to be complicated because what if I have CIFS, which doesn't really, what role do UIDs and GIDs play in CIFS when it interacts with the server and so on. And network file system in this sense can be a bit tricky. For example what CEPFS does, if I remember this correctly, it always sends the FSUID of the caller with any request that it makes to a CEPFS server. And this FSUID is more or less in the server only used when you have access restrictions on the server and you for example say, if someone sends me this UID and it doesn't match the UID that I set on the server, then they are not allowed to interact with any files or create any files on disk. And so when you have an ID map mount and you go through a client which uses an ID map mount, you always need to make sure that you send the ID mapped FSUID to the server or do you need to at least figure out what you want to send to the server. It's like things like that that can get complicated. I think for NFS and AFS and SMB it's much simpler than that because on the wire they have a globally unique number, there's no issue like this. So the only trick is that when a UID comes in, let's say on create or whatever, and they have to map that. Yeah, Chuck. I was wondering, do you have to remount if you want to change the mapping without unmounting the files? No, we implemented it in such a way that DID mapping can't be changed once it is established. You can't do a remount and then attach, for example, another username space because it would have been horrible, it would have been rather complicated to do this nicely in the VFS because then you get into lifetime issues. You need to guarantee that everyone who wants to operate on the ID map mount that the relevant object, the struct username space doesn't go away behind your back and that's all kinds of complicated. So the way this is done right now is you create a new detached mount with Open Tree Clone which is the... I showed this on the slides, I think, a new system call in the new mount API which gets you a detached mount, meaning it's not visible anywhere in the file system. Then you can change the ID mapping and then you attach it to the file system. And as soon as you change to the ID mapping or you have the mount attached to the file system, you cannot change anything anymore. Okay, a second possibly related question is how does the scale and the number of mappings permount? For example, if you've got a multi-user system with a thousand users on it that you want to have a unique mapping for every one of them. So I did... There's two ways to understand this. Either it's a question about the number of mounts or it's a question about the number of mappings permount. Okay, so I originally, username spaces only supported up to five individual mappings. You see you saw that come on, you saw that line up here where it's 65534 colon 1000, colon 1 and originally you could only have five. Back in 2015 or in 2016 I don't know, I changed the username space to allow up to 340 individual mappings. And that's sort of the limit. This has cache line issues actually. This is in a hot path and the way for example one of the advantages of attaching a username space to the idmap mount instead of for example calling override crats in the VFS is that you don't get any DOS issues and you can work under RCU nicely and so on. So that's rather simple. If you are in a hot path with these if every time you call for example FSUID GID has mapping which the VFS for example does every time a file is supposed to be created it checks can this UID and GID be represented by the file system? Is there a sensible value assigned to this? So it calls it a lot and a bunch of additional checks and every time it looks into the attached mapping of the username space and checks is this UID and GID mapped and so that's a hot path and the way this works is if you have up to five mappings then it's a simple array and if you have more than five mappings up to 340 individual mappings then you have forward and a reverse pointer that point to an array that sorts either by the first UID 55534 or by the second UID starting by the second UID 1000 and then you can really efficiently binary search obviously you can use binary search to guarantee that this performs really well but it's destruct is I optimize destruct such that it's really cache line aligned and so increasing the number of mappings will be difficult yes Thank you about performance if you look in a user SPIN there's three or four ID mapper utilities there different file systems have their own up calls for ID mapping and presumably all of those have to be changed to fix this all the ones that are in user SPIN have to do something to take advantage of this but right now there's three or four of them what I'm wondering about is calling up calling up NFS ID maps SIFS ID map RPC ID map all the different ID map ones that are already in user SPIN is that going to be slow all days or a better way to make them faster I think aren't those mapping between names and numbers yeah so between Browner and 37 not between 37 on this mountain that's a different totally that's the thing the UID and GID mapping that NFS uses I'm not an NFS expert so please correct me if I'm wrong that's that's concerned with mapping UIDs to user names by calling out this is more slightly this is a separate issue this is how do we deal with network file system foreign identities on your system yeah I mean like if I have a container and UID 37 is you know I don't know Christian that's something rather and then I go in another container and it's UID 96 I mean you're doing to NFS ID map calls right you're seeing user at something and you're looking up his ID map and then you're doing the reverse so in one container you're looking to see what the username was and then the other container then you have to look at the ID for that username that you just queried in the other one that's after going through the Christian's mapping layer or before you're going to look at it so you'll talk about the raw ID from the medium effectively well the medium he's translating the medium to the application and back again but he's going to represent that ID differently but at some level they have to look up the right but when you the network file system see it again you're seeing the raw number not what's presented to the user well yeah you're seeing a mapable number in a namespace they're different namespace but that's a separate part of the problem and isn't covered by Christians we need to deal with that too at some point but that's not harder I believe your ID map a programs map the KUID so effectively it's constant and then this will do the UID mapping on top but to to devices medium may have the same UID superplugging a USB stick with an HD3 file system on it it may have ID 37 you go look at an AFS you may see ID 37 but these are different ID 37s so yeah I think this is a slightly different topic but I think if we've exhausted this one and I know we're almost out of time what is there any thinking about wanting to support project IDs I know some container systems use project IDs for their own use but if you want to do nested containers yada yada I think we definitely need to revisit this issue I think this is we've basically dodged this issue for years and didn't really bother with it because nobody could be bothered to have clear semantics when I looked at the project IDs paths when I did the original work I was confused on a lot of levels because it's usually for example you interact with the username space that the file system was mounted in but in all of the or most of the project ID paths the inner user NS is used and so it's a very schizophrenic situation where it's not very clear what the intended semantics are and the main problem that I want to tackle going forward in the future is we need to provide better documentation what's going on with UID and GID handling with quotas and come up with with better semantics for a few other things in the GFS layer but it's definitely on my to-do list because people want to do this and we get requests for this all of the time. So the only question I had is project ID more than XFS now because when I last looked it was only XFS EXD4 also supports project IDs and I think there was some question that Derek and I were puzzling over whether we did it the same way in the presence of namespaces and how project IDs were mapped in namespaces. The intent is to unify it but I confess I'm confused what the semantics should be so As far as I understood quota can mean a lot of different things for different file systems. Having worked with this parts in user space also a bit not just in the kernel it's very difficult for example if you set up a container root you want to say I want the container to only have this and this much quota then butterFS will require a very different setup than XFS or X4 and it's really hard for user space to actually get this right. I just want to point out before we let Amir do his thing Jan says on the chat that the VFS quotas like the normal stuff seems to be missing the ID map handling so it looks like it doesn't do any of this conversion for like qget quota but anyway let's move this to hallway track and Amir you're up