 Yeah, so This hopefully be less controversial than certain comments made in the previous talk. So we should get this very quickly. So this is Basically finally taming the last bits of magic link issues that we had from the container on time side of things And this was something which I worked on in the original open at to Proposal a while ago, but um, it got oh Yeah But it got removed because I removed it because there were a couple of semantic issues that were not clear And so I just was hoping that uh, we could have a very short discussion about Which stuff is missing like basically doesn't make sense to everyone. Is everyone happy with it? Am I going to get things thrown at me when I post this on the list? And yeah, that basically it so Effectively the current status of like all of the work that I was doing for making Files system stuff less well at least VFS stuff less awful for containers open at to is merged lip-path arrest Which is like a whole separate thing we could talk about some other time is basically a Way to safely do path operations on containers. That's still saying I'm working on And then this part is just the magic link hardening part of things. So Why do we want so first of all, okay? I guess I didn't put aside for this But what is a magic link a magic link is anything that uses ND jump link in the VFS? So it's basically a thing that looks like a sim link, but it's not a sim link And we have in open at to you can block resolving these things so you can't be tricked into opening one by accident but the hardening stuff we're talking about here is basically as an attacker and We have had CVs several CVs in container on time specifically about this issue, which is that because Because once you have a handle to the underlying files, okay? Because magic links let you get access to the underlying file if you have a container manager that is joining a container There is a window in which depending on what kind of privileges you set up What kind of missions the container process has and many other things? You can grab a handle to for instance proc pedexy of the container manager You have a handle this file now obviously you can't open it for writing at the time because by definition It's a live of a process. It can't be open for writing you get e-text busy But if I have a handle this thing I can then wait for it to die and then I would reopen it for writing You might ask why would why on earth would a container process have the rights to open a file on the host as root? Well, because that's how like Kubernetes and like every other major whether like major obviously some people don't do this because they realize this is a bad idea, but Like most stuff that runs on containers people are not using user namespaces They're not protecting things the way they should be doing but there are other things where like you could imagine a setup where I have copied like I have downloaded the Code necessary to run my own containers on my machine and I run this thing It would be very strange for a container for a container to override stuff that's on the host so anyway being able to fix this probably would be nice because the moments run C and Alexi and C run and pick your container on time They all do a variety of awful things to make this attack go away The one that we do in run C is an LXC and I think basically everyone else is that we make a copy of the entire binary for everything Container when you start a container the first step is okay. I now open proc self-exing. I create a MMFD I copy the contents of the MMFD I seal the MMFD and then I Exec vE that MMFD because then I'm absolutely sure that even if you can overwrite the damn thing It won't affect other containers on the system There are several caveats about things anyway. It's I'm sure we all agree that is all absolutely awful and should not exist But unfortunately, it's necessary because we can't defend it against other way So this is I want to solve this problem and in addition being able to restrict the reopening of files Effective like a capability style setup. I think makes sense in general So I think that both of these things we can solve with this one thing So basically there is a patch set which I posted a while ago. I it's on my it's on my arm I have my my next tree you have that it exists basically what it does is that it's the design is that And I'll explain what this means exactly in a second, but basically when you try to reopen a magic link. So you try to open Proxelf fd blah It will not allow you to reopen it if the mode you requested is not a subset of the mode of the magic link itself And then it also adds something that has been Will be great if we had which is oh empty path which basically lets you do exactly the same thing But without having to go through prokofast you just have the file you say, please please reopen this for me and then It also adds Way to mask this so with open at to you can set a new thing in the open house truck You can set the mask reopening so you can say open this file But do not allow it to be reopened for rewriting open for writing through through the regular Yeah, through through reopening and then all this information is exposed in fd info as I get extra field So what exactly does it mean to say that you cannot You need to be a subset. So what this means is that for a if you do an opath of a regular file you can reopen it Anyway, you like unless you have a Mask set so this is all without a mask set. So it's just regular open Just regular open of it with opath you get you can reopen it with anything Which is the way it works now then with opath of a magic link you copy the mode of the magic link So if the magic link does not have the right bit set in like the sorry because magic links Unlike regular sim links. They have magic modes and the mode actually has like you get like if you ever open a file for Rating for reading only the mode has only the read bit set. It doesn't have the right bit for that for instance With caveats, but this is the general way works. So When you do an opath the opath will copy the mode So that means if you have a if you have an opath of the exit of prox self XZ of the Proc blah XZ of the container if you opath that you get the same restrictions So the restrictions get copied with opath and if you open it with any other regular mode It's just the obvious like if it's read it's read it's right right and then all this is based So reopens are based on the magic link mode, which is based on the F mode So the magic link mode is based on the F mode and when you do with the Omd path it just use the F mode So yeah, so that's the current patch. So basically the Questions are and we can go through we'll go through each one I have slides for each one, but if you would like to scream at me now, please do Basically The first question is whether or not those make sense. So Andy suggested this this semantics when we were talking about it some time ago and I Like any concept? I think there are some slightly hairy issues we can get into where it's not quite clear exactly how things should work But it does at least protect against this one thing the the questions of magic and semantics so because Part of these changes will change the way that certain restrictions work and we'll change part of your API One of the things that I would like is that if we can just narrow down exactly How we would future proof is against for certain future things like for instance If we wanted to make it that you could that you could have a file descriptor that you couldn't exec So like I open this file and you could only read it You cannot fexeck it for instance or if you wanted to do this you can't do this today but if you wanted to do it if you want to add this feature in the future it will be a good idea to at least Plan out so that this stuff doesn't need to change once we add that feature for instance But yeah, so basically directories this is like quite boring, but basically There is no way to detect when you have like a file P like from the F mode You cannot detect that it's a directory so you would need to have like another way of detecting this and so on and so on Yeah, but we want to have for the future proofing aspect of this We would want it to have like Rwx all set because at the moment directories are not restricted by this patch so effectively I Guess yeah, so should it be like an F mode is a directory bit that is only cosmetic at the moment Yeah, should we even use the F mode bit like is there another way we should be doing it And also reader is currently not because the current way this works is because of the way that where the magic links work Is that they're all like Well, they're magic as in the name because it's all being piped through and de-jump link and everything else you're in a situation where It like Every single if you want to make more restrictions that apply to this stuff You need to like rework the way that lookup works such that it like you save this information During and de-jump link and then during the part of lookup you then like handle this Which means that for every single like if you wanted to change the way that if you wanted to restrict Resolution through these things which this which this patch doesn't do if you wanted to restrict the resolution such that like Oh, I create this handle to this directory, but you cannot re-dure it or you cannot go underneath it You would need to add like a bunch of the restrictions to a bunch of the places So I guess I didn't add this because it wasn't clear I mean it is to me. It's obvious that it's something which would not be a bad thing to have But it's not clear whether people would be happy with like having like this littered all over the place and so on and so on Yeah Does there another Mike? Does it make sense to add a system call specifically for reopening a file But the problem is that we use it like people use it so you can't you can't ban it And it's not like people use it in like nefarious way like in container on times We we we notice use this we like abuse this to hell and back because like we because we need it because there are certain security properties You cannot get without using it. So having it as like an official kernel thing is great one of the reasons why it's tied to this patch set is because you cannot add this feature until you fix these holes with how Reopening works, right? So I should point out obviously DAC rules apply like if you could not normally open the file You can't that's fine. The problem is is that this stuff like totally bypasses namespaces and you cannot Like there is no there is no way of restricting it because like this is like kernel AP like the These racing API doesn't mod work Shouldn't you watch mod work to change? Mode of sim links doesn't matter Mode beats on sim links are completely ignored. Yeah, but could they measure for magic sim links? I mean, this is what that's what this does, but yeah, could it be done with schmood? Shoot it. I mean, I don't think it's visible Okay One thing to keep in mind Magic sim links are not all that major because Every time your script Mentioned say slash dev slash yesterday That actually Resolves to proc self FD zero So it's not like we could freely change Behavior of that stuff without breaking the living hell out of unknown amount of What scripts on hell knows how many systems it's It's really not that an exotic feature We need you really need this to be backwards compatible because we for example We do crazy stuff where we open Slash dev slash pts slash no dev PTMX as an opa file descriptor because it doesn't be an open on the actual PTMX device Stash that file descriptor and then later on use it to retrieve the actual Slaveside of that PTY device and that needs to continue to work So it did's whole thing needs to be implemented in a backward compatible way the thing is though Why I think we want to have this is people keep coming to us with Hacks to fix to fix the well hacks, but patches to fix the Prox self Xe hell the recent attempt for example was to somehow restrict this but This would be our way out if we had this properly properly done Then all this problem goes away because then you couldn't be able to write to this prox self Xe file anymore Provided the container engine Opened What happens if Somebody tries to bind prox self Xe well with following links on On the source Then uses whatever path name it's bound to to open it for right Yeah, where do you stash of that information? Yeah, that is a problem. I mean so the the thing is that we so Speaking Okay, there are two the first thing is that that is something we should also figure out a way to deal with but the thing But the practical upshot is that for most containers where we're dealing with untrusted code and everything They can't do mounts like that. So you so you wouldn't be able to do them out now It would be obviously it would be we would want to also fix that But yeah, but at the moment the problem is that you can do this reopening stuff. There's no like Again, there is there is like a DAC permission check, but again, we're talking like 99.999% of containers that are running in the entire planet are running as a route with no username spaces So like there is no like you're not That permissions are like, yeah Yeah, they're not enough But I agree that that is that is a separate problem that yeah I would need to sit down and think about how we would deal with that because I agree that is some that is something Yeah, which we'd also need to block The whole proxel the whole proxel XE attack vectors really just an attack vector Because they don't use user namespaces. They're not I mean it always not the solution to to Everything but they do block the proxel XE attack Whereas if you just have a container with a bunch of namespaces that is an valid attack vector There are situations where even with these namespaces you would be you would be vulnerable, but like How practical they are is like a different topic But like for instance if like going back to when I first started working on rulers containers a long time ago for run see or whatever You would the use case was like I have no access at all Like this was like I've been given a machine by University supervisor who hates me and doesn't want to give me any packages at all I download all the stuff by myself and then I run I I so the run see binary is owned by me And then I start a username space where I am root because there is I can only map one user and then in that case You could overwrite it now how useful that use case is like a different topic The point is that like there are use case email user namespaces with that whether that's also problem. Yeah This was part of the original open at two patch set. Thanks for working on this by the way and Yeah, I find this to be extremely Useful if we could make this work because I think it's pretty it makes the opa opa file descriptors a lot more useful as well You get a lot more guarantees for them from them and That that is also by the way the reason Even though this is going in the is related to that But that is one of the reasons why I tried to prevent Being able to use opa file descriptors in ever more system calls The recent patch that was for example where someone wanted to use an opa file descriptor in the saddx adder The system call and I really don't like it because it makes opa as a concept way less useful by giving it ever more capabilities It should be With no way of limiting it like if we have a way of meaningfully Limiting it and this is a different story, but this would be I think it would be a great addition Yeah But yeah, so yeah, so basically I guess the thing that the talk is basically so at the moment It solves this one issue which the proc solve actually issue like that issue is solved with the patch set The only question is that it seems to me that effectively it would make more sense if we Like actually just had this so effectively right now because because sim links Because the permission modes don't matter and because magic links act like sim links in many ways You end up with a situation where like oh like there is no check when we are going through a sim link of like Like does the mode make sense even even if we're using it as a directory It doesn't check of that because it's a sim link or it's a magic link I and then Andy jump link means that it's like even it's more magical But the point is that when I say magic I mean like it's not like within the realm of regular like oh a sim link Is just you replace that part with the condos of sim link. That's what I mean by magic link but anyway, so Effectively, it seems to me that you're like the the most logical the nicest thing would be is if we had the mode of the of a magic link to a directory for instance Acted the same as if it was an actual directory So like if you were to have if it had no read bits You could not read dirt for instance and if you had no exact bit you could not resolve through it for instance and This would then be all be stuff that you can that you can deny using open at so that means that you could use open at Now there are there are things to consider with regards to if we were to do this with directories The reason why something in the patch set is because it becomes way more complicated because you have to think about well Okay, that means that every single at syscall than as to reject it because technically if you do an open underneath it You need to then block that because it doesn't have read missions or exact permissions or whatever So it gets more complicated, which is why it's not in the current patch set Sorry word count. I apologize. I speak too fast. I'm sorry Okay, watch it back at half speed Yeah, but no, sorry The short version is that um Yeah, so I think that and then the other thing is the exact bit for instance. So at the moment You cannot so there is no way to open a file Sorry, there's no way to have a file handle that a process cannot fexek Okay, because there is no restriction again Obviously if it's if it's if it has the exact bit seven the actual file But the point is that like if you have this handle to let's say I don't know a security binary or something and you want to hand this handle to someone else and You don't trust them. They could exec it if they wanted to now Obviously would hope that I said you already binary wouldn't have bugs or whatever But the point is is that you can there's no way to strict that you cannot restrict Resolution through directories as like on a file handle basis and I'm not suggesting that we work I'm saying should we implement now question mark, but like even if we don't implement it now I guess basically I think the design of this should also be able to have this in mind So if we were to add this in the future, at least you wouldn't have to rethink the way that the magic link changes would work Well Right now Right beats on directories themselves have nothing to do with Opening files in them for right What we are suggesting to change Well No, no, so what I'm suggesting is that if we had Okay, so if we have Excuse me With the exact bits. So what I'm suggesting with directories is that basically if I have a An opath to a directory and I have set it such that it does not have a right bit in Proxelf FD Blar Then I cannot MK dear using that thing. So I cannot MK DIR at would fail is what or whatever Oak reat under that thing will fail is what is what I'm saying semantically that in the single under the thing Well, it's a director's under the thing. So like you have I have I open slash Foo open slash opt with with no right bit using using it was opath But I I set the mask such that you cannot write to it and then I then try and then I then try to MK dear at That file descriptor something That would fail is is not what the patch does. I'm saying to me that semantic makes sense But I don't know if it's something which if we should pursue it if it is something that is too complicated to do everywhere I don't know. That's why I'm that's why I'm bringing it up. I mean, I wouldn't I mean I'll sign while I was saying that gives me an impression that it's probably a Painful it sounds fairly straightforward. Basically, you're putting an ACL in the stroke file That anything that wants to do something with that stroke file has to obey So you can say you're this stroke file if you're using it as a base for McDowette Can't do right. Can't do lookups or can't do reads and then if you reopen that You can't give the reopen file any more than any more than the permissions in the source file Yeah, so that's what that's what does the yeah, it's a capability based system I mean and the thing is it so it's like the on the file itself the patch that already does that That's what that's what does the question is like, okay I now want to say I now want to resolve through it So let's say I want to go underneath it at the moment because it's the rest of the logic of magic links Basically, the only thing that touches is it touches the last part of a look up for for stuff So that that part is checked But like when you walk through it or when you have like an ad thing that is on it and stuff like that Those things are not blocked mainly because I wasn't sure whether people would be happy with like changing all of that because we We need the exit stuff to not be broken and the current patch that does that but like all the other stuff It seems to me would be a good idea, but I don't know if this will be Unpalatable I don't think so we don't have anything of that sort on directories and that's how The system had been well since ever since probably 69 or so I Change that Yeah, it would be opt-in. I mean you would have to create an opath of this thing that has this mask Like it would be like if you create a regular opath Same as always, but if I open to a directory and I want to say okay I want to give this directory to another process, but I don't want them to be able to make files underneath it or something Even if they were okay, what happens when you bind that thing somewhere else now you have a directory Yeah, okay, so you said you said a bit on it. So you're not allowed to bind this Yeah, so I yeah, I'm going to like put a giant asterisk on this entire talk, which is that Yeah, yeah TBD on binding mounts. Yeah, I agree. So yeah again So you could say the buying one thing for basically everything here Which is that like if you can bind mounted this stuff does not help But the thing is that in 99.99% of like all containers everywhere. They can't bind mounted that doesn't mean we shouldn't fix the problem I'm just saying that like Yeah, that it seems to me like if we just block Okay, if it made up if you're possible for you to say you can't buy it Like if you had like mounts unbindable or whatever as a way of like on the opening I don't know. I would need to think about it. This is like, yeah, I I agree. It's a problem I would be happy to take suggestions on how to fix it. Um, I Accept patches to my tree Yeah, and then yeah, this is actually I don't have a thing on mounting But this is this is a very minor thing which is um, which I did try to block a time a while ago, but unfortunately Al you said this you you said that we would never remove this Which is that yeah, basically you can currently mount on top of sim links. So if you if you have I Think you can do it easily with the new mount API But with the old bounce API you can do it if you have the right if you do it through enough Indirections through magic links. So if you have if you open an ono follow to a fire to a sim link You have a handle of the sim link you then mount through proc self fdblah to that ono follow You can mount on top of a sim link and as long as it's an undirectory it'll work as normal And it's all great. The only problem is that if you do this on top of a magic link From a user space perspective, there is like literally nothing we can do to protect against this like you cannot because Okay, so from the use case of Actually, I suggest a much simpler solution for that don't allow to bind anything on proc feed whatever Because other the because That's We shouldn't be low in any mounts in that area But by like by an unprivileged user you mean by anyone It was what happens what happens when proc process exits It's a it's equivalent to overriding a system call. So you're saying like from the kernel perspective just block mounting on top of procosys Yes Not proc of us itself We already have a way to mark. I know as Don't mount on that We already do that for We already have that mechanics and I would say that anything That isn't persistent enough and proc 42 something something something definitely isn't Should just get that treatment by default Okay, I'm I'd be I cannot tell you how happy I am to hear that I would I will write the patch like literally as I exit the room like yes that I would love to have this Yes, please please we need to stop because it's like there are so many there are yeah, okay I mean I actually I don't have time now But I was gonna have time where I would spend like five minutes talking about how many awful things this enables Sorry We'll accept that we have proc FS and FSD You have to be able to mount that Proc That's fine, but That is persistent. We should just do it on do it for proc Pd for for some directors for pro for thread Subdirectors in proc fs and anything under them Doesn't matter Places like proxies are obviously not going away. That's there is no magic there No, but the proc pdns is is kind of out that it's not use because it's automatic like when you start a process It has its own thing. Yeah, but yeah, yeah, I Again, I kind of yeah, because it's like if you don't block this somehow Even with open at to magic links are unsafe still because you cannot be sure that the thing so because result Nox dev blocks mount crossings but with magic links I want to cross across a mount point, but I don't want to be crossing a bind mount on top of it That I can't be aware of and it's like, oh, I just check mount info. It's like well It turns out that you can have race conditions where you're the other so yeah, so um, yeah, if we can block this entirely I I'm very happy to hear that it's something which which which would be happy to have a patch to block this Yeah, because I think and you could you could bind mounts bits of Proc on the other bits of proc and so it's looks like The things we can handle that so because so the so what we can do is so you can have if I want to act on proc proc self Or proc a proc self blah, I open the parent. So okay, I want to okay. I want to reopen a pit Let's say, oh, no, I want to access proc self XE for instance, which we do in run C We reexact ourselves to proc self XE so I want to access proc self XE I open proc self with open at to with no x dev no similar not proc self proc my pit I open that with no x dev no sim link no magic link. No funny business There's no that's not a thing but whatever you block all this stuff and then you then open underneath that XE and In that case if because it's a kernel API I can check I can check that the slash proc is the real slash proc because the inode number of slash proc is set In the kernel as part of the API I can check that's a real thing. So I am definitely sure the root is right I'm definitely sure that all the directories are right because I open up to guarantees that from me And then I'm I'm finally at blah. I know that nothing can over once this is done I know that nothing can over mount XE. I can then just exact exact v yet XE and there's nothing that can mess with it Because you because you I that every single step I have checked that it is actually what I think it is so As far as I can tell that is actually and also we can With the clone tree you can create a copy of private property a copy of proc So there's no there's also additionally no race conditions as well, but that's different. That's a different discussion. But yeah I think that that would protect against this thing You're looking at me funny, but let's talk about sorry. We don't know. Sorry. We don't I think we're out of time. Sorry