 This is, I hope this is going to be a discussion about fuse pass-through. It's a feature that's long been requested, because lots of fuse fire systems are just passing through data from underlying fire systems. So they are not changing anything from the fires, but doing something to the director's structure or the metadata. And right now, if a fuse system does that, all IO goes to the user space server by request, and obviously that's slower than if a fire could be directly read by the kernel. This means that the server needs to tell the kernel what fire needs to pass. One interesting question is what should be the granularity of the mapping? Should it be just whole fires or it could be blocks or byte ranges? If we have more granular mapping than fire, it could be implementing block-based fire systems as well, because then we could have parts of the device mapped to fires in the fire system. Another question is how to establish the mapping? One issue with that is that currently replies from the server are done via write, and if we pass open fire descriptors through that, that could be a security issue. So one way to resolve that is to implement an IOcto, implement this one IOcto. So passing fire descriptors which establish the mapping should be done via an IOcto. Sorry? Why this needs to be done? The basic problem is that somebody can hijack a write and then send a random FD through the write. You can trick a program into writing random data. And there are examples of this given in the exploit. You can trick a hyper-related program to send random data, so it sends the FD it did not want to send, and then that gets attached to the file. I must admit I was kind of skeptical, but I wasn't going to argue with Jan Horn. It was easier just to do what he said rather than... I'm trying to argue that one. Because it's much, much, much harder to trick a random program... You can trick any program into doing a write. Well, that's not true. You can trick many programs into doing a write, because programs write random stuff all the time. Tricking a program into calling a specific IOcto is a much more challenging problem. And the examples were just pseudo stuff to the correct FD. The second notifier has uses IOctals for data transmission that is not like file data, random file data, but structured data and contains file descriptors and so on. We use IOctals for that as well for exactly the same reason. I was convinced in the end that this was a very valid objection, because only these ones allow you to hook it up to a different FD. The issue is that if you can tell the kernel, like, okay, so normally if you want to say pseudo blah blah blah, write some random... write some... If you want to say write some random file as root, if you do pseudo and you try to redirect it, the unprivileged process has to open that file for writing, which means you have to pass a security check for opening or writing. But if you're talking about a thing where you can now say... you can now trick a process into basically... effectively being able to say that you can do... you can get a process to do an administrative operation which changes how other things up on the system. That doesn't like it. How do I put it? Effectively, if you can get... basically the problem is the indirection. So it's like if you're just writing to a regular file, you have to write check on the file, and it's fine because you're writing to that file. But in this case, if you're writing to a file, but you write a special magical thing to it, effectively you can trick a process into sending administrative commands, which is different, effectively. That would be my aspect. There are examples of exploits where we have like mempo difference and stuff like that, which having a process that can write to a random file can cause issues. There is a history of that sort of thing. Is it a permission check? I mean, because permissions are checked on open, not on write. But I imagine you do a permission check on iOctl2. So I don't think it's a permission check. It's about... Yeah, yeah, I... Yeah. So Jan Horn has a very good explanation. I can take that out and... It's an interesting thing. Yeah. So anyway, so that's... Basically, we have two approaches to establishing this mapping with the FUS2 prototype. I used iOctl just to... create a mapping for... Well, just a register file descriptor, which can later be used to map. FUSBVF took a different route for this and uses iOctl to completely reply... To send a lookup reply with this iOctl because they use the protocol which the mapping is established during lookup. The question is the lifetime of the mapping. I guess it's related. FUS2 did explicit dead-on of this mapping. And I'm not sure what FUSBVF does, but you will explain. One concern with this I have is how mapped open files are visible to the outside world because if we just add a mapping, it will create an open file inside the kernel, or at least reference to a path, but it's not visible anymore. So you can't see it with LSOF and it's just... We have a FUS file. You will see that as it's open, but you can't see the file that is referred to by that FUS file. So AFCUNIX would have the same issues because you can send FDs over that that you can't then see. Yeah, but I can confirm when you're developing something like this, not to be able to see the file handles there's a right pane. Because you leak them, you lose them, and then you have no idea who's got them. Oh, no. Yes. No, no, I'm just saying. This is actually... I hadn't really thought about this, but this has been a big problem in developing FUSBVF, not a big problem. It's been a pain in the butt developing FUSBVF. It would be nice to solve. I want to say as well that there is a way you can cause this through, if you create... Wait until we get into it in 30 minutes. Basically, there is a way which you can construct a mountain namespace that is completely hidden, and in that you can then bind a file descriptor and close it, but keep the handle of the file open in such a way that there's no way for you to detect it as a process on the host. So, like, LSOV can't see it. It is possible to do. But the point is that, yeah, I think that this is like a more general problem that I think should be fixed, by the way. But yeah, I don't think it's a few specific problem. Like, there needs to be some way... I mean, this is going all the way back to the, like, how do I unmount my USB problem, effectively. And yeah, it's something we should fix, but I don't know if it's a few specific. It seems to me like it's a more generic problem that we have LSOV as, like, a hack to get on... Sorry, that's... Yeah. So, for overlayFS, this is sort of resolved by having a file descriptor. Then you don't need... You don't see the underlying file open, but you see the overlay file open. So, you have an idea. And this is sort of solved for pass through FD, at least an auto-closed pass through FD for views, but not for... Yeah, so, overlayFS, it's... You can see... You can derive which file it is referring to, because it's specified in amount. For pass through, you can't derive it, because it's... The mapping is established by the server. And it could be arbitrary. Oh, namespace is also a problem. So, if you've got... You're trying to pass a file descriptor from a fused daemon that's in one namespace to a mount that's in another mount namespace. And you're the sort of namespace. What sense would this be a problem? I was asking, could it be a problem? So, you do... You show the FD somehow. You do LSF of it, but it refers to a mount that is not in your namespace, for example. Or it refers to something that's in a different user namespace to you, different network namespace, whatever. Specific permission checks, or we can do what we already do today, it's like the sender, master, the receiver, whatever, must be in the user namespace, hierarchy, for example, of the... that it's received from. Okay. Did I understand correctly that this is a question about permission? So, what... So, what happens if you see it and it appears to point to something in your namespace, but that's you first something in a different namespace? There could be confusion there. Namespaces are just horrible. Well, I mean... Okay. So, I believe... To my understanding, this is what the Fuse FD pass-through solution does. Right? But this is a different proposal and there are patches for the FD pass-through, and what question of mine is for this session, and do we need... Miklos, do we need the FD pass-through as well as the BPF, or that's a question, an open question? I didn't hear any of that because I was trying to get the screen to show up on the other screen. Anyways, this is the... kind of how you interact with setting up Fuse BPF initially. So, we have a... a block that we attach to the, like, lookup response, where at the moment we, like, pass in the name that the, like, BPF program was registered under. And then for the backing directory, at the moment we're passing an FD that we're basically just using to get access to the backing inode. So, the lifetime of the connection is whatever the lifetime of that inode is. We don't currently allow you to... Well, the backing path, yeah, sorry. Yeah. We don't currently allow that to change over, like, the lifetime of the program. You'd have to, like, invalidate the existing connection. And we also require that this goes through an iOctyl path, like you mentioned, where basically all we're doing is flagging this as this came from an iOctyl, and then going through all of the, like, usual fuse, like, it goes the same place that the fuse right would go. So, I guess, going back to the basic overview of what fuse BPF is doing. So, we have a goal to try to be as easy to use as fuses. So, what nice defined entry points. And, despite how complicated this graph is, compared to what it is for fuse. So, we have a set of calls which, for the most part, is mirroring what the fuse, user space calls would be doing. And we provide two hooks for a BPF program to, I guess, alter what the parameters you're passing in are. So, we have one up front where you can change what some of the input arguments are and then at the end where you can alter what some of the return codes are. So, for instance, in something like a, you know, reading from a file in the pre-filter, you might say, oh, for, like, this amount of space we can handle it directly, like, just go directly to the lower file, whereas for, like, some other offset you might need to defer to user space. And then, you know, an example of something you might do in a post-filter, you know, for a read, maybe there's, like, a section that you want to redact the information there or alter something. Yeah, so the lifespan of the connection to backing is the lifespan of the inode. Or, like, I think we have a connection between backing file as well, right? Yeah. Each object, the inode file and path is linked to the corresponding backing file and path. Except the path also has the mount point. There's no way to use this to trick the fuse daemon into loading arbitrary BPF scripts perhaps remotely. So the way the linkage currently works is that you have to register that program up front. So it can use, as under this system, it can use anything that has been registered with views specifically. So you have to have, like, set that up at some point. So there's, you know, I've just been using BPF tools struck top register for. I mean, this is no more dangerous than the other BPF. Which... That's what he was afraid of. Yeah, one thing I did not mention, so at the moment there is this kind of difference between, you know, interacting with the fuse daemon via going through fuse BPF. Your context is different, which we plan to adjust by, like, grabbing the daemon context and, like, the fuse init reply or something like that so that we can, you know, be accessing files in the same sort of context that we would as the daemon. That is not an answer to the question that you asked. It's something that I forgot to mention. So a question from my understanding. Mm-hmm. If I look at this diagram, then to me it looks like you can use fuse BPF without any BPF because it's basically this idea that fuse is just a pre-filter and the post-filter. Is that correct? Yeah, that is correct. So if you remove all of the BPF, what you're left with is kind of a direct pass-through that you're not interacting with in any way, which is kind of, like, maybe a more expensive bind mount. But what's the role of the pre-filter and the post-filter in the fuse daemon then? Can't you... What does that do? Can I try to explain? Yeah, sure. The idea is, of course, you want to offload many, many, maybe most of the operations. But, of course, it's not useful to pass through all of them, which is what the patches do. But you want to be able to say, I want to pass through everything except renames. Or everything except this file. So you could either work into the options of fuse whatever you want, or you can write a BPF program that encodes this logic. That's one of the uses, maybe the main use. Part of the problem comes from having this talk before the one where I say what fuse BPF is. Which is a... So if you have a time machine, I would recommend going to the Wednesday talk first. So the calls to user space that we have here are different than the, like, regular fuse path. So these are more a limited sense of instead of a pre-PF filter that just deals with this limited set of arguments that it's allowed to change. Maybe you're trying to do something that BPF doesn't currently handle, or the verifier deals with you when you try to do it and you want to prototype something out. So you would write the pre-filter or post-filter in user space to do that. And then it would do some alteration to the arguments that you pass into the backing call. And going back to the read and write examples, which we have not fully implemented yet, but the idea there is that you would be able to say instead of reading from this amount of space, we're just going to handle this amount in this chunk and then like this, because maybe you want to do different things for there. We have plans eventually to allow for multiple backing files that's not currently set up. We currently just have a one-to-one sort of mapping. Yeah, yeah, it's pretty much exactly for that. Is that would be during open, I assume? Yeah, during any like, we... any other patch that's doing that during open. That's what I call the pass-through of deep patches. They're in Android right now and there's patches that being passed posted and it's not been upstreamed but their approach is different. It attaches the filter during lookup. Yeah, so it kind of extends what that approach was doing to everything. So for instance, you could do a similar thing for the directory structure where let's say you have a directory where there's one file that you don't want certain users to have access to even knowing its existence. So you would be able to purge that from reader and I believe normally you would get a different error code if you were to try to access that. Of course, like you have read-write access. There's nothing you can really do there. The reason is that I'm asking this is eerily reminiscent of in some details of a general proposal that a while back where you could attach a BPF basically a BPF could run during open and then during open you could say, based on a BPF program, refuse the open system call or redirect it to somewhere else. Yeah. So this is effectively allowing you to do that for like open like we would have the ability to say in like I think that would be under the pre-filter of do not do this thing. You would just return an error code. Now I could scroll to the have a lot of BPF specific stuff that would take a lot more than two minutes to cover. Maybe it's also for your next talk, it's more appropriate. Are there any questions about this thing that I have not fully explained until Wednesday? I have questions that I have maybe for Miklos. Do you think there's a need for the open pass-through in addition to this work or do we need to combine them? I just say for everyone it's the Android patch that is upstream it deals with pass-through of read and write which is the most performance related thing. It's not okay. I don't consider Android upstream for some reason. So what I would love best is that some sort of pass-through is first implemented in Fuse because then we have one solution which is stand-alone which doesn't have BPF in it and it's also because I guess that's the simpler part in doing the pass-through. And if we have that I guess it's easier to add much easier to add the BPF filters. So I guess we probably will structure the patches that way for your acceptance, I mean to a certain extent setting up everything for pass-through means taking all of the Fuse calls, constructing how to handle them in a pass-through mode. And then all we have to add is the two lines of code that say and pass this to a filter. So we'll do it the way you ask. So I think the integration is actually I was going to say 90% but that's not fair at all because Daniel's playing with BPF the last six months. I've probably just structured the patch that currently so that we have the kind of just direct pass-through up front. I would need to look at some of the details of how the other pass-through, how the interface works to know how to change we don't plan to but we currently plan to allow changing of the backing file which is the difference. I'm about to time I think.