 All right, so I'm going to give some details on the current status of FuseBPF and some of the things where we are still sort of debating where we want to go. So FuseBPF is kind of taking, when you have a FuseFile system, oftentimes what you are doing is you're making some modifications and then dealing with a lower file system in some way. And this is one of the efforts to try to kind of formalize that behavior. So we want to perform similarly to the native file system, at least as close as we can get to it, and still keeping all of the nice ease of use of Fuse. So we have nice defined entry points. And we're trying to keep the interface similar to what you would see from the FuseDemon. So just a brief overview. When you are using classic Fuse as a stacked file system, you end up transitioning to and from user space several times, calling out to the FuseDemon and then back in through the VFS layer. And within FuseBPF, we try to stay inside of the kernel as much as possible. So this is our kind of general flow. You have a call come in. You optionally filter some of the inputs to whatever command you're running. For instance, if you're doing a look up maybe on the lower file system, you have some different named file you want to look up. And we have an optional call out to do the same from user space if for whatever reason what you're doing can't be done in BPF. So you're reading from a database somewhere. Then we will handle the backing call directly within the file system as a stacked file system. And then afterwards, there is another hook where the BPF can alter what has been returned. So in the case of a read when you've got your data back, if there's some sections you wanted to change there. Or one of our use cases for this is during certain sorts of accesses, we want to hide the existence of a directory. So in that case, we might alter the error code that's been returned to not seeing that it's there at all as opposed to a permission error. So we switched over to using a BPF struct op call. In our V1 of the patch, we were adding our own program type and had a awful lot of void stars, which was kind of unpleasant to deal with. So we have more or less two op codes per fuse call. We have our prefilters and post filters. For the most part, these are just giving access to generic structs. Some things are being handled within a special fuse buffer, which I'll go into more details in the next slide. But that's mostly for fields that have variable length, like strings and data buffers. And we also have the option from this BPF program to say, actually, we want to just use normal views. So you can fall back to that path. And one of the nice parts about the struct op implementation is you only need to implement what you need. So let's say to give a very dumb example, you would never actually want to do. If your stacked file system was just adding a character at the end of every file name, then you would just do an adjustment and look up. You would have to do some things for reader. And that would, I think, basically be it. But if you're doing minimum changes, then you only have to worry about the specific sections that you care about changing. So more on what fuse buffers are. Ideally, we would just expose these as dying pointers. But we're trying to avoid unnecessary copies. For instance, when we're handling a lookup, we have an existing string inside of the dentry, the dentry name. And naturally, we aren't going to randomly overwrite that. That would be bad. So instead, we create a copy from a helper call from these fuse buffers. I had some nice suggestions on the mailing list on changing the names of these functions. But this maintains information, like whether we had to reallocate the buffer that we're using, which then fuse would clean up after the fact once we no longer need it. And that reallocation is why we can't just use dying pointers as is, although all of the interaction with the data is done through the dying pointer helpers. So this is how you interface with using fuse BPF. So you write your struct-off program. And either at mount time or at lookup time, you create a linkage between a given dentry I know and the struct-off program that you want to be running with it and what the backing file is. Currently, we're just passing the FD number, which we're then interpreting when we're dealing with the fuse response, that we're open to changing that. In our use case, it didn't matter that we would have to have an additional FD to pass in since we're doing that at one level and then inheriting the same directory structure underneath. If you were, say, needing to do a different mapping for every file, that would not be a very nice approach. So just an example of two of the op codes we have. As you can see for open dir and open, they're pretty similar. So there would be the possibility of combining things like this, although at least from your struct-off implementation, you could always just use the same program for both of these when you are defining the structure. So Paul has run some basic performance tests on this. This is using a RAM disk. So it's very much exaggerating the differences that we see. So we're seeing a lot closer to native performance than a real system. The difference would be a lot smaller. I don't know if Paul wants to make any comments on. Yeah, this is a RAM disk. So as Daniel says, it's a fairly slow processor by modern standards and a fairly fast RAM disk. So it's really probably a worst-case scenario for IO versus CPU. But this is what we've got. I mean, basically, we're seeing significant slowdowns with views. I mean, we see far less, far smaller slowdowns when we actually use views in real life. I mean, it's like typically 10%, 20%. But in these exaggerated tests, we're seeing bigger slowdowns. But the views BPF gap is much smaller, even slightly faster with the first one. That's probably never. So we still have a lot of things that we're still working on. We've been focusing on the use cases that we have within Android. So one of the big things on our list is there's currently a difference in the context that you're accessing things through. So naturally, if you're going through the Fuse daemon, you're running everything from the Fuse daemon's credentials. And we haven't done the correct credential mapping yet for doing that in a similar way from our stacked file system. It's on the to-do list, but I was rewriting everything to use struct op. So I haven't quite gotten around to that. We plan to probably something like grab the daemon credentials in the init response or something like that. There's still some op codes we're not dealing with, like iOctyls. iOctyls are very fun. I'm not quite sure how to deal with various things there. There are some cases where our pre and post filters are not fully hooked up. I haven't done that, since if we are going to change any of the formatting there, we'd prefer to do some of that work only once. Sorry, I was going to say for the daemon credentials thing, what I would say is iOeuring had the exact same problem. And my understanding is the solution for this is that they fork off a thread from the actual process because it's too complicated to deal with all the credential stuff. I don't know if that is practical for the BBF thing, but I just want to say that since that is a problem that iOeuring very painfully already went through, I would say if you can talk to see how they did it. No, I mean I merged a generic API for this for 6.4. So it's called user workers, which is a generalization of this concept. I am all for using preexisting stuff there. One thing we found in Android is any form of thread switching and worker queues really messes up latency. And basically we can't ship it. We've had a huge trouble even with deamverity because of its worker threads. And I think it's really pretty simple. We just need to do all the iOe from in the context of the fused daemon. As far as I can tell, every time I think this through, that actually seems like a great security model and the correct security model. I'm actually very open to discussing that because it seems too easy. But every time I think about it, that seems to be correct. Because the fused daemon will then return, of course, whatever it wants to the user space. So we'll get a double level. The fused daemon has to be able to do it, and the user space app has to be able to do it. And I think that's the right model. I mean it's the model fused company has. And it's the model we should reflect in fused BPF. We shouldn't be changing that because it's tried and tested. Different question. I'm wondering about the pre and post filters in user space. Do you really need them? Like my thinking is, if you are already in user space with pre-filter, haven't you already paid the performance penalty? And you may just as well do all the processing in user space? So we still save one transition there. So if we go back to the, that was the wrong direction. So going back to here, yeah. So once we've gone out to user space, we still have the extra call-in through, you're likely going to call back into the VFS layer through various syscalls. So we do save that. There's a lot less of a benefit. Right, right, but don't you probably have multiple syscalls there anyway? Because otherwise you wouldn't have put your pre-filter into user space in the first place. And then you're saying, if you've got your user space, then you're probably doing expensive stuff and then you may just as well stay there. One example we have is hiding information in files. What you'd probably do is use some sort of BPF map to say which bits of the file has to go. Say after a read, you do a read from the file that gets the data into a buffer. And then you use a map to say, these are the areas we have to send to user space, hopefully very small areas. Like in our case, it's taking the location information out of a picture. So user space is going to actually do the finding where it needs to, what it needs to modify and how. But only when you do a read from the exit header of the picture. So yes, for that particular read and that particular small part of the file, there's a high price to pay. But it's a picture, it's megabytes in size. For all the rest of the reads, you stay in the kernel completely. If you stay in the kernel completely, I have no objections, but then you also have the BPF pre and post filters, right? What I'm saying is the case where you do not use your BPF pre-filter, but you use your Fuse pre-filter, like your user space one. Well, I mean, in this particular use case, you only go to the Fuse pre-filter, the user mode pre-filter, when the BPF pre-filter has decided it's a good idea, which is a very small amount of the time. Yeah, exactly. And you've already read the data. And you're not really trying to optimize that path, that's the bare path. Why in that case bother with a user space pre-filter? Why not just say, okay, this one, rare request is handled completely in user space, not as a pre-filter going back to the kernel and then somewhere? Actually, the answer for that is, possibly it's a better idea, but mostly I think that you quite often don't set up the necessary node IDs and so on to actually be able to quality user space through the normal Fuse stuff. So it's a very different, we've found it useful so far. And particularly in that situation, I just described, it works. Another situation is, actually, no, that one doesn't work. It's only the permission thing that Daniel was talking about. In the case that we can't decide whether, in some cases we can't decide whether we should show or hide the file from the BPF. It's rare, but it happens. It's to do with multiple groups. In that case, you go to user space. And user space, that's a bit more processing. But it happens at the same point in the flow after everything else has happened. It actually makes the coding easier, if I'm trying to say. But yes, but you don't get much performance. If every time you're doing a call, you're ending up in user space, then this is pointless. I don't disagree. All right, so I think this is where I was confused yesterday. So I just want to make sure that I'm on the right track. Your thing does two things. One is sets it up with a mapping where you can attach a backing file for an existing open files. When you look up in Fuse, you say, okay, here's a file descriptor that I want to associate it with it, and now all operations are going to go to the backing file system. And then you also did all this pre-filter, post-filter stuff for every single operation. So we've got two different concepts here. The actual pass-through part, which is the association. And then that's pre-filter, post-filter thing. Yeah, so those are the kind of two things that it's doing. We personally, I guess there are some cases where we're just using pass-through itself. And I guess there's some limited use for that. Like I guess, as opposed to a bind mount, if you're doing a move across this and you have two different backings on the same file system, then we end up not having the xdev issue moving across the bind mount, but. Right, so for me what I'm thinking of is like lazy loading, right? It's like I intercept the lookup and I pull it in, and then from then on I don't give a shit. I just want to use the backing thing, and then the filtering stuff is just kind of extraneous. Yeah, so in that case you would just set the backing file, and you wouldn't set any BPF program, and it would just skip that entirely. Okay, so then this, so in that case there is no BPF. Yeah, in that case there is no BPF. Okay. During lookup you can change the BPF, you can add one, you can change it, or you can remove it. And the same thing with the backing file. So in your case during the lookup you see that the file you want is there, so you just say no BPF, associate with this file done. And then it will be very neonated performance. Is lookup the only way I can attach something? I was thinking about it for the ComposeFS, and that would basically attach something to everything, and I would rather do it unopened than a lookup. So we have already had a complaint from Facebook about this, sorry, Meta, and we've realized that yes, we probably shouldn't be actually opening the file and if we lookup what we should be doing is associating the path. So we're going to change the lookup. Currently we require an FD at lookup if you want to change it. Meta has requested that we make it so that you actually can just pass in a path, and I suggested a relative path to an existing FD with Pollyery. The usual FD plus path, either one of which can be null, would be a nice solution to that. So basically you can just pass in the FD of the current directory, and the thing you're looking up, the name of the thing you're looking up. That should, we haven't done that yet, but that's what we're going to do. When do you actually get to lookup because we are working on atomic open? We had to do some dirty stuff for atomic open, didn't we? Yeah, so currently our like, we have to have like an existing object when we're connecting. So we're generally doing that in like the backing, like we already have like the folder that we want to have set up existing. We had like some like things where we were looking into setting up the linkage at like say make your time if you're like creating. The question was specifically about atomic open, and I think the best, the honest answer there was we haven't thought about it too much, but I think it can be done. I just don't think, I mean, where's your lookup? If it's a lookup from an open coming in or if it's a lookup from a state coming in. So where you attach it? It's the lookup, it's the first lookup. And when the dentary is created, and actually one interesting point was when if the dentary is even a negative dentary, we do it at that point. And we didn't do that at first and it works so much better when we changed it so we actually do it. We actually attached the BPF at negative dentary lookup time and that worked so much nicer. It was weird how big an improvement that was. I'm just wondering, when did they ask if it's possible to allow attaching during lookup and during open or changing and that should solve the problem? So given that one of the requests when people have used pass through was for changing the backing file, where this question is coming and what we've done is we've got a large bucket of sand and just shove it in. We're kind of avoiding that question because we just, there's obviously a lot of questions about locks, how would we protect stuff? What happens to stuff in the cache? There's just a whole slurry of questions that we are not ready to answer quite simply and we probably need the help of the experts in this room to answer them frankly. If we're going to start allowing things to change, what does that do? And the answer is I don't actually know. So at the moment I'm just saying you can't change it because I can at least tell you what's going to happen if I say that and there's no, and we don't have any use cases that we thought of yet we're changing it would be that useful. If you don't change it at open, but if you set something at open for the first time, would that have the same problems? So that's, I mean, don't forget the dentry and the iron that are created at lookup time. What if at lookup time you don't yet put any BPF or any backing FD and then later at open could you do it at that point or would that still be hard? That would be impossible because at open time all we have is the new I node. We don't have, well, I mean, I suppose we could go and dig about in the parents I node, but that seems, the way we've engineered it is that the thing, the object you're looking at, the I node of dentry, whatever it is, we look into that and see whether it's got an associated backing I node or backing dentry. And so at open time there must be the BPF in the I node at that point in our current design. Can't you have a backing file for an open file and a backing dentry for a lookup to the entry? Two different things. It's different operations. You have IO operations for a backup file like Alessio's patches patches and you have the entry directory operations for a backing dentry. Why mix the two? I don't necessarily know the answer, but I mean, the simple logic I was going for was we're gonna have backing dentries for the dentry, we're gonna have backing I nodes for the I nodes, we're gonna have backing files for the files. When they're created, we're gonna put those things in. Again, other things could be done, but that is simple. It's understandable and you can work out what's gonna happen. That's not saying we can't be, well, maybe we can't be clever. That's not saying people in this room can't be clever and do better things, but that's what we came up with. And it is at least a simple model. It will probably be more complicated if you do it at open time, for sure. And I think, yes. And keeping the implementation simple, especially at the beginning is probably a good idea. So let's not get too creative. Just very quickly, we don't have to discuss it, but basically if we're gonna be doing a thing where the internal thing or whatever we'll be doing path resolutions and stuff like that, you definitely wanna make sure you can do resolve flags for it, because if you allow, so because the issue is that, okay, so if it composes as it doesn't really matter because it's all a blob store, but if you imagine the classic trivial example of like, oh, I want to have this thing where I'm opening the files with the same file path that user space has given me, I just want to mess around with these other little things. If you don't support resolve flags, you're gonna be looking at, oh, and now I have managed to escape the outside of the slash that is controlled. I'm just saying make sure that that's taken into consideration. I think how I can say that, I mean, I'm kind of aware that we've skirted around the whole namespace problem because Android doesn't use them in this way, so we haven't really addressed that. I'm saying like even if you ignore the namespace problem, like imagine if you have an unprivileged process that cannot access slash data or an Android or whatever. I, yeah, whatever. So for instance in that case, but the fuse daemon, I don't know what the security policy is, but let's imagine you can access slash data without any restrictions, for instance, and you're like, okay, well I have slash data slash blah, which is where I'm storing some magical thing that I give access to user space, to the user space program that I have access to fuse, but now I'm like, oh, okay, well like as user space, I trick through various tricky means to get you to do dot dot slash, dot slash, dot slash, and then that gets passed into the BPF thing and the BPF thing doesn't know that this is meant to be, that this is like the classic like extract a tar archive inside a thing problem or like a container image problem or like things we've dealt with other things. So open add to hasresolve flags, which handle this in the lookup path. If you just have the resolve flags and you just pass them, that in theory would be enough. I mean, there are probably other things we might want, but at least to me it seems that resolve flags at least would be, yeah. I mean, it's probably not complicated at, it's like one extra field. Yeah, definitely will want to add that. At the moment, we just have the FDs because that is very straightforward and all the resolution has been done already, so. Yeah, so there are some kind of general issues that we have run through doing this. One of them, which falls more on like the fuse end, is that when we're doing direct pass through, we're defaulting to like a node ID of zero. And if for whatever reason you've done this and now you've decided way down the line that you want to call out to the user space DMN, you're in this situation where you're like, ah, yes, we have user ID zero, oh, that's node ID zero. We want to do something with that and libfuse will rightfully be like no, we don't know what that is. So currently you could in BPF assign a node ID and have some communication layer there, but it would be probably better if we can come up with some standard way of doing this or giving some block of IDs that BPF would be able to assign and provide some means for user space security more about them in this case. So that's something that we haven't run in too much, but other people might. Another minor thing, when I was setting up the Distruct Ops program, there isn't any existing module support there. So to get the patches I have working, I kind of have a hacky moving a lot of Fuse-specific stuff into BPF and then having a registration call out when you register the Fuse module. But that's something that I'm going to look into having more of a ability to register a StructOp type from a module. I don't know if there are any objections to that. I think it's probably okay from other talks. There's also an issue where at the moment we have a whole lot of StructOp callbacks. When I was initially coding it up, I think I ended up at 63. The limit was 64 and the actual limit was 37. So at the moment I have a hacky patch that is just allocating two pages instead of one page so we don't jump off of our trampoline. So that's a patch that needs to get cleaned up. And occasionally you've been running into some limitations with Dine Pointers. Although every time I run Fetch, a lot of those seem to go away. And there's had some adjustments I made to it a few days ago that went in. So over time that part is getting nicer and nicer. And that'll probably make it particularly nice when we're dealing with Reader and we want to iterate over the file descriptor type. Things. Yeah, and then we have plans for upstreaming. Like it's a very big patch that currently it's like 30 something patches and I'm trying to arrange it in a way to make it as easy as possible to review. And currently I have mostly pass through patches up front and then bringing in the BPF changes later. And I don't know if anyone has any thoughts on I ask good partial steps there. So this was a suggestion I was gonna make is to make it essentially two separate patch sets as the backing stuff first and then the BPF next. Because I think that I have looked at it I tried to look at it last night and I passed the fuck out. Cause like it's 37 patches and like I'm looking at it kind of confused about and this is why I'd asked the question because I got confused about like these are two very separate concepts. And like I recognize that as a project it's one thing but like it was hard for me to grasp that initially. And so like seeing it as like two different things would help at least me kind of like approach both of them and with the right frame of mind. Yeah, I guess part of our all hang up with that is just that we see, we don't see as much like use for the pure pass through version of it. But like it is a good intermediate point I do agree with. Right, I think that's the thing is like I agree that it's probably not as useful but is for like a merging and like it's not to say that they both can't be merged at the same time but just like from a review standpoint it's easier for me. But I mean of course now I know so like whatever. So very naive question. Why is it useful to pass through alone? I don't understand it's so useful. I don't I just maybe we'll take it offline. It's so useful. Currently if you ever use a few server then you get a request for say look up, say open and you handle everything, you get read, you get write. I'm talking about the old patch it. So now you can do for open for a specific file and reply with pass through all the IO on this file and it's done. You can do the same answer look up of file or a directory that's more useful and then pass through every look up, read, deal operation. So it's very useful. I mean we have a use case for it, it's very useful. And you need to follow the fuse objects model. The fuse protocol has I nodes, right, nodes and it has file handles, files and the operations are file operations or I node operations. So you need to have two different objects that you pass through. I don't see another way but. I guess my feeling is slightly different. Not that I'm disagreeing with I am slightly. When we were asked to do, when people were proposing to extend file pass through to directory pass through, I was kind of like without some sort of ability to filter that, that's, and actually, I mean let's get some credit here. There was a guy who presented few ex views at Plumbers in 2019, I think. And his talk, although we didn't use any of his codes, inspired us to think about this. When we heard that talk, we heard, we were asked for directory pass through. I put the two ideas together and N years later and one pandemic later, this is what you get. Yes, I agree that maybe directory pass through is not, so that's the part that's, I can't imagine why it would be useful without some sort of filtering. But file pass through is very useful. So maybe you should start with merging that. Just file pass through and then adding directory pass through and then BPF. I don't see why directory pass through is not useful. I mean, some directories are accessed natively, read, deal, create operations natively, and others are not. You need to go to a server. I don't see why that's wronged. Yeah, so I mean, no, it doesn't, just the, yes. The fused lookup, not the fused BPF lookup can also set the thing. I mean, the only problem is what you end up with is a tiny little requirement in one file and a deep directory makes the whole directory, you get the performance penalty of staying in fused. And that's, well, it's not okay. We're actually, that's exactly the reason why we started down this path. Yeah, so I'm not arguing that none of, like I think all of this together is very useful, right? And I'm mostly just saying that like for my tiny brain, I had a hard time separating the two concepts. So it's not an issue, because now I understand and now I'll know, but like other than that, like I think there's a logical split there that might be helpful for the maintainer, but otherwise, like I think it's fine the way it is. There is one other fringe benefit of, with just the two, like the pass-through part. If you were to have, I guess, two different pass-through directories to the same file system. When we do the move, like they're actually the same file system, so we don't run into xdev trying to move between them, which I guess you would with like a bind mount. Let's go have a break. Oh, I guess one other thing I was going to throw out, if we have a moment, just one thing we have been considering changing. So currently we have a pre-filter, post-filter in our backing call. It would be kind of nice to just have like one BPF program here, which would then call some K-Funk-like thing to do the backing call in between, so you could handle it all at once. That's something that we are thinking about, but I'm a little scared of verifying that and like not allowing people to do the wrong thing within that, so. All right, thank you very much. So now we have a 20 minute coffee break, and afterwards there's still some follow-up on the BPF site. So see you in a bit.