 давайте move on to the next half session so the problem, Miklos described the problem, last year maybe give the mic to you, David also talked about it, how to monitor a very large number of mounts in a mount namespace, when mounts go away, and mounts are created, how to monitor this sufficiently, I don't know, אני אעשה לך לתת לך את הנושא, זה לא ממה, אבל אני רק רוצה להכתב את זה, כשאני מנסה, הוא already has an un-mount event, ויש להכתבים לנסה את זה, ולמנסה את זה, ובכלל, לנסה את הולכת על הולכת, ולכתבים לנסה את הולכת, כשזה נמנסה, זה מאוד רביעי. אני כתב את הולכת על זה, זה מאוד רביעי. והשאלה, שהשאלות הולכת, גם את זה, אתה יכול להסתכל על זה? תלך, זה כבר נפואה של עצמם, ואז אתה יכול להסתכל על זה. איך לדזכת את העצמם? אם נפואה את עצמם, תלך, תלך, תלך, תלך, תלך, תלך, תלך, תלך, תלך, ‫הוא מבחר מסביר. ‫נבחר עוד עד כך, ‫אנחנו יכולים להשתמש... ‫אנחנו יכולים להשתמש מרק ‫בתיכון, ‫בתיכון שזה מונט. ‫הוא שיש להכנות. ‫-כי מונט, כן. ‫לא, אני בתוכנת, ‫אתה כונות מונט, ‫מנכון שיש מונט, ‫אם כך מונט, ‫אתה שיש לו אינטרון. אני חושב, אתה יכולה לשמוע על כל התנועה, וזה פתוח רגע. אני עשיתי את זה, ובאיש כמו שאני עשיתי את זה, אני עשיתי את התנועה, ולתת על פתוח הרבה, זה כאלה איך וינדורס שלהם, כי everything is addressed, there are no hard links, you don't need to consider a hard link type thing, so a path is almost an exact script of a mount, except that paths, you can have mounts, multiple mounts with the same path, so I just said, you put a watch there, you say everything under this path, anything that happens under that path you get a notification for, or a mount, so basically a mount tree is what you need to watch, but you place it on a namespace, on your namespace, you don't actually, the path is just a filter. So each namespace has a separate mount tree, so it's slightly different watching mounts because you need to have information about which mount changed, so you need to have the mount ID, for example, but the problem with the mount ID is that mount IDs are reused whenever a mount is... So I added an extra 64-bit mount ID, well, it will eventually get reused, your computer probably be solved first. When you say added it means RFC patches? Well, yeah, I've posted patches for it, it works. Yeah, but your patches are not for FA notify, but for WashQ, so it's the same concept, but with a different interface. Well, it produces, it produces notification, we could make notifications come out somewhere else. I think the notification stuff and for example the FS info stuff should have been decoupled in that patch set, and in your patch set it was like one big thing, right? The problem is that the notification queue has a limit, and you can overrun the limit. So if you move a whole mount tree, or if a whole load of people do mounts, it will generate a notification of each of them, and if you're not quick enough, the notification queue in the kernel will overrun. I don't think we need a ring type of efficiency for this use case, so FA notify is less limited in that perspective. Well, it still has a limit. Yeah. And if you overrun the limit, you then have to find some way to find all the... So with FA notify, you can have two types of queues, one is in principle unlimited, but obviously there is, and then it means basically the allocation will block until it can succeed. So basically you will block the mount until you can generate a notification message. So then you cannot lose events, that's one way how you can set up FA notify. The other one is that you will simply drop the event if you cannot accommodate it because the queue is already full, but you generate a special type of event like there is lost information. Yes, the reason I added the FS info thing was so that you had a faster way of finding out what had changed, because otherwise you end up parsing proc of the mounts. So we... We're too slow. We also have the eventual user of this API here. I made him come along. How perform and sensitive this is actually for you? I don't know, right now we parsed proc of mount info and that of course is terrible, but for me, if events are dropped, that's fine as long as we have a way to figure out what happened in the meantime later. So we want an event that tells us very clearly the way events dropped and we want an API that gives us this information on our own terms afterwards, which is better than proc of mount info. So I'm actually... I mean, Amir knows this, I guess. I'm actually very much important having an API that tells me for a mount all the children of that specific mount, the immediate children of that specific mount and gives me the events when immediate children come and go. That's all what I probably really would like to have. Because that's what I did in FS Info. Ian Kent, I don't think he's here as he implemented the thing for system D to use FS Info and apparently performed quite a bit better. But yeah, I mean, I'm totally important if we can get a new API. Right now it's really horrible what we have to do because in large Kubernetes installations we get like these millions of events and we can't cope up and we're just busy with dealing with that stuff. Does it matter if it's for notify? I don't really care. I mean, to me, it would be natural if... I mean, FA Notify sounds great to me because it has the queue handling already and things like that. So I'm reinventing it. I don't know. I've never looked at what's your thing called again. The watch queue. Yeah, the watch queue. I don't know. What does it have for... You know what I like about FA Notify? I think it can even call ask events, right? Like, kind of not so that it drops. Which events? Like it can call ask events subtree-wise? Can it not do that? I don't know. I think it was discussed... Coalesce events? Yeah, so basically is that if it drops events somewhere that instead of replacing them by a single bit of information that tells me that some information was lost, it would rather replace a couple of these events with a new event that just tells me below the subtree something got lost. It doesn't have that. The only call is events with exact same information or very, very similar information. But for example, you could do something like that. If you're watching, I mean, it's pretty equivalent to watch a namespace and to recursively watch direct children of a mount. I mean, you can sort of do one with the other. For FA Notify, if you would have to do like mount events for direct children, it would be pretty simple. Because one, at least limitation of the current API, it needs to get a path as the descriptor of the object. So it's easy to give a path describing a mount or a file system or an Inode. But if you're watching a mount namespace, maybe it's hidden. Maybe it doesn't have a path yet, so there are complications. But if you're watching direct children, you can always set up a watch. Let me know if something is mounted on top of that mount. It's pretty easy. And then what information you get in the event, that could be configurable. So maybe you'd get the Inode where it was mounted or other things. Or maybe you get nothing. You just get an event, something was mounted. And this will be coalesced. So multiple events of something was mounted. What I would like to have is an opath file descriptor simply as part of the event, like of the child. You have an opath file, FD or NFS file handle where you can get it, but of what? Of the mount point? Of the mount point, yeah. You can get it or not get it, then it's easier to coalesce many events. So something has happened to the direct children, and then you only go and probe what happened. I may have a silly question, but what's the actual use case for this? What are you guys intending to do with this info? In system D currently, the boot up process to a large degree is starting services and waiting until things happen, starting the next services. And a good part of that is setting up mounts. So we can do stuff like, yeah. We start that service during late boot once flash var MySQL or something has been mounted. And so we need these notifications. And this is actually, we use it heavily for individual services and people use it heavily. Like they run app so and so, and then immediately before starting app so and so, they mount something and then it gets, because it's a dependency tree, it gets pulled in until that service, at the point the service runs, and then when service dies it gets automatically removed, unless somebody else also has dependency and things like that. So for that we need to track exactly what's going on in the system. And right now we do that always proxf mount and it's just terrible. I don't know how I will describe the event of tacked mount. I'm not allowed to call it that anymore. I was instructed to call it belief by friendly people on social media who pointed out that this has also other meanings that I wasn't aware of. And also for, yeah, the danger, the danger of mistyping people before IOU ring was merged. By the way, just to explain why I think that having this per mount point would be really nice, right? Like so that we can watch the immediate child mount points instead of having something system wide is mostly because we generally in system D want to only watch a subset of the entire tree, right? And because like things that I show up in below slash sys, we generally do not care about their API file systems. We never wait for them. They're just there basically. And that's why, like if we had the system wide stuff like proxf mount info, mostly the ignore events we get from there. So for me, it would be much like the thing that I find nice about the recursive thing is that even though it would be a little bit more work to manage for us, we could very specifically select the sub trees that we care about and ignore the sub trees we don't. Yeah, but that brings us back to the I&Otify recursive watch, which is racy and we need to see that we can implement that without races when a child one is created, you need to automatically add a watch or something like that. Yeah, the problem is that if you are watching just whether something gets mounted under this mount, then so you learn something gets mounted, but then you have to place the watch on the new thing. Yeah, but that's fine with me. I would just consider that I lost the mount event, like what I have to handle anyway if the queue runs over, right? It's not different from the queue runs over event. Yeah, but you would have to note you would have to do a scan of proc mount info. Yeah, I don't want that. I mean, that's a thing. I definitely always want the ability that, yeah, I want the notifications and an API that gives me the immediate children so that I always can catch up nicely and the API for getting the immediate children, I don't know, maybe that's FS, but that is kind of for me the other side of the metal. Having just FANotify without an API for getting the immediate amount of the children won't be useless, but... You get a notification for a mount event happening, or a human event happening, and then you could pass this FD that you, for example, get to some sort of system call or whatever in what form, and that gives you the list of child mounts, identified by what. So that's the thing, I think, and I think you wrote this on the list, I think we need a unique mount ID. It's created, I guess, and the notification needs to contain the idea of the new mount. I'm not sure how it could be solved to make this complicated underneath thing because if some mounts are immediately mounts over the top of it, you can't get the mount underneath unless you can adjust them by some ID rather than path. So for the system to use case, I don't care because we only care about the topmost mounts. Okay, I see what you mean. If you have the mount ID, we need a way to query by a mount ID, but it shouldn't be a problem. No, I think the problem is currently the way it at least works for mounts. If you have an FD, that FD will refer to a dentry mount pair, and it will be fixed. That struct path usually doesn't change, but as soon as you enter into the mount system call world, the first thing that we do is lock mount, and then we walk up from that struct path to the topmost mounts. So your FD technically refers to something else or can refer to something else by the time the actual mount operation is performed if you have mounts on top of it. I mean, what I would ideally want is a file descriptor to the mount FD, so that it's clear what it is. At some point, though Christian may beat me to it, I want to make it so that you can stick something to intercept mount calls that are being made and do something without the network being enabled on the system. Even though it's a deviation, we're giving a presentation tomorrow on how to do unproved mounting right now, but it would be nice if we had something like this. But upcalls for user space are a giant pain, usually not something that we... stuff like user mode helper or also the key ring stuff. Pretty untrustable. Christian is absolutely right. User mode help us need to die. They're terrible. They're past the internet system so they don't get security stuff applied. They don't get resource management applied and they live in their own environment. Are you staying for the security summit because there's some people who disagree with you there? I'm staying around there. But I mean, I know a lot of people who agree with me too on this, but yeah, it's just terrible because they don't have security settings at all. Like they don't inherit anything from anywhere. Fuse? What's with fuse? Upcalls. That upcalls in that sense, yes. Anyway, that's a different discussion. It's one of my personal things that I really want to see them go. So the thing I was saying about up path is, okay, if you were to mount on top of it, then you would get some other behavior, but that's just part and parcel of how mount works with everything. Like if the idea is that you want to get the sort of child mounts or something like that, or you want to do some FS info or whatever, whatever that ends up being, effectively some way of like, oh, I have a thing, please give me the info about this particular mount or for this particular FD, that presumably would not be translated because it wouldn't make sense to translate it in that case. Or would it have to be translated? Is it like a future of? No, it wouldn't have to be, no. So it's just like if you wanted to, I guess, you mount or you mount? If you wanted to, you mount. Or, yeah, you mount and unmount because for example, amount property changing code also doesn't look up the top most mount usually. So it's just really adding or remaining amount. Yeah, then it goes with all mountry and everything else. And it makes sense, it makes sense because of at least the times that I had to be in this code is like if you were to try and you mount in the middle of a mountry, you had all sorts of issue because you would need to reparent mounts and so on. And what mounts survive and it's pretty messy to do cleanly. The way propagation works where your propagation starts from the parent mount that you're trying to unmount. So the reason I'm bringing the old part thing up is because if the issue is, okay, we're talking about having mount IDs, which we hope won't overflow, but may or may not, it's some independent point of the future of a system that's in up for 35 years or something, right? Because even if you have 64-bit whatever mount ID, the thing is that if you use like KCMP or whatever, if you had KCMP, you'd be able to compare two file descriptors. Assistant would have to open like 5,000 files for this to work, but you would be able to check it that way, right? To check if the two things are the same, right? Just a second, nanosecond? You will only overflow it in about 500 years. So, I don't think you're any... Yeah, sorry, yeah. It is 64, yeah, yeah, sorry. No, you're right, yeah, you're right, sorry. My bad. 64-bit all the way. The current mount IDs are a problem because they're not even allocated, like pits are cyclically, but it's if you do mount dash dash bind opt and you get 364 and you U-mount and you mount immediately again, you get the same ID in case no one in another namespace beat you to allocating that ID. Can you... Can you... Can't you just extend them to 64-bit? I mean, they're not exposed in binary to use this space, so you can just... But then you'd suddenly have 64-bit numbers in PROC self-mount info. Possibly. Yeah, that doesn't seem like a regression, but it is exposing a name to handle at, actually. Around this, I propose that we have another at flag to get the MNT ID. I take back my idea. It's not this... What am I supposed to do with that info? The MNT ID is exposed in the system call as a system call argument even, right? An open by handle at... Yeah, but honestly, open by handle at two, whatever. I mean... Huh? Or an at flag, but I mean... I mean, it's not that... I think in terms of adding new system calls, it's much easier than this. Honestly, I take back my thing, because I mean, for example, in system view, we parse that and we know it's an int, so we parse it as an int, so you cannot met, like, the stupid idea of my ignore me. You have to have something different, because otherwise you will break user space even if you exported it as a text file because we actually looked into the kernel to see what it's used internally, and then we used the same type. That's one of the reasons I added a new value that was independent of the old one, because the old one is recycled, too small, people assume it's too small. Honestly, you should just all use UUIDs and the problem goes away entirely. I'm serious. Is that the reason why you put them in TID64 between statics, but then it returns the old one? Static returns 64 value, but it returns the old small value, so it's an at flag. Do you have coffee?