 Good to be back, wow, that seems loud. And of course I have to do the obligatory smile. Okay, thank you. Hopefully that went good. I like the old days when you had the crank, Kodak ones, and then you didn't know if you had the right picture or not for months. Anyway, today you were talking about a tracing guy or whatever at a file system summit. So why am I here? I'm here to talk about dynamically allocated pseudo file systems. So I decided to start off with a problem statement. Basically saying pseudo file systems are created with the entry and I nodes that are seldom accessed. There are thousands of them that are just idle, sitting there idle, wasting memory. It would be much better and more memory efficient if we just allocate them when we need them. Just in time. So I just did on one of my machines. It just said, okay, I'm just curious, how many SIS file systems are there? Or how many files? First I said, how many files are there in flash SIS? And that's 45,000. How many directories? 5,922. How many are in the PROC file system? Now of course PROC file system expands how many tasks you have. And this is just a normal machine. This is like my server I'm running. So basic server running. I had 227,000, actually closer to 228,000 files with 203,000 directories. Or no, sorry, 20,000 directories. And then I'm like, okay, what about my system? I'm using the dash mount, so it's only the file system that these things are looking at. And tracing has typically 15,000 files, 2,600 directories. The debug is actually quite small on this machine. Well, it's my server, so I don't have that much debugging turned on. So it was 1,695 files, 398 directories. Total of 320,191 total files and directories together. Why do I care? Well, actually Alexi has to do with this. I mean, he's not here, but he actually, he was doing something. I said, why don't you just use instances of the TraceFS file systems? Cause he was using Traceprint def, Traceprint K or whatever. And I didn't like the way he was using it. I asked him to do it another way. And he came back and he said, we analyzed it. And whenever we create an instance, it creates a lot of memory and we can't afford to do that. Your instances are very, very high level. I mean, the ring buffer is 1.4 megabytes per CPU. And this increase in files, I was like, okay, let me just take a look at this. And I said, how many files are there in my directory? And then if you create instances, by doing so you do make dir instances foo and it makes a duplicate of the top level. So you can enable events. So all the event directories within an instance are all duplicated so you can enable them or set filters or triggers or whatever you want on events on an instance that will not affect the top level or other instances. So every time you create an instance, it will create a whole duplicate of the event FS file system. Yes. Oh, I mean, when I had my other laptop, I actually could do a demo, but I can't do it on this one. This is my work laptop. But basically what an instance is, is basically a way of creating another ring buffer within the Linux kernel. So basically if you wanted a separate ring buffer to do something different, for example, I'll be doing sketch switch tracing. I think RazzDemon or Demon, whatever it is RazzDemon creates a separate instance to do recording that doesn't get affected by other instances. We do this in Chrome OS. We'll create an instance to do some debugging, even trace FS, like whatever we want to do some specific tracing that doesn't affect anything else. So we only get our own trace events. We create an instance, enable events, record. So it's a separate ring buffer. That's basically what it is. So it's a way to do multiple ring buffers, but it gets expensive. I mean, I did on my kernel, I just said, okay, the kernel I was using, the average size of a de-entry is 192 bytes. The average size of an iNode is 624 bytes. And if you add all those up with all the iNodes and everything, that's basically you get a de-entry in iNode for every single file you have. That's 14 megabytes of just de-entries. 14 megabytes for when you create an instance. That's a lot of memory for that, I don't think is needed to be used. So then I'm curious, what about all these other file systems that we have out there? I looked at the SysFS file system. That's 42 megabytes of de-entries and stuff that's going on. And for the PROC system, yes, I had a lot of files like that. That's 202 megabytes of memory. With, to refer to the PROC file system, it doesn't actually keep de-entries and iNodes around. Does it? It has its own little structure that it expands. Well, it creates a dentry and iNode just in time when you want them and then it gets rid of them again. Okay, how is it? So there's a PROC data entry structure. So it basically, as soon as you let go of the dentries, the de-puts help is just going to work. Is there, okay, how do we do that? Implement that. Is that, okay, this would be, my talk could end right now if I could use that for TraceFS. Probably, well, the problem with TraceFS, I think is all the attribute tables we've got all over the place. Right. So it's not TraceFS, it's a debugger face. Yes. And the Sys, the configure face, we've got all these attribute tables. Ted wants to talk. I see him getting antsy. So PROC does this, it's done as a PROC specific hack as opposed to something that other file systems could use. So I think if your question is, why isn't there a generic version of this, the answer is it was done once for PROC and no one ever thought to generalize it. Yeah, I see. I think it could still be generalized. I just, you've got, basically you've got a tree and you can just have more than one tree. This one's for PROCFS, this one's for that, this one's for that. Chris. So when you put up the slide before this one, I hopped onto a production machine and I did a find on slash PROC to see how big it was. It took until Ted started talking for it to finish. So it was around 31 million files in slash PROC. Yeah, so there's a hack in PROC files for this. So in other words, I want that hack. Right, but it was pegged at 100% the entire time we've been talking. So it might not be the right hack. So what we did in production is we switched from anything that looked at slash PROC to using BPF iterators in everything that was doing the equivalent of ATOP or whatever else. And there's something I also wanted to talk about that. I actually don't have my slides, but actually of Christian or let's speak. Yeah, I just wanted to say the tracing, if S is a separate file system, right? Yes. I mean, you could probably implement the PROC hack in there. The thing is, do we want to implement a hack? Hey, yes. Yes, the point that I don't want to do, I don't want a hack. I actually want a proper solution here that not just trace FS could do, but debug FS or whatever FS that we have in there. I don't know what other pseudophilicisms that we have. We should have a generic way to do, any pseudophilicism should be just in time. There's no need for this. I mean, just to go back on my here, I just want to go, let me jump ahead real quick. I talk about pseudonose, yeah, yeah, yeah, yeah, yeah. Well, AJ presented this at Plumbers, and he had an event FS, and it was very, AJ's not here, he's a VMware person. And when I left, I said, please continue, because he was like, should I stop it? Because it was kind of doing a project for me. And I said, please, if you could continue. And he just sent me the patches on Sunday. And then that's why I've been hacking, I've added them up on my virtual machine, and I did it, I was doing testing with and without the event FS. This is just the event file systems. This isn't even the other debug FS stuff, or I mean trace FS files. There's a bunch of trace FS files outside of event FS. But event FS is the biggest one, so I just used that. And it allocates things, it allocates everything dynamically. It does have a hack for just for event FS, not for the rest of the things. I want to explain one of the reasons. I mentioned that at Plumbers, I brought this up. Ted was actually one of the people that said, hey, we should do a generic thing for everyone. And that's why I'm here today. Again, what's the best way to do this? We need a good internal API, not another fricking hack. And I'm trying to stay with the code of conduct and how I express this. So, I mean right now I could just use the current event FS stuff that's being, he hasn't sent the patches out to the main line yet. He wanted me to review them before he did that. So, like I said, I just got them on Sunday. So I did this little idea. I just did a cat of proc slab info and proc mem info. And then I created a, did a make dir, the instances foo. And then I did, okay, let's look at this. And then I did the diff of before and after to see what it looked like. So, without event FS, this is the mem info. And you'll see the, okay, I just did my taxes. So anything in parentheses means negative number. So, so the free available, I mean this is on killbites. So that's 14 megs right there that's not there. It shows it right here. You can see the stuff now at the bottom, the slabs, 11 megs, 10 megs before and after by just doing that make dir. It's kind of big. For the event FS, when I added the code, if you look here, much smaller numbers. Huge, huge difference. I mean we got like a megabyte extra in slab. And there is stuff that all the internal state has to still be there. We can't get rid of that. But that's only a one meg. It's a very small, instead of 11 meg, it's one meg. So a factor of 10 decrease, well, almost a factor of 10. So the mem info before is like again, 10 megs and everything like that. And then the before, after if I compare, compare the two with the slab. Whoops, I ignored the last thing. I was supposed to delete that slide. The differences are like, you'll see like 12 kilobyte or 12 megabytes differences between the two for the slab. So how many objects are we dividing this number by? Is it like one object this big? That's the thing. A million objects this big? Okay, so there's, we could split up the event FS because there's one event that we care about for the structure. But if you create, we still need to allocate for state because every event has its own state, for instance. And that can't be done just in time. Actually, we could kind of do it just in time. So there is ways of saying, okay, allocate this if they're going to enable this. So when something gets enabled, we got to create a state for that. So we have to, we could do it just in time. But I think right now we're just focusing on the de-entries and I know it's given to them. So we actually could even improve this better by making the actual state objects allocated when you go enable the event. Which comes back to what I wanted to also discuss was there's some things that maybe we don't want to free up and have an API to not free it up. So for example, when you enable event, I don't think I want the event to be allocated when, or that de-entry for disabling the event to be just in time allocated because like I said, everything's allocated when you look at it, when you're not looking at it, it gets freed up. And my fear is if something happens like memory contention and you want to disable events or stop the event and then you go to stop the event and then you run into a memory problem, you can't allocate the event, you can't stop the event. So I figured there'll be some, we need an API also to be able to say, freeze, keep this in memory, don't free it up until we say it's okay to do so. A pool, what, a pool? Yeah, but what do you say, emergency pool? Or yeah, either have an emergency pool that you can just pull from to make sure that it's available that we don't, if you want to stop the events, you could just say, okay, stop this event and it will work. So that would work too, having an emergency pool. That's a good idea. Anyone taking notes? That's a stupid question. What API is the, is TraceFS built on? Like is it built on the CISFS APIs or on? As its own API, there's actually a TraceFS create file, TraceFS create. Okay, so it's completely, a completely separate thing. Okay. I kind of copied DebugFS and then just stripped it because that's what Greg told me to do. He said, the whole reason why I had TraceFS is because a lot of people are telling me we want tracing on our production system but we don't want DebugFS on our system. So can you please separate the two? So we created TraceFS so that people could not compile in DebugFS and still have TraceFS. Yeah, I was just wondering because otherwise it sounded like if you use the CISFS APIs or Kernfs APIs, then one of those would probably have to learn that feature. Yeah. But if it's separate, then you could probably have an external API. The thing that I was scared about, or was thinking about is plumbing this into, writing this into something that is usable both by PROC and TraceFS sounds. Interesting because PROC is so special. Right. So if we were to do something in TraceFS, had that be the guinea pig or whatever, and start building up some sort of generic way of doing this and then slowly maybe move CISFS to it and maybe PROCFS? PROCFS sounds very, very special, I don't know. But I guess Chris's complaint was the fact that because it's everything's just in time and since you have thousands of stuff and you do, you know, you're reading the PROCFS, it's going to slow everything down. Well, if you want to test it thoroughly, putting it in PROCFS will be one good way to do that. You'd know really quickly if you started going wrong. Yeah. So that's basically all I basically came here to basically come and say I want to move this forward because we brought this up plumbers and said, hey, because first we were just wanting to make sure we were doing EventFS was correct. And I think the feedback we got from the VFS folks was this should be more generic. It shouldn't just be for you because it sounds like it could be useful elsewhere. And that's why I'm here is kind of say, how can I go about making this useful, not just for TraceFS and ending there? In fact, we created the EventFS as kind of a separate file system within the TraceFS file system because right now it's just focused on events because I didn't want, the events has a separate structure but I think we want to actually bring that up and do it for all of TraceFS. And I guess the reason why we didn't do that is we had control files in there that we were afraid of just allocating on the fly. Well, at least I will let people know and what's going on. So if you see patches from me, don't crucify me too badly. Any comments? Did you ask Greg's opinion on this? Because I think DebugFS has this node right concept and you need to generalize the concept of an internal node. And then the lookup creates that annual indentary from the node tree. That's the general. Are you saying to ask Greg to do it for DebugFS too? Yeah, just for advice. Just consult about the general. TraceFS is outside of DebugFS now, so it's not, they're not related. DebugFS could benefit from that as well. Of course, if you notice, at least in my, I didn't have many Debug things on but there's not many files and directories actually in DebugFS, which actually surprised me. This might be me misremembering things but for example, DebugFS has this system or SysFS in general, I think, has this system where when you create a new entry in SysFS or a directory, then you essentially pin it and you pin the whole file system with it and for it in order to go away, you need to call, I don't know, SysFS remove or whatever it is. So the infrastructure is based on the assumption you the caller are responsible for cleaning this up. So this is deeply enshrined into SysFS. So it might be a lot of work to actually pump this in there. Well, yeah, okay, because the idea that I have isn't just to clean it. Well, if that's just for the, is it just for the de-entries and I know it's to do all that cleaning up or that's what's pinning? Because SysFS might be also questionable because. It's a bit weird, but then we have a couple of file systems like this security of S, for example, is it works the same way. The dentry pins the whole file system and so it's really weird. Oh, yeah, that's the underlying thing, yeah. Oh, and there's one more thing I forgot to mention too, race conditions. The FS, I was reviewing some of the code, I think there's gonna be a lot of issues with circular locks, grabbing locks and certain, you know, the parent locks and stuff like that, because you have to be able to, when something's not used, you gotta free it and when do you free it? How do you free it? Well, I mean, obviously if you have a parent or a child node, but I know this is a lot of fun stuff there. That's why having a generic. This is all, I think this is all been pretty well sorted in ProcFS because processors appear all the time and disappear, so they have to add things and take them away and they'll be taken away while you're using them. But I'm assuming that ProcFS, it's just okay, when the task disappears, you just destroy everything from the top down. This is going to be anywhere in the tree that this could happen at. Well, yeah, so the task can disappear and all the subtree under the tasks directory, which is a very list of signals and FDs and stuff, all goes away, and if you've got files open, then it stops you accessing them from that point. But I'm actually talking about, let me step back there, because that's when the actual, I brought that up as when the process itself disappears. This is when the process is there and you have the tree and then you have to free up these iNodes everything while the process and everything is still there. So they're only created when someone does an LS. Well, yeah, the depot, super block operation or iNode operation takes care of that. You just return the appropriate number and as soon as the files depot, it just deletes the dentury and the iNode, you just go away. Okay, and so it is ProcFS. So ProcFS has its own structure, which describes just the things it needs and it creates them on dentures and iNodes on demand. And then it deals with the thing it's pointing to going away while the file is open. So if anyone here doesn't mind, I'm probably, I'm taking notes on who is mentioning, talking here and I'm like, probably send you notes like, hey, where does ProcFS do this? Because that's a lot of, I'm looking at the VFS directory. I'm like, I have no idea what's doing what. But if you look at the problems with Citifest, it's so intertwined with all the device model and all the drivers and you've got all these static, what appears to be static variables that create files behind the scenes. And it's gonna be interesting to change all that lot. But it might be best to make it create, effectively, a proper do entry rather than an entry in an iNode. It may make things simpler, I don't know. I'll have to take a look at the code and see what they do and see if there's any way to see that. Yeah, I just, I don't know if the ProcFS has a re-validate to do that. Uses the re-validate callback, probably. De-revalidate is what does that. I suppose, dere-validate, and lookup creates the entry from the internal state. That's the two main things to look at. Now, another question is, I guess I'll probably ask you guys just for advice. I had to use this guy from AJ, send me all these patches. Should I go do the event FS or should I just say screw it and let's go back and start with making the whole trace FS dynamic? Even though he has something that kind of works, but it could be like a test case for, okay, where the bugs are gonna be. I don't know the VFS system at all. And basically, it's kind of like how I got my mail server working. I got it working and don't touch it. And if it breaks, I spend lots of hours trying to figure out how to fix it again. That's the way I feel like I'm working with VFS. Do you have other users that would really like to use this feature? I don't know. I think that's a talk to Greg, talk to some other file system that might have a similar use case. And then based on that, you can make the decision. What other pseudo file systems are there? I mean, what we have- C group FS, but C group FS doesn't need to create anything on demand. Yeah. There's also a library. Kern FS is a generalization. It has the concept of node. It has the concept of internal node. So you should look at the area. I'm not sure which pseudo file system uses Kern FS. Okay. Yeah. And then there's a library. There's a library that implements simple file operations for pseudo FS. But generally, the concept of VFS is that lookup operation instantiates the entry and I node from of course some backing storage. So that would be your internal state. That's kind of what we look at. And the evaluate is used on lookup to make sure that the backing stored didn't go away from network server or something like that. Because I figured this is the way probably like, you know, this happens with file systems, normal file systems where you create it. And when, yeah. And then there's reclaim. When you reclaim hits, you kind of free of these things, right? And then another reason I would look at prokofs rather than SysFS as a base is that SysFS has to do lots of McDays and stuff when it's setting up. Prokofs gets around that by just creating the I node that's already created. He doesn't need to do the McDay because it's using its own tree at the back. So you stand on the thing. It creates the I node on the spot. I probably go to prokofs anyway because I'm more familiar with that code. I've actually done hacks in that code a little bit. So, and SysFS still, because of the K object, I still have no idea how K objects work. I don't think anybody does. It's great listening. You should be aware that there's some operation I think did drop, right? That decides whether or not that the entry stays behind in cash after you drop the reference. But I think it's for all sort of file system, it doesn't happen unless you explicitly keep the reference. Yeah. I don't know. That's, there's some crooked API there. But that's not an issue. Just something to be aware of. And I guess I've just assumed that no one here has an issue with me moving forward and just trying to fix this and. Yeah, okay. Well, I'll show it to Greg. So it's about time up. So that's that was my presentation just to get my feedback from you. Well, thank you.