 So, NFSD is under new management. It was Bruce Fields for many, many years. I guess he was the heir apparent. Neil gave him the maintainership back in I think 2007 or 2008 or 2009. And he did it for a long time. And then last year he stepped back and he's sort of enjoying a sabbatical from the IT sector. He's well. I'm not I'm not trying to cover anything up there. I'm just, he just decided it was time to enjoy his son growing up. And that's what he's done. So I became the maintainer of NFSD in January of last year. And Jeff joined me as co-maintainer. When was that? I guess July, June, July last year. And Jeff has done this before. So I'm ably assisted by him. Some interesting priorities for this work. NFSD has some features in it that no other implementation in the industry has. For example, NFS over ID may support for just about every fabric you can imagine. OPA, which is obsolete. IWARP, Rocky, InfiniBand, and a couple that I'm not even aware of. NFS over ID may works on all of them. Our client does too. NFS server can say that. So that's a good thing. We also have support for NFS V4.2. That's pretty rare in the industry. So those are things we can be proud of. And I hope I can extend that winning streak a little bit. I guess my priorities are, number one, functionality, making sure that we're still on the top of the list there. Number two, security. And I've been working on both GSS Kerberos and RPC with TLS, which is a sort of new fangled thing that allows us to do intransitive encryption of NFS without the use of Kerberos. I'm pretty excited about the cloud folks have been asking for this for, I guess, since 2018. But I think we're about in a position where we can deliver it. So I'm very pleased about that. The third is performance and scalability. And that's kind of what this talk is going to be about. And the fourth is observability, which means the ability to trace the operation of a server and do diagnostics on a live server without impacting performance or a scalability of the server. So I'm way into trace points. I should be into BPF, but I haven't really tasted it yet. I guess that's next. But yeah, I like trace points and I've been putting them in wherever. So anecdotally, I've had some reports that NFS reads are slow. Reads for a long time, I think for like almost 20 years have used a pipe splice mechanism, which is poorly documented. And we broke it pretty badly last year in a couple of ways. Al broke it when he did his pipe iterator work and I broke it because there was a piece of it that was really not documented at all. And I said, what do we need this for? And yanked it out. And now we know what we need it for. Anyway, but I'm told that it's not performing very well. I have not measured this myself, but so it's something that I'd like to pay some attention to and try to understand basically how we need to do it better and how we're going to join the rest of the file system family by going to full support portfolios in IOMAP and all those wonderful things. Wright also has some performance problems that are not related to what Reed does, but they both sort of rely on this structure called XDR buff. This is the basic way that we track the assembly of RPC messages. And we put them together from, we put it together from a KVAC that's got the actual RPC header in it, that's head. The classic way to do it is with pages. So an array of pages that contains the read or write payload and then tail. And tail usually contains things like a GSS checksum or it can contain an XDR pad. RPC messages have to be a multiple of four bytes, four octets in length. And so that's what that tail there is for. And then this other information is for doing zero copy out of the pages. You basically can point to an offset into the first page in the array and say start here. Don't start at the, at byte zero of that page to start in the middle of it. And then these fields down here record both the maximum length that the buffer contain and the actual length of the RPC message in the buffer. So you may notice there is already a BioVec in here. That sort of was pasted in a few years ago because we thought, okay, BioVecs are the wave of the future. So let's get started and put that in there. And the client uses that. The server kind of doesn't. One of the things that sort of stopped me from doing it is that not everything can support a BioVec yet. There are no BioVec enabled APIs for RDMA, for example. So I haven't gone whole hog on it. But I guess BioVec is something that at least the socket layer is really into. And so we could use it for that. But I'm just, I'm not sure how to bridge the gaps. IOMAP is also interesting. I'm told that one of the things that IOMAP can do that we're really interested in is it can read a sparse file from a local file system without actually triggering the mechanism that fills in pages with zeros. Chinner told me that if you read an unallocated extent in a sparse file that will force it to actually allocate blocks on the disk and fill them in with zeros. And we don't want an NFS read to do that, especially with really large files. We would like to preserve the unallocated cents as they are until they're actually written into. So that's something we're thinking about using IOMAP for. And I guess maybe that's the way we want to do. Somebody, is somebody online? Maybe I just heard my own echo. So maybe we want to replace the splice. Al's got his hand up. In which situation would read a few gaps in sparse file? That's, that might be something strange, XFS specific, but on some strange file system, but normally read shouldn't, read should be possible on a really only file system, right? Yeah, so I don't allocate any blocks, but what happens is we'll see. Yes, that's a different story, yeah. And you could save even that, yeah? You could avoid instantiating page cache being full of zeros over. Okay, we want to do that too. We have a new operation in NFSV 4.2 called read plus, where the operation can actually distinguish between data and holes, that basically unallocated extents. Yeah, it's a sparse read. So the client can ask for, tell me, I want to read this particular by range and the server can respond either, okay, here's the data, or it can say, oh, there's nothing there on the server and it has a very compact representation of the hole and that saves some network bandwidth. So yeah, we certainly would like to, we would like the server to do the least amount of work possible when reading. Yeah, so this is what IOMAP is good for, because exactly it will communicate to you in a compact way that this is happening. So I probably misunderstood what Dave said, but. But, you can't just, I mean, IOMAP is a function that file systems use typically. I mean, if we are reading some random file system under the hood, can we call into IOMAP to do this? Like maybe I would have exported VFAT or something, right, do you know? Yeah, this would be, you know, file systems would have to enable IOMAP. So we're probably going to need some kind of operation, operation back to always, you know, that gets filled in so the files will say, I support IOMAP and here are the IOMAP functions that NFS in particular will call. Kind of sounds like some version of FIEMAP, which we currently do, which is like user space spacing API and like our seek data seek goal that, yeah. So like we would have to think how to make this work for you in a race pre manner, but in principle, these are the user space spacing. That's exactly why we're not using FIEMAP today is because of the races. The other thing that's probably worth noting is in addition to a flag, which is should you even try calling the functions? And if we actually have a separate IOMAPs ops, you can just simply check to see whether you have a pointer to that. It may very well be that on a per file system or per file basis, the file system may say, I don't support IOMAP on this file because it's data journaled or something like that. And then you would try to call it, you'd get like a EOPNOT supported or some such and then fall back to the page cache methodology. So we'll need to do that because especially for EXT4, what we've been planning on doing is supporting IOMAP gradually. So we would do it for the easy cases first and then add support for the more complex cases later. And so just a heads up, this new interface we'll probably have for this particular file can't help you fall back. Or we might just not bother doing IOMAP rates from that file system until it supports it for everything. Yeah, I think the challenge there is is that what might happen is we'll support it for like 99% of the file systems out there and it would be kind of a shame if you didn't use it just because there was this 1% of like FSCrypt file systems that like almost never show up on a data center server, right? So a couple of things about that. One is that not all file systems are exportable. So the ones that are not exportable we don't have to worry about in the first place. And the second is we already have a bifurcation in the read processing in the end of the server because sometimes we can do a read with a splice pipe and sometimes we cannot when you have to use an iterator. One of the main reasons why we'd have to use an iterator is if there are more than one read operation and FS4 compound we can't do the fast, the so-called fast version of read in that case. Because I- Consider on FS itself, somebody trying to re-expert NFS and your operation tell me that, well, give me data or tell me that there is a gap looks like rather useful for re-expert, right? But you are not going to see IOMMAP for NFS. IOMMAP for NFS client. So what you need is something that tells you that where the gaps are. IOMMAP certainly does it, but carry that through to the lower file system in that case. Otherwise it's just, yeah, it's not going to work. That's what we're going for. Actually, say I want to read this range and get back at the series of pieces. That's what we're going for with IOMMAP, yeah. So what I want to know is how useful is the page cache on an NFS server? Because if you have many, many clients that their combined working set might well be larger than very much larger than the memory of the server. That's one possibility. It depends on the workload heavily. Yeah. And there are all kinds of workloads. So what I'm wondering is whether this is a direct IO operation or is this a page cache operation? And it sounds like we don't actually know the answer at this point. There's no good way to find out. We basically rely on the page cache and the balance dirty pages and all the rest of that infrastructure to determine whether pages are going to be kept in the cache or whether they're going to be read in once and then thrown away. Some servers actually have the ability to make smart decisions about that and ours doesn't. Because there are, so certain servers like to have like these little very low powered CPUs and only like a handful, hundreds of megabytes of memory and just use that as the basis of being a file server. And our servers are generally large like gigabytes and gigabytes of RAM. So we do rely on the page cache to go fast. I'm kind of just still kind of wondering. So you want the like this read parse or however we call it to be atomic exactly with respect to what like this other writes to the file or? I mean just with itself. I mean like right now there was an implementation at one point that Anna did that was used five map. But the problem is things can change after you get your map. And so it just wasn't atomic enough to do that. So what we really need is a way to, we need to do what IOMAP does. I'm not sure that IOMAP is the right layer for that. In other words, you wanted atomic with respect to whole punching and truncate at the very least because that's where if IOMAP gets completely screwed. I didn't actually use FIE map. I was doing just CX with C code or C data. But same thing. Yeah, this Derek here, I was also kind of wondering. So if you get, once you get the IOMAP mapping, what do you do about whatever may or may not be in the page cache since the IOMAP itself tells you nothing at all about what's in the page cache. Well, it tells you what parts you can expect to find data in and what parts are going to be unallocated. And then the read plus reply can be built. The server can build that reply based on that mapping information and it can do reads on just the parts where it expects data and it can return holes where the IOMAP information says there's nothing there. Yeah, but there are other weird traps to that. Like it can tell you that there's an unwritten mapping on XFS, there actually can be dirty pages in the page cache in front of that unwritten mapping, but you'll never know. And like that's part of, and like the C data implementation for IOMAP does actually notice that it's been fed an unwritten mapping and then it goes plonking its way through the page cache to see if there are any pages in there. Yeah, so C code, C data would be definitely like the better API and so what I said, but it has the base problem that you can get that you know here should be whole and then suddenly there is data. But then the question is like, yeah, but you could possibly check I generation after you have like filled in the request to see whether. Yeah, yeah. Yeah, you could basically use I version to see whether something has changed with the I node after you have formed the request. Or we could just say we don't make a promise about it. You know, if you're going to use the read plus you might get some of that a fractured piece of the file. No, no, I don't like that at all. This is Jeff. So one other possibility is that we expand the page caches abilities to store. There is no data here. Because right now file systems have to men set page, entire pages to zero if you're reading through a sparse file. We could free the pages that we previously allocated and say and put in a special entry that says no data here. That's a lot of work. But I mean, it's certainly something I've been thinking very seriously about doing. That would be useful on the client side decoding too. But I was kind of wondering do you actually want something along the lines of here's a file position, read, read, read whatever data you can find at or at any higher offset and then tell me where you actually got it from. The sparse read stands basically position length. And then after that all the data is concatenated at the end of it is useful here. But that might be the sort of model we want to go for. It would be nice to be able to do a read into the VFS and get back this sort of info and construct what you need for the. Well, I don't see how you can avoid a race there where someone might fill in an unallocated extent while you're doing the read plus. It seems unavoidable. Yeah, so I don't see how we can avoid the situation that I said before, we just don't make the promise. If you need that promise then you lock a file. Yeah, that or we just rely on the lower file system to do that. Because like EXT4 and XFS lock the IMU text for a buffer read. Yeah. Yes, they do, I'm pretty sure of you. I thought I looked yesterday, I thought I did. Maybe, okay, maybe I'm wrong. So if we don't make that promise, I think Anna said at one point that actually breaks in FS tests with when we're using read plus. Not a mistake. Yeah, I remember it doing that. Oh, can we grant you? I try to lock the file before decoding with seeks and SQL try to lock it again. And we end up in a deadlock is what I was also seeing. How can we guarantee that nobody will local to server want to write something into that area just as the data, just as our reply is being transmitted to client. Folks are gonna do something stupid there. I guess it's a fact of life. We can't actually even grant heat for local file systems. I mean, if you were doing it as two or three separate NFS read operations, it's undefined what happens if someone is modifying the file out for under you between reads number two and three. If you do a compound read saying that the result of the compound read has to be atomic, I'm not sure how much value that actually adds. Now, maybe that breaks certain specifications, but yeah, if you don't need to make that guarantee, don't, I think it's not worth it. Right. NFS V4 compounds do not guarantee atomicity across the operating. Okay, so the reason why Willie's here is to talk about folios and maybe give us steps one, two, or three for getting folio support into the NFS server. And that's kind of why I have this structure up on the screen is because where do we plumb this in? Essentially, we're building something that looks a lot like send file and receive file. And what we've done traditionally is we've got an array of pages and on the receive side that on the server, the, that's a bunch of anonymous pages and the network layer reads into them. And on the send side, at least for sockets, the pages involved in the RPC message are given to the socket and then released from the page array. And we fill those in with new anonymous pages for the next request. So, I mean, you've said in the past to me things like, it's not good to take a fully allocated folio and then break it up into little pages. Right. I mean, I'm not here to tell you how to write your code. I mean, I can tell you about how things, how you can work well with the MM layer and with the file system layer. I would like to. So the point of the folios project is to manage memory in larger chunks. So if you go to the page allocations, say, give me an order file, it will do that. But if you then split up into 32 order zero pages, it's like, well, could have done that for you. Could have done it more efficiently than you could, like allocating it and then breaking it up. So if you are going to allocate larger folios and please do because the larger chunks we keep memory and the better it is for fragmentation for the whole system, keep it there. Don't break it up. So maybe you allocate modified folio and then you use only part of it before you reuse it for something else. But don't try and over optimize. Don't say, well, I only need 23 pages and I'll use the others for something else. You're probably better off just using the first 23 pages of the folio, leaving the other nine pages for just free for a while. And then once you're done with that request, having that 32 page folio available again. I suspect that the issue is the way we do SEMP and that is we give the pages to the network layer. And that's where we're getting this sort of, okay, I just gave the network layer nine pages and I need to fill them all back in. And that's kind of where we're getting this page at a time behavior. So if we didn't do that, maybe we could just allocate a bunch of folios and leave them in place, I'm not sure. If only Dave Hells had come to this talk because he's way more into what the network layer is doing with pages slash folios than I am. I have been happily leaving that mostly to him. Yeah, he kind of asked for this session, so. Bad Dave, naughty Dave. No biscuit. No biscuit Dave. Yeah, I mean, I think sort of from a system wide perspective, we are very much looking to have everybody deal with arbitrary order folios where they currently use pages. And a lot of that stuff just works. It isn't necessarily guaranteed to work, but a lot of places, you know, you pass them the first page of the folio and they actually work. You have to be a bit brave to do that. You probably want to go in order and make sure they really are going to just work, but if you call put page on actually any page in a folio, it will decrement the reference count on the folio itself. So a lot of stuff does work. I don't want to say just try it, because that's what I started doing when I was working, when I was conversing with page cached folios, it's like, yeah, try it. And lots of things crashed and broke and I fixed them one by one and we got to where we are today. Um, but, you know, we read through, okay, what does the network layer do with this page? Can it handle being told, oh yeah, there are 30,000 bytes in this page? Like, does it just work? Does it just copy 30,000 bytes after that one page? Because if it does, you can pass it a very big folio. He wanted, Dave wanted to convert our use of send page, kernel send page to send message. Yes, yes. Using an iterator. Yes. So are you playing to, or are you or Dave playing to implement an iterator that you can handle folio to and it will deal with it? Absolutely. Okay. Yes, so that's why he's doing this. So that send message can take in, I think it's a BioVec. Right there. Right there. So yeah, it can just take in a BioVec and iterate over it and BioVecs can already contain. If I just say folios, believe me, it is still typed as being a struct page, but actually, the struct page that is used in a BioVec is used only or mostly for its properties as a descriptor of memory. So that would look like a single entry BioVec with a very large length and that struct page would just point to a folio? Yes. So it's already possible and it will be BioVec actually does it. The trouble is that Dave seems to want to mix KVEC and BioVec to do heterogeneous iterators, which I think is complete nightmare. It's going to be a bunch of overhead for no good reason. So yeah, it saves you two calls, but defining semantics over the rest. I was gonna say, if that's madness, then I think it wouldn't be difficult to convert the head and tail KVECs into pages with just one page. That's sure it could be a very short page. Yeah, they're generally not bigger than a page anyway, so yeah. Well, you can't do it unless that stuff sits in something K-Milk and yeah. Yeah, we just use a verter page and yeah, get the page address. No. But if it came from a slab out of it, you can't do that because it's playing games with the ref count. I see. Because when you grab a reference on a page, you are guaranteed that page won't go away, but K-Milk and K-3? Have no idea that somebody behind the bugs grabbed the reference to that page. It was K-3, hey, no problem. Pages from this range of addresses is free, can be reused by K-Milk and then you have trouble. Okay, we'll just have to be careful, but I don't think that's rocket science, we just need to be careful. Yeah, I think if you're using K-Milk to allocate that already, then just switch over to using the page. Just directly. Yes, right. Yeah, use the same struct page for both. You just use different ranges in it. Yes. Yeah, in fact, that's the common cases that the head and tail K-Vex pointed to the same page. Ah, that should work. Yeah. I think the client side does K-Milk for the head and tail. I don't think the server does. In those cases, it just uses a page and splits it up. Okay, thank you, helpful. That was an opportunity. If anybody has complaints or rotten fruit to throw, I'm happy to entertain questions or comments or doc. Yes. I have one question. I'm testing the workload of just sequential write to NFS server performance book. The problem is that NFS server can split your, IO request into multiple threads. What I would have been happy to have is affinity per inode for IO. I tried to look up at how to do that and I couldn't figure out how it could be done. But similar to IO U-ring, because if you're writing to an XFS file system and you're breaking those writes and reads into threads, you get poor performance because of the shared read write look. Oh, I see. So it's like you would like... At least 50% degradation in sequential write or something. Aha, the guilty party arrives. Okay, you can answer that, but I don't know, it's hard. I'm happy to talk to you about that offline. That's basically not the way the NFS server is architected, but we can talk about ways of helping that situation. Okay. I'll answer a question, unless for the last reason. Yeah, I guess we fixed the problem, so. I wasn't needed. The question really was, what the hell are you playing at, Dave? And I think we answered that. Yeah, we were looking at the XCR buff struct and trying to figure out how to get rid of the head and tail Kvex, at least on the server. I think we figured out how to do that. How to do that. He's on the call, he's listening.