 Felly, mae'r ffordd arweinio'r ddiweddwyd yma, fel yng Nghymru a'r unrhyw ddim yn unig i gyd yn ymwyntio i ddweud o ddylch. A'r gweithio'r ffordd o'r moedd ymddir i'r llaywyr Cymru, mae'r ffordd o'r gweithio'r ffordd o'r MMP o'r Mathu Wilcox, is getting the VM parts out of the network filesystems and into a common library. So what this does, it sits between the VM and the filesystem and it handles all your address space operation calls apart from perhaps Truncate and they will assist with Truncate. It also moves all the folio handling stuff, multi-page folio handling into the library gets. So the network filesystems then don't need to know about that. Local caching moves there as well, which means that the cache can deal with multi-page folios more easily. It can do content encryption which when I say content encryption I don't mean transport encryption so that's where you on your clients can unlock your files but the server doesn't know I can't decrypt them. But what's occurred to me when I was looking at this is that we have local caching, we have content encryption, we don't want the unencrypted content appearing in the local cache. We want to save in the local cache the encrypted content we got from the server and if we make any local changes we encrypt it and then save it to the local cache and up to the server. Otherwise you've got to secure to hold because one takes your laptop they've got access to all your files and this is easier to do if we can put it all together in one place and then we can give all the network filesystems access to the same facilities. To make it work with content encryption I had to add buffering capabilities to it which means it can do read write modify so it can issue an opt to get a read from the file server you can make a modification to it and write it back and it also allows to unbuffered writes so where the write does not necessarily appear in the page cache and because I'm getting rid of write beginning write end in this I can do a huge unbuffered write and just send it straight to the server and it's gone from memory we don't have to keep it. In fact the implicit direct IO if you like. I'm also adding dirty data tracking because data may changes may be tagged with the person who made them so we need may need to write it back with particular authentication and for so except you have snapshots where any particular write belongs to a particular snapshot and you have to write snapshots back in the right order but we just need to keep a chain of them so we keep a list of the dirty regions and then we write them back in the order so we've just got the current snapshots. I could do this like NFS does where because NFS has a structure for each page that keeps track of this information for that particular page what I'm trying to do is amalgamate those and then keep a list of these separately so we can have a lot fewer of these and we can just extend them as we add more pages to a particular dirty region. So what does it require from the network file system? The network file system we basically get rid of all knowledge of folios and pages from the network file system. It only has to provide two main things and operation two hooks. One is to do an asynchronous read and one is to do an asynchronous write and we drive these by giving them an IOV iterator that describes the pages. It may be in a bvech, it may be in an x way, maybe in the page cache. The network file system doesn't really need to know. We just say here's an iterator, go and read these, go and write these. So we do direct IO in the NetFS layer but we do encrypted direct IO, buffered IO, encrypted and un encrypted have all that working. I'll be writing it in part so it will be working again soon. And the FS also provides hooks if it wants to just read ahead. So take Ceph for example. Ceph has bits of files, two megabyte blocks stored on servers all over the place and tries to stitch them together to make something appear to be a file. So when we're doing read ahead, we want to adjust the read ahead so it falls on block boundaries, Ceph block boundaries. So we do a two megabyte read ahead from that file, from that block, then we did two megabyte read ahead from that block. And what it also allows us to do is make a bigger read ahead and dispatch them all at the same time or just queue it up and dispatch them in order, which is something Cephs would really like. So basically some queuing facilities. And there are two other hooks you need to provide if you want to provide content encryption. That's an encrypt block and a decrypt block. NetFS Lib does all the stuff, all the work of building up a scatter gather list if you're to hand to the encryption that he just says, here's a scatter gather list, a pair of scatter gather lists. Encrypt from this one to that one or decrypt from this one to that one. And the idea is that we will add two operations to the FS script library so that you can, if you're using FS script with this, which Ceph is looking to do, you just point the two hooks at FS script and it just does it because the FS script information, so the FS script context is all put on the inode. So you just set it up in advance and do it. So Steve, there's a microphone somewhere. I think this is particularly fascinating because with network file systems, usually the only feature you have is encrypting it at rest. So encrypting it before it goes on the wire is cool. And that's possible. It can be reasonably easy to do at some level. You just have to know what's used. So it could be done, at least on SMB. But the thing that caught my attention was compression, which shows up, oh, I didn't put that on here. So, for example, if you're trying to save a file into Cambridge or Oxford England and you're sitting in a bad network somewhere in the middle of the swimming pool and a resort here, you might want to compress the data before you send it. So that's supported in the network, at least for SMB and probably for Ceph and other things too, I guess. I'm working on making it so that I can add that later as well. That's more tricky because the block size for the compression is usually a lot bigger than your page size or your folio size. So I may need to take this dirty folio and several clean folios around it as a block and then you compress them up. I've got the buffering in there, so I just say here's the buffer, just compress into that. The problem is that I guess each of the network file systems will compress a little differently. Well, yeah, so I'll provide a hook to do the compression. It'd be like the encryption to the encryption hook. So for Cephs, you wouldn't be using FScript. Maybe it works, I don't know. I mean the thing is that we just have to investigate what the defined algorithm is in the specs and it's just something I've never looked at. But one of the things that's interesting about compression is that you could do it lower after the write or you could do it here. And if there's advantage to doing the compression at this layer rather than below, I'm about a third of the way, or the code's about a third of the way complete for compression on the wire, but this might be a good example of something to... Yeah, because I create a separate buffer and I can make it so that the data that's being written back to say is copied to the buffer, possibly encrypted in the way, or it can be, we insert a compression step and this will handle the compression buffering and it just hands the buffer down. It gives you, here's an idea, it happens to point to the compression output buffer and then you can save it. So I can, if necessary, we can make this do the buffering required for transport encryption and transport compression. And then a related thing, which you just reminded me of accidentally. There are cases where a file is encrypted at rest, usually the server decryps it, or sorry, encrypted at rest or compressed at rest. Today they're decompressed before they're sent over the wire unless you have a compressed transport. But there's cases where you could use iOctols and read the raw form of it. You have a file that's compressed at rest on the server side. There may be multiple ways that's done and that may be tricky, but it's interesting because I guess there's two compression cases, I guess, is what I'm getting at. Yeah, the role is released to maybe three because I think the one you're talking about, the server can actually decrypt it. Yeah, so the simple ways to think about this is I could go over to my Mac or my Windows system or whatever and I could right mouse button and click compress and that would force it to be compressed at rest on the server. If you mounted to your Mac or mounted to Windows or mounted to Linux or whatever, it would decompress it before it's sent over the wire. But there's also a compressed network transport option. Re-compress it, whatever the network. Yeah, mostly what this is doing is providing the buffering and then calling you to make transformations to it. I know that transformation is some encryption or whether it's compression or just as plain copy because you want to give the buffers back. Anyone's any other questions? That's about all I've got to say on it at the moment. Do you have any support for directory caching right now or directly operations? I have had patches in the past to make it support AFS directory caching, but AFS directories are just blobs that you pass locally. We can look at adding something to cache directory information later as an extension to that AFS lip, where you don't get it as a blob from the server, you read directory entries and then cache them. And the way to do that may be just to turn them into AFS directories locally because it's defined format. And then I can use the same code to pass both. But certainly that's an option. So your goal with this is to essentially replace all of this code for NFS, SIFS, everything? Plan 9 in AFS, yeah. And also I've been asked whether we can do this for Fuse as well. And as OrangeFS. Okay, so how is that looking? Like I assume this part works. So the read helpers are upstream and AFS plan 9 and SEPHR supported. I've got patches for SIFS. One of Steve's colleagues has tested that and doesn't see any particular performance regression with it. And it removes, was it 1000 lines of code or 2000 lines of code from SIFS? I'm moving into here. I'm working on the right helpers. I've got them mostly working apart from Truncate. So I've just gone and asked too hard because I've tried to build it onto the maple trees. I thought that's too hard. Let's just use a link list at the moment and do more interesting things later. So it's massively simplified it. Going to look at doing the Truncate next. We can hope, well, if there's any luck to be able to get the right helpers upstream in the next merge window, it's a bit tight. You might might will miss that because RC6 is almost upon us. Because I'm here at the moment. I can't do anything into them back. So it should be the one after that. Well, certainly by the murdering after that. Assuming no particular problems found or objections thrown up. So the big benefit so far is just like the simplification of the existing network trial systems. Yeah, because I moved 5,000 lines of code. Is it so far? Some of that was some FS cache. But also AFSNP9 and more to come from the safes. And then they would take the right helpers out as well. That's a whole nother lot of several thousand lines gone. I'm just curious. I was trying to follow how you did it. And it seems that you broke the old implementation before you. And it moved it completely. Oh, you talked about FS cache? Yeah. So how does the end user experience this transition exactly? I was trying to understand. Barring an FS, it should work the same before and after. But is there any transition between the old FS cache? Well, the FS cache change is done on upstream. So, but basically the change was, I ended up deleting 90% of the code anyway. And so it just made more sense to start again and rebuild it in the beginning because that makes it much easier for people to review. Right. I'm just wondering was there any upgrade issues or is FS cache completely wiped on it? Well, it wiped the cache because it changed the cache format, but it's just a cache. Yeah. So it just changes the version number and the layout and the old cache is blown away. But it's just a cache. It will reload it next time you use it. And it's a one-time hiccup. Yeah. Okay. So what happened right now in CFSF? No support for FS cache at the moment? So it's safe, do you say? Yeah. We added support for FS cache with CFS, but it didn't go in and ask by RC1. It was RC3, was it? Something like that. But I've been working on, I have set of patches to make CFS use NetFS lib. So local cache, you can use local caching without going through NetFS lib. You can talk to FS cache directly. And FS does that. But it does it on one page at a time basis, which kind of sucks. Because what I did with the FS cache changes was get rid of the old page cache the page wait list snooping, which I'm sure it was missing events occasionally. But it generated so much logging there was no way to debug it. And just switched to using IOCBs which didn't exist when I first did it. And now it can do direct IO directly from your cache into your network files and page buffer. It's just that when you've got a slew of pages or a whole run of pages, using read pages rather than read a head. So read a head rather than read pages. It's a lot harder to do without more infrastructure. And I don't want to basically duplicate NetFS lib into NFS as well. I'm being paid to be a gadfly here. One item I don't see on the slide is direct data placement, which is near and dear to the heart of CIFs, NFS and 9P. Can you make any statement about how you'll handle RDMA transports? I heard that. Not at all at the moment. I've seen the RDMA code in CIFs. I didn't know there's RDMA code in plan nine. I have not touched NFS with NetFS lib at all. I don't have any hardware on which I can test this. I can probably borrow some from Red Hat. You don't need hardware. We have two software RDMA device drivers in the kernel today, one for IWARP, one for Rocky. They work with standard Ethernet cards. I'm volunteering to help you, of course, but I would like to see a plan. All right, I'll try and come up with one. Okay, great. Right now, but yeah. Thank you. RDMA is something I know I need to think about, but I need to work out a way to actually test whatever it is I do. I've heard that before. Jeff Layton on chat says, I don't see any reason why we couldn't support RDMA and NetFS lib, but that's sort of below the NetFS layer. It'd be up to the file system to handle that, I think. When I'm doing a buffered read or write to it from the page cache, I create an Iov ita with the page cache pages in it and hand it to the network file system. Presumably it will do what everything needs to do to do RDMA to or from those pages. I've told it where to find the pages. And if it's a direct IO, I set up a bvech pointing to the direct IO pages and tell it here you are and do something with those pages, please. I think this is a good point, by the way, to dive into a little bit more. One of the things that is noticeably absent from some testing is testing with emulated RDMA. I mean, in theory, in Azure, and probably in Google Cloud and AWS, you can fire up VMs with RDMA. But in practice, I don't know of anybody that does that much. They tend to run their own dedicated hardware rather than fire up a VM for 20 minutes. But that means that the NetFS changes aren't, you know, we need to test them on that. So I think running with the emulation is a good idea, at least in some of our test infrastructure. One of the things that jumps out at me, though, is adjusting read-ahead. It's not just the size of read-ahead that would vary if you were using an RDMA connection. It's maybe, you know, the latency changes, right? So one of the things that I'm kind of struggling with is trying to figure out sanely what that algorithm looks like for adjusting the read-ahead. You know, in a network file system, you have a number of adapters. It's a little bit tricky in NFS, but NFS has a way of running with multiple adapters. SMB, it happens all the time. And we get a notification from the server if the server added or removed adapters. But the bandwidth of the adapters is also advertised. If it's RDMA, it's going to be more. The latency is going to be less. How many adapters we have, how many credits we have, we can throttle up and down, but we have to pick sane numbers and we don't have all the data in the file system. We don't know how memory constrained you are. Do you really want to read 100 meg? Yeah. Maybe you're running low on memory, I don't know. Well, the initial driver comes from the VM anyway. It says, I want you to get this much. And then this was originally mostly about shaping it to fit like Ceph's the objects, the object boundaries. But it's also about matching is, because it allows you to cut a particular request up into sub requests and matching those to your R-size, your network R-size, because say plan nine, for example, it can't handle the message odd in the megabyte and that includes the metadata. So we have to shrink the request down a bit, but I'll allow you to do byte range requests. So you can start three bytes and then do seven bytes. I guess what's hard is that it's trying to figure out, like in general, for a particular target, you want to have a certain number of IOs in flight. And that number could be quite large in some of these work, like we artificially cap it at four in Azure, but other servers could have 20 adapters. You might want 50, 60, 80 requests in flight. And there's certain workloads where that makes sense, but I guess what I'm getting at is this is probably common code, but it's common code that we have to play with to figure out how to get on a flight. Yeah, there's some of that, but for that you also need the MN people on board because they've got to do their part of it. It's not just at this point of file system issue. Yeah. Yeah, I mean, like I think setting R-DMA aside, yeah, I hear you Chuck. I think this is a really good start, right? Like I think this covers a lot of ground and like maybe doing the R-DMA stuff is a really good thing, but I think getting us converted over to this is like a really tall order as it is. And then maybe at some point R-DMA can live in here. Well, I think the R-DMA ought to be relatively straight forward because I'm just giving you, here's a set of pages, go and do something with that. It also makes Willie very happy because it means he's only got one place to look about with folios, not five or six or seven. What's going to be interesting is having a look to see whether this can be applied to Fuse. Yeah, I would love to see this applied to Fuse. Smith, yeah. Because I have, as I said, I have been asked if I can make Fuse work with this. I don't know yet. Particularly I've been asked about local caching on Fuse. The problem with that is I need to get an index key so that I can match the data in the cache with the Fuse I node. And that probably has to come from user space somehow. There's, I think there's some patch set to change the Fuse protocol to work with file handles instead of Ion. I think, whatever it is. Is it NFS file handle type things, you mean? Yeah, same type. Yeah, well, if they're stable, the thing is that you have to be, every time you mount it, it has to be the same file handle if it's the same file. As long as that's true, I can use that. And also I would need coherency data as well. I think I may have given Joseph the wrong impression. I'm not objecting to this. I agree that it's a good plan and already may can definitely be phase two or three. And I didn't mean to imply that everything has to stop and wait for that. Just wanted to make that. Yeah, because I said I've seen it in the, seen those RDMA supporting steps and it's, I need to make that work with it. Okay. Hold on, I got now Zhang on the thing. Can I have a connection on site side? Since we were applying FS cache to local FSs, different from network FSs, local FS has their own exact on disk formats in general. Instead of caching each individual files as what net FS does, we cache file system images themselves to avoid duplicating the whole file system tree again. Oh, that's interesting. So currently we use FS cache low level APIs instead. As Dave suggested, these days I'm thinking that, yeah, it's different. So maybe it doesn't necessary apply to net FS model to local FS. So this EROFS eros people? Yeah, I think so. So the midterm if there's some local interest. Okay. So he's talking about like maybe having the local caching stuff work for local file system. Yeah, because we've got some patches for the EROFS, how you pronounce it, to talk directly to FS cache and cache files so that it can get a user demon around the back to go and preload stuff into the cache. And then it just goes to FS cache to get it. But what they have is each object in the cache is a bundle of files. And then they have some kind of put a loop back there on top to go on which that EROFS file system sits and then they go through the bundles. And just just as I think handling or DMA net FS is just a matter of making the issue read issue write operations do the right or DMA operations. We might need a little support for net FS for it, but I don't foresee a lot being needed. Yeah. We're also, if we move net FS to look at using net FS, we would also need to have a look at how to make PNFS work with it. I don't know how stuff like it is in its implementation with objects stored all over the place. But I just haven't looked at PNFS yet. Maybe someone who's familiar with PNFS, anyone over there? They're all looking behind them. That's a name I've not heard a long time. All right, it's 131. Let's hand this off to Ted for the local stuff. Appreciate it.