 All right, welcome everybody. We have a CEPFS talk and we have two of us of the CEPFS core contributors, Jeff and Patrick So please welcome them and enjoy the talk. Hi everybody. This is a talk about some work. I've been Working on for the last year or so. It's still quite leading edge. So be gentle So first of all go ahead go in. Oh, yeah, sorry. Let me apologize for this. Firefox browsed the web thing yake the door So anyway, just a little bit about us. I'm longtime colonel dev done some move to doing Recent work with CEPF and just recently took over a maintainership of case FFS Jane wanted to move on to doing more work in the NDS. And so I've taken over that part of it and Patrick is also a contributor or a lead person on CEPFS these days and joined Red Hat in 2016 and He mostly shepherds the project along at this point So anyway What motivated this work is the real is it? Yeah, the sort of the truism that anytime you're working with a network file system, you know NFS CFS anything like that Metadata directory operation directory operations generally are pretty slow You know if you're doing an open an unlink rename anything like that You almost always are doing a round trip synchronous round trip to the server. So we'll you know, someone will call into the kernel We'll dispatch an RPC to the server And then we have to wait for the reply to come in and then finally we can return back to user land. Those are slow So this affects a whole lot of different workloads, you know, you're untarring files, you know our sink You know anytime you're removing a big directory tree And also stuff like compiling software, you know, basically anything you do that touches a file system is going to be affected by that So First is you know why first of all is why are local file systems so much faster? Well, the obvious thing is that they don't have a server to talk to right now. They don't have to make this long round trip Then there's also some journal non journal file systems that buffer their metadata mutations in memory So like these are stuff like ext2 Not so much in use these days But in most journal file systems the journals pretty quick and so, you know We don't tend to worry too much about the fact that we have to journal all this data in order to handle the crash recovery so the consequences of that is that you know, you can They can you know batch out the rights to the journal So you can do it do a whole bunch of you can build basically a transaction and then flush it out to the journal But those operations are not guaranteed to be not all You know, especially when you're dealing with these non journal file systems The operations that you do are not guaranteed to be durable unless you have sync So if you do a rename unlink anything like that It's possible that if the box crashes before the data hits the disc You may that operation may turn out never to have happened Even after you've returned back to user land now it turns out in the most modern journal file systems. That's not such an issue They almost all synchronously write to the journal before they'll return to user land But you know technically, you know You're supposed to have sync in order to do that in order to ensure that your Operation that persisted on disk Well, let's you take this part. So this is a So No Okay, I'll try to speak up. That's hard for me. All right, so I I Get you know just to give you all an introduction to Cephaphase for those of you who don't know Cephaphase is a POSIX Distributive file system. It's the oldest storage Application that's run on Ceph. It was the original use case for Ceph back in around 2005 It's a cooperative file system with the clients in particular of note is that the clients have direct access to the Object storage devices. They're able to read and write all the the file data blocks Themselves, they don't have to go through any kind of metadata server. So the server that Jeff was talking about earlier is actually the metadata server So that is the centralized services. There can be more than one that Aggregate all the metadata mutations journal them to Rados in the metadata pool and And also serve to manage the cash between all the clients making sure the clients are all consistent and And that the clients caches are also coherent. So there's a capability mechanism that the MDS has to give the clients rights to do things like read or write from a file or Keep track of what entries exist in a directory and that's all cooperatively maintained by the clients and and the MDS is the clients are considered trusted in this FFS model. So they're not going to misbehave in any way Because namely they do have direct access to the data pool, but they're also expected to maintain their caches coherently with the MDS So Jeff's going to talk in particular In this talk focusing on these RPCs at the top between the client and the active metadata server So, you know, how does the MDS manage all this you know mediate between the different clients? Well, it has this mechanism that we it's called the cap subsystems short for capabilities And basically capabilities if you're familiar with something like NFS or SMB is very similar to like a delegation or Oploc But they're more granular in particular they come in several different types of flavors So we have a pin off file link Exeter and so they a lot of that sounds pretty obvious and pin just ensures that the thing doesn't go away and off ensures that or pin actually ensures that it doesn't float between MDS as I believe actually so So we ensure that the thing is pinned to a particular MDS while the operation is going on off as covers user ownership mode File is a big cap I'll talk about that in a minute link is link count primarily and then X adder is you know covers the X adders So they all pretty much all have a shared and exclusive variety So we can hand out shared or exclusive caps to a to a client for them to buffer operations or your cache operations But the file caps are a little special they have a whole bunch of other different bits and if you see down here, too You know the way we express caps and track them in the current in and all the code is be a bit mask And so this you know part down here is More or less showing you sort of how the bits are laid out for the for the different caps The thing to notice about the file caps is that they are Pretty extensive. So we have we have shared exclusive of course, but there's also Cache read write buffer. I believe that one is an append and then there's a lazy IO Which is sort of a weirdo thing to to allow it to not have to talk to the MDS so much But mostly here we're talking about directory operations and so traditionally the MDS has Not really given out much in the way of caps to the to for on directories so we will give out shared caps pretty much but Exclusive caps not so much and then But so you know in order to try to speed up a synchron or a speed up directory operations What we want to do is start allowing the clients to do a bit more Locally and so to do that we have extended or over Overloaded the the file caps to have different meanings on directories. So in particular we want to allow Create and unlink Those are the two that at least we're starting with So basically and we're also going to have the MDS handout Exclusive caps so we you'll notice too that we have sort of a shorthand notation here as well for for how the caps work Or how the caps are expressed So internally in the MDS We've done Typically whenever we have to do a directory operation the MDS has to go gather a bunch of locks between you know to ensure that other MDS's don't come in and try to do something and so Zhang sort of lead a developer on the MDS has Developed a new lock caching facility so he can basically have an MDS gather locks for an operation on a directory And then cash those for later use if he needs to do another So essentially what happens is we only Hand these out though in on the first synchronous Creator unlink in a directory so now let's talk a little bit about dentury caching so You know again we don't want to always have to do a synchronous round trip to the server to do a look up or something like that for for a Directory entry and when I talk about dentury what I'm talking about is a path name component within the within the file system so In order to do an asynchronous directory operation we have to we need reliably We need to know that our cashed information about the directories correct We can't go and fire off an unlink and then find out later that oh that file didn't actually exist You know so that's not allowed so we have two mechanisms for tracking Denturies we have a Dentury leases and we can just come in positive or negative flavors and then we also can Hand out FS and by extension you know shared shared file caps on a directory or Exclusive caps on a directory exclusive implied shared and so for the latter if we If we just get the caps on a directory we don't actually know anything about the denturies that are in it So we have to either have done a reader in a full reader on the directory Or we have to know that the direct you know what the state of all the dentries in the direct is in the directory So for instance if we create a new directory, we know it's empty and we can consider it complete and This allows us so we do this use this today actually this allows us to do Lookups even negative lookups on a directory In without having to talk to the MDS if we know that the state that our information about the directories complete and someone asks for Some dentury that we know is not there. We don't have to talk to the MDS. We can just say now that doesn't exist So now let's talk about doing actual asynchronous operations So we'll start with talking about what what happens today, right? You know, it's pretty pretty typical similar to NFS or SMD or anything like that When we do an asynchronous we do a synchronous someone calls unlinked down into the kernel for instance We do that synchronously we dispatch a call to the MDS We have to wait for the reply to come in and when the reply comes in Then we can return back to user land But this is can be really slow, right? I mean think about it if doing like an RM dash RF on a directory You know we're going to do a reader find out what all is in there and then we go issue an unlink on each file and Know each of those is a round trip And so that slows that's pretty or very very slow So in fact here's a diagram that kind of shows the procedure for an asynchronous unlink So here we have so we didn't open on directory, right? So we did and then that's going to do you know get information about the directory We get capabilities for the directory. Let's say we've got you know exclusive caps We do a reader to fill it know to so that we know what all the dentures are and then we do Go back and you know that gets reply gets a reply and then we do an unlink and then that comes back And we do another unlink and that comes back and so on so forth and finally we can do an arm there at the bottom so You know if we're going to do these asynchronously we have a decision to make Do we want to wait to transmit them or do we want to? Just go ahead and fire them off as soon as we get them right and so it's natural to think about When you're talking about asynchronous operations, it's natural to think about like buffered IO in the kernel In that case we are writing to a cache and page cache in the kernel And then eventually we flush that out and so that the deal with the kernel with writes though Is that anytime you do a write in the kernel? There's a pretty high probability that you know that there will be a follow-on write a little later that may Also modify that data so often it's Advantages to wait a while to before you flush these things out Not so much on directories, you know operate, you know workloads that Repeatedly create and unlink the same dentures are pretty rare so, you know at least at this point we are Operating under the assumption that there's not a lot to be gained by delaying them And so as soon as someone calls this we're just going to go ahead and fire off the call And then we just won't wait on the reply We may change this in the future. There are some workloads things like our sink that do Will often copy out, you know create a file write to it and then rename it into place That may be more advantageous to do that. Can I wait until the end? So That may be more advantageous to so we may you know in the future consider doing this differently So an asynchronous unlink so how do we do this right? So first of all we have to get you know exclusive caps on the directory and we also need an unlink cap Which typically means that we have to have done a synchronous unlink in the directory first We also need to know that the denture is positive right so we have to know that the file exists and then also we there's also a Concept in Seth sort of exclusive to Seth that it has to be the primary denture for a for the file Seth has has a really strange way of tracking hard-linked files And so we exclude basically you're excluding hard-linked files from this at least for the time being So the idea here is that we're going to fire off the unlinked call to the MDS And then we're just going to assume that it worked right and then not wait on the reply so we can fire the thing off We go ahead and De-delete the dentury inside the kernel and then we return back to user land You know the upshot of that is that if we are doing a whole bunch of these We are shoveling them all out in parallel and that really can speed up You know like we're moving a directory recursively. So here's a there's our diagram again Pretty much all the same up here, but down here you can see we're firing off lots of them You know async unlinked requests and then they come back and then we eventually once we've you know Once all the replies have come in we can go ahead and issue them an arm to our request to remove the directory so Again, this is real bleeding edge work. So These numbers may change in the future, but for now I you know, I just did some real basic testing on a virtualized test environment on my home machine And so it basically created 10,000 files in a directory So and I should say say here too when we did start this work. We started with it unlinked because it's easier Creates are you know quite a bit harder. I'll go into why that is later, but so Here you can see if we just remove all the files in that directory if we have to wait and do them all synchronously It took about 10 seconds on this box But with asynchronous durops less than a second The catch is that the we have to wait For all the replies to come in before we can issue the rmder So again here. I'm just removing all the files in the directory I didn't actually remove the directory itself if you go and remove the directory itself yet You'll find that it blocks for a while, you know, it's still faster than the async or than the sync Synchronous case because you're not waiting for because we're issuing all these unlinks in parallel But it it does but you do notice a delay on the render part Here's some more numbers So these are Histograms that I created with BPF if you do not use BPF in it yet you should it's awesome but So you can see here these are this is the time spent in seph on length these these are in jiffy's which is a millisecond and so Here you can see synchronously That these are all quite slow right, you know, we're the fastest one is still 512,000 milk, you know milliseconds over here We're down to a thousand, you know thousand two thousand and The and you see down here. There's one outlier right here, you know, that's probably You know and some of them are still outliers I haven't gone to go figure out why some of them are not going as fast I think occasionally we get a situation where they go synchronous and Again, we have to do at least one it one synchronous remove in the directory to or one synchronous unlink in the directory before we can do an async Jeff I think those are actually micro seconds Nothing I thought it was one thousand micro seconds. I Did these in jiffy's so yeah, okay? So I'll have to go back and look maybe I've got it. Maybe I'm off by a factor of thousand, but I don't know Yeah, maybe you're right Okay, I stand corrected. I did do it in jiffy's though The so We have some opportunities to prove that improve this situation too right right now We're not doing us. We're doing synchronous RM there, but we may consider doing that asynchronously in the future Again to you know what I find you know in certain cases is that you know like we like I said We have some outliers here, and I think what happens is occasionally we end up doing something synchronously and then those Operations get backed up behind the pile of async operations that are in in flight So we probably have some may need to consider doing some throttling on this And then also we can consider batching up the unlink operations as well So that we get so we can just If we could batch a bunch of them up fire them all off it in in a single call that might be more efficient I'm not convinced on this the lot caching thing that jang has put in as seems to me that that's where the most of the slowdown would be So I'm suspecting that that may not be as is useful, but you may experiment with it and find out if there is a benefit It's probably with the MDS journaling. We would we might expect to see the MDS can write the operations to the journal more efficiently That's a good point. Yeah, so Okay, now let's talk about async create so If we do so the requirements for an async create we need DX and DC caps again, we've overloaded the file cache cap for this We also need a known negative dentury in this case of course because if there's already a file there We can't do a create on top of it And we so we need a negative either a negative dentury lease on the thing or we need You know, we already have DS on the parent directory by virtue of the fact that we have DX But we need completeness If we don't have the lease We also need a file layout. So like Patrick pointed out file data goes through Directly to the osd's the clients need to know where to write that data And so the file layout is what is what tells them that when we do that first synchronous create in a directory Then we copy that file layout because we know that any New file created in a in a directory will inherit the file layout from the parent And then we also need a Delegated inode number whenever you do a create, you know, we're creating an inode We're creating a dentury that inode But we need to know what that inode number is I'll talk about that a little bit more in a minute So again, we just fire off the create call immediately We set up the new inode plug into the dentury and and move on And return from the open Uh And we always assume that a newly created inode gets a full set of caps because we want the We know that the Typically whenever the you do a synchronous create even you on a new file We get back a full set of caps on it And we always set o exclusive in the call just because There's should be should be no reason that we can't uh, you know that the file is not there The file turns out to be there. We want it to error out. We don't want to over, you know, open the file and screw that up So inode number delegation So in order to uh, we need to know in advance what the inode number is going to be whenever we create this new inode So we have to hat that's in order to a hash it properly in the kernel So we can find it later and also to allow for rights onto into the file before the open reply comes back So we hand out ranges of inode numbers and then create responses now So whenever a uh, we do a synchronous create that first synchronous create, uh, the MDS will shovel out a pile of inode numbers to the client and then it can go and use those I mean, so we have a new tunable in in the user land MDS now to The MDS also preallocates inodes for a particular client on and attaches it to its session And so what we're doing is just delegating a pile of those to the client for use So this is tied currently to the MDS session. I think we need some work in this area still That right now if you lose the session then the inodes go away and it's not clear to me how we're going to Handle case where we'd like already fired off a request and it didn't work So error handling on this is all still a bit sketchy So let's talk about performance. Um, again, uh, we're gonna we're creating 10 000 files in a directory here So I'm doing a very simple, you know shell script loop To just write to all these files So without async directory operations takes about 11 seconds on this box With it about half. So A slight improvement Again histograms Again, if you see this with bpf, you'll see like the nice bars out about, you know text text bars outside But I've chopped those off here just because I don't have uh, I didn't have room on the slide But here you can see, you know, these are all quite slow Up here, you know, most of them are in this this range right here, you know 512 k to 1 million But over here with async derops, we can, you know, we're down into the, you know thousands We do have some outliers here Again, I think with what's happening in that case again I need to go do some analysis to figure out why but I think the situation there is that we Get to a situation where we run out of I know numbers And uh, so when that happens the client has to do synchronous And some of these calls end up backed up behind some of the previous async calls And then they take a long time to come back So we probably have some work to do here too again with throttling The all important kernel build test. Um, so I just did a, you know Built a you know made a little Linux tar ball make a directory CD into it Untar it and make make the thing So here just about five minutes to do the build Uh with async derops we shave about 50 seconds off. It's about a 20 improvement not bad So again opportunities we can to improve this would be we could do Allow for in place renames, you know, again, we may need to make an asynchronous rename We need to We could also batch creates as well if we buffer them for a little while We don't necessarily have to fire them off immediately Uh, and then we can also do other operations make der sim link Stuff like that would be kind of nice to add probably And of course, of course error handling, which is the Bugaboo for this whole thing So if we return early right error handling is where the where this is all Ify, so if we return early from an unlink or an open and what we do if these fail, right? I mean, that's the big question For creates we've already closed we could have already closed the File by the time the reply come in. I mean, we could have written written if there are small files in particular We could have opened them Written to them closed them and then all of a sudden the create reply comes in and we found out it didn't work for some reason Um, I'm not sure which failures are permitted by the protocol. I'm not sure nothing Patrick put that in you want to mention what that was about? I think so So, you know, obviously there's a lot of different types of failures for an unlink or open that could happen and um Part of the challenge with doing an asynchronous creator on link is identifying what failures Entirely are the responsible the the client and you would not Um, it would not be permitted for the mds to give you those failures as part of the asynchronous call versus um Failures that may occur at the mds for example, you know space Um, which the kernel client would need to handle somehow Um, you know space actually impracticed for metadata mutations. Just doesn't happen in sef unless you have a catastrophically Out of space sef cluster in which case you have many other problems But there's probably not very many failures that Uh, the kernel client can't actually handle itself So but we still need to go through all the different cases there and make sure we're not missing anything That's basically I think what that bullet points for So again, you know when I started this work, you know, I kind of hung the whole thing on this On this paragraph right that comes out of the fsync man page, right? And basically it just says calling fsync does not necessarily ensure that the entry in the directory Containing the file has also reached the disk for that an explicit fsync on the file descriptor for the directory is also needed So, you know the upshot being that if you don't that when you write a file Or create a file and write to it, uh, you also need to fsync on the directory too to ensure that the dentury actually made it to disk In practice, uh, most file systems nowadays, you don't need to do that anymore This is was written. I believe when you know ext2 was prevalent. Uh, and so You know nowadays the uh, almost all modern file systems journal to create before they return back to userland And so if the box crashes and comes back or whatever Uh, the file will certainly exist Just to add on to that. Um, you still see the the remnants of uh, very cautious Applications that are written to fsync on the database namely sequel light, which actually has some very excellent documentation for the entire rigmarole process of Synchronously writing the database correctly such that it will survive any kind of uh, crash or or uh, machine failure So they they did they do actually do the the fsync on the directory file descriptor But you'll see as many especially kernel developers belong, you know, it's actually fairly rare for Applications to use fsync correctly So uh, yeah, I mean In most cases fsyncing on the directory is not usually You know, it's usually pretty quick because there's not much to be written back But here at least in this implementation too, you know, if you do an fsync on the directory We will wait for all the buffered or wait for all the async operations to come back So you can't we can use that as a barrier to ensure that things actually did hit the mds and do the right thing So more more about error handling. Um, so currently after failure to unlink what we do is, uh, we just We mark the directory non-complete because we don't know what the heck happened, right We invalidate the dentury that was that was there and sit to force the uh, the client to have to do a new look up And then we set a right back error on the parent directory to show up so that if you do that fsync you'll get an error back After fail create we again invalidate the dentury We should also mark it non-complete Set writable and we set a right back error on the parent directory again and we set a right back error on the created i-node So, you know This is an area where i'm still exploring, you know, what we should do So one idea might be to propagate errors All the way back up to the top of the mount so that you could potentially open like a high level directory Call an fsync on that And then find out and then get an error back and then if you see that error, then you know Something failed down below in the modern world where we're doing a lot of stuff in containers Spinning up temporary containers to do builds and stuff like that That may be sufficient, right? You know if you if it falls over if something fell over during the build you can just throw that build away and start again Right, and we may need to consider new interfaces Oh, oh, they all know another idea might be to use sync of s sync of s and the kernels has pretty lousy error handling right now It basically the only error you can reliably get back from sync of s is E bad f so if you pass it a bad file descriptor But we could use that as well And that's it questions Jeremy you had a question earlier Yeah So my you kind of answered that later on but I was wondering if the self protocol had the equivalent of the asynchronous The thing with smb where you would basically chain operations together We have an asynchronous open with delete and close marks and then a close you chain the things together And then you can issue those asynchronously or together in one rpc And and basically the server then processes them in whatever order it wants and you're just waiting for them to come back Yeah, yeah, we don't do compounding Okay, more is the pity. Yeah, compounding. I mean there might be a future enhancement. Yeah, I would love to see that Stuff wasn't designed with that from the get-go and so it doesn't I mean Especially is if you're doing a lot an open a lot of operations close the the ability to implicitly just Sequence all those and use the file description implicitly from the open. He's very powerful Yeah, yeah, I'd love to have that. We don't have it right today. Yeah, so any other questions? We're working on an open source backup software named barrios and I'm sorry. Can you speak up a little? Pardon. Oh, can you speak up a little? Okay We are we are making an open source backup software named barrios and I hope this question isn't too off topic Um, glass fs has something named cluster find which can give us a list of files And directories changed since a certain point in time Does uh self have self fs have anything like that? Not that i'm aware of anybody or any reg you may know or patrick Okay So sef has uh, sef fs has this concept of recursive statistics, which um, one of the Which is like a stats on an inode except it's it's recursive in nature. So like you can do figure out how many, um Hierarchically how many files are under a directory tree by just doing it looking at a specific extended stat attribute of the of the directory One of those things you can look at is the version of the files So you can actually see that um, and that trickles up all the way to the root So if you know a file has been changed recently that you can look at the recursive statistics of the directory To see that something under it's changed and just keep going down and looking examining the files it's not Quite as efficient as the cluster of s where you know, you can get the entire diff and all See all the files, but you can the mechanism is there to actually do it yourself um Although in the future we do want to make it simpler by Building support of light cluster of s and the sef fs and making it a first class citizen of the file system So not quite, but you could do it yourself if you want it Any other questions jeremy again? So for returning errors, I thought that was interesting the only errors I can see other than sort of like the remote disk died As if someone does a rename of the directory out from under you Or if someone changes the permissions on the directory so that your operations would get access denied I'm assuming that you would get a lease on the directory first such that They would be blocked or you would get a pending Notification before that that's how you would handle that right right. That's what those caps are all about They're like they're kind of similar to an op-lock release. Yeah, I just on part of the metadata, but the but yes That would not be allowed while we had exclusive caps on the directory and you can't do this unless you do so got it. Okay, thanks Any other questions So just to add on to that one of the we didn't really get into this in too much detail But one of you heard jeff several times said, you know, the first synchronous creator the first synchronous unlink There you know our diagram was actually a bit wrong on that That we showed that not all of the unlinked requests are completely asynchronous The first one has to be synchronous and that's just so that the mds can acquire all the necessary locks And also issue any necessary capabilities so that you can do any further unlinks yourself So that first request is actually synchronous And one of the reasons for that is you need to get the the file layout for the create which Let's you know how to actually stripe the data blocks across multiple objects for the The new file And then also ensures that the md yet that the because file layouts can also be set hierarchically on a subtree That you know that nothing no directory above you has had its file layout changed such that the file layout on a new File should be something else so the client can safely move forward knowing that the file layout hasn't been changed Out from under it by by operations on a on a higher level directory And all that's protected by the the caps that the mds is issuing any other questions So this is all still experimental But part of this already in in the code So when will this be available will this be 16 or Uh Upcoming but the one off or is it like a multi-gear effort? Uh, I think the user land part of it the mds part is in we had to get that part in before we could even figure out Whether this is going to be worthwhile, right? And so we've got that part in for mostly. I think it is worthwhile. We're going to keep pursuing it Um the kernel bits. I have I mean I have prototype code, of course that works But it has some rough edges It won't make this merge window. Uh, so You know five seven would be the absolute earliest in the kernel part But probably later I have a feeling I'm going to probably want to do a cleanup of the sync fs code in the kernel And then turn around and and maybe plumb the error, you know, have that use that for the error handling You know recommend that we use that for the error handling And that probably will be a bit of an effort because the sync fs code needs some work So, yeah, any other questions? Just to add on to the answer to that question. Um, so the mds bits are actually going to be an octopus which will be released in March, so All the the groundwork is there The the step fuse client, um, probably won't have support until the p-release which we're calling pacific. So in 2021 That'll be available, but the kernel client I think which is getting most of our focus right now should you know, hopefully we'll have Something merged into mainline within a few months, you know Assuming everything's going as we expect Yeah, I think the unlinked code probably can go in fairly soon I almost merged it. I almost had them merge it for this coming merge window But I decided to wait because I'm going to have to rework some of the underlying code for For the creates to be better And I wasn't comfortable merging the unlinked part until I Felt better about what you know what that part would look like So I it's not quite there yet, but maybe later part of this year, you know, but My guess is that we'll probably have something in before too long and this will probably be like an optional feature to We right now I've got it Set on a module parameter, but we may eventually, uh, you know, you make an amount option or something like that I think we're out of time now. So if you have any questions, let me know