 We have Siddharth and Patrick up on stage for talk about how to improve Siddharth's metadata operations. Hey guys, I'm Siddharth. And today I'll be talking about how distributed file systems manage their metadata and how they distribute their metadata and how SFFS handles that problem as well. And finally, I'll be giving an overview on fmural pinning, which is a new feature we've introduced for SFFS, which we are actually working on. So let me start my talk. So the outline of the talk will be, as I said before, how metadata is handled by a lot of file systems and how SFFS is going to handle that metadata problem. Finally, fmural pinning. So Jeff and Patrick have already given like an introduction to SFFS. So I'm not going to speak a lot about that, but I'll give you a brief introduction on the components of SFFS. So first off, you have the metadata servers who handle metadata transactions and who give permissions to clients, et cetera. And then you have the clients who are trying to do file I.O. and trying to do metadata operations as well. And finally, you have the backing object storage called rados, which has object storage devices, OSDs in it. So the crux of SFFS would be rados, because that is where the storage happens. And so one important question that can arise is, why do we need metadata servers at all? Why can't we let rados handle your metadata workload as well? This may seem intuitive at first, but that is not exactly the case. So metadata operations, upon analysis, you can see the metadata operations would take up like almost 50% of your total file system operations. And another thing is that rados are storage scales in a linear or straightforward fashion, but that is not exactly the case with metadata. So metadata is fairly complex and it is hierarchical in nature if you can see the file layout. So it's not very easy to scale it. So it becomes essential to decouple your metadata with your file I.O. or storage. So now since we've established the importance of metadata servers, I'll come to how a lot of distributed file systems handle your metadata. I mean, these are file systems that use metadata servers. So a very common metadata handling strategy is called a pure hashing strategy. So this is employed in file systems like Luster and ZFS. So what happens here is the client will create a hash of the path name to the file to determine which MDS the file is going to be. I mean, the I1 for that file is going to be. So example for that would be if me, the client, wants to create a file called, I don't know, some file.txt. In that case, I'll have to hash the path to that file. I'll get a hash. Now keeping that aside, each MDSs would have been delegated a hash range. So you can see MDS1 is delegated that hash range, MDS2 this one, and MDS3 that hash range. Now that example, you did see that that was the hash value lies in these two values, which is handled by MDS2. So now the client knows that it has to talk to MDS2. So that is how a hashing based strategy works in a brief overview. Now some advantages you can see with this strategy is that requests are going to be almost evenly distributed across the cluster since it's going to be a pseudo random distribution of metadata. So this is fairly obvious. Another advantage you can see is that if the file name is hashed as well, so by that I mean not just the path name, the file name as well. In that case, have we create activity in a particular directory? It does not create like a hotspot of load or something like that in a single MDS. So an example for that would be a simple example is if you have two files under foobar called set1.txt and set2.txt, and if you hash them both, obviously it may give two different hashes. So those two requests are directed towards two different MDSs, and you have a better distribution in this case as well. Now some disadvantages that you can see in this kind of scheme is that according to file system traversal semantics, you have to traverse from the root of the directory all the way till the end of the file. So if you have this as an example foo file.txt, then you'll have to go all the way from the root till file.txt. So yeah, just detail that here. And so in this case we've got that MDS2 is the MDS that the client has to, that offers the client the permissions to read file.txt. And if the client does have the required permissions, then MDS2 will sync with the client to let it talk to rados directly for file.io. So another disadvantage that I mean an obvious result of this is that it brings in a lot of inter-MDS hops. So you get hops between MDSs and this in turn results in a high network overhead. And another disadvantage is that with hashing-based schemes you lose hierarchical locality. And by this I mean you lose any metadata locality benefits you get with having like the whole directory tree in the MD cache because everything is distributed, fairly distributed. Now another disadvantage that is faced by a lot of file systems, self-included is that renames are expensive. So if you have a case where you have to rename a directory tree, rename a directory in the directory tree which is far up, say abcd file.txt is renamed to fbcd file.txt. And then you'll have to recompute all of the hashes all over again. So this is going to cost in terms of penalty, in terms of performance penalty. Another strategy which I'm not going to go into too much detail is lazy hybrid. So lazy hybrid, this strategy works similar to a hashing-based strategy but it's different in such a manner that you don't have to hash each part of the path name. You can directly hash the full path and when the request reaches the particular MDS there is an access control list in that MDS file for that particular file which stores the effective permissions. So you just need to check that. So this kind of implicitly solves the costly problem of traversing between MDSs. Disadvantage is that like hashing you lose hierarchical locality. There's no locality benefits in this case. Another disadvantage is that for renames and for permission changes you will have to traverse the whole path all over again to be expensive even though those operations are pretty infrequent. And finally this is not exactly POSIX compliant because you're going to have problems with distributed locking and such. Now we're coming to something close to how Chef implements. It's called subtree partitioning. Now in subtree partitioning, subtrees of the directory hierarchy are assigned to individual MDSs. So in this case you can see that just to note the shaded ones are directories and the unshaded ones are files. So here the orange subtree you can see that it's assigned to MDS2. The gray one is assigned to MDS0 and so on. So with subtree partitioning you can get like a linear growth of your metadata cache along with a number of MDSs and that's pretty beneficial. And also your cache utilization will increase due to your spatial locality increases. Now some fairly obvious advantages you can see is that here you have good hierarchical locality. So since you are storing, not exactly storing since you're caching entire subtrees in the RAM, you don't have to traverse from MDS to MDS in this case. So you have that benefit. Next one is it scales horizontally. So along with increasing number of MDSs you can assign subtrees to it and you can get a fairly horizontal scaling in this case. Another advantage is renames are not as expensive as hashing based distributions because here if you are modifying the directory tree you don't have to hop from MDS to MDS. You can do it locally within the same MDS alone. But you're still going to have problems with renames which I'm not going to get into. So all this considered your naive subtree partitioning seems like a good choice of a metadata management strategy for CFFS but that's not exactly the case. You still have a few problems here. So one problem is that this does scale in breadth but not exactly so well when you try to scale it in depth because when the workload grows in depth you're still going to have a hotspot of activity on that particular MDS or if it happens on multiple MDSs then obviously you're going to have multiple hotspots which is not exactly good and you may have MDSs which are non-busy and those compute resources are effectively wasted in this case. So CFFS uses a more dynamic version of subtree partitioning called dynamic partitioning. So the problems I've talked about before can be mitigated with using a balancer, with using a metadata balancer that can export subtrees from one MDS to another based on the load factor or how much activity is going on in that particular subtree or portion of the directory hierarchy. So this can be achieved using a metadata counter so whenever you have activity going on on that particular Inode or Seeder, then your metadata counter increments and based on that counter your balancer is going to decide whether you need to export it to a different MDS or not. So since you're not explicitly storing metadata you don't it's not really difficult to migrate it between metadata servers so migrating between caches is not that difficult. Even when I said this there are cases where you have a performance penalty rather than a performance boon on some workloads you have excessive migrations which is definitely not good. So CFFS thus give the option to override the metadata balancer by allowing the cluster admin to manually pin the subtrees so in cases where the metadata balancer is not working well the cluster admin can choose to decide to pin that subtreed to a particular MDS as he chooses but even the option of but giving the option of pinning subtrees to the cluster admin is not exactly foolproof because you can't exactly decide how the workload is going to be and you can't exactly gauge the knowledge of the cluster admin to decide where he is going to pin the subtrees which MDS is he going to choose. So to mitigate and automate that process we've come up with something called ephemeral pinning so this is a metadata distribution strategy using consistent hashing as its core so we've decided to use why consistent hashing actually why hashing in its sense is because you can get a proper distribution of metadata if you use hashing you are not hashing independent directories in this case you're hashing subtrees so I'll talk about that later when I'm explaining this you might find it difficult to understand I have diagrams that explain it properly so just hold on and so to achieve this we've come up with two export pins or two ex actors in distributed file system sense so these are ephemeral export ephemeral distributed and export ephemeral random so the cluster admin can choose to set either one of the ex actors on directory inodes based on knowledge of a portion of the workload of how the file layout or the workload is going to be so if you have a fairly distributed workload in the sense that the workload is creating a lot of directories under a single parent directory in that case you can choose to go for export ephemeral distributed and when you have a fairly workload where it grows in depth then you can choose to go for export ephemeral random and this setting this ex actor is going to override the metadata balancer for that particular sub tree so I'll give you a brief explanation on this don't worry if you can't understand it I hope the next the diagrams would make you understand it so if you set export ephemeral distributed on a directory then all the child directories or the child sub trees are going to get distributed across mds using a consistent hashing strategy the hashing is done on the inode number of the child directory export ephemeral random is a bit different from export ephemeral distributed in the sense that you're doing this hierarchically so when you set this on a directory the whole sub tree or the sub trees that are getting loaded which are nested beneath it will get distributed to random mds is probabilistically and you usually set the value of the ex actor it's worthy to note that the value of the ex actor is the probability that I just mentioned so and we usually make this probability as low as possible I'll explain this in the slide so this is export ephemeral distributed so that is a parent directory that exists in mds one and you have set the distributed pin on it now when the workload is generating directories under this parent directory the hashing is done and metadata is distributed almost randomly across the mds is so as you can see for a fairly distributed breathwise scaling workload I mean knowing this it will help in setting the value of the ex actor and scaling metadata optimally in this case so this is a fairly very small workload to kind of gauge how the I know distribution was across 3 mds is so it is a very small workload didn't even take a minute to run so you can see that how the distribution how the metadata is distributed almost perfectly across the 3 mds is so this is using the export ephemeral distributed pin now export ephemeral random is different in the sense that you have the parent directory first one thing to note is that whatever is not getting pinned and hashed it is assumed that it is going to mds one so now it is assumed that parent directory is in mds one and directory one is in mds one and whatever is getting hashed you can see that goes to different directories so directory 2 is not getting hashed but this is not happening because I told that it is probably strictly getting pinned so how you do it probably is you have a random number generator to generate a number and you check whether it is greater than that particular probability that you set so in this case that is not happening so now you can see that directory tree the condition satisfies and it goes to mds 2 not happening so if you have a fairly depth wise workload then this can work really well in that case so finally let me talk about why we are using consistent hashing and let me say what is consistent hashing first so consistent hashing is basically a distributed hash table scheme but unlike a naive distributed hash table you do not have to resize the entire hash table when your cluster modifications like when you are scaling out or when you are scaling down so CFFS has this another added advantage that you do not need to store all the data structures of consistent hashing in your memory because because your ranks are arranged in an ascending order of number so it is you have mds if you have mds say mds A, mds B and mds C then you are going to have an implicit rank for each mds called the mds A is going to have rank 0 mds B is going to have rank 1 and mds C is going to have rank 2 so having that kind of negates the need to store data structures in your memory cache that is it you can check it out in this link I still have a lot of benchmarking to do probably a lot of request distributions and check that out probably and that is it this is my mail if you have any doubts you can ask that time for questions no questions that is good thank you