 Hello, thanks for having me. I thought my talk is in an hour. That's why I kept you waiting. I'm really sorry about it Yeah, my my my watch is still in the in Berlin apparently So, um, yeah, so let's get to it. I'm gonna talk about the test directory scaling So who am I? I'm a freebies decometer since 2018. You may know me as zero MP I am now serving on the freebies decor team and Most of most days I work on cool stuff with false of Clara But in general I poke around things like documentation ports RC scripts tracing and ZFS and the outline of this presentation is as follows first I'm gonna tell you about ZFS and dive a bit into the implementation details of how it works Then I'll pivot to directories then we talk briefly both scaling and I'll give you some numbers about well, how the tuning of directories in the test Can work So to give you some Background of this whole work. So this is the the Benchmarking with it for a client Clara and The orange line is what the defaults the first does. So when you have many When you write many files into directory in the single directory in ZFS The performance drops, right as you can see but if you apply some tuning But I'm gonna present to you you can be where the green line is of course it depends on your setup discs the rest of your tuning but Essentially, there is some performance gains that you may try to get out of your system. So let's talk about ZFS So what is it? It's a copy and write file system It was developed at Sun in 2001 it was imported into free BSD shortly after in 2008 and as of 2023 free BSD uses uses something called open ZFS Which is a variant of fork. There are a couple of different implementations in the wild of ZFS and Open ZFS Works on both linux and free BSD This allows for more more cross-pollination of ideas more testing. This is a pretty cool initiative Um, so what does ZFS do? So there are a couple of things the most important ones are maybe data integrity so every block you have on your system is checksum, which is nice and When data gets corrupted ZFS can detect it and if it's not too much of a corruption you can also try to Correct it or at least report to you that hey something is wrong. You should really take a look at those discs Um, it also offers data consistency So even if your system crashes you won't end up with a disk that is in an inconsistent state Whenever you write to a disk in ZFS Your data is kind of saved on the side and waits for the checkpoint to come when the chain checkpoint arrives All your rights they are made official and now they are the new You know version of of the state like of what's happening on the disk So you never have an inconsistency there and also there is pooled storage. So There's no need to partition discs discs in advance. You just give all your Storage you have to ZFS and then you can create partitions grow them shrink them we don't have to Think in advance about how you want to split the storage you have in between different partitions Um Apart from that it does many other things like snapshots efficient the remote application compression encryption did application all kinds of cool stuff which are well It can do all those cool things because of the way how it's structured. So we could get many features out of the Well the implementation Which I'm gonna tell you about so I Would like to give you like a short Introduction to what happens when you write data to ZFS Right. So what we have here? We have uber blocks at the very top this is like the the root note of this whole tree that ZFS consists of and At any given point you have one uber block that is like an active one And it points down to like this whole tree that is all the data you store in ZFS So in this case, we have the uber block it points like many many days to take it first like Let's say for simplicity like to a pool and then to a data set and then some directory maybe and then we end up at this profile of the profile file Which then has some indirect blocks and then we have the actual content The data blocks which is in this case like one of them Let's say contains editor equals vi then the other block contains contains page or equals less, right? This is like the actual data starter So what happens when you write when you want to replace your editor vi with editor at because vi is too complex And you want something simple What happens is that you issue a right and ZFS allocates it like it copies the block and modifies it the way you want and it saves it on the side and then The whole tree all the parents are updated because now we have a new block so the block above it needs to Well have a new checksum because the content below changed So so you go up and then oh we have to go up again And then you have to copy the profile because you know The checksums and things and as you go up And you prepare all those new blocks What happens later is that when there is a checkpoint You start pointing to a new block that you just created in this writing period and the new uber block gets Well activated so from now on you have a new tree with all the updated data That's how we avoid inconsistencies right because first we write everything we need in this new Checkpoint right and then once we're ready. We just switch the root node, which is the uber block and we're good So what is the file in ZFS? It is an object and an object is just a group of blocks There's organized at the top by a denote which is something like an I note in in UFS yes and Almost everything is an object in ZFS files directory's data sets in this case when you when you look at the file object You have well the file object the red frame the yellow frame is a denote So it's this is like the the root of this sub tree And one of the things it points at is the block tree where the actual data is stored sometimes you cannot Sometimes it's like not like apart from the direct blocks where the actual data is stored. You also need indirect blocks which help you address more than Just a bit of data, right? You just need to build this tree a bit Has to be a bigger tree to store more data So those are the parts of a file object that you would like to Talk about today. So let's take a closer look at the files So what you can do in ZFS you can inspect Those objects. So what we do here is we create a half a gigabyte file On ZFS and then we inspect it with ZDB, which is like a debugger for ZFS Yeah, so let's take a look at what's happening in this output. So there are a couple of interesting things for example the the blue Highlight this thing Fletcher for this is The checksum algorithm and you can change it to different ones different algorithms offer Different trade-offs, right? Some of them are faster. Some of them are like slower But something you have a choice depending on your workload This may improve performance or like change the characteristics of your system the yellow highlight for example is the compression so compression is Compress all the data by default for you Which is nice because then when you read the data from the disk you have to transfer less bytes So it makes things faster and decompressing those things is usually quite fast But then of course depending on your workload you can either pick compression algorithms algorithms that are Faster but they compress Were in a worse way or you can pick the one that is like very CPU intensive But then it also compresses very well, right if you want to save some space on the disk Yeah, and then we move a bit lower and in the green highlight shows us the the block tree and So it says indirect blocks and then L2. This is like level 2 In direct block and then we have some level 1 indirect blocks L1 and then we have L0 Those are the direct blocks. This is where the actual contents of our file is Yeah, and in each of those entries you can you can see the Address and maybe the size the checksum It's it's pretty cool to the zdb tool. It's nice tool to know about and To know how it operates if you want to explore the test a bit more for example We have in in BSD's we have this wonderful file called flowers if you don't know about it You really should it's very useful So what we're gonna do we? Just for the experimentation we set the record size to 512 bytes so not too much And then we copy the flowers file into data set so now when we expect inspected with the zdb we Take a look at the output and we can see that we allocated exactly three Data blocks, which is what we expected because if you look at the output of LS at the top this file is One and a half kilobyte, so three blocks of five hundred twelve bytes should be enough to store it So that's cool. So let's what we can do now. We can use zdb to extract The contents of this file to take a look so we use org to like parse out the exact address and Now use it again with minus r we tell it hey on this pool called tank there is this address so give the contents of this block to me and as you can see We get the same outputs from zdb as we get from head Which is nice, right what we did here the addresses and copying and so on it works you can Feel that you have a bit more control over what ZFS does Yeah, these and much more we can do with zdb and like explore a bit what's going on there directory So from our wonderful manual pages We can learn that directories provide a convenient hierarchical method of grouping files while obscuring the underlying details of the storage medium and Directory consists of records Directory entries each of which contains information about the file and a pointer to the file itself So a directory is basically a map of file names to some identifier So let's take a look how it's how it's done We can do a couple of different operations to the directory Today we're gonna focus on path name searching. So lookups and name creation. I won't be talking about the performance of All the other operations. It's out of the scope for today What is the directories at FS it is an object just like a file the difference is that it has a different tree It's not that it's not a tree of Data blocks. It's a tree of zap blocks But then what is a zap, right? So the zap is a ZFS attribute processor So it's a key value store. It's a map and the keys can be strings and the values can be strings numbers areas of numbers Different things. It's very flexible. It's used primarily primarily for directories But it's not the only place where it's used in the code base It's implemented as an extensible hash table and it's case nicely up to hundreds of millions of entries in the Single directories. It's a good data structure. It's slightly complicated, but it's a good one It's also called a fat zap Because there is also a micro zap which is a Much smaller data structure designed to remove some of the overhead of this complicated data structure for smaller directories so a Microsoft is just a single block and it stores the mapping of filings to well Objects, so you cannot really store like complicated things. There is just You know strings to some IDs. It can only hold a maximum of 2047 entries by default and Keys are limited to 50 bytes. So if your file names are longer You won't be able to use a micro zap And values are limited to integers, but that's obvious and A micro zap gets automatically promoted to fads up. So whenever you Create the 2048 file in your directory ZFS will Change the micro zap into fads up for you and start using this whole Complicated data structure, right? Because like it's it's getting ready to handle a lot of files The same thing is gonna happen if you have a long file name One of the problems is that if you make a mistake in your home directory, for example And by some accident you have a script that creates like temporary files and you end up creating a million like temporary files in your home directory The test will gladly do it and it will work then you can remove all those fires But but the fads up is gonna stay there. So from now on whenever you run LS So the best we will have to traverse all this like extensive fads up that you just created and in order to get rid of that You would you will have to Well recreate the directory basically Which is not exactly something that you want to do with your home directory It's it's a known Limitation let's say and it's been People work on it if you want to check out the process the progress on the on the patch that is in flight now You can look for open ZFS issue number 14 088 so scaling This is gonna be like a very High-level introduction to what you can do to ZFS. So there are a couple of different ways how you can Change how ZFS behaves the first one is you set properties on data sets The other one ZFS create you can also set those properties during the creation of a data set Some of the properties can also be changed with CCTL and then depending on what property it is it influences The whole ZFS module that runs on your machine or it or like specific pools or data sets For some of the Tunables you have to set them before loading the module and then you do it with a loader for example So there are many different places sometimes you can get a bit confused where you can do it But in general the documentation is quite good some of the popular tunable that people Set or like start with when they're trying to Tune ZFS is our a time which is access time This is owned by default. So whenever you access a file It also gets Well, the a time gets updated so as the best has to write in a new block where the a time is, you know a new one This can impose a lot of overhead and if your application doesn't Depend on a time. It's a good idea to turn it off the record size so this one is a very How should I say Popular problem in when people deploy ZFS because their application is Writing in like different chunks in different sizes than then ZFS is storing data and this mismatch creates performance problems Well, usually because the test has to do more work than it should be doing So it's good to align it with your workload Another one that is primary cache you can change what you what you really want to cash in ZFS you can Either store like data and metadata inside or just metadata This makes sense. For example when your application does its own caching So that you don't cache data twice and then things like compression, right if you want to Have this trade-off of Spending more CPU time, but then having data more compressed to gain more storage. You can also do it with the compression Tunable and I will switch to something more specific to directories. So this is the first of two tunables I'm gonna talk about it's a it's a way to control the maximum size of a micro zap and It's this is CTL is called VFS ZFS zap micro max size The idea is that at the moment the default is for a micro zap is to be one of the size of 128 kilobytes and That's where those 2047 entries fit if you want to store more so we want so we want to Delay the moment when micro zap turns into a fads up you can well increase the increase this Maximum size of a micro zap for example to one megabyte and then you can store up to 60 like slightly over 16,000 files in this single block it was made and tunable fairly recently earlier this year and To give you some ideas. What are the advantages and disadvantages of a larger micro zap? so When it's larger directory object has to has to It has less direct blocks, right? So when you're reading you don't have to read this whole tree of fads apps you just Fetch this one block and you're good because now, you know the whole structure of a directory On the other hand When you are writing to a directory, you no longer update You know small blocks you update this one huge block. So if you write to your directory often This can lead to an overhead. They're like always trade-offs the other ZFS directory tuning thing is a size of indirect blocks It can be controlled with VFS ZFS default IBS the default is 17 which is 2 to the power of 17 which is 128 kilobytes It's available on free BSD since a long time, but then It was also made a tunable solution three on Linux and Advantages of having smaller indirect blocks is that You have to Process less bytes when you're reading or writing, right? But on the other hand you may end up with a larger tree that you have to traverse So again trade-offs So let's see how those tunables behave when you actually use them and run some benchmarks talk about some numbers I Prepared two benchmarks for you today one is a look up benchmark with me when we just Recursively traverse through the directories and we just Process every entry that is there and another one is create so we create a lot of empty files Just to update the directory structures and see how it's How fast it gets I used hyperfine as my benchmarking harness. It's a pretty cool Piece of software if you don't know it, it's it's it's very nice for ad hoc experimentation And yeah, and the tunables you already know those are the ones I introduced in the previous slides So benchmark one So the measuring what we're measuring we're measuring the time to list files of all sub directories I'm just using like a very short C program that does FTW and We're gonna focus on three parameters here The files per sub directory the maximum micro zap size and then indirect block size This is what happens when you Have 16,000 files in directories So this number exceeds the default micro zap size, but it's Just before the it fits nicely into this one megabyte Micro zap before the fads up kicks in So as you can see here larger micro zaps increased performance Because we could stick to the micro zap for longer so Retrieving like reading out the record structure was only one one block and not many That's probably why we see the difference The indirect block size doesn't really matter here or at least not in this particular deployment and then We test 64,000 files, so we definitely exceed the limitations of both the default micro zap and And the tuned micro zap and as you can see there are no Differences in performance pretty much. It's it's it's just noise So once the fads apps kick in You don't really see the difference so One of the takeaways from this one is that if you want to tune it to your workload you should Really think how many files you're gonna store in your Directories there is no one silver bullet to just make everything faster Every time it has to be really adjusted to your own workload and your system Right and then Let's take a look at one more example of The lookup benchmark. So here we have 16,500 files So we exceed both micro zaps the bigger one was like just exceeded like 200 files ago But we turn primary cache off so we are not caching anything Which results in? the indirect block size being very relevant in this in this in this situation the micro sub size does not really matter anymore from what we see but the Performance of recovering smaller block blocks from the disk. It really makes a difference so as you see depending on the workload again or How your system is designed and how much memory it uses? Different tunables can do different things. They can have a different effect on your system benchmark, too So we're gonna look at the writing case for directories. So what I'm doing here I'm I time I measure the time to create files in directory. So what I do I just open a file So now I have an old an empty file and I sync it so that it gets sent to the disk And again the parameters are the same. So let's jump to the results Over here I'm running the benchmark with 247 files, so it fits both the default micro zap and the large one and there is no observable difference yet That's probably because the system is too fast or something So that we can't really notice Let's go further when we have 16 838 files, so we exceed the default one, but it fits into the The tuned one just barely We already see some differences. So larger micro zaps are more expensive in writing Because we overwrite them way more and they are heavier. They think we need to write more bytes, right? So it's slower So this can be pretty disastrous for your performance because now it's twice slower than it used to be so Yeah Measure your system before you tune However, when we increase the number of files to 64 thousand the the performance gap Starts to shrink. So there is less and less like the difference is shrinking basically right so That's a good thing so again Measure your systems So this summary There are two important tunables for directory scaling the maximum micro zap size and the indirect block size and The tuning takeaways are reads are faster with larger micro zaps and smaller indirect blocks and writes may slow down when using larger micro zaps Are they set many times tuning depends on the system? So you should really measure the system before tuning it, but Basically now, you know, which two levels are relevant to this part of that FS So when you decide that hey, this is like relevant to my systems to tune how fast my directory is working you can use those and See how it plays out. Do you have any questions? So should I repeat the question? Yeah, so the question is if I compared it to UFS No, I didn't But it's a very interesting But I thought about it the reason why I didn't do it was that it's It's very tricky to benchmark two different file system against each other especially since that FS has so many layers of indirections and caching and so on so I didn't do it because I wasn't sure if I can deliver, you know, reasonable results If you can talk about so the question is if I measured the globbing in a shell so that I'm not Looking looking up all the files, but like a subset, right? I Didn't do it But then when you're reading out of the rectory and you would when you're using globbing you I guess you You still have to read the whole Data structure just to see if you're matching anything So I think that would be the same The question is if the tunables are data set specific or they influence the whole system They influence the whole system from what I see Yeah, if it was otherwise, we would see at least, you know some a part of the city I would be a name of a data set or a pool. Ah, yes. Yes, of course. I wanted for the system to stay like I Wanted the directories to really be micro zaps and I check that with CDB after every So the question is if we can change this this limitation of micro zaps to not be 50 bytes, but maybe less or more there is no tunable that I know of and I Guess that what you can do you can just patch it out and see if you can fit more directories in there like entries there But that's like a manual operation. I would assume So you still have to read out the whole thing The interesting bit is that when you when you switch from a micro zap to a fat zap you just add one more block I mean the structure of everything changes But effectively when you look at the output of CDB you have one more block to deal with that answers your question. It's So is the question does it help to have sub directories or? Oh, yeah, nesting is very helpful if you can change your workload to Nest directories instead of fitting all the files into one directory This is one of the ways to make it. I mean then of course you have to do more lookups, right? Because you have to traverse the whole hierarchy But it's From what I saw this is usually the better idea Mm-hmm. Yes So the question is if ZFS has a special cache for names Ah for for the name like the name cash I Don't remember. I'm not sure if I ever knew So the question is if I have If I measured the the number of IO operations That were issued during the benchmarks. I Didn't measure it. I just wanted to have I mean I did it when I was doing the work for the client I didn't do it for for this like presentation of The general I in general intuitions of how to work with those tunables But if you look at the very first At this graph, this is right throughput So I'm not sure how to answer your question. I didn't do it. I don't have any numbers to like contact I can share with you on that It could be done