 And he has like 110. So at least I have half of the slides that he has. I usually could have spoken about Yacht, if you know what I mean, but Yachto in English. But today, yes, it's going to be about file system performance with modern file systems. So I'm the creator of Butlin. I like to participate to this conference, to spend time exploring things and sharing the results with you. So that's the abstract just for people reading the slides. So I'm not a file systems expert. There was a great conference about that like one or two months ago. I'm just a regular embedded Linux engineer like you are. And I was given a significant amount of time by my company, Butlin, to do some research on this topic. So I'm grateful for Butlin for following me to do that. I actually did a similar presentation for those who were here about 13 years ago. So a refresh was due, right? It was using the Beagle board, if some of you remember. So why this talk? Because well, now rotating block devices are very rare now, especially in embedded systems. However, most of the file systems we're using are still based on, were still created when we had rotating storage. So we've had some evolutions. And new file systems have appeared as well, like F2FS, EuroFS. So I wanted to benchmark those and see how they perform. So yeah, let's try to see what can could be good solutions for your systems. However, it's hard to predict exactly how they will behave. File systems is a huge topic. So I'm just trying a few things and don't take my results for granted. They are just on the hardware I tested. And I still had limited times because we do projects as well. So I couldn't spend like how much time as I wanted. I did the tests on MMCSD. That's the most common type of storage. And I hope it's close enough to EMMC though the performance could be different. But at least let's get started and see what we can get from there. So the first thing you can check is whether a device is solid state or not. There's a nice file in Sys, in such Sys that's populated by the block interface to tell you whether a device is rotational or not. So if it contains zero, it should not be rotational. So this works actually for MMCSD but it doesn't work for a USB mass storage. Unfortunately, which doesn't, I guess, know what kind of device it is. That doesn't seem to be part of the USB specification. It's just storage. So let's see the available file systems first. The oldest one, X2, introduced in 1993. Still actively supported. If you look, there's loaded metadata overhead. It's the module is small, so it's nice in terms of RAM usage and space. However, it's not very well suited for embedded systems because if you abruptly shut down the machine or you crash, you can have data corruption, metadata corruption. You need to run FSTK and it means someone has to look at the results. So that's not something that will allow a device to reboot autonomously. So that was, this problem was addressed by introduction of journaling. For example, in the X3, which was the successor of X2, but it wasn't scaling well, so it eventually got deprecated. And by the way, you know that journalism reduces corruption and loss of information. I hope you agree with me. So the data range for this file system is 1901 until 2038 and that's actually bad. That's 32 bits. So it seems it will have to go away. Like people using X2 will have to migrate to X4 with the right block size, the right 64-bit mode for dates. So that came as a surprise because we talked about RiserFS not being white 2038 ready, but that's the case for X2 as well. So X4 is the modern successor of X2, introduced in 2006 with journaling without the scaling limitations of X3, still actively developed, but according to the TETTO, who's one of the gurus of file systems, it's still created as a stopgap based on old technologies. So it's not like the greatest that we can do. With the X4 driver, by the way, you can support X3 and X2 at the same time, so you don't need to have to select X2 and X3 if you want to support those. Just keep the X4 driver and I think it has an option to support those ones. In a single driver, don't need to have multiple ones. It supports transparent compression encryption. In case you didn't know, that was introduced more recently. There's no compression available in X4 though. I checked the date range, so if your product doesn't need a warranty that's more than 400 years, that's gonna be fine. I checked the minimum partition site to have a journal as eight megabytes, and in that case, oh wait, eight megabytes, and you can have X4 on a very small partition without a journal, 256K partition if necessary, if only 32 inodes. So that's possible, so you can make it as, you can use it on a very small partition. You create a file system with MKFS X4, all the ones are like that. There's also XFS, which is also a journaling file system which date back from 1994. It wasn't introduced in Linux back then, but it was developed for IREX from Silicon Graphics. Actively maintained and developed by Red Hat now, like it's the default, it seems like the default file system that they promote in Red Hat, Enterprise Linux, and in Fedora, it seems. So it's powerful, well developed, well maintained. So the features are using a variable block size if you want, direct IO, online growth, so things that could be nice in some systems. The minimum partition size is 16 megabytes, and yeah, you create a file system with MKFS XFS. When I don't mention the upper limits of the partitions, it's like so huge that it's gonna be fine for embedded systems like petabytes of storage. But for the FES, it's not a check file system. Well, English people say better FES, butter FES, or B3 FES. So it's a modern file system with many features. It's like a dream file system for people, storage experts, I guess, because they can do whatever they want, they have great features, snapshottings, compression, encryption, I guess, and so it's big, it's a big beast, but I guess you can tweak it to do exactly what you want. And the minimum partition size is 109 megabytes, so still not too big. One that was introduced earlier, no, not earlier later was F2FS from, initially from people at Samsung. It's, this one is specifically designed for SSD storage, like solid-state storage, trying to take advantage of the fact that you have solid-state and not through teaching, so maximize the performance and also increase the life expectancy. So it's actually trying to make most right sequential that works best on SSD. You have transparent encryption in F2FS with various encryption schemes. You have support for setting that on a file-by-file basis. You can choose, please encrypt this file with this scheme. It's using extended attributes for doing so. And you don't have size limits, so it's not like huge, like 16 terabytes is the maximum size for the partition, so it's not huge, but still acceptable, I guess. And the max file size is four terabytes, which is still big. Another one that's not so popular is NILFS, also known as NILFS2, introduced by some folks at NTT. It's maintained, people keep working on it. It's not maybe as actively maintained as F2FS, for example, but it's still active and alive. So this one treats the storage as a circular buffer. It's quite different from the others. Newblocks are always written to the end. So it's weird because you don't lose anything. So you can access, you have an automatic snapshot. You can roll back to any previous state of the file system, if you want. So, yeah, so you can easily restore files that were deleted recently enough. That's fine, but there's a weird behavior when I tried it for the first time. You actually can fill up the space if you don't clean up, like because of the snap-nose shuttings, you have, even if you remove files, they're not removed because they're still in history. So what you need is a daemon that's called NILFScleanody that will take care of removing garbage collecting, kind of reclaiming the space corresponding to removed files. So if you're using this one and you do a lot of reading and writing, it's good to run NILFScleanody. And before you unmount, you have to kill that, stop that daemon as well. So that's a bit unusual for a file system, at least from my perspective. It's supposed to be great at latency, minimizing seek time, and being the best at handling many small files. And you create your file system with MKFS NILFS2. Nothing special here. Now let's talk about the read-only file systems. There's SquashFS that's been around for quite a while. It's been in mainline since 2009. It was even available as separate patches before. It's actively maintained. It's fine for parts of the file system which are read-only, like if you update your file system in one shot, like flash the whole root file system in one shot, you could have the file system completely read-only. You don't want to make, if you don't want to make file-by-file updates. So it's used a lot in USB and CD live distributions. It supports several algorithms for compression, as you can see here. And this one is supposed to give priority to compression rate versus the performance, the read performance. So it works well even for small partitions, like you can have like 120K, maybe partition, whatever works. You create the file system image on your PC with mixed SquashFS. There's no point in creating the file system on the target machine, because you could create an empty partition, but if it's read-only, you can add files to it, so oops. So that's why you create it from an existing directory on the host machine. EuroFS is the new kid on the block. That's in mainline since 2019. Maybe it's been there before in staging, but now it's in the mainline tree, developed by guys at Huawei, and it's used in their phones. And this one is trying to give priority to read performance versus compression. So EuroFS actually works by compressing, putting the compressed data into fixed 4K blocks. So the blocks always have the same size, the optimal size for the storage. So that's why there's no waste of space, unlike SquashFS, which is taking a fixed-size block first, and then put it in a compressed chunk. But thanks to compression, you don't know what kind of size you're going to get. So you have variable-size compressed blocks with SquashFS. With apparently with EuroFS as well, you can have a random access to files and directories. You don't have to be sequential in looking for a file in the directory. That's better. And it's also suitable for small partitions. And you run MKFS, EuroFS, on your host machine as well, from a given directory. FileSystems, we didn't test X2, because it's obsolete in 15 years. GFS, it's supported, but legacy. So nobody really uses this much. And the tools are not available everywhere. Rather, if it's lack supports, and it's going away in a few years, according to discussions I've read about, ChromeFS is supported, but legacy. And who wants, who needs SquashFS? ChromeFS when you have SquashFS and EuroFS. And B-CacheFS is new, but it's not in mainline yet. So please wait a few weeks or months. And invite me again. So let's first do some raw benchmarks. First, let me show you how an SD card is organized. So basically, an SD card is a microcontroller, a bit of RAM. And this is all controlling non-flash pages. So non-flash space. So in non-flash, what you have is erase blocks that contain multiple pages. That's the unit for reading and writing. And when you want to write a page twice, for example, you have to erase the complete erase block. However, in SD cards, the erase size is actually wider than that, because you have limited RAM and addressable content. So it's easier to erase multiple erase blocks at the same time. And this is what we call a segment. It's a group of erase blocks that are handled as a unit by the firmware on the MCU. So the typical size for this is 4 meg, as we will see. 4 megs, yeah. But you have millions, no, no, no, hundreds of thousands and thousands of those. So I took a few SD cards that I had. I wanted to find the best one for the benchmarks that are coming after that. So the first step was to find the page size for those SD cards. It's pretty easy to do. You just read. I was reading like a 64 megabyte image, I think, if I remember correctly. No, I was actually writing to the storage. And I'm doing that in different block sizes with DD. So starting with 512 and then the next multiple of 2, 4k and 1 meg extra until 64 megabytes in one shot, like one single write. And what I found in most of the tests is that you have optimal performance when you write a page, a page which is like 4k in this one. And if you look at the other ones, sometimes it's 1k, the best one. But at least with 4k, you always have the best performance. So I guess they have optimizations for writing to less than a page, like if it's 1k, for example. So what I'm learning from that is that when you want to write an image, write to an SD card, like write an image to an SD card, don't use the default DD command because it's going to be 512 bytes. That's the default box size for DD. And the performance is going to be three or four times worse as if you used 4k as a block size. In the past, I was naive. And I thought that the bigger the block size would be, the faster it would be. And that's actually wrong, according to those results. 4k is enough. And it actually makes sense. You just write sequentially things. And as long as it's big enough, like 4k, you sequentially write those. And the SD cards are optimized for this kind of scenario. So 4k is the block size. Ultimately, the best one in terms of performance was this SD card from Sandisk. So that's the one I kept. That's the first time, by the way, I scanned an SD card with a scanner because my phone camera, I couldn't do that. It's too small. The next thing you want to see is to infer the segment size. So there's a tool from Arnbergman that he wrote originally in 2011 or something like that. And there's a nice talk about a nice article on Linux week news that's called Optimizing Linux for with cheap flash devices. The idea is to find the segment size like you write before. You take some powers of two. Like you assume that the segment is maybe for megabytes like here. So you read right before the boundary, like a few bytes away from that. You read on the boundary and after the boundary, or close to the boundary. And you see that you're supposed to see a change in performance because the hardware has to switch from one segment to another one. So there's an extra cost of switching. And actually, in this case, you see here when you read very small, when you do experiments with very small chunk sizes, the performance stays the same. And then it jumps at four megabytes all of a sudden. So that was the intent of this tool to kind of help detecting the segment size. And here it's obviously four megabytes. So yeah, because you read before, read after, and if you read across, it's more because of the switch, like right before and right after. So that's the tool you can use. It's packaged by package managers. So just install it or compile it with Yocto. So I made some tests here. It's a bit less obvious. And that's something else. So here I just, on some SD cards, it was easy with Flashbench. You obviously had one result. And with other ones that was harder to interpret, but at least four megabytes turned out to be a good candidate. It could have been two megabytes maybe because of the performance change. But if you use four megabytes instead, two megabytes never hurts to have a close multiple of that as a segment size. So especially if that impacts the layout of the file system. OK, I made some raw read tests as well on SD cards. And you see that's the most spectacular case here. You have a really jump at performance when you start reading something like one megabyte at a time. Or you can simplify and assume it's one segment at a time. So when you read from an SD card to dump with DD, it's good to use one meg or something like that. One meg or four megs, fine. So at last, the file system benchmarks. So I'm going to compare X4 with X4. That's the strategy here. Trying to get the best options for each file system and then let them compete. So yeah, with X4, there's nothing special I could find. It's designed for block storage. So I didn't find anything to help with solid states. The default size is 4k, which is good according to the test we've made, good size. And then you have profiles. So if you actually fill run MKFS, the text to X4, it will actually pick a profile that's defined in this file according to the size of your partition or your image. And this is kind of optimum in terms of number of IOs, block size, et cetera. So just for your information, XFS versus XFS didn't find anything, but XFS uses a 4k block size. So it's good. ButterFS versus ButterFS. There are more options here, so it's interesting. You have a 4k block size, so it's good. But MKFS versus ButterFS actually detects when you're using an SSD. When it's mounted on an SSD, so it's nice. So in the past, when you use MKFS, ButterFS with a partition which was a solid state, it created the file system with an option that's called M-Single. So you don't duplicate metadata on the physical storage for extra security, extra reliability, I guess, which corresponds to M-DOP duplicate, so writing twice. However, in recent versions of MKFS, ButterFS, they decided to switch back to always duplicating the metadata because the information that I told you about whether a device is rotational or not is not trustworthy, so they prefer to be on the safe side. So there's a document here that explains what this does. I made some tests, and indeed, MKFS, ButterFS, like that's the test I'm going to do when I compete between fast systems, it has some impact on, I mean, that's with the single, sorry, on read-write time, it's mostly write time, of course, as you can expect. So, and it reduces space a little bit as well. That's the use space. No, no, that's the space. And it's good for performance, so we keep this option. So we'll use ButterFS with this option, to remember, MKFS, M-DOP. You have SSD options in ButterFS, SSD, as this is spread, and no SSD. By default, it's SSD when you have a storage device that's detected as SSD, but if you have a USB key, you could turn it on if you want. So I compared SSD and no SSD, and actually didn't see a change, a noticeable change. So not keeping it, I also tried SSD spread, which was according to Arch Linux, was a good option to use, but I had no space left on device errors, something weird was happening, so I couldn't really use that one. You have compression options in ButterFS, as I told you. So that's at mount time for new finds, I guess. So you can use compression, no compression, Zlibs, LZO, and ZSTD. And if you see here, so that's the tests with the various compressors, the blue one being the one without compression, it helped, the compression will help a little bit with read time, like a little bit here, but it significantly hurts the right time. Even though it's acceptable with LZO, which is fast, not compressing very well, but it's fast. So ultimately, at least I'm using Beaglebone Black, so one gigahertz CPU, at least with this CPU that wasn't worth it to use the compression here. By the way, compression is smart anyway, so it gives up if a file turns out not to be compressible, it will try and see, ooh, I'm not getting much, so I drop and I continue and compressed. So that's acceptable here, that's actually quite smart. If you really want compression, you could use LZO because it's the best performance and compression ratio, and if the compression rate is really key for you, more than performance, use ZLIP according to our tests, not ZSTD. So all this depends on actually what kind of file system activity you have on your systems. F2FS versus F2FS, this was a fun one as well. I struggled a bit to find out how to compress, to turn on compression on F2FS. You actually have to create the file system with the compression option, like those two, and extra attributes, and then you need to mount the file system with some mount options. Compress extension equals star, you want to have all files compressed because you could have only files of given type that are compressed, or each type you could say, please compress it with attributes. And then you choose the compressed algorithm here. And well, it actually didn't provide, compression didn't provide any sizable benefits here, so it was always worse than, at least on my CPU and storage, almost always worse than with compression, without compression, except in the video rate, like sequential, hard to compress. I don't know exactly why here, but it's not worth it, at least in my case. I also tried other options that were recommended by art clinics like Lazy Time and garbage collection options, but they didn't help with bite performance, except marginally in some cases. So, to be on safe side, I kept the default options for F2FS eventually. The last one was an NFS2, I didn't find any mount option, supposed to help. You have the all the mount options here, and the block sizes that are, you can specify a block size when you create MKFS, and with MKFS, the default values are good, like 4K for the block size, and a segment size of 8M, so that's okay. I mean, that's a double of the segment size we have on the RSD card, but that should be fine as we saw. Comparing SquashFS versus SquashFS, so, the idea was to prepare an image with Raspberry Pi OS Lite file system, like 1.2 gigabytes based on ARM binary, so lots of small files, which can be compressed pretty efficiently. And then I compared the size, and that's in blue, and the performance, that's in red, and the best results I got was LZO, a little faster than LZ4, and with not compressing as well, but was a good trade-off in my opinion. Best performance. I tried other options, but that they didn't help. You can read Alex's direct decompression to the file cache, that I expected that to help, but that didn't, in my case. By the way, with SquashFS, you also have options for multi-threaded decompression, having a single core CPU. At least in those tests, I didn't try those. They probably can help as well. Comparing E-RoyFS versus E-RoyFS, you had the same file system. The best one was compression option, compression option that was called micro, that is called micro LZMA, with the maximum compression option. That's the one that gives you the best performance and size, so that's the one I chose. Big Peekless started in help. That's another option. The best one is, that's a zoom-in, because otherwise you couldn't see, so the best one is this one, effectively with micro LZMA and with compression 12. There's just a penalty on the machine creating the image, but that's not the embedded system, so we don't care. So the real comparisons now. If you compare the module size of those file systems, the winner, the loser, is ButterFS. It's two megabytes of module size for arm 32-bit with dependencies, so woo. So it's meaningful in terms of boot time. The module loading time is, of course, biggest with ButterFS, though it's not exactly linear compared to the module size. But yeah. And then the mounting time, once the module is loaded, you mount one partition. But the worst case is nil2FS, nil2FS2, sorry. And the second ones are f2FS and XFS, but those ones are acceptable, like less than zero, the two seconds is fine. And then the big picture is module loading time plus mounting time. That's what you get when you try to mount a root file system with this file system. So that's the boot time impact of this. If you use this, the root file system, the worst one is ButterFS indeed. So that's not a good choice for boot time. So you could use ButterFS, but in another partition with where you store data, typically, but not as a root file system, I wouldn't recommend it. In terms of use space, not a surprise, SquashFS was the best one. So you're like just running the U and it tells you. SquashFS is the best one, EuroFS is second. So if compression matters, I mean size matters, and you can have a read-only file system, it's a good choice. Reading time now. So here I'm reading the contents of the file system that we took the Raspberry Pi, Raspberry Lite one. So I just tar, let's create a tar archive from the contents of the mounted file system and I write this to Nevener, so there's no write. Always an easy way to write, read all the files. So the winner is f2FS, as you can see, and the, no, f2FS for read-write file systems, but the best one, as expected, is EuroFS, which is giving priority to read performance. And the next one, of course, is SquashFS, still compressed, and you can really pack the data very efficiently, so it wins over read-write file systems that have to keep space for future reads and writes, future writes. Writing time, it's funny here because it's kind of linear. So the best writing time I got, so here I'm copying an old Debian root file system, that's 320k, I'm copying it five times to the partition. And the best one is actually X4, that's a surprise. And yes, you can see that. So Linux field is 6.3, Bigel bone black on ARM. And then I continue the test and I actually remove one directory out of five and copy the oldest one that's remaining to a new one, so it's erasing, reading, and writing, right? So the best one, surprisingly, needs to, need FS2 and close is F2FS as well. So as a test, I'll try to highlight a few lessons learned from this, at least in my use case. The removal time is quite long. It's always weird the time you need to erase that directory in Linux, well, because probably it needs to be done atomically, so it's complex operation. The best one was definitely by far NILS FS2, so if your system is mostly about removing files, that's a choice. And then sequential write time, sequential write time, I'm copying video, big bug money of course, five times to the storage. So I copy it on the RAMFS first so that I don't have any penalty reading from it. Copying five times, and yes, that's what you get. XFS was the best one, so it's difficult because the winners are different every time. And the sequential read time, here doesn't seem to be anything weird, except NILFS2, which is slower. So reading a video for five times. And that's the big picture with the various tests I made. So I prepared a summary from all this. So if you look, ultimately X4 is still kind of the best default choice, like the one you would pick up before you make your own tests, you could, I mean, X4 is a trustworthy one actually, in most, it's getting good results. Sometimes the best results. So yeah, it's a good, still good default choice. That's interesting, at least in this case, right? Small limited board with a single core CPU. Then you see NILFS2 can have the best results in some cases, but you can also have the worst ones. So it really depends on what kinds of file system activity you have. And then you have like XFS, but RFS, F2FS, which are almost like close to each other. Let's say, so let me summarize in the next slide. And of course, the read-only files systems are great. The Geo Street, and that's easier. So they are doing the best job there. So things to remember from those tests. I remember that X2 is going away. URFS is the fastest read-only file system, as expected. SquashFS is the best one for minimizing space, right? And that's still a very good read performance. A good default choice is X4, effectively unless you, until you make your own test, you can pick up X4 and get good results. F2FS, in my mind, seems to be the best second choice. And then you have ButterFS, which turns out to be bulky, complicated, and but powerful, and if you, but at least if you use it with M-Single, it's also a solid choice, except for a good time. And yeah, so if you want to tune things, you have like many more options. So it can be interesting. But you have lots of options in F2FS as well. And FxFS is also a pretty good choice. Easy to use and not too bulky. About the same size as X2, so X4, I mean. Right. So another thing is that compression doesn't help with, doesn't seem to help with performance, at least when you have a single core CPU. And yeah, as I said, NileFS2 can give you great results sometimes, but also sometimes the worst ones. So just test on your own system. So always try with your own hardware and your own applications, like run your system test and see which ones behave the best. It's easy to switch five systems, just format the passion differently. Adjust the MKFS options and the mount options. And that's it, that's transparent for applications. Some more lessons learned. So when you flash an SD card with DD, the page size is the thing that the parameter like BS in DD will give you the best results in terms of copying. So use this and you will divide the copy time by four. Don't exceed one megabyte, that doesn't help, that slightly degrade performance as a block size. If you read from an SD card with DD, the segment size is one of the best block sizes for reading. So read with BS 1M or 4M if you want. Bigger won't help. I, by the way, tried to use a flash bench dash F. It's supposed to try to find some special segment that's actually optimized for fat, like the SD card manufacturers like to do that. So that when you benchmark with a fat and you read that always to the same location, that's efficient. But at least on the SD cards I had, I didn't find anything noticeable. Weird. And anyway, if your idea was to put the journal where the fat is, the journal is something you access very frequently as well. With X4 and others, it won't fit anyway. It's too big. Like it's like 120, for like big enough partition, it could be megabytes, like tens of megabytes of size. So too big, sorry. But what you could do with the journal to improve performance is store it elsewhere. If you have fast of performance out there, you could, with those fast system, you can choose to have the file system outside of the block storage that you're using to store files. So it's like there's very fast non-volatile memory or something like that. You could store the journal there. That's a possibility too. Of course, I ran out of time. I needed time to automate my tests. So that took time to resurrect the scripts I used like 13 years ago. So I didn't have the time to run on tests on ARM64 with a faster CPU and multiple cores. So that's something I need to do next time. I haven't tested real random writing, like modifying random files, random locations in random files, and random reading as well. And I only made tests on one SD card, not on EMMC, that could lead different results. And I haven't tried USB or NVMe, SATA, et cetera yet. So it's a one particular use case for embedded devices, but hopefully the mount options can help that I found can help in your projects. Resources here, you have a great talk about from quite old now, but from Peter Chubb at linux.org.au about SD cards, how they are organized and how they perform. And it's the nice talk from Richard as well on your E-RoyFS versus QuashFS in case you, is he here maybe? Yeah, hi Richard. Thanks for the nice write up. And now we have maybe a little bit of time for questions, not so much, I'm sorry. Oh yes, is there a microphone somewhere? However. No, go ahead, I repeat the question. Yeah, I know, kind of consideration if it comes to you, I mean in case of power loss, could this be a power loss? Haven't tried. Do you really have to? Ah yes, so questions about the reliability of the power systems in case you have a power loss? No, I just know that the SD card, you have like a power pin that extends, that's when you remove the signal pins, I mean the power pin is longer, the signal pins are shorter, so when you remove the SD card, it should have a little time to complete the operations on the flash. Before, I mean the card will stay powered a little more time, but I don't know, I hadn't torture my SD cards to know how they will survive. But normally with journal file systems, it's quite robust. Not always, maybe you have experience. Yes. Okay, so we can talk after I'll be around. Thank you. Thank you. Thank you.