 First of all, I'd like to thank you all for coming, and I'd really like to thank the FOSDEM organizers. This is actually my first time presenting at FOSDEM, first time I've been at FOSDEM, and it's been a lot of fun. I really enjoyed myself, so thank you all for coming and thank the FOSDEM organizers for inviting me. Please bear with me. This is a presentation I had to create on the fly, because my primary laptop got stolen at the Brussels train station, so I had this cool demo I was going to show, showing how quick FSCK was on a file system that I'd been using since July, except it got stolen. So WikiTravel has all of this stuff about things you have to be careful, pickpockets work in teams, they'll distract you, grab your laptop bag. I'm here to tell you it's all true. So yeah, I learned that the hard way. Anyway, fortunately I happened to be carrying a backup netbook that I was planning on using for crash and burn testing. I was actually planning on getting some development work done over the weekend, which didn't happen, but that's okay. And so why don't we get started. And first let me talk a little bit about some of the good things about the EXT3 file system. It's probably the most widely used file system in Linux, and it's a code that therefore has been around for a long time, people trust it. You know, it's been well shaken down. Also very, very important, and this is probably one of those things that a number of file system efforts before didn't really pick up on, EXT3 has an extremely diverse development community, which means we have developers from Red Hat, from ClusterFS, which was since purchased by Sun. We have developers from SUSE, Red Hat, IBM, and so that actually is really, really useful because it means that, number one, you don't have to worry about what happens if one company decides that it's time to cut back on their kernel development budget. But it's also important because if a distribution wants to support a file system, they need to feel comfortable that they have people who understand it well enough that they can actually help their customers if their customers have problems with it. Historically, I can't speak for Red Hat, but until very recently Red Hat did not have an XFS engineer on staff, and I don't believe it was a coincidence that Red Hat didn't support XFS. It's only recently that Eric Sandin, who was a former XFS developer, he's now helping us out with EXT4, joined Red Hat, and now they're going to be including XFS support. My understanding is at least in preview form in a rel update, and then they'll be supporting it fully in the future. But again, it points out the fact that if you don't have developers add a distribution, it's really not surprising the distribution is going to be really hesitant supporting something as critical as a file system. Another example of that would be JFS, IBM's JFS, which is a very good file system, and at the time when it was introduced, there were ways in which it was in fact far better than EXT3. There was only one problem, which was almost the entire development team was at IBM. Red Hat and Suza didn't have any engineers who were really familiar with JFS, and surprise, surprise, they were really hesitant in supporting it, and that's been a really big deal. So one of the things that I've told the BTRFS folks, and a big supporter of BTRFS, I really believe it's going to be a great file system, although people who think it's going to be ready in the short term are probably a little bit too over-enthusiastic. I'll talk a little bit more about that later. But one of the things that I told them is you've got to recruit people from across the Linux industry if you want to be successful, because it's just, you know, it's the realities of development. So there are also a couple of things that are not so good about EXT3. A lot of silly limitations, perhaps the most stupid one is the fact that we can only have 32,000 sub-directories. Actually, that should be 32,000, not 32,768. We have second-resolution timestamps, which is a bit of an issue, given that computers are kind of fast now when you can compile a file in well under a second, which is sort of an issue if you're going to be using make, which likes to track based on timestamps. And, you know, 16 terabytes is starting to actually be a real limitation. And perhaps the biggest problem with EXT3 has been its performance limitations. Now, some of that has been deliberate. EXT3, we've always taken the position that we care a whole lot more about making sure the data is safe than it being fast, because if it's, people get really cranky when they lose data. That's probably the simplest way I can put it. You know, it's one thing to win the benchmark wars, but first of all, many, many workloads are not even file system bound, right? So if you have a super fast benchmarking result, but in real life, you're actually really CPU bound, then it may not really matter, and then if you lose your source tree, people get cranky. So we've historically been very, very conservative with EXT3. But over time, that's started to become a real limitation, and so it was time to try to add new features and make EXT3 into, you know, what I would call a modern file system. Now, this brings up an interesting philosophical question, which is, is EXT4 really a new file system? It's certainly a new file system subdirectory. So if you look in the kernel sources, under FS slash EXT4, you will see a complete source code. This is source code that was forked in 2619. But it's important to remember that EXT in EXT2, EXT3, EXT4 stands for extended. And in fact, EXT4, as far as the file system format is concerned, is actually a collection of new features that can be individually enabled or disabled. So some of them are extents, some of them are a huge file, dur underscore end link, so on and so forth. And together, if you enable all these features, then you would get the full effect of EXT4. And the EXT4 file system driver in the Linux kernel supports all of these new file system features. But you can also run a standard EXT3 file system and mount it as EXT4, and it will work just fine. In fact, as of 2629, which is the next stable kernel release, we're currently at 2629 RC3, you will actually be able to mount an EXT2 file system, which is to say a file system without a journal, on the EXT4 file system driver core. And this was actually code that was contributed to us from Google that allowed us to be able to mount a file system without a journal with the EXT4 code base. EXT3, simply because way back when when Stephen Tweedy was developing EXT3, we had forked the code base. In order to make life simpler, he had written the code such that the journal had to be enabled. And in fact, if the journal was not present, and you tried to mount it as EXT3, EXT3 would just refuse the mount. Google, as it turns out, wanted the advanced features of EXT4 because extents and all the rest proved to have some really nice performance benefits for them. But Google doesn't believe in journals because Google has the theory that if the system ever crashes, you wipe the hard drive and you recover from the other two redundant backups. And if you're going to do that, and you don't have to worry about FSCK'ing the drive because if the system ever crashes, you just wipe the disk and copy from a backup, then you don't actually need to recover from journal, and they got a little bit of a performance boost by running without the journal. So in the latest 2629, you'll actually be able to enable all of the EXT4 features but disable the journal because Google was interested in running in that fashion. It also turns out you could run with all the features disabled and mount a standard EXT2 file system on EXT4. You would still get some of the performance benefits that don't depend on the file system format, but it's more interesting just simply from a flexibility point of view. So why did we even bother to actually fork the code? And again, this was just simply from a matter of development stability. EXT3 has a huge, you know, user base, including Linus Torvalds and Andrew Morton, and they would get cranky if their file systems got destroyed and their source trees were wiped out. And so we had a lot more opportunity to experiment if we just simply forked the code base, and we didn't have to worry that we might accidentally trash, you know, some important people's code. At the same time, it also allowed us to do all of our development in the mainline tree, which was also a big help. So that's what we actually did. But from a theoretical point of view, it's not really a new file system except that we added a whole lot of code to it. We started with EXT3, we added new extensions to it. From the user space code, we used the same E2FS progs to support EXT2, EXT3, and now EXT4. You just simply have to have a new enough version of E2FS progs. So, you know, is it a new file system? Is it not? It really depends on your point of view. It is definitely a new file system code base. I'm not even sure you could call it a new implementation. It's just simply a more advanced version with new features. So, pays your money, takes your choices. So what's new in EXT4? There are a huge number of features that are new. Probably the biggest one is extents, and then the changes to the block allocator, and I'll talk more about those in just a moment. Some of the other features that we have added are simply to address problems that I've already alluded to before. For example, we removed that really stupid 32,000 sub-directory limitation. NFSv4 has a requirement for a 64-bit unique version ID that gets bumped whenever a file is changed in any way. And they need that specifically so that they can do reliable caching. I'm not an NFSv4 expert. I think they have some clues that makes the caching less efficient if you don't have that feature, but the NFSv4 people really wanted it. So while we were in there, we added the NFSv4 version ID. We also, and this is in the sort of category of stupid changes that was really easy to do once you're actually going to open up the code. We now store the size of the file in units of a file system block size as opposed to the POSIX mandated 512 sector size, which is a mistake perpetrated by System 5 Unix, and that basically gave us a very painless way of expanding the maximum size of a file from 2 terabytes to 16 terabytes if you're using a 4K block size file system. And if you're on an itanium system or a power system and you use an even bigger block size such as 16K, 32K block size, you get another couple of powers of two out of that. And that was just something that we could do that was actually very, very easy. We'll talk a little bit later about why we didn't actually change that to expand that even further. And it's something we could do. It's just something we didn't do this time around. We added ATA trim support. This is support that we'll be showing up in some of the new big storage subsystems that do something called thin provisioning, where you've got a large number of block devices that might not be completely full. And the file system, when you delete a file, can tell the block device, we're not using these blocks anymore, so you can use them for something else. It's also useful for solid state disks. For the same reason, they can do a better job wear leveling if the file system can inform the solid state disk that these blocks are no longer in use, so you can use them for wear leveling. The support is in the file system. The low-level code to actually send the trim commands to the devices has not actually hit mainline yet. And my understanding is it's because the people who are in charge of writing that bit of code doesn't have hardware that actually implements the features yet or something like that. As far as I'm concerned, I have all the file system code hooked up. It's just it's not talking to any real devices yet, but it's something that I've been told is going to be a really big deal for solid state disks and for thin provisioning, so we have that in there already. Another thing that we've added is checksums in certain bits of the metadata, specifically the journal and the block group descriptors, and that's allowed us to reliably put in the block group descriptors what part of the InnoTable is actually in use and what part of the InnoTable is not, and that's allowed us to speed up FSCK. So there's been a lot of little tiny improvements that we sort of made while the patient was opened up for surgery, as it were, but probably the biggest one as far as EXT-4 is concerned is extents and then related to extents, the block allocator. So let's dive into that. How many of you are familiar with the indirect block map system that EXT-2 and EXT-3 use? How many people are familiar? Okay, some people are, some people might not be. So quick review. Inside the Inode for EXT-2 and EXT-3, there is room for 15 block pointers, 15 32-bit block numbers. The first 12, 0 through 11 in the iData array are used to map direct blocks, and so if your file is less than 12, less than or equal to 12 blocks long, if you're using a 4K file system, 4K block size file system, that's 48K, the location of all of those blocks can be stored in the Inode, and you don't have to do anything else, it's all there, and it's just mapped. So in this case here, in this example, the Inode, the first 12 blocks are located at block numbers 200 through 211. Now, if the file is any bigger than that, there's no more room for direct blocks inside the Inode. So we allocate in the second column to the right on the top that gray box an indirect block, and we put a pointer, which is slot number 12 in the iData that points to the indirect block, and there we have room, again, if we're using a 4K block file system, 1024 block numbers that can go in an indirect block, and that will give you a range of blocks. Now, if it turns out that that's not enough room, we can insert a double indirect block, and that's the light blue, and so we have a pointer in slot 13 to a double indirect block, and each double indirect block will point to 256 indirect blocks, which then contain 256 block pointers to the file, and finally, if that's not enough, we have room for a triple indirect block, and a triple indirect block has 256 slots, each one points at a double indirect block, 256 slots, each of those blocks' slots points at 256 indirect blocks, which then goes to a file, and 256 times 256 times 256 times a 4K block size is a really big number, and so that's how indirect blocks work. Now, it turns out this system is incredibly inefficient for really large files. If you're using EXT3 and you ever have to delete a huge ISO image, and it really doesn't matter whether or not it's a CD or a DVD ISO image, the DVD ISO image will take even longer, it will take a long time to delete, and the reason that it takes a long time to delete is that it has to read all of those indirect blocks and then free the block pointers in the indirect blocks and the double indirect block and the triple indirect blocks, and that takes a very, very long time, and it's especially inefficient when you consider that the file system is actually going to fairly great lengths to keep files contiguous, so most of the time, what you actually see in these indirect blocks are increasing sequences of block numbers, 200, 201, 202, 203, 204, 205, and that's not a very efficient way of storing that type of information. A much more efficient way is to use something called an extent, and an extent is just simply a way of saying we're going to start at logical block zero and logical block zero is going to be located at physical block 200, and that's going to continue for 1,000 blocks. So if we have 1,000 blocks free on disk and we can allocate it contiguously to the file, then I only need a very small amount of room, room for three integers to encode what previously would have taken 4,000 bytes to store. 1,000 entries, one for each block, and instead, we just simply say that starting at block zero and going on for 1,000 blocks, we're going to use the range starting at block 200. So this is what the on-disk extents format looks like. This code, the extents work, was actually contributed by ClusterFS and Andreas Dilger and Company, and they actually used EXT3 as the back-end storage for their Cluster file system, which they called Luster, and they needed to get better performance out of it, and so they actually enhanced their version of EXT3 to have extents, and then they contributed that code back to us. Many, many thanks to Luster, and so we sort of stuck with this particular format, and what this format effectively gives us is 48 bits of logical block numbers and 32 bits of, sorry, 32 bits of logical block numbers and 48 bits of physical block numbers, and a lot of people ask us, why didn't you go to 64? And the answer was, well, this is what Luster was using, and we wanted to stay compatible with Luster because after all, they contributed a lot of really good code to us. There was thinking that at some point we would add support for an alternate data structure that would in fact give us 64-bit logical, 64-bit physical block numbers, and basically instead of a 12-byte structure, it would probably take a bit more than that, it would probably be a 16-byte structure, and we have a version field in the extent header where we could actually indicate this is a new version of the extent header, and we may very well do that at some point in the future. It turns out that 48 bits is a very large number. You're already up to an exabyte with 48 bits of physical block numbering, and we wanted to get EXT-4 out there sooner rather than later, so we've sort of stuck with the very simple. Maybe at some point in the future we'll expand it, but for many, many users, an exabyte is more than enough space for what they would actually want to use it for. So this is what the extent map actually looks like. In the iData field, we store a small header structure which indicates how deep the tree is, what version of the extent structure we have, and then pointers to the location on disk. Now, we can store up to three of those extent structures in the inode body directly, and as it turns out, and I'll show you some numbers in a moment, the vast majority of your files on a fairly standard file system will in fact fit in under three extents. In fact, on the file system that got stolen on Friday, I'd been running EXT-4 since July, and I think I had done a full backup and restore once or twice during that period, but it'd been my primary file system. I'd been using it for quite a while, 128 gigs. It was my primary laptop. It had Ubuntu on it, Ubuntu Hardy on it, and I had all of one file... I'm sorry, that's not right. I had all of... I had several hundred thousand files on it, and I had maybe 700 or 800 files that spilled over to a single extent block, as a B-tree that had a single leaf block, and that leaf block had room for 129 extents, and then I had a single file that had an extent tree that was too deep, that had one index node, and then a number of leaf nodes, but the vast majority of the files, 99% of the files, all of the extents lived inside the inode because they were under three extents, and in fact, something like 95% of them were encoded in a single extent, and that was because of changes in the block allocator. So for the more complicated files, and as I mentioned, on my file system, where I had been torturing the file system fairly badly, I was using it for normal use, I had exactly one file that looked like this, where I had the inotable and index node, that index node could store up to 129 leaf nodes, and each leaf node could store up to 129 contiguous extents, and that one file was in fact a sparse EXT-3 extent image that I had been using for testing. The short version is I had to deliberately create files that were deep enough that I could actually exercise the extent tree code because in theory, this tree can grow to two, three, four levels deep, but in fact, you have to really torture the file system to get it to generate that. Most of the time, you either have a single leaf node pointed from the body, the inode table, and the vast majority of the time, you just simply have one, two, maybe three extents in the inode table because the file is that basically contiguous on disk. So we can handle sparse files, and we can handle files where the file system gets very badly fragmented. In practice, I've been noticing that EXT-4's anti-fragmentation algorithms are good enough that at least for my general workload, it's fairly rare. Now, I'm sure someone out here will have a workload that will prove me wrong, and that's okay. We have an online defragmenter that we're hoping to get done, but that's not in mainline yet. So part of the reason why the code is so fragmentation resistant is because of changes to the block allocator, and this is also a code that was contributed by Lustra, Fasten, and Andreas Dilder. And the reason why they needed it is because extents work best if the files are contiguous. In fact, if your file system is really badly fragmented so that you have a free block here and a free block there and a free block here, it takes 12 bytes to encode an extent. So if you have lots of singleton free blocks in your file system, extents can actually be a less efficient way of encoding the file map data than just simply using a normal indirect block. Now, in practice, that doesn't happen, but one of the reasons why is because of the multi-block allocator. And the multi-block allocator, which, again, came from Lustra, gave us two things. Number one, it gave us delayed allocation, which means we don't actually allocate files until the very last minute and either the application has explicitly requested the data to be flushed out to disk with an F-sync call or the Page Dirty Cleaner has decided that it's time to actually push blocks out to disk. And then the other part is the multi-block allocator, which, when it allocates blocks, it allocates blocks based on how much data it needs to write. The previous block allocator allocated a single block at a time and about the only thing it knew was the previous block was located at block N. And so the first block you would actually try to find would be block N plus 1. And if that wasn't there, it would try block N plus 2. And so it was actually fairly stupid. The multi-block allocator will know that we're going to be allocating a dozen blocks or 200 blocks, and it will actually search for enough free space for the requested amount of space that we actually need to allocate. And that's one of the reasons why most of the files on disk actually turn out to be contiguous on disk. And this is, by the way, responsible for most of EXD4's performance improvements. Now, there is one little tiny gotcha laid allocation code. And that is that what many application writers had been used to was EXD3's ordered mode semantics. And EXD3's order mode semantics effectively said, before we do a journal commit, we will make sure that any blocks that have been allocated on disk will in fact be written to disk before we slam the inode onto disk and do a commit. And this guarantees that you never get stale data, right? Because stale data could be a security problem because it's previously written data that possibly belonged to another user. And you might be exposing that if the system crashes, if the inode has been written to disk, but the data had not been written out to disk. And so the default mode that most people used for EXD3 was force a journal commit every five seconds and use order mode semantics. Now, in practice, what this meant was if you wrote a file and closed it, within five seconds it was guaranteed to be on disk. And a lot of people just sort of assumed that that was normal. All systems that do delayed allocation will not actually do this. They will not actually force the data out to disk because we haven't even allocated the data on disk. And so what EXD4 would do under these conditions is you might write a .file, and some of these application programmers would rewrite a .file without leaving a backup. So they would truncate, rewrite the data, and close it. And if the file had not actually been allocated, then there would be no data actually pushed out to disk. And if the data had not been pushed out to disk, we now have to wait for the page cleaner to decide that it's time to write dirty data back to disk. And the default for that is 30 seconds. And the page cleaner doesn't actually write out all the dirty disk, it will actually stage it out. So that data will start getting written out to disk after 30 seconds have gone by. And if you've dirtyed a lot of data blocks, it might take another 30 seconds before everything has been written out because it doesn't want to overload the system, so it actually does it in little tiny chunks, spaced out over five second intervals. If you turn on laptop mode to save your battery, that 30 seconds can get expanded to like two minutes. And now what you end up happening is it can be a good two to five minutes before data that had been written to disk is actually, sorry, data that had actually been written by an application might take two to five minutes before it actually is written on to disk. So if your system crashes, you can actually lose more data. Now this is in fact all legit, right? If you look at the POSIX specification, it essentially says unless you call fSync, all bets are off. And it was just simply that many application programmers were used to the behavior of EXT3 and not bothering to call fSync. Now, I say this worried a little bit that everyone will now use fSync a lot. And the reason is because in recent years I have noticed a disturbing tendency by application writers, some of you may be in this room, to generate hundreds and hundreds of dot files. Like if I look under .gonome and .kde, I see hundreds and hundreds of individual dot files that each contain huge amounts of data. Actually, they don't contain huge amounts of data. That's the problem. They each contain like three or four bytes of data, but there are hundreds and hundreds of files. And if you call fSync on every single one, you will really pound your system with a hammer pretty badly because it's going to force a lot of data to disk. And you'll be forcing a commit for every single one of these fSyncs. Probably the right answer is to use fDataSync. That's not going to be quite so painful and it will actually have most of the semantics. So if you guys are going to stick with using lots of these little individual dot files, each one that contain a few bytes, you probably want to use fDataSync. Or you might want to consider using SQLite or some kind of proper database because it's very clear what's going on here is people decided that the Windows registry was evil, so it will be the anti-Windows. And instead, you now have hundreds and hundreds of these little tiny files, which isn't such a bright idea either. But be that as it may, I know I cannot influence what application programmers choose to do. All us file system authors can really do is sort of adapt to what the application developer community actually throws at us. So we may end up actually trying to store data in the iData array. And the reason why we didn't load these many years ago was it used to be nobody was insane enough to use lots and lots of little, little tiny files. I mean, I actually, at one point, took a look at it and said there was actually very few of them, so we never bothered to store data in the iData array. But these days it looks like there are a lot of app writers that have lots of files that are under 60 bytes. And so maybe we have to revisit that decision. All of that being said, delayed allocations just sort of exposed that because I've gotten one or two bug reports of the form, you know, I was using ext4, crappy NVIDIA driver crashed my system, and I had several hundred zero link files in .knoam or .kde. I think I got one of each. So this is not a gnom versus .kde thing. Both desktops seem to be doing that. And it's like, you know, my first thing was, oh my god, how come they have so many of these little tiny files? And why are they rewriting them all the time? Because that's got to be a performance hit right there. And I'm not sure I want to know the answer, but if someone from the gnom and .kde environment, you know, communities want to tell me why you're apparently constantly rewriting hundreds of these files in, you know, the user's home directory, you know, I'll, you know, get myself a good stiff drink and then you can tell me. But, you know, it's one of the things we're looking at. One of the things that we may end up doing is have some heuristic where if the file is small and we notice that it was in a truncate or removed, we'll actually immediately map the files on close, which is less heavyweight than actually calling fsync. Eric Sandin tells me XFS had to do something very similar. XFS apparently has this clue where if a file has ever been truncated, it implies an fsync as soon as you try to close it. And that's because there are so many application writers that got kind of lazy about assuming that they could just simply do that and then XFS's delayed allocation hit them. So apparently this is not a new problem. So that's one. Another interesting feature that we have is something called persistent pre-allocation. This allows blocks to be assigned to files without having to initialize them first. The original use of this was for databases and streaming video files, where if you know that you're going to eventually fill a gigabyte on disk because you're going to be recording an hour of video and an hour of video compressed is about a gigabyte, you can tell the system, please pre-allocate a gigabyte on disk and then the file system can allocate that space contiguously because you know exactly how big it is. This can also be useful for packaging systems like RPM and d-package. If you know how big the file is, the file system will be able to do a better job. If you tell it, please pre-allocate me the space because then it can pre-allocate exactly how much space it needs and you can reduce fragmentation by a little bit if you can actually do that. Another interesting use of this is for files that are grown via append. So a log file is constantly being appended to, a Unix mail spool file is constantly being appended to and if you know that that's happening, one of the things you can do is just simply pre-allocate space. If you know roughly how big the log file is, you can pre-allocate the space and then the log file will be contiguous on disk because you've pre-allocated it. Now, you can access this via the G-Lib C POSIX F-Allocate call but the problem with the POSIX F-Allocate call is two-fold. Number one, if you happen to be on a file system that doesn't support pre-allocate, it will do it the old-fashioned way and just simply write blocks of zeros, which is very, very slow. And so there are some cases where if F-Allocate doesn't exist, you would rather the call do nothing and the G-Lib C POSIX F-Allocate doesn't do that. The other thing about POSIX F-Allocate is it always changes the eye-size field and so therefore if you look at the file using L-S-L, it will actually show that the file is a gigabyte after you've pre-allocated a gigabyte on disk. If you use the raw Linux system call, you can get a hard failure if the file system doesn't actually support F-Allocate. More importantly, you can let eye-size remain at the original size and now what you've done is you've pre-allocated the space on disk but eye-size still shows that the file is zero-length or whatever the original file is and now you can do tail-f. Tail-f will do the right thing and then as you append to the log file, the file will grow into the pre-allocated space and eye-size will grow along with it. And that can be a very nice feature and what that basically means is we've been pounding on the G-Lib C folks to actually expose the raw Linux system call because it does a lot more than POSIX F-Allocate. So let me talk a little bit about performance charts. There's an old line about lies, darned lies, and benchmarks. And so the first thing I'll tell people before you believe benchmarks is to ask, are the benchmarks fair, are they repeatable, and do they fairly represent the workload that you're actually using because a lot of times people will look at benchmarks and say, this is the file system I want to use, look how great it is. And if you don't ask yourself whether or not that file system is even applicable to the kind of work that you do, remember what I said earlier, many workloads are not even disk bound or file system bound, kind of pointless. One really good effort, you can find it at btrfs.boxcool.net. It's done by a guy named Steven Pratt, who is a member of IBM's performance team. And if people want an example of how to actually do good benchmarking, take a look at his site. He documents the hardware and software configurations that are used, and he tests multiple configurations. And this is why this is important. So this is large file creates using a RAID file system. Red is EXT3, green is EXT4, DEV, this is back in October. He has newer results, but I didn't have time to update these particular charts. Blue is XFS, red or hot pink is JFS, and then the last three are different versions of btrfs. And this is a very early version of btrfs. And you can see with this one that EXT3 in red is kind of low, EXT4 is a whole lot better, almost as good as XFS, but not quite. JFS is a little bit lower. And this one says, ooh, okay, that's pretty good. We're almost as good as XFS. This is with 16 threads. And with 16 threads, now you see that EXT4 is still a whole lot better than EXT3, but it's nowhere near where XFS is with 16 threads. And btrfs is down there. Here's with 128 threads. 128 threads, now XFS is way down there, EXT4 is way up there. So if I'm going to sell EXT4 as the best file system ever, which chart do you think I'm going to use? And this is just simply what large file creates. If we do large file random reads, this is with one thread. That's 16 threads. There's 128 threads. So large file random reads, you can see that in some cases EXT3 is actually better than EXT4. I don't know why. I suspect it has to do with changes in the layout algorithms that we can still fix. So there's still some tuning work we may need to do. Large file random writes, you can see we're way better than EXT3. 16 threads, 128 threads. Here's sequential reads. And the main thing I want to get across here is these bars are fluctuating loudly. This is why benchmarks can be highly misleading. If someone only shows you one chart, they're trying to sell you something. For some reason, the mail server simulation workload, which is a mixed read write workload that tries to simulate a mail server simulation, EXT4 does really well. I can't tell you why, but it just happens to be really well. Except on 128 threads where the machine apparently crashed. And I can't tell you why either. This was also last October. This is now with a single disk. And one of the interesting things with single disk is BTRFS is now way better than a number of the file systems as we go through the various benchmarks. And you can see here that on some of these benchmarks, BTRFS is actually doing very, very well on a single disk. Not doing so well on RAID. Again, this is last October. BTRFS's file format has not been finalized yet. It certainly wasn't finalized as of October and they were still tuning it. So, again, these results are a little bit unfair. You can see here. And then here's the mail server simulation where EXT4 apparently walks all over the competition. But again, workloads matter, right? So I'm not going to tell you that EXT4 is better than all other file systems. On some workloads it does pretty well. We still need to do some tuning work. But it's always useful to know that. Okay, this is actually something kind of interesting because we didn't actually plan for it. But it turned out that a lot of the improvements that we did to improve general RAID write performance also made a huge difference for EXT4. And I think, looking at it, a lot of it has to do with the fact that we're doing much fewer indirect block reads compared to extent reads. And the ununishlized block groups means that you don't have to scan the entire I-Node table if the I-Node table blocks aren't in use. So this is results from my 128-meg file system on the laptop that was stolen back in September. And they were actually identical copies. I'd been using EXT4 in production use for about two or three months at that point. And I just simply made a copy of everything on my file system onto a fresh EXT3 file system. So EXT3 actually had a benefit over EXT4 because it was a fresh copy, you know, it was totally defragged. Whereas EXT4, I'd been using it for two or three months. And you can see that past one of FSCK on EXT4 was 17 seconds. On EXT3 it was 382 seconds. And take a look at the number of megabytes read. We went from over 2,300 megabytes read down to 233. And that's where a lot of speed up comes from. We're just simply needing to read fewer blocks on disk and we're having to do a lot less seeking. And we saved most of the time on past one and past two. Again, there's not that much difference in the directory reads. But the directories, and in fact on this one here, EXT3 took less time to read the directories because the directories were contiguous because we'd done a fresh copy. But there were fewer reads for EXT4 because there were no indirect blocks. So you can see there the net is you go from 424 seconds down to 63 seconds. The general rule of thumb that I found is EXT4, if you use a freshly formatted EXT4 file system, saves you somewhere between six to eight times faster. So take your EXT3 fsck time, divide it by seven, and that's roughly what it will be under EXT4. So if you want to use EXT4, you need e2fsprogs1.41. I really recommend that you go to e2fsprogs1.41.4 because we fixed a whole bunch of EXT4 related bugs. You need at least a 267 kernel or newer. I strongly recommend 2628 and the For Stable branch. That stuff will hit the stable kernels fairly soon. It just hasn't yet. That was one of the things I was going to work on before my laptop got stolen, but that's okay. And there is a 267 For Stable kernel. Again, both of these will be sent off to the stable kernels maintainers soon. And of course you'll need a file system to mount. You can just simply use a completely unconverted EXT3 file system, and the delayed allocation will help you. So you will get somewhat better performance just simply taking a completely unconverted file system from EXT3. You can enable features such as extents, the huge files features, directory end link, directory eye size, sorry, that should be dur index actually, on a particular file system. If you enable uninit bg or dur index, you will have to force an fsek after you actually enable those feature flags. That will get you some of the performance of EXT4, but you will only use extents for the newly created files. The old files on the file system will still use the old indirect blocks. Or you can create a completely fresh EXT4 file system and then do a dump restore, and you'll get the best performance from that. But it's up to you how you want to do things. If you just simply want to play around with EXT4, you can just simply, you know, leave your file system unconverted. One warning is, at the moment, once you start converting to EXT4, we don't have a good way of going back in time and unconverting. So if you want to get involved, there's an EXT4 mailing list. The latest EXT4 patch series. I have a git tree, and I also have a patch directory. At this point, the git tree is probably the most up to date. We do have an EXT4 wiki, which is at ext4.wiki.kernel.org. It still needs a lot of work. If someone would like to jump in, I would love some help. At the moment, it's actually a little embarrassing. KernelNewbies.org's EXT4 article is actually better than what we have on the wiki. So if somebody wants to help me improve the EXT4 wiki, I'd really appreciate it. We do have a weekly conference call. If there's someone who's really interested in diving in deep, contact me about that. And we have an IRC channel. And this is the EXT4 development team. And I'm probably missing a couple of people, but these are people who've been working on it for the last couple of years, and they do a lot of hard work. I'm the guy who basically does QA and all the integration work, and then a lot of the user space utilities. So with that, I know I ran a bit over time, so I don't know, maybe I have time for maybe one or two questions, and then I'll be happy to stick around and ask some more questions. Yeah, in the middle there? Thank you. I was interested to know if there are any good solutions for the sinking problem that you mentioned. I'm involved in laptop-mode things, and the default is actually not two minutes, but 10 minutes. And we spend a lot of time getting applications to drop all their F-sync calls, just because any one of those will spin up your disk, and there's no way to get rid of them. I think the short version is F-data-sync seems to be a good compromise for now. We are looking into ways of solving the F-sync problem, but it's in really tricky code. So we know about it. It's one of those things we'd love to fix, and it's on our hit list. So yeah, that's one of those little embarrassing bits that we really want to try to fix. Thank you. Any other? Maybe I'll take one more. In sort of a related question, for databases and also, presumably, for a lot of those other applications doing F-sync, they don't necessarily need the sync to happen immediately. They just have to know that it... They have to know that it hasn't happened yet, so they can avoid sending a commit confirmation or telling some other application. So is there any sort of non-blocking F-sync that... No. Why don't we talk? We're from database people on that one, but we should probably take that one offline. And I know I'm really running over, so maybe I'll be happy to stand in the hallway and take questions for people who are interested, but I don't want to get people late for their next talk. So thank you very much for your attention.