 Text one, text two was actually a kind of an anagram of x2, so there never was a text one. And what happened to it, well, that's a long story. I don't think I have quite time to go into that right now. If you want to know, you can Google for evil patents. And I believe that I am still, number one, hit on Google for evil patents, quotes. So here we are. It's nice to be back. I was out of the kernel community for a little while. Came back for the purpose of finishing work on Tux 3. And as we all know, Linux needs another new file system, right? So everybody loves a new file system. You look forward to it with a mixture of excitement and dread. We get to have a bunch of groovy new features and discover some groovy new bugs and have some, usually, some wonderful flame wars on the Linux kernel. And somehow that eventually turns into progress. So what am I doing writing a file system? I guess I just have to say that something about Linux, every Linux developer feels they have a file system in them, just like a novelist has a novel. So some of us finish our file systems, and they become great, important things deployed to the entire Linux world, which now consists of some 600 million machines out there. So we've got a rather large market for everything we do. And that means that we should, when we satisfy that market with new technology, we should be awfully, awfully careful just about how we go about that. And I think I can reasonably claim to be pretty careful about that, because this project is really about 15 years old now, if you count from the beginning of Tux 2. OK, let's jump in. So why Tux 3? Why even write a new file system? Why am I not implementing the latest Web 2 thing, or some cloud thing, where all the action is these days? Well, fact is that local file systems are really the be-all and end-all of how a computer works, especially how Linux works. It runs everything. You won't be having these new wonderful cloud platforms without really solid local file systems underneath. And we always expect a lot. We expect more and more out of the same hardware, just because that's the way Linux is. And it turns out to be one of the most important determinants of performance for just about everything we do. Certainly, it affects the reliability of everything you do. If your data goes away or gets corrupted, that's kind of the end of most projects you might get into. Also, flexibility, how flexible is your file system? Can you move your data around from machine to machine? Can you say replicate it for backup just by waving your hands? And the answer to that is no. So what we want with Tux 3 file system is to affect that flexibility in a positive way, and especially by letting you just wave your hands and do magical things like having your data off safe in the cloud somewhere. It's not something we can do today. I mean, we can kind of make a duct tape solution with LVM snapshot and other things. But we can't do what we really should be doing today. And we should. Let's do something about that. But really, did I say why Tux 3? I haven't said why Tux 3 right now. We've got lots of file systems out there. We've got X4 is still chugging along. Part of that's my fault. No, I'm not sad to say I'm really actually proud of it. Fixing up the directory system to make it outperform a lot of the high-tech file systems from other companies and platforms was one of the things that allowed it to continue on to its preeminent position today deployed on more than 500 million computers. And then we have ButterFS, our great hope and answer to Sun's ZFS. It seems to be progressing well in that direction. And we've got Enterprise file systems. XFS came from way back in the SGI new thing or two about scalability. But what's Tux 3 doing in here? Well, Tux 3 is because some of us really believe in we have old school beliefs, old school Unix beliefs of very specific functionality. Each component should do what it does best. We don't want to be a volume manager. We want to be a file system. And we want to be the best file system we possibly can. We want to raise the bar for data safety. How we keep your data. You know, if you make some changes to a document and save your file and hit the power, your data really should be there. There's no excuse for it not being there. And that's not a test that's always been met on the next. There's no reason why we shouldn't be able to do that. And that's one of the things that we're trying to do. Performance, the last word in performance has not been written in file systems. I know that because we have advanced the performance bar. Otherwise, I'd just be guessing. Robustness and simplicity are hand in hand. If you want robustness, you better have the simplicity. Hi, Natalie. Nice to see you. And I think more than anything, we can sum it up with we want to advance the state of the art. So let's go and look at how we've done that, or whether we've done that, a little bit of history. I'll start basically halfway through text three history. We'll leave off the text two. And start with Zuma's door, which Natalie knows something about, a former colleague from Google where we developed an enterprise filer that had a requirement of snapshotting and replication. So I developed a thing called the DD Snap, which was originally started at Red Hat, continued at Google to do snapshotting. It was meant to be cluster snapshotting, but it worked perfectly well on a single node. So we used it, put X3 on top of it, fixed bugs, which took a couple of years, and ended up with a pretty amazing enterprise NAS system. We had also NFS version 3 and 4 on top of it with Kerberi's authentication, and basically everything that that little company NetApp had. That kind of fell off the edge of the world for reasons that aren't completely clear to me, probably because I stopped working on it. It was brought to a state of proven reliability. It was a very simple snapshotting algorithm, copy before write. You have data on a disk. You have somebody writes to it. The operating system picks up that write, intercepts it, copies out the data to a snapshot store. And that's the right proceed. So this has a disadvantage that everything has to wait while we copy the data out. So there are other ways to do it called cow copy on write snapshots, which you've heard about, ButterFS and ZFS, NetApp's waffle, use what I call a shared tree snapshot approach. So Tux 3 started because I discovered a new algorithm for representing snapshots that can store all the snapshotting information at the leaves of your file system tree instead of sharing file system trees. So that allowed us to adopt an old school kind of file system design where you have an inode table, you have a bunch of files descending as trees from that inode table. Very, very simple thing that we could now do became possible because of this new snapshot technology. So we began working on it. I began working on it, coding it all up in user space. That was my idea. We'll start coding Tux 3 in user space. It was pretty cool. Some guy from Nordic country came along and turned it into a few-ish file system for us, Tarot Repona. And that was pretty cool. He did that in about a day. So we had now two ways of running Tux 3 file system. At some point, I attracted a developer named Hirafumi Ogawa to the project. And he took on the kernel side of things. Normally, I was expecting to do that myself. Well, frankly, Hirafumi is better at it than I am. And he just got to work. And so we were able to divide those responsibilities, make progress. And about the time we're doing that, ButterFS became a big thing. It was a big community thing. Even the X4 maintainer, Ted, was lining up and saying, yes, 3X4 has reached the end of its life. ButterFS is the next thing, and so on. So that was good. And I went off and did other things during that time. After about three years of doing other things, Hirafumi came back to me one day and said, I think there's something good here in Tux 3. There's a lot of good stuff. And he wanted to continue it. So we did agree to continue and get back to work on it. Around about Christmas last year, we put together all the pieces up to that point. We were missing something called the atomic commit, which is your actual reliability. You have no idea what you've got until you actually run it and see that it is reliable. You don't know how fast it is or anything. So a little more than four years into the project, put it all together, ran it, and discovered that we were something like six times slower than X4 doing the same thing. However, then we noticed that our debug code was still on, that we were using 512 byte block size and a bunch of other things weren't configured correctly. So we fixed all those and ran the test again. And lo and behold, we actually beat X4 on the first benchmark, which was, I think, we were actually using a run of FS stress at that time as a kind of a benchmark. So that really gave us a lot of encouragement. So we got down to the hard work of taking this working prototype and actually making it a file system. So I'm going to talk a bit about what it is and try and put in your minds an impression of the size and shape of this thing and how it works. First thing is, we really tried to use as many tried and true proven techniques as we possibly could. One of those, Text3 has an inode table. A lot of new file systems don't have inode tables. They are built more like databases and some unified concept. Well, Linux is actually really organized to have an inode, separate inode table and separate caches for files. That's the sweet spot of Linux design. We decided to stick with it. So but we didn't do, OK, we also used bitmaps, which is kind of considered a little bit archaic. And we put our file names into directory files. So again, old school. So we did some modernization, of course, just like everybody does. Every file system has adopted extents, B trees. Most are using write anywhere now, moving away from the journal model. And we came up with some genuine new innovations, which is, of course, why I'm here if there is, without that, there would be no reason for Text3. Now let's look at those in a little more detail. Traditional elements, this is Benjamin Franklin. Well, he would have if he could have got his power stable. OK, so Text3 still has blocks. Cool, it means you can always allocate a block. And the structures that are required to point at blocks are smaller than memory pointers that can point anywhere. We have bitmaps. Go and look at Jeff Bonwick's blog about CFS. He will tell you why bitmaps are no good. And we looked at that very critically and discovered holes in his arguments. It turns out that bitmaps are still really good way to keep track of storage, just to short illustration. One bitmaps, one block, that means one's 37,000th of your file system will be devoted to bitmaps. That should be OK. I know at Table this was a lot harder decision. It's very tempting to say, well, why should you do two lookups, why should you do a directory lookup and then another lookup. And then I know Table. But we stuck with it. And in practice, it doesn't seem to slow us down much. We've got a lot of very efficient caching code around that. There are actually situations where it can be faster than an all-in-one structure if you're doing a pure LS or something, just looking at the file names. It's faster to have your file names in a file than mixed up with your actual file attributes. Indexed tree is exactly one pointer to each extent. That's a big thing. That's what the version pointers technology allowed. It's allowed us to adopt a design where each pointer in the file system, tree of trees, points, uniquely at each extent has one pointer coming to it. That simplifies checking it, simplifies updating it. Lots of benefits. This generally simplifies things. And directories are in files. OK, so modernizing some things, the same things that everybody else modernize it. We have extents. We are a little more compact than some other file systems extents. We were working on a format that averaged about 12 bytes per extent. And we finally relaxed and went with a simpler format that was easier to debug 16 bytes per extent. And an extent can map a gigabyte or however much you want. So it can really help reduce metadata size. We use B trees in a couple of places. The files are indexed with B trees. And the Inode table is not just a flat table in memory the way it is in the X series, but a tree with pointers and running B tree algorithms. We discovered some things about B trees when we really got into it that were kind of surprising. It's not common knowledge. B trees are actually not a very good structure for updating. And I'll get into that if we get some time to talk about the shard map index. We made everything about Inodes variable. Inodes are variable size. They have a variable number of attributes. Every attribute is optional. Attributes can have different sizes. And we actually made that work efficiently. It sounds like it should be costly, but apparently it's not because, as I say, we're performing pretty well. And metadata position unrestricted right anywhere. Try to avoid that sounds too much like right anywhere file system layout or waffle. So stuff that is new in Tux 3 and that makes this project worth doing. It might have been worth just writing a clean up file system, but that would be a marginal decision considering the amount of effort involved. So we came up with this method of doing atomic commit called delta updates, which we think is better than journaling. So why would you ever journal? Now that this is invented, I'll get into more of what it is later. This allowed us to really raise the bar on consistency. We are doing the equivalent of what would be in X3 journal equals data. That is, we're recording all data to disk and the order is written in a very precise way. We do that quickly and accurately. And we're able to do that at a very high speed. So we get this seemingly impossible best of two worlds, best of both worlds, strongest possible data consistency and highest performance. So how do we do that? One of the ways we did it is with asynchronous front-end and back-end concept that I first heard of, Matt Dillon's Hammer file system. The front-end is basically your sys calls, file system, sys calls to read and write and so on. That all happens in cache. And the back-end is the thing that updates the file system. It's completely asynchronously in X3. So that means that the front-end is never waiting for the back-end to get something written to disk. It just continues. I mean, there are a few cases where it does have to wait. But for the most part, it doesn't. This log unify, we made a creative use of logging. So it's not like a logging file system. It's kind of a write anywhere log. Instead of writing out a whole metadata block, like a bitmap or index pointers or whatever, we'll just append a message to the log that says that block should be edited in this way. And we'll do our delta commits, which are a chunk, a whole bunch of blocks that have to be written together so that you get a move from one consistent file system state to the next consistent file system state. We will include some log blocks in each one of those deltas. And we'll do several of those deltas before we actually go and change the file system tree that exists on media, and that we call a unify. So that's log unify. That's another one of our advantages that's good for quite a bit of performance. One really neat thing it does, just I think I have roughly one minute to describe how waffle and bitter FS work. You have a file system tree. You want to change something in it. You want to leave that entire tree alone and have a new tree that points at all the old stuff plus the new piece of data that you wrote. In order to make that happen, you have to find the thing that pointed at that new piece of data, give it a new location. You have to find the thing that pointed to it, give it a new location, all the way to root. So that is recursive copy to root. And I think I actually took two minutes to explain it. I'll try to do better next time. With our log unify, we eliminate that when we want to, we are just like those other file systems, a non-destructive update. We never overwrite anything. We find a new place for it. And we'll make the metadata point at it. But we will not update the parent that points at that metadata. Instead, we will log entry that says, change the parent sometime. And we will eventually update the parent in a unify. And we won't update all the parents either in the unify. We just go up one level. So our changes ripple slowly up the tree, eventually get a new root for the entire file system. Just a neat, efficient way to do it. A couple of other things. I did the h3 index for x3, which has been pretty much unbeatable over the last 10 years in terms of performance. It's given x3, 4 a good leg up on the competition. But it doesn't scale that well as in when you go over a few million files in one directory, you start having issues. So I wanted to go back while developing tux3, see if I could do something about that. And quite amazingly, we did come up with an entirely new kind of indexing technology that is very capable of handling a billion files per directory and meets or beats each tree in pretty much every way except for a couple of easily quantifiable cases that are not that important. It generally, I hope, obsolete sage tree. You might even see it going back into x4 if it performs to expectations and version pointers, which I talked about briefly. Well, I'm going to skip past this one. I already said something about it. And we have a few slides to get through. So I'll just mention Jeff Bonlick's blog entry on this, which you'll find if you ever go searching for why free space should be mapped with extents. So as the proof, he points out that if you have a heavily fragmented file and you've mapped it with bitmaps and you're going to delete that file, now you have to update zillions of bitmaps. But then if you look at it critically, you'll notice that you have much bigger problems than just your bitmaps if your massive file is that fragmented. So yeah, we looked at that in some detail and decided that bitmaps are still the best way to do things. One bit versus 16 bytes that is a factor of 128 size advantage for a bit to map a block. When your file system gets full, it's when you really care about all these bitmap blocks. You've got lots and lots of single blocks being mapped by bits, so ultimately the bitmap blocks win. Especially the way we do our log unify, where we don't actually write out a whole bitmap every time we change it. We just append a log entry and later sometime down the road we'll go and write out the actual bitmap block. Allocation. So this is where we are right now in text3 is doing the allocation. We've basically done taking care of just about everything else it needs taken care of so that you can use text3 as a local file system. Allocation is an interesting challenge. For the most part, we can follow x3's model, the Orlov allocator you've probably heard of. But there are some special challenges because we're non-destructive copy on write. We have to keep moving stuff around. So all those heuristics have to be adapted for our special situation. That's in progress right now. Until this is actually completed, we're not going to be doing certain benchmarks because our placeholder allocation strategy is just to allocate the next block as was proven long ago by Microsoft to be a very bad strategy that brought us the defrag. So we'll continue, skip ahead. There's a lot to say about allocation, but I don't think we quite have time. We've been through log and unify pretty well. There's one observation. Our logging, it's a little bit different from the traditional log FS kind of logging, which was great for writing because you're always just writing in the next available location, really bad for reading because your reads are horribly fragmented. So we err on the side of no read fragmentation. We'll slow down our writing a bit to get better read fragmentation. Turns out we don't have to slow it down very much. Our atomic commit is really a key element of Tux 3. We found a way to do it that allows us to give full data safety at speeds faster than what metadata only journal can manage. And this is all about trying to achieve that state of instant off, where you can just hit the power any time. That's really what I've been after since the very beginning in Tux 2. It was about instant off and so on. So we're actually getting a little bit closer to that. This is our asynchronous front and back separation. It's a very cool technique, which is made possible by this neat underlying technology, which I'm not really going to get into, called block forking, where when somebody tries to write a block in cache that is already on its way to disk. We take that block out of cache and put another copy of it in its place. So needless to say, that involved a lot of really arcane things that we did in the kernel to make that work. It does work now pretty well. And it's very cool in the way it actually lets us take a snapshot of cache effectively, considerably, simplifies the process of getting a consistent update out to disk. Also keeps the front end from stalling. So yeah, and after all, it is about performance. That's the first thing you look for. You just assume that your file system on Linux is going to be reliable, and you judge it on performance. So I know what attributes, already mentioned, variable, everything, wrote that code fairly efficiently, doesn't slow us down, does not show up in CPU profiles. Scaling is a huge challenge. So this is where TechStreet is bringing a little something extra to the table. You can actually make a full file system that you can mount in 16K of volume memory if you use 512 byte blocks. So if you use 4K blocks, it's more like 64K, something like that. So really scales down to the smallest imaginable devices. It's got a very simple internal structure. It doesn't need very many elements to record its base data and metadata. So it goes up to an exabyte. Now what is an exabyte? I did some back of the envelope calculations and decided that an exabyte of disks today would be somewhere in size between a volleyball court and a basketball court worth of disk racks. So it's not really something that we can test in a foreseeable future. Who knows? Maybe by the time the 2037 problem comes up, we'll actually be able to test and see whether TechStreet can really create a usable exabyte file system. But in the meantime, we'll just concentrate on making sure that every structure that we use in a doc does scale, the B-trees scale, and everything. FSCK is almost the number one issue facing the file system community now. If you ever do have to check your file system because you know it's got some corruption, it can take days. It's heading towards weeks. This is broken. And this is really something that we have to fix. So we've got a little bit of leg up on that with TechStreet. We started working on our FSCK. It partially works now. We're looking at how to do it incrementally. That's still research. And in practical terms, what scale do we need to handle? We need to work well on these little devices, embedded devices, DVD player. I've got a DVD player running Linux, probably half the people in the room now. And people will run it on the big Lawrence Livermore things to where they run X4 back in for Lester. We have to cover that as a Linux file system. We have to cover that entire spectrum in practice. There's no wiggle room there. That's some of the file systems that you could describe as competitors to TechStreet, ButterFS, XFS don't really scale down that well, particularly ButterFS. It's one of the reasons we got back to work on TechStreet. OK, so I've got a couple of minutes here. I'm actually going to eat into my own question period a bit. I've got 11 minutes, to be precise. So let's look at something resembling a benchmark here. It's just a linear write of one 4G file onto the file system. So TechStreet writes it at 50.9 megabits a second. Read from cold cache, a little faster, 60 megabytes a second. And the actual raw speed of this device is 61 megabytes a second. So TechStreet is doing the write at about 96% of maximum attainable performance. And the read back is at 98%. And this is what it's doing. This is a graph of write position versus time. There's the position on the volume. Here is time. And here's our 4 gigabyte file being written. You'll have a little dot for every place that has to seek somewhere else and write some data. Here you see the deltas, one delta, next delta, next delta. These little gaps here could be a performance issue. We write queue drains and everything while we're doing our atomic commit in there through trying to do that as quickly as we can. We're putting down our metadata as we write. And every now and then we'll do a unify. In this case, we're doing a unify every three deltas, just to exercise the unify. And the name of the game here is to just write as much as you possibly can and knowing that you're not writing anything extra, you want to see that that right bandwidth is as high as possible. Now the way of putting it is we need to keep the write queue full. So we read that back. And you see the extra seeking other than just the linear reads here. We read a couple of blocks here that got dropped up there by the last delta. And a couple more here. This is actually a bug. We evicted those guys from cache. When we looked at this chart, we realized that we were doing that. We went and fixed that. So in this, we only did four out of line seeks for the whole read. And now it's down to two. So that is really about as fast as it can be. Again, keep that throughput up as high as possible. So that was a very simple performance situation. Obviously, things get a lot more complex than that. I'm not going to do any more benchmarks today. I will allude to some that we have posted. We, a few months back, posted one faster than TempFS. It had some rather excited commentary. But we actually beat this very high-performance TempFS, which is really not a lot more than just Linux's caching layer. Tux 3 was able to beat it while being backed by DiskStore. And it was able to beat TempFS. It was able to beat TempFS because our deletes, our truncates, run in the background in another process, and they don't on TempFS. So there's going to be more of that as we go. There's really, we don't see serious issues with hitting best of class in every benchmark at this point. This worked. OK, shard map. I already spoke about shard map, and I'm already into my question time. So I'm just going to mention that this was an example of a real fit of inspiration. This is a problem I've been working on for 10 years, trying to find a way to fix H3. And the answer came to me in just one second in that while having a shower and banging away at the problem, I just realized what I had to do. So shard map is, in memory, hash table on disk. It's a bunch of little FIFOs. And this solves our update problem. Also, it gives us very good cache performance. It, shard map really needs a presentation all on its own. And the same with version pointers. It's a very cool technology, something like the weave structure that was the traditional form for CVVS, VCCS, actually, for both of those. So it came up with something like a binary form of that that works really well for power systems, we think. Progress. So we've been working on Tux 3 since 2009, three years out. It's actually only a couple of years of real developer work times two developers in a bit, so we've got about five man years in that. There will probably be more than 100 man years in it before it's finished. Now, obviously, we're not going to do that. We're going to propose for merge in a fairly usable state, and then people will come in and start contributing their talents. I went through a bit of that. We actually got this working last Christmas, tested it, and wow, I mean, made all the work before it worthwhile. Now, at this moment, we are preparing to offer it for merge. We're solving the last three big problems, and we have declared that you have to actually be able to use it on root. Once you can, Hirofumi is running stress tests on Tux 3 as root file system right now in Japan. And of course, it will be used at your own risk for quite some time. And there are the three big items. Memory map consistency, this was some fallout from the forking. Cool technology, we had a really deep issue that we had to go and solve. So that was done about two weeks ago. We came up with the solution. So merge should not be that far away. We'll offer it for merge. People get to kick at it, and we'll see what happens. So I'll introduce you to the Tux 3 core team. That would be me and Hirofumi Ogawa. There are other contributors, but we're the main ones. And I will thank you for listening and open up for questions. Good question. So the question is, is Tux 3 intended to be a traditional hard drive system? In other words, what about flash? So yeah. So we're always cognizant of both. And what we discovered is that if you optimize for a hard disk, you are pretty much optimized as well as you can be for flash, assuming a flash translation layer. Yeah. OK, so no more questions. I thank you very much for attending. I'll see you later.