 That song may go on for a while. I don't know. I can't remember exactly how it goes. This song may win some award for like, well, I don't know. There's probably a, I wonder what this best song is out there that actually has the fewest lyrics, right? Is there any awesome song where they just say one word over and over again? Or like two words? What's that? Part of like a rock star. Oh, OK. I've never heard that. Anyway, you guys can go figure this out at home. Like the best, you know, the Pareto optimal song in terms of the fewest number of words and most awesomeness. You can define that in a font. All right, so today we're going to pick up where we left off last time talking about how file systems handle failures. So I'll present a technique called journaling. And journaling is actually kind of an interesting approach, something that is applicable in other scenarios. And it's actually something that's done in other scenarios by databases and other types of systems to recover from failure. So it's a general enough idea that it may be useful to you at some point in the future. And it's certainly been useful in the context of other non-file system systems. And then we'll see how far we go. I don't think we're going to get through all this today. This is a lot of material. But at that point, we're going to talk about two classic file system designs. The first one, which I think we will get to today, is the Berkeley FAST file system. This is a oldie, but still sort of goodie. FFS introduced a lot of ideas into the file system space that you guys still are familiar with, because they're still out there. So certain design decisions and certain features of modern file systems are sort of inherited from FFS. And then LFS, the long-structure file systems are fun, and it's a fun sort of really controversial take on the idea of file system design. And a cool chance to see how a really radical redesign of file systems was motivated by a change in performance characteristics of the underlying system. To some degree, this is what happens in systems all the time. It's the relationships between different parts of the system change, new ideas come into the space, we revive old ideas, certain things work that didn't work before, et cetera. So this is a case where the ratio between a couple of the components on the system in terms of performance changed, and people argue that this justified a different approach to how you do file system design. OK, please work on assignment three. The second part is kind of out. I think Scott's going to push some upstream changes later tonight, and the targets for you guys to use for the second part of assignment three. We're also going to be a little bit more specific about some of the design requirements for your solution. Essentially, you cannot turn in DumbBM. We will do as much as we can to make sure that your solution is not DumbBM with a page alligator that works, or a souped-up version of DumbBM with like five hard-coded regions or something like that. Your solution also needs to support paging as opposed to segments, which is what DumbBM uses. And again, we will do our best to test those things. So if you are starting with DumbBM and thinking you're going to just rev up DumbBM a little bit and turn it in, please don't do that, we will try to make sure that doesn't work. Thank you, Ali. Someone keeps stealing my water bottles in the department. It's very odd. Anyway, either that or I'm losing them. I don't know. At first, I thought I was losing them. Now I think that someone is walking off them. Write my name on my next one with a big marker. Okay, so due to miscommunication between Ali and myself, the midterms are not ready. We will have them for you later in the week. We need to take them over and get them scammed. If you're really desperate for your grade, come by while I'm in the office and I can tell you your score on assignment three. But we'll have those back. Overall, the class scores, unlike last year's disastrous midterm, which required me to spend like a whole half lecture building up people's confidence again, this midterm was not the bet. The average, the median score was about a 33 out of 50. So that's kind of right where I like it. And we'll turn those back. Last year it was like a 13 or something out of 50. So that required some conversation. So I think this year's midterm was a little more appropriate. There was one perfect score this year, which is I think the first time that's happened. So that's cool. All right, any questions about the file system stuff that we've talked about already? I'm just gonna kind of plow onward. Okay, so remember the problem here is that I've got a bunch of things I need to change on the disk. Because of caching and other reasons, those changes sometimes make their way down to the disk at different points. And if the disk fails at some point where it's the on-disk data structures are not consistent, I can have various types of corruption. Some of the corruption can be really bad. I can cause entire files to disappear. It can cause their contents to disappear. I can leak space on the disk. I can mark things that are allocated that aren't allocated, et cetera, et cetera. But all of these problems are things that file systems try to avoid. So let's talk about another way of doing this. What we did last time was we said, you know what, we're just going to try to make sure that we understand the cache behavior. And when we're writing out parts of the file system that need to be written out consistently, we'll make sure they write through the cache and go to disk immediately. However, here's another way of doing things. So the observation in journaling comes down to the following. What's, if I think about the disk itself, what operation on the disk is not, you know, what's not atomic is writing a bunch of disk blocks. So regardless of whether I'm talking about flash or a spinning drive, if I need to make changes to a number of different parts of the disk, I have to make those changes sequentially and there are gaps of time in between them, particularly on a spinning disk where I've got to move the heads around and stuff like that. So any operation that requires writing multiple disk blocks is not something that I can do atomically. And so if I'm interrupted in the middle of it, I can have corruption on the disk. So I can have an inconsistency. However, if writing multiple disk blocks is an atomic, what is atomic or at least more atomic? What operation, so multiple disk blocks, you can mention on a spinning disk, I've got to move the head here, write one block, move the head over here, write another block, move the head over here, write another block. What would be potentially atomic in this situation? Yeah. Just writing one block? Yeah, so I write one block. So that block, to some degree, that block will either be completed or not. Now technically there's like this tiny little gap of time there where I'm actually doing the right. And so it's possible if my failure happens at exactly the right moment, I still might have like half the block written. But still to some degree that block didn't finish writing. And the time gaps are much smaller because I'm not seeking around the drive. So what I can do is I can use that as a starting point to build up an approach that allows me to atomically store information about what was supposed to happen to the disk, the operations I was supposed to be doing, and then using that atomic log to check and reconstruct other parts of the disk. So let me also just quickly contrast this with the other approach. So does anyone ever run like FSCK on their drive? So there are tools out there that will check the integrity of a file system. So you can take this tool and one of the things you can imagine it would do is it goes through and it builds up a model of the tree that the file system is supporting and make sure that there are no disconnected inodes and things like that. And you can check a bunch of things. What's the problem with using a tool like this? Somebody make a guess about this tool's performance. What's that? Well, okay, so let's say conservatively this tool in order to check the consistency of all the on-disk data structures has to touch how many disk blocks? All of them. Every block that has something in it, whether it's an inode or a data block potentially, I have to touch those and that's terrible. That just takes forever. So these tools are really slow. What's a problem with this? Say no problem, who cares? I'm not gonna have to use this very often. But why would it be a problem if that, because some of these old, I mean they did a great job, some of these old disk consistency tools, you can run it on the file system. And in like 12 hours, it would complete and you'd have some information and it might be able to fix some things automatically or whatever, but why is that such a problem? Yeah, it doesn't really matter how it's loading the system. Yeah, Zach. The disk is offline usually during these times. Now eventually you had fancier versions of these that could run in the background, but just imagine that I have to take this system offline for half day to fix this. So it's hard to get to four nines or five nines or six nines of availability on something like, now for your own personal machine, who cares? I mean, this explains the enduring popularity of the Windows defragmentation tool. It might take seven hours to run, but it's really fun to watch. It's like a video game. Who cares? Unless your system is serving as some sort of hidden torrent node or something like that, and no one is really affected by this. But when you run a server and someone says, oh, whoops, someone tripped over the power cord. Not only did the server go down, and do I have to bring it up again, but oh yeah, you have to sit there for half a day while I check the integrity of the file system. So not okay. So this is not really good or acceptable. So how can we limit the amount of things we have to check? There's the technique called journaling, which is really common in modern file systems. JXD4 supports it. Most modern file systems support journaling as a way to improve recovery performance. So what do we do? We use the journal to track the changes that we are going to perform to the disk, and the journal keeps track of when those changes actually make it to disk and are fully committed. You can think of it this way. So I check those changes in the journal, and remember, the journal is more atomic than making the changes themselves. So you can imagine that I can read, I can write to the journal, I can set up the journal operation so that a record in the journal fits inside a disk block, and then it's atomic to actually update the journal and say that I did something or I was going to do something. After a failure, I use the journal to figure out what is still inconsistent about the disk and bring it up to date. So rather than having to check the entire disk, which is what these old F-sync approaches would have to do, the journal-nate approach allows me to focus on the parts of the disk that might have problems and make sure that they're in a consistent state. So let's walk through an example of how this might work. So here's an example of a journal entry, and this is obviously not what would be on the disk, although it would be funny if you implemented it this way. Here's what I was in the process of doing. I was going to create a file, and we know the multiple different operations that this requires. I have to allocate an inode, I have to find some data blocks, I need to, and then I'm going to put the inode in a particular directory. So this was something that I was doing. So before I actually go about making the changes to the disk that this would require, I write this down in the journal. Here's something I was in the process of doing. Then once all those operations are completed, so you can imagine this, without the idea of checkpoint or to be able to remove journal entries, the journal would just grow forever. What happens is once I know that all of those operations have been completed, I can mark that entry as completed, and when I have a whole batch of entries up until a particular point that are completed, I call that a checkpoint. So when I flush a certain number of changes to disk, the system can figure out all the journal entries that have now where the on-disk contents are now consistent with what I was trying to accomplish, and those I can get rid of. Those are done. So this is called a checkpoint. Checkpoint means the point at which there's a point where everything before that point in the journal is on disk, and not necessarily everything after it is on disk. Does that make sense? Some of the things after it may have made their way to disk, but not all of them. But the point of a checkpoint is everything before the checkpoint has been committed, all the on-disk operations required to reflect that state are done. So here's one way of doing this. Once I've, once, now one of the things this allows is this allows these things to happen more slowly. I don't have to flush all these things to the disk immediately and create a lot of IO. What might happen is over time, I might do something like associate the data blocks, because the file might get used, but some of the other changes might linger in the cache for a little while before they actually get crossed off my list. But once I know that all of these operations are done, and all this data has made its way to the disk through the cache, I can mark this journal entry as complete, and I can put a checkpoint below it, or below all the other entries that are finished. Now what do I do when I recover? So let's say that the system crashes, doesn't shut down cleanly. How do I use the journal to, how do I use the journal to guide the recovery process? What's the algorithm here? What's a good starting point? Yeah. Yeah, so I back up to the last checkpoint, because remember the checkpoint guarantees everything preceding it in the journal is on disk. So I know that already. You know, I wrote that down before I crashed. So I back up to the last checkpoint, and then what do I do? Yeah, I use, this is called replaying the journal. I use the journal to check to make sure that the things that I was supposed to do are done. So I start at the last checkpoint, and I work forward through each entry, and what I do is I check the on disk data structures to make sure they're consistent. Because remember, the checkpoint doesn't mean that all of the operations below it aren't on disk, it just means that not all of them are. So some of those operations might be done. So I might take this, and I might say, okay, I see that I already allocated I node 5567, that's already marked as in use. Okay, that had been flushed to disk. You know, oops, I forgot to associate these data blocks with the file, I'm gonna do that now to make sure this is up to date. Gonna add I node 567 into the right directory so that it doesn't end up in lost and found, hanging out there. And once I get to the bottom of all my journal entries, I'm finished. And now the nice thing is the disk is up to date, it's in a consistent state, and I've touched as little of the disk as possible. Does this make sense? Any questions about this approach? Yeah. Is it possible for the journal to get kind of corrupted on this power? Yeah, so what happens if I have an incomplete journal entry? They're ignored, right? If I have an inconsistent entry at the end of the journal, like let's say, again, let's say I either had a journal entry that spanned a couple of blocks, and so I only wrote out a couple of them or whatever, those just get ignored. So journals are not a magic silver bullet to data loss after a crash. I'm not gonna say it's impossible because maybe somebody will make it possible soon, but it is very, very hard for a system, a file system, to guarantee that it will not lose data during a crash. But at least what the journal allows me to do is to get back to a consistent point where I know that the file system is in a consistent state. And that's really what I want to do. It's okay if maybe there was like one write or one append to a file that I didn't finish. Okay, so there might be a little data loss here, but at least if all the onto-status structures are consistent, I have a good starting point for moving forward. So in this case, if I got halfway through here, actually what I would probably do is undo this journal entry, make sure that these changes are not on disk because otherwise I would leak some of this information. Also, if the file system crashes, but what happens if the file system crashes before it actually writes the journal entry? What do I do in that case? I mean, nothing, right? That's data loss. There's no way to get around that. If I was about to do something and I was in the process of writing the journal, which I always do first, and I didn't even get that far, then sort of by definition that operation is gonna be lost. But again, that's okay. It is not the goal here to prevent all data loss after some sort of crash. That is impossible, right? Don't try to bend the spook. What is possible though is to make sure that I can rapidly bring the file system into a consistent state where I can start up again. So one other thing I wanna point out here, which is kind of interesting, one of the, you might have noticed in the, when I wrote this journal entry, let's go back here. This is, the journal entry here reflects changes to the metadata. The free, you know, iNode bitmaps, the free data bitmaps, things like this. It does not reflect, like if I was doing an append or a write, the journal here does not reflect the contents of that operation. So this allows me, in the sort of the model that I've showed you, this allows me to ensure that the file system metadata is up to date, but it does not guarantee that all, that file operations that might have been going on, like there might be some data that hadn't been written to the file yet. And that's not entirely, you know, that's an, I think that's an okay design trade-off. It is more important that the file system metadata and data structures stay up to date, because if those start to get wonky, you have a huge problem. If I lose one or two writes that were going on at the time of the failure, that's all right, right? As long as the metadata is in a consistent state. The other thing, of course, is that while it's possible to reconstruct the file system metadata in a lot of cases by doing what I talked about before, where I scan the whole disk, there's no way to reconstruct the contents of the file. If you were writing a certain number of bytes into the file and they were in the cache when the system crashed, I have no idea what those were, right? So even if I touch every block on the disk, I still can't reconstruct the contents of the files that might have been lost during the crash, whereas in a lot of cases, I can reconstruct the metadata. Yeah, so there were some early journal approaches that actually journaled the data blocks themselves, which you can imagine is kind of sad because now every data block gets written twice, both to the journal and to the file. Most of the time you just don't do that. All right, any questions about journaling? Journaling is an approach that I suspect you can see how it generalizes to other types of systems. The big idea here is that I represent some change I'm going, some complex change I'm going to make to a bigger system in one place that I can change easily. And then if I get stuck in the middle, I have something to look at to try to figure out what I was gonna do. So this happens in databases, this happens to distributed systems. This idea comes up in a lot of other places. I don't know if it came up here first, but it certainly is a very big part of file system design, even within just one disk. And you can imagine if I was doing some sort of transaction and Google's data center that involves five different machines, this idea is very useful because one of the machines fails or whatever, I can kind of figure out what I was doing and make sure that everything is up to date. All right, questions about caching and consistency? Yeah. Yeah, good question. So the journal will create more disk traffic. That is undeniable, yeah. There are, I'm trying to remember, there are ways to cache parts of the journal as long as the cache understands that the journal entries have to be written out before any of the changes that are on disk, right? So that's a normal requirement, is the entry has to be in the journal before I start making changes to the disk. So you can cache those, you have the same trade-offs you have with other caching operations, right? It's possible that the contents are lost. You know, when you get into the world of IO, you get into weird stuff, right? People have, you know, reliability and fault tolerance and the ability to prop a system back up are important enough that people have thrown a lot of fancy ideas at this problem. So for example, you have servers, well, well first of all, I mean what's the most basic solution to solving a certain category of the types of failures that would cause problems for the file system? What's the simplest thing I can do? Yeah, what's that? Eh, even simpler than replication. Cheaper and simpler than replication. Yeah. No, you guys are thinking, like think around the machine. Like what can I do in the data center to prevent a certain category of these errors? Make sure the power doesn't go out. I'm serious, you guys are giggling, but like every data center needs to have a UPS on site, right? And a lot of the big ones have backup generators on all sorts of other fancy power supply systems and a lot of power cleaning. You thought the power that you got on the grid was good enough for these computers, no way, right? They have all this complicated processing to keep the power lines very stable. Yeah, so that's step one, is just don't let the power grow out. Then I'm okay. Then as long as some moron doesn't come and like punch my computer in the face, I'm okay. I can handle a lot of different types of failures, right? The things people brought up, replication, you have these fancy server hard drives that have battery-backed caches, just in case somehow the power goes out, they have an actual battery on them that allows them to finish writing things to the drive before they shut off, blah, blah, blah, right? So replication is certainly a big one too. I mean, now storage has gotten cheap enough that I just throw more storage at the problem. I have two and a half copies of everything in, if you really look into heart, if you wanna spend a couple of fun hours, start looking into, and you would think this would be an easy question to answer, look up how to reliably store data for like five or 10 years. Cheaply, this is not a solved problem. You can spend a lot of money to have someone do this for you, but if you have like huge amounts of photographs or other videos and you actually want them to survive for a while, you think, oh, I'll just put it on a USB stick, nope. Those only last a couple of years, blah, blah, blah. So, I mean, storage is hard. But there's a lot of exotic solutions out there because storage is also really useful. Okay, so let's talk about FFS. Now we have to time travel back into 1982. How many people were born in 1982? Okay, yeah, I think it's me and Ron. Yeah, so old, right? This was included in the first BSD release in 1982, and to some degree, I think, you know, FFS and BSD grew up together, right? BSD was one of the first really effective and popular distributions of Unix, and it brought FFS along with it. There's a guy named Kirk McHusick who has been essentially the FFS developer now for, I don't know, forever. He's still working on sort of modern variants of FFS the last time I checked, and the whole file system was pretty much built by him. I'm certain that other people have contributed by this point, but it was pretty impressive. So there is something, there is a, FFS is still in use under a different guise, or something called the Unix file system. Has anyone ever used UFS? I've never used it. I don't know, maybe it's a BSD thing. Anyway, so UFS is still out there, still being actively maintained has, you can imagine after 20, 30, 34 years, there's a lot, has a lot of cool features, has a lot of modern features, but still something that people use. So, FFS is kind of an interesting beast because FFS, if you look at the system as a whole, there's a certain set of contributions that FFS made to file system design that have been really lasting. And then there's others that are kind of things that we got rid of as soon as we could, because they were really tied to a real deep integration with the geometry of disks at the time. And those were the ones, those were the features that kind of didn't survive. Partly because disk geometry, I mean disks have gotten faster, flash makes geometry irrelevant. And to some degree, some of the geometrical stuff that FFS did will sort of blow your mind, but it's certainly not relevant anymore. But pretty cool that, and this is kind of a neat example of somebody really making a deep and careful study of a particular piece of hardware and designing a system that really, was really tightly coupled to it. So imagine that you have, take the disk that we talked about and make everything slower. It spins slower, the heads move really slowly, there's not really much data on it. What are some of the different types of challenges that you might want to tackle with that drive? Remember, you are, as a file system, you are creating a on-disk data structure to hold information. So on certain systems, you can just pretend that every disk block is the same as every other disk block. Once you get to a really slow disk, what are some of the things that you might want to keep in mind? Or the things that you might want to try to do? Or these features of disk geometry that you might want to exploit? Yeah. Okay, so that's an interesting point. This is a, Sean Briggs have a good question. Why is the outside of the disk a better place to put things that are accessed more often? Yeah. It's not quite correct. I mean, the head was moving laterally, right? The tracks are an equal distance apart pretty much on the entire disk, I think. What's different about the outside of the disk? Yeah. So okay, so that's a great point. The track, the data density is constant. So the track at the very outside is where it's the longest, which means you can store more data there. So that's good, but what else does it mean? Yeah, I mean, this is basic physics, right? The disk is rotating at a constant speed. So the head is passing over every part on the disk and every in one rotation. The inner tracks are shorter, meaning that there's less information on them. And that means the bandwidth from the inner tracks is smaller. Does that make sense? In one rotation, I see more data on the outside and so I can pick up, I can write more information, but I can also read and write information faster because there's more going under the head. And so do you guys remember that video with the, no, you might have noticed that most of that activity was happening at the outer edge of the disk. It was, you know, you didn't really see it spending a lot of time close to the inner part and that's because all, well again, I hate to be categorical about things, but all same modern file systems start allocating it at the outer edge of the disk and work inward because, you know, you hope that you don't have to use the inner parts. Did you have a question? So, but, you know, and all this is sort of based on layout, right? So this goes back to the things that we were talking about before. So where do I put common data structures that I could access all the time, like I notes? Here's the other thing to keep in mind and this is a big difference. I should have said this before, but this is a big difference between FFS and LFS. What else is true in 1982? Right. Yeah, okay, so the drives are smaller, the data density is lower, everything's slower, but what else is true about this system? There's another part of the system that's really critical here to file system performance. That's also changed quite a bit since then. Yeah, the buses were slower, that's okay. What's that? No, what's the, I mean, what's the other, we just spent, you know, a couple of lectures talking about using something else to make the file system faster. What was that? Memory. So these systems had a lot, lot less memory. Now, you know, memory has also sort of been, been subject to Moore's law and so that today I suspect the ratio of memory to file system size is probably different than it was back then. We had less memory, less memory for caches and so remember what we were saying about file systems like CXD4, kind of, maybe it doesn't matter as much where you put the inodes because the cache is gonna soak up a lot of your access to inodes. On a system in 1982, not true, right? I have to worry about where the inodes are on disk because my cache is not very big and I'm gonna spend a lot of time finding the inodes and potentially operating on them. And then you think about things like the relationships between files. So from a completely probabilistic aspect, if I touch a particular block in a particular file, what's the likelihood that I will next get a request for some other part of the file? And to some degree, you might argue that the probability is uniform over the entire file system. It's not quite true and there are certain cases where I can get away with some clever optimizations. If I can observe certain patterns of file access. All right, so let's talk about some of the common features that FFS, we'll come back to the rotational planning but let's come talk about some of the things that introduced that you guys are familiar with. So FFS introduced sort of a modern large block size, happens to coincide with the page size. So FFS introduced the idea of trying to do contiguous block allocation on disk. This is important because things that are contiguous on disk are close to each other on disk. And so you can get those things more easily without having to move the heads around a lot. FFS introduced a bunch of features, symbolic links. You guys are used to all this stuff, right? Symbolic links, file locking, file name links. It could be not maybe not as long as you wanted but long enough not to matter and things like quotas for shaled file systems. So these were all features that FFS introduced that are pretty much standard operating procedure now. All right, so what's close on disk? So there's two, on the disk that remember, go back to our geometry stuff, there were two things that were slow. What are the two things that prevent things from being close on disk? If I'm in one, well I mean another way to think about it if I'm in one spot on the disk and I want to go to another, there's two things I have to do. What are they? Well they're both physical movement, right? There's two types of physical movement. So I need to move the head, so my head is moving laterally and then I need to do what? Yeah, wait for things to come under the head. So I need to wait for the disk to spin. Now over time the seek times have really, I think this is another one of these ratios that's changed over time. The seek times moving laterally have not changed very much and I think also as the tracks have gotten smaller I suspect that it's gotten harder to center the heads over a specific track. Remember, not only am I flying over to one other side of the disk as fast as possible, well once I'm there, I have to make sure that I'm in the exact right spot where I have a head that's, I have a track that's as small as possible because the smaller the track, the more data I can put on the disk. So I think that's probably kept seek times from improving to the same degree. Now again, you would probably be happier with the disk that had a slightly slower seek time in exchange for 10 times more capacity. That's probably the trade-off that we've made. So here's some ideas that I've best introduced in this space. So one is this idea of a cylinder group. A cylinder group is all of the data on every platter of the disk that I can read without moving the head very far. And it's not without moving the head at all. It's just without moving the head very far. So on each cylinder group, FFS creates almost like a mini copy of the file system. So it creates a backup copy of the super block. Remember the super block has all this metadata about the file system itself. This is something that it's used to have multiple copies of because if I lose it, the file system is toast. So I create a copy on every cylinder group. I create a cylinder-specific header that has information about that cylinder group that's kind of similar to what you would have on the file system super block. I create the iNode table and then a number of data blocks. Does anyone remember what this looks similar to? What does this remind you of? What's that? Somebody's mumbling something back here. No. What other file, I mean we looked at a file system before. These are EXT4 groups, right? Remember EXT4 had these groups like zero, one, two and each one of them had iNodes, data blocks. I mean this is kind of where this idea came from. And this is almost like I've created, taken one file system and on each cylinder group, I'm creating a mini file system. The advantage to doing this, particularly in 1982, is that it really ensures that things are close to each other. The data blocks in the cylinder group are linked to the iNodes within that cylinder group and so that sort of guarantees that I can read the contents of a file without leaving the same cylinder group and therefore I don't have to move the head very far. Ah, there we go. It's like it's own little mini file system. So yeah, so this is like EXT4 groups. We saw this before. So, and this is where, this is kind of the stuff, FFS stuff that will sort of just blow your mind, right? So the super block that FFS used, it would actually do some profiling of the disk and we used it to do this really detailed rotational planning. So this is not a problem that we have anymore because bus speeds have improved but it turned out that at the time, you could, the buses, so somebody brought this up before and it's a great point, the old buses that connected the disks with the rest of the system were not fast enough to read or write data at the full speed that it could be read or written from the disk. Does that make sense? So I can sit there on a platter and I can pick up data at some rate but I can't get it over the bus to the rest of the system at that speed, okay? So what does that mean? It means I can't actually read consecutive disk blocks. I might be able to read a few until the cache on the disk fills up but if I have a big file, if I start reading it consecutively, zero, one, two, three, consecutive sectors on the disk, at some point I'm gonna have to stall and wait for the data to get to the system, the rest of the system into memory and then pick up things again. Why is this a problem? Like, what makes this particularly unpleasant? Let's say I start off and I get zero, one, two, three and then I'm, I can't read four yet because the cache is full, what do I have to do? Well, I know I want four next but then what do I have to do? Let's say I can't read four yet, yeah. No, the buffer's full at this point. Yeah, Ron's got it. I need to go all the way around again, right? So I got zero, one, two, I'm ready for three but I can't fit three into my cache so, you know, like, and, you know, these disks weren't spinning as fast as they are now so now I've wasted this whole rotation waiting for three. So FMS, I'm giggling, this is just so awesome, right? FMS would actually measure this and it would lay out files like this. It would say, okay, well, I know that I can't read things this fast so rather than allocating, you know, block sequentially for the file, I'll make sure I allocate them with these gaps in between that allows me to read the file sequentially, all right? And this is one of those things that just makes you think like, I don't know, I don't know what to say, all right? I'm happy disks aren't like this anymore but on the other hand, if this is what you're given this is what you have to do, right? I mean, sometimes engineers have to make things that are kind of nasty, all right? So question, I mean, is, you know, is any of these FFS sort of, particularly the geometry stuff, does this stuff even matter? And if not, why not, right? And this is a good thing. So one thing I'll point out is FFS and a lot of these early systems dealt with low level hardware interfaces that were on some level a little less clever than they are now. Someone came up to me after class, maybe a month ago was talking about, there are new, I think there was a question on the midterm about voltage scaling and frequency scaling for processors. There are CPUs that are coming out right now that actually do this for you. So they think that they're smart enough where if you provide them with a workload they're actually going to choose when to speed up and when to slow down automatically. And to some degree, this has happened on a variety of different types of pieces of hardware. Discs are not immune for this. So, you know, for three, four, five decades now both hardware and both sort of software people and hardware people have felt a sense of responsibility about solving this particular problem. How do we make IO fast? The problem is that some of these solutions end up making things difficult for the other part of the equation. I mean, this is a hardware software problem. You're not going to make it go away but how do you expose information and whose responsibility is it to do things, right? In general, the trade off here which is sort of ubiquitous to computing is that you can make a piece of hardware that is very, very good at doing one particular thing and it can do that thing extremely quickly. It really doesn't matter what it is. At some level there are certain software approaches that don't work well when you try to implement them in hardware but that's usually because you need too much hardware. Most things, I can make a chip that does that one thing way, way, way fast and then you will ever implement it on a general purpose process, right? On the other hand, software is super flexible. It's easy to update but it is always going to lose a performance back. So today I would argue that things have, my take on this is that things have shifted very much in the direction of software. Why is that? Why do you feel, you know, why would you make an argument that software is what is our problem at this point? Yeah, okay. Well, I guess we're worried about, a little bit worried about that. So maybe some of the hardware gains aren't, but what's happened? I mean, okay, well remember though, this is a trade-off here between performance and flexibility. So hardware allows you to say, you know what, in exchange for lack of flexibility I am going to gain really impressive performance. Why might things be leaning in one direction or another today? 2016, yeah. Okay, that's fair. I mean, people have different needs. What else? What's happened over the last 50 years? What's that? We need to create more problems for software engineers because we have more of them. I like that. Not sure that's completely, don't tell anyone else that. It's not quite what I'm thinking. Yeah, I mean, come on guys, hardware is so fast. Just so fast already. To some degree, find your computer engineer buddy and give them a big pat on the back because they did a great job. They gave us these things that are so powerful. We have no idea what to do with them. That's the problem now. The problem isn't like, oh no, my smart phone's not fast enough. I just don't think that's really a problem. And if it is, software is probably the culprit. So to some degree, you could argue, even if Moore's law is starting to flatten out a little bit, Moore's law happened over a period of multiple decades. And the rest of us, people who write software, I would argue that I think computers could, probably getting in trouble for saying this, but I think that if we had a decade where computers didn't improve in performance at all, nothing, just zero, no new chips, nothing faster, whatever, just we're stuck with the same hardware we're still with today, we could solve a lot of problems with that hardware. We don't need new hardware to solve a lot of the problems that we have. So anyway, there's my little rant. Let's talk about how this works at the disk level and how this can sort of frustrate some of the things that FFS was trying to do and that will be done for today. So all the rotational planning that FFS was doing relied on the disk obeying the geometry assumptions that FFS was trying to make. So if the things that FFS thought were close on disk, weren't close on disk, this becomes a problem, right? So here's one way of saying this, right? This allows, so this was sort of the early model of disks. You had a map from sectors and you can pretty much map that directly to where things were on disk. And tools in systems like FFS exploited this to do clever things, right? The advantage of this is that it provides the OS with a lot of visibility and you can write pieces of software like FFS that rely on this information. The problem from the perspective of hardware people is that they're like, you guys don't know what to do with this information and you write this crappy code that crashes all the time. I mean, give them credit. Hardware does, if your processor rebooted as many times as your OS does, you would be mad. Hardware just can't fail in the same way. So here's another approach which is, and to some degree this happens already today, where, remember I said before that modern disks have these huge performance variations even with the same model, modern disks can take a sector that's failed and remap it to another part of the disk. What this means is that assumptions that the file system is trying to make about where things are on disk get broken. So I thought that these two blocks were right next to each other because they're consecutively numbered. But it turns out that the disk has remaped one of them way to the edge of the drive and now every time I hit my I know table I'm doing these huge seeks. It's not good. And pushing more complexity into the device, to some degree the device is closer to the little firmware that runs on the device is closer to the hardware and can optimize performance better in certain cases. The problem here is that the operating system that's up doing dumb things. So a friend of mine about a decade ago looked, he did this interesting study where he looked at the performance of certain file system designs on RAID, which is a system that we're gonna talk about on later in the week. And what he found out is that the file system designs were trying to do these things that were clever that were based on on-disk layout. So again, they were based on assumptions about locality. When the underlying disk breaks those assumptions, not only are they not helpful, but they actually are harmful. So it turns out that just, he wrote this paper about it that was called stupid file systems are better because the idea was these clever things that the file system was doing to try to optimize performance were actually causing performance degradation. Okay, so I will just wrap up on FFS. I'll put these slides online. You don't need to see this material. We will start on LFS, log searcher file systems on Wednesday.