 It's good this class is being videotaped because there's like five people here. All right, so it's just too bad, whether it's not very good, but it's kind of a fun lecture today. So we're going to finish talking about journaling today, which we kind of started at the end of class on Wednesday. And then we're going to look at, we have kind of a lecture and a half or two lectures, maybe about specific, interesting file system implementations. And these are sort of historical looks at two fairly important file systems. One is the Berkeley FAST file system, which we're going to talk about today, which introduced a lot of what we now consider to be sort of standard file system features. And then on Monday, we'll talk about log structured file systems, which is another pretty interesting file system design point. The assignment through your auto grading is ready to go. Now we might make some tweaks to it over the next few days, but everybody who's been waiting to start assignment three for the assignment three auto grader to be done, you don't need to wait anymore. So yeah, so that basically from here on out, you guys have like, I don't know, four weeks. What's today? Today's Friday. Yeah, exactly four weeks to finish up everything, basically. So good luck. All right, so any questions about file system caching and consistency? We talked on Wednesday, we finished talking about caching. We talked a little bit about how caching behavior could potentially exacerbate consistency or the consistency guarantees that the file system is trying to provide and how those might be more difficult to provide when I start caching things in memory. So any questions about this before we go on? All right, so remember that just a bit of review. We talked about how there are file system operations typically have to modify multiple disk blocks in order to leave the file system data and the file system data structures more importantly in a consistent state on disk. So how does caching, once I start using memory as a cache, how does it exacerbate this situation? Tim, second row, Tim, yeah. And because it's in the cache, where is it not? Yeah, so remember we thought about caching as extending these windows of time where potentially there's dirty blocks in the cache that aren't on disk. And so if those are important to consistency, if the system fails during that period of time, then those blocks may never make it to disk and so the disk may be left in an inconsistent state. So when we talked about different ways, so first of all, what type of operations are really our enemy here? And this will be something interesting when we come back on Monday and talk about log-structured file systems. What kind of operations are we worried about? Rights, right, because reads don't change things on disk by definition, right, rights are modifications and so reads, we can just throw a big cache in front of the disk and we say, we're done, right? But the rights have to make their way to disk if I actually want that data to persist. So when we talked about caching policies we focused on what to do about rights, right? And we had two extremes as far as this kind of inherent trade-off between safety and performance. What was one extreme, Paul? What's that? Yeah, well, write everything, win. Yeah, immediately, right? So we had this idea of just not buffering rights, right? Or not, and then again, this is important, right? We still modify the buffer, right? Because we wanna allow reads to be returned from the buffer but any time a buffered object is written to, we just flushed that operation immediately to disk, right? So this means that the amount of time that there's a difference between the cache and what's on disk is minimized in this particular case. What was the other approach, Peng? Yeah, so essentially I can also buffer all operations, all write operations in the cache for as long as possible, right? So this is, you know, the minimum of either the length of time the block is in the cache, right? So if I eventually evicted dirty block to make room for something else that I wanna cache, I'd have to write it at that point. Or it could be, you know, when the system is shut down or the file system is unmounted or something, right? And we talked about, you know, the performance and safety trade-offs here, right? I mean, this, in this approach, there's a one-to-one mapping between writes and disk operations, right? Here, we allow that ratio to go up, right? Potentially as large as we can, right? We're gonna absorb as many writes as possible into the cache, reducing disk traffic, but at the same time causing ourselves more problems on failures. And then we had a couple of different approaches to trying to balance these two goals. What was one of them? Mukta. Yeah, so we talked about how I could periodically just write dirty buffers out to disk, right? So that means that, and maybe I do that kind of when, try to do that when the system is idle or something, right? But what that means is it means that there's a fixed period of time during which I could have some data loss or an inconsistency on disk. What was another approach that was, that took a kind of a different tack on it? Yeah, so I might not cache writes to file system metadata, to the superblock, to inodes, to any of these important on-disk data structures that the disk needs to, that the file system needs to function, right? So I might flush those immediately, right? And we talked about how there are these sync operations that can be run on file systems or actually on individual files that will force the file system to bring the disk into a consistent state that matches memory, right? So if you're really paranoid, you know, or in certain times for various applications, this is important, right? So for, you know, if I'm writing an application that needs to be certain at a various, at a point that all the changes that made to a file are on disk, it can call sync and that will make sure that the operator system flushes any dirty buffers that are associated with that file, right? All right, any questions on caching and consistency before we go back to talking about journaling? All right, so what motivated journaling was this observation that, you know, what's not atomic is multiple, modifying multiple disk blocks, right? I potentially have multiple on-disk data structures I need to change, the heads bouncing around, I'm trying to get to all of them, right? And even if I don't buffer things in the cache, right? I can still have a problem here, right? Because when I'm in the middle of, you know, appending something to a file and I'm doing all these different things that have to take place in order for that to occur, the system might crash and so I still might have this issue, right? On the other hand, writing one disk blocks is a topic, right, despite my problem with plurals in the sense. But writing one disk block is potentially atomic, or it says as atomic as I can get, right, on the disk. And so what journaling does is we create a portion of the disk, right, a special area on disk, this is, you can think about this as just another on-disk data structure that the file system is maintaining that's called a journal. And the journal keeps track of all the changes that are being made to the disk. And when I do something like add blocks to a file or allocate a new file, I write down in the journal the specifics of things that I'm doing. And as I go along and those changes actually make their way to disk, when I've gotten to a point where the disk is in a consistent state, I can cross off all the things in my journal, right, and we call that a checkpoint, right? So here's an example, right? So let's say I'm creating a file, right? What are some of the things I need to do in order to create a file in an existing directory? Greg, what's one? An imap. An inode, right, I need an inode, right? Every, you know, every file has an inode, okay? So I need to allocate an inode, what else do I need to do, Robert? Yeah, so I need to change, I'm gonna put this file in some directory, right? And that directory, so I need to change that directory content to map the relative path name to the inode. Okay, what else do I need to do? Allocate some disk blocks, right? So what does that involve? So there's a couple of things in here. Yeah, so I need to, first of all, I need to probably allocate some disk blocks and mark them as allocated, right? So the on disk data structure that I'm using to track which blocks are available needs to be modified, right? So that's three things. And then I actually need to associate them with the file that I'm creating, right? So that's four things, and I've probably forgotten at least one, you know, one of those, right? So here's, you know, what Greg pointed out, I'm gonna allocate my new inode, right? And keep in mind that these journal entries are specific, right? I'm not writing like allocate and inode, right? I need to allocate, I need to tell the journal what inode was allocated, right? So this means that I've done some operations on the disk. I've read, you know, the on disk data structures to figure out where's a free inode that I potentially, you know, maybe I've actually marked an inode as used, maybe on the disk or maybe in memory, right? Then I need to find some data blocks, right? So I'm gonna associate those, I mean, I'm gonna mark them as inuse, right? Which is another operation, and then I'm going to, you know, link them to this inode using whatever data structure that file system uses to do this, right? We talked about a few before, right? And then I'm also going to modify the contents of a directory, in this case, we're saying the directory has inode number 33, to include this new file, okay? And that's it, right? So these operations represent a series of steps that I took to accomplish one thing I was doing, right? Which is I was allocating this new file, right? One more. Do you actually get added to the journal? That's a good question. I think I can add these just because I kind of, as I'm going along, right? Like, I think what the file system would do, I actually don't know, right? But my guess is what the file system would do is as it's doing each one of these things, it's actually just, you know, because probably the journaling for each one of these things is done separately, right, by the component that does it, right? So when I allocate an inode, right? The inode allocation routine marks that this inode was allocated, right? You know, when I, you know, find some data blocks, that, you know, that routine that finds data blocks marks those data blocks, right? But that's a good question. Actually, hold on, let me pause here. Any other questions about this? All right, so now the idea is that, you know, what the journal records is changes that I'm making to the file system in order to accomplish something specific, right? So a set of changes that are required to perform one operation, right? Remember that was one of our challenges here was that all of those, you know, steps require changing at least one disk block, potentially multiple disk blocks, right? So let's say that at some point in the future, right? And this would happen, right? All of the changes that I've made are actually on disk, right? Do you have a question? Or are you just flexing your hand? Okay. So at some later point, all of these changes are on disk. What do I do in the journal at that point? Sir. Yeah, so what a checkpoint means is that the, what a checkpoint indicates is that the file system is marked that the disk reflects all of the changes that were made before that point in the journal, okay? And what I want to do, right, is I'm not going to checkpoint in the middle of one of these operations, right? I'm only going to checkpoint in between operations that need to look atomic for the disk data structures to be consistent, right? So what would happen here is I would, you know, at some point later, let's say that the Inode free map has been updated. The contents of this Inode are on disk. The data block bitmap has been updated. All those data blocks are linked correctly. So all these things have happened, right? And the disk reflects these changes. What I can do is I can roll the journal forward and I put a checkpoint in there and I say, at this point in time, what I'm guaranteeing is that the on-disk information reflects the state of the file system up to this point, right? So what about stuff below the checkpoint, right? So, you know, I've gone on and I've continued to do things. So I've got a bunch of other journal entries. What do I, what sort of assumptions can I make about things below the checkpoint, correct? So they, right, they could not be on disk, right? I can't make any assumptions. It's possible that some of them are on disk, right? So it's possible, for example, that as I was going along before I made this checkpoint, right, I might have actually written, you know, written, I might have actually done the on-disk operations necessary to push, for example, all of these changes to disk, but maybe I've only done two and I haven't done one and three, right? And so I don't have a checkpoint, right? Maybe I've done one, right? Maybe I've, you know, I've marked this, I know it as being allocated, right? So that changes on disk, right? But in order to have a checkpoint, all of the changes before the checkpoint have to be on disk, right? Not just a subset of them. And after the checkpoint, again, I can have some mixture, right? Some of the operations might be on disk. Some might still be in the cache or things that haven't been committed yet, right? All right, so, okay, so this is kind of where we left off on Friday. Now what do you guys think happens when I recover the file system? So I've, you know, the file system is crashed, right? I've shut down uncleanly somehow and I boot up again, you know, file system mounts the disk. What does it do? What do I, how do I use the drill? See? Okay, so I go to the most recent checkpoint, right? Because that's what I know is consistent, right? So I'm gonna find the most recent checkpoint in my journal, okay? But then, you know, what's gonna be true about the, potentially about the journal? Depending on the, so at some rate, and we'll talk about this in a second, you know, I'm flushing changes to disk at some rate, but what's potentially still in the journal? Yeah, okay, well let's say that the journal is not corrupted, right? So let's say I trust the journal. We'll talk about why I might do that in a second, right? But what's in the journal, potentially, right? I have a checkpoint, but then, I mean, I guess, okay, so what's the best possible case, right, Alyssa? Well, what's an even better case, right? I have a checkpoint, what's the- Is your checkpoint after? Yeah, I mean, essentially the best possible case is that there's no, there's no operations at all after the checkpoint, right? Which means that, you know, for whatever reason, I didn't do anything to the file system between the last time I check pointed and the time that the system failed, right? In which case, I don't have anything to do, right? I know what the state of the file system is consistent, right? All the changes that were in process have been made, right? Okay, but let's say that, you know, we don't get that lucky, right? So if we don't get that lucky, what does that mean? Yeah, if the journal after the last checkpoint is not empty, then what does it have in it? Yeah, it has some, it has operations that I was doing that weren't check pointed before the system failed, right? And as Greg pointed out, those operations are what, Dan? Well, I mean, they could be anything, right? It could be a bunch of different changes I was making, but what can I assume or not assume about them on Iran? Jen, can we do better than that answer? Yeah, so some of them might have happened, right? And some of them might not, right? And I can't, I don't really know which ones are which, all I know is that they happen all happened because if they had all happened, there would be a checkpoint there, right? So what do you think I start to do at this point? Yeah, so I start checking the operations after my last checkpoint and verifying whether they've taken place on disk. If they haven't, then I'm going to commit them to disk and keep going, right? My goal is to essentially kind of clean the journal, right? So what I wanna do is I wanna take anything that was in the journal and commit those changes to disk, right? Because again, the problem is that right now the file system is in an inconsistent state, right? Some things that are after the last checkpoint are on disk and some aren't. If I just kept going, right? I would have these problems that we had talked about before, right? Where I'd have an I know that had been allocated, but it's not in any directory, right? Or things like this, okay? So let's see what happens here, right? So here's my last checkpoint, right? So this is where I start. So let's say I say, okay, I wrote down in the journal that I was gonna allocate I node 567. How does the file system determine that this already happened? How? Yeah, so I go to the disk and I pull up the I node bitmap and I say, is I node 567 marked as allocated? And if it is, I say, great, I'm done, you know? So this operation is on disk already, so I say, okay, I don't have to do that, right? Now what do I do? Jen. Third row, Jen. What's that? Yeah, so how do I check with two? I like that answer, that's a clever answer. Yeah, the same checking process that I use with one, just, so it's different, right? So I would look in I node as 567 and see if these data blocks are linked to that I node, right? In this case, let's say that they aren't, right? So now I've found something that needs to be committed to disk, right? So this operation, I actually have to make a modification to the I node, right? And then, you know, I can keep going, I can say, okay, does the directory with I node 33 contain I node 567 with this particular path name? If the answer is no, then I'm gonna make that change as well. And at some point I'm gonna get to the bottom of the journal, right? And then at that point, what I've done is I've brought the journal into a state that's consistent with the disk, right? Okay, great. Well, okay, so the system will probably almost always fail between checkpoints, right? So I'm always gonna start from the last checkpoint and move forward, okay? But I think what you were asking is actually something different, right? Right, I think I have a slide on this, right? Okay, so what about incomplete journal entries? So this is on this question. So I was marked, right as the file system failed, I was writing this down in my journal, right? I was making a change to the journal, but I only had two of these entries, right? I remember each journal entry has to be delineated with a start and end tag, so I know what they are, right? Like what operations constitute an entry. But what happens if the file system fails after two? Agree. Yeah, so essentially this goes back to my embarrassing Tom Clancy quote that I removed from the review slides today, which is if you didn't write it down, then it didn't happen, right? So this journal entry is not complete. And so what it means is that I just can't, this operation did not happen, right? So the journal is not necessarily going to, again if I fail at certain points, there is going to be some state that's going to be lost, right? And this would be an example. So essentially what's gonna happen is I was in the middle of creating this new file and that new file is just not gonna exist in the file system after recovery. You know, tough. But what's, so what's better, right? Why would I, so I'm gonna ignore these, right? If I process this clearly I would have a problem, right? I would have a file that had data blocks associated with it was actually not in a directory, right? Which is, you know, one of the problems we talked about when we had failures that occurred in the middle of this process. So it's, but it seems, I mean it seems like I've just played this like long shell game with you guys, right? Because I claimed before that, you know, if I fail at certain points that I would, you know, there would potentially be some problems with consistency or data loss. And now I introduced this journal and we spent like, we wasted like a whole class talking about it, right? Cause I thought it was interesting or something. And now we have the same problem, right? Where if I fail at certain times there's operations that might not have been, might not be finished, what's better about this approach? Spencer. Okay, so that's very important, right? So first of all, one of the nice things about journals is journals highlight the changes to the file system that are important for consistency. And they allow me to focus on exactly what needs to happen in order to bring the file system into a consistent state, right? So before journaling and other approaches to consistency, file systems had these incredibly laborious checks that they would do, right? So how many people have ever run like FSCK on an old file system, right? Sometimes FSCK would run for like hours, right? And because it's actually going through the entire file system, right? Potentially checking everything for consistency, right? So you imagine you could write a tool that would essentially go and walk the entire directory tree and take every directory and figure out what the I-note is and all the files associated with it, the file size is correct and it would try to do this and match everything up, but that can take a huge amount of time, right? And one of the problems with those tools is that they're checking all sorts of things that aren't important to check, usually after failures, right? So let me ask a different question, right? So in one case, we're talking about failures that cause the file system to be an inconsistent state because the system wasn't done making some changes to the disk, right? In what case might you have to run FSCK or I shouldn't say FSCK because with journaling and file systems, FSCK uses the journal, right? But in what cases might I have to run a more extensive file system-wide recovery tool? What kind of errors might not work so well with journals? Assume, can you think of something? Well, just get out of the checkpoint world, yeah. Yeah, if I have some sort of file, discorruption, right? Let's say I have a sector that goes bad, right? And suddenly it's just full of garbage or I can't read or write from it, right? In that case, I might need to touch every part of the file system to try to figure out what's happened, right? There might be directories that are just gone, files that have data blocks that aren't accessible anymore that I need to remove from the file system. So in that case, I might have to do a really, really big, ugly cleaning. But in most cases, when we have problems that are caused by failures and the inability to make all the changes permanent on disk, journals work really well, right? So let's talk about one other thing, right? So what we've been talking about up until now, right, is this idea of storing metadata updates in the journal, right? So all of this, you know, if you go back here, right, all of these are metadata updates, right? Changes to inodes, allocation of disk blocks, right? Changes to directories, right? Which you can think of as kind of, as metadata updates, right? Even if they're implemented as files, right? So what about data blocks themselves, right? So let's say I'm performing a write operation and I want to mark the journal, right? Or even better yet, let's say I'm doing an append, right? So the append has metadata updates associated with it. I need to get some new data blocks. I need to associate them with the inode. I need to change the inode size, et cetera, et cetera, update the modified time. But what about the data blocks themselves? What are some, you know, am I going to journal them or not? So what happens if I put them in the journal? Can I put them in the journal? Damn. Yeah, it's a big journal. Well, why, right? How many times are as every data block being written? None of you. Okay. So if I put data, so I'm asking this specific question, if I put data blocks in the journal, how many times am I writing every data block to disk? Right. Yeah, I mean, I'll take double from that answer, right? I'm writing every data block twice, right? Because I'm going to write it in the journal and then I'm going to write it into the place I disk where it's supposed to go, okay? So if I include them in the journal, that means I have to write them twice, right? The other nice thing about putting them in the journal is with the journal smaller, right, it means that it's more likely that single entries in the journal, right? So we talked about what happens if I was in the middle of writing a journal entry and the system crashed, and so I only had a couple of the operations. If my journal entries are small and they fit within a disk block, then that's very unlikely to happen, right? Because I'm going to make a change to the journal, and I'm going to flush it all at once, right? And so unless somehow the disk is interrupted, is it's in the middle of actually writing bytes onto the platter, the chances are that that entry is either going to be in the journal or not, right? And I won't get half of it, right? But if I start including data blocks, then I have a really big journal and it's possible that these operations are not atomic to write to the journal, right? The other thing I can do is I can just exclude them from the journal entirely, right? And the implication of this is that on failures, I'm able to preserve my file system data structures and maintain the consistency of the file system, but data might be lost, right? So if you're in the middle of doing some sort of big write, right? Some of the changes that you've made are not going to be on disk, right? But I will have noted the fact that the file has gotten bigger, right? So let's say I do some massive append, right? And I fail in the middle of it. So when the file system boots up, that file will be as big as it should be, but it just won't have some of the contents that you were in the process of writing. And I think that this second option is more common, if not canonical, right? Because it just means the journals are quite small, right? And it allows us to do these atomic updates. And again, I mean, journaling is already creating a little bit of extra overhead, right? Because potentially every time I do something, I'm making one extra IO to update my journal, right? But if I start journaling data blocks, then I'm really adding a lot of extra IO, right? So that's probably guessing that's that time. All right, any other questions about journaling before we talk a little bit about caches and then go on? All right, so actually, I'm gonna skip this. All right, any other questions? Well, actually, hold on, I will say one thing, right? So we talked about disk caches. And all this slide is saying is that it's possible that stuff actually also can get lodged in the disk cache itself, right? So I have told the disk to perform an operation. I've sent the data across the bus, but when the power is cut, the data is sitting in some sort of disk buffer waiting to actually be written to the platter, right? One thing I wanna point out that's very slick that some of you guys may have heard of, right? So there are drives out there so imagine you're a disk drive, you're a spinning disk, you've got your platter going at 14,000 RPM and suddenly the power goes out, right? So some disks actually do this very clever thing where they realize that they don't have any power, right? The power rail is cut, but they do have this drive that's spinning very, very rapidly. And what they do is they actually harness the rotational energy of the disk to perform some final operations that allow them to potentially flush out some buffers or retract the heads or do some things that a disk might wanna do before it shuts down, right? So that's a very, very kind of, I just thought I'd point this out, that's pretty slick, right? So it's like, as the heads are slowing down, it's actually using them to power a generator that's allowing it to perform a few more operations right before it shuts down. All right, any other questions about, any questions about journaling or other questions about journaling before we go on? All right, so we don't have a huge amount of time, so we'll go through FFS pretty fast. So the Berkeley Fast File System is old and crafty at this point, right? 1982, how many people here were born in 1982? Yeah, that's what I thought. Not even you guys? Oh wow, okay. I'm old. So this is Kirk Mikusik. This is the guy who developed the Berkeley Fast File System. He's still working on FFS, you know, 30 years later, yeah. This is morphed into what's called the Unix File System, UFS, and it's still in use, right? There are still vendors that ship systems with UFS and Kirk Mikusik is still hacking on FFS 30 years later. So that's kind of fun project. So, and FFS was in its day very revolutionary as far as file systems go, right? Some of the things that it introduced have been quite lasting, right? And some of them I think are probably, were a little bit less more ephemeral and are probably going away rapidly now with changes in disk design and particularly with flash, right? So the last lasting features had to do with this really aggressive tailoring of file system performance based on an understanding of per disk geometry, right? So FFS, in order to get good performance, right? So this is 1982, right? Discs are potentially even slower than they are now. They were probably much slower, right? And so what these guys did is they said, okay, we're gonna use all of our knowledge about disks and how disks work to design a file system that tries to be as efficient as possible by really making a detailed study of the hardware and designing a file system that fits it, right? Well, at the same time, trying to be general purpose, right? So this isn't like a file system that can only work with one particular type of disk, right? But it does collect a lot of information about specific disks and try to use that in ways to improve performance, right? So when we started, so we started to think about spinning disks, right? And we've talked about this a little bit in the past, but just a review, right? So what are some aspects of disk geometry that file systems might want to be conscious of or exploit or understand? Tim, yeah, so locality, right? And locality, so all these questions end up being where, right? Where to put things, right? But locality is what Tim's identified. Why do I care about locality? Okay, yeah, that's part of it, right? But why else, Sam? Yeah, remember I have this slow process of seeking back and forth on the disk, okay? So when we talk about this locality what we're talking about is literally, you know, where are the bits that correspond to this file on the platter, right? They are located somewhere on the platter. If you, you know, if you had some sort of very cool thing that I don't know what you would call it or whether it exists, you could like put a little flag exactly where that your file is on disk, right? And then that just won't work anymore either, which would be sad, but it'd be kind of a fun thing to do anyway. So FFS thought a lot about, you know, where to put inodes, where to put data blocks with respect to inodes and then where to, and even trying to group related files together, right? So we talked about this a little bit. And then also files that are likely to be related. How do I actually determine relationships between files? So some of the more lasting features, we're gonna come back to that in a second. Some of the more lasting features of FFS had to do with addressing limitations of previous file system implementations, right? And there's a lot of common features of file systems that you guys are used to and have absorbed of just being normal that weren't normal before FFS, right? So FFS introduced 4K blocks, right? Which as we noted, are still essentially in use by file systems like EXT4. Early files, so early files had no way to allocate contiguous blocks on disks, which you might think is kind of important for locality, right? If I want to take a bunch of blocks on disk and associate them with the same file and have those blocks be close to each other, I better have a way to actually allocate contiguous blocks. So FFS introduced a way to do this. And then all sorts of other things, right? So symbolic links, right? Introduced by FFS, file locking, right? The idea of being able to provide exclusive access to a file, to a particular process. Unlimited lengths for file names, right? Which again, I mean, it's bizarre to me that it's 2013 and you guys still have these incredibly bizarro email addresses that are based on eight character UNIX username restrictions, right? But at least you can name your files with your full name, right? Even if it has like 68 characters in it. And then per user quotas on file systems. So there's this whole sort of laundry list of features that FFS at. So, and again, this goes back to our discussions about disks, right? So we talked about seek times as the main component of dyslatency, right? Once I get the heads to the right place, the rotational delay to get the right block under the heads is pretty small, right? Especially as disks have gotten faster and started to spin faster, right? But moving the heads back and forth is really the thing that takes a long time, right? So FFS introduced this idea of cylinder groups, right? And we talked about this before. Cylinder groups are all the data, essentially, that's close to all the data blocks that are close to each other on disk, right? So it's a group of tracks on the disk on every platter stacked top to bottom, right? Those are all, that's all the data the disk can read without having to move the head that far, okay? And some of this is gonna actually start to look very similar to EXT4, right? So each FFS cylinder group had its own copy of the super block, right? Most of these were redundant but we talked about the fact that the super block has this really important information and that's why I want some redundancy, right? So in FFS, every cylinder group has a super block. Every cylinder group has a sort of cylinder specific header with some of the information that you would expect to find in the super block. Every cylinder group had inodes and data blocks. And so on some level, each cylinder group is almost acting like its own mini file system, right? And this is the time to queue up the mini me slide, right? So, and you can think about it this way, right? Like instead of one big file system, right? With stuff scattered all over the place, one way to improve locality is just to make the file system smaller, right? Essentially shrink the disk, right? Make the file system look like it's eight different file systems on eight small disks with very, very small seek times within each disk, right? Then when I have to go to other disks, I still have some problems, right? But if I'm good about where to put things and I try to keep related files close together, then it might actually make the system look like I have a smaller disk but a much faster one, right? Does that make sense? So, and this goes back again. So, EXT4 Redux, right? Remember these EXT4 groups, right? I mean, these are, this is a legacy of FFS, right? These, this type of layout. So just the idea that, hey, I don't wanna put all my inodes in one place on disk. I want them scattered across the disk, right? So this is, this is very, very, this is very, very, very based on FFS, right? So, the FFS Super Block also contained this really detailed per disk geometry that allowed FFS to try to do really, really good block placement, right? And this gets to the point where it's almost obscene in terms of like what the lengths that they went to to try to understand this, right? So, here's an example. So, I don't know if this is true anymore. It may or may not be, right? But at least at the time you had disks where the disk could actually read data off the disk faster than it could send that data across the bus to the rest of the machine, okay? So, imagine I'm sitting there and I have a cache on the disk, but let's say my cache isn't big enough, right? And so essentially at some point the cache will fill up and I'll be stuck, right? So, imagine I'm spinning around. I've got my heads in one place and I'm picking up block one, two, three, four, five. I'm throwing them across the bus but the problem is the bus is saturated and so pretty soon they start either getting stuck in the cache or I just start dropping data, right? So, does that make sense? So, I can't actually read contiguous blocks on disk fast enough to get back them back across the bus, right? So, what do you think FFS did to address this problem? So, first of all, what does this mean? So, I can't read contiguous disk blocks on the same track and stream them back across the bus because the bus is too slow. So, what can I not do? What did I want to be able to do that I can't necessarily do, right? Or what sounded like a good idea that might not be a good idea, Greg? Well, yeah, so this would create hiccups, right? Where the disk would have to kind of like flush the cache and then restart reading, right? But what did I wanna do before, right? Let's say I have a file and I'm trying to read from it in an optimized way, right? What was my goal originally? Robert, where do I want those blocks on disk? Right, and so I wanted, in theory, I wanted those blocks contiguous on disk, right? Or next to each other, right, on the same track. So, before I noticed this restriction, it sounded like it would be a fantastic idea to just have all the blocks for my file just like right next to each other on the same track. So, as the disk is spinning, I can just grab them and throw them across the bus and I get really, really, really great transfer time. So, what's the problem here on it? Yeah, so what Ahmed's pointing out is that the problem here is now the bus is in my way, right? So, if I have all those blocks contiguous on disk, that might be great. But as I start to try to read the file, the bus is gonna saturate and I'm gonna have to, so let's say I get blocks zero, one, two, and then the bus is saturated and so what do I have to do? I'm on block three but I can't read block three, I have nowhere to put it. So, what am I gonna have to do? I have to go all the way around the disk, right? Doesn't take forever, right? But I'm gonna have to essentially wait for block three to come under the heads again and then I'm gonna read three, four, five and then I'm gonna saturate again so I could essentially just do this over and over again and that's really inefficient, right? So, again, it sounded like a great idea. So, I think what Ahmed has said is that what FFS did is it incorporates this delay and so rather than laying out files contiguously, it will put the files into blocks zero, three, six, nine. So, it essentially starts putting gaps on disk that allow the bus speed to match the read speed, right? So, there's two ways to react to this, right? One is to be like, that's really cool. The other is to be, that's gross, right? Like, it's just so, such a detailed, per disk knowledge that's needed to start doing layout at this level. But again, I mean, this was one of the optimizations that these guys were able to do, right? Just because of how much they knew about the disk, right? And so, as I pointed out before, it's possible that some of this stuff just doesn't matter, right? And if it doesn't matter, it might be that it's a good thing and it might be that it isn't a good thing, right? So, one interesting thing that's happened with disks is that, so, this sort of layout technique, what does it rely on about the disk? Well, what assumption is the operating system making about disks and how disks operate that allows it to do this, bluntly? Well, not necessarily that all disks are the same. This would be different, but what about, what assumption am I making about block naming? Yeah, no, no, go ahead. Okay, yeah, so that's, yeah, so I'm not planning for degradation here, which is an interesting point, but what else am I assuming? What am I assuming? So, the operating system is telling the disk, hey, store things in block like zero, one, two, okay? What assumption is it making about block zero, one, two? Yeah, that they're close to each other on disk, right? So, the idea here is we've been making this assumption all along and, you know, no one has had a reason to question it, right? But the operating system has to know something about where the disk puts things, right? Let's say that blocks zero, one, two, and three are all over the disk in random locations, right? Then I'm not doing something clever here, I'm doing something stupid, right? Because I don't know anything about disk geometry. So, what it turns out is that modern disks actually start to violate some of that assumption, right? So, the first thing that modern disks will do is that when disks come from the factory, they usually have a bunch of bad sectors on, right? Areas of the disk that don't work for whatever reason. I think that probably has to do with just the quality of the magnetic medium, right? So, there's parts of the disk that it can't write to, okay? To simplify the operating system's life, what the disk will do is the disk will actually remap certain blocks to other parts of the disk, right? So, for example, I might have blocks zero, one, two, three that are all right next to each other. But maybe block four doesn't work. So, what the disk has done, without telling the operating system about it, is block four is now an alias for block like 10,082, right? That's actually what happens, right? And the nice thing is the blocks still look consecutive to the operating system. So, the operating system doesn't have to know about this, but the point is that if I think that zero, one, two, three, four is a good place to put contiguous data, turns out that's not so true, right? Because I'm gonna get zero, one, two, three, and then whoop, I've gotta seek all the way over to some other part of the disk to grab block 10,082, but the operating system doesn't know that, right? This actually gets even worse when you start to talk about things like RAID and stuff like that, RAID, RAID arrays break all sorts of assumptions about where stuff is, right? And actually a friend of mine did a nice project where he showed that on RAID arrays, a lot of these assumptions that disks make about locality can actually turn out to be counterproductive. The file system operates more slowly because it's trying to do clever layout, right? The other type of device that does a lot of remapping of block IDs are flash drives, right? A flash drive, locality may not matter anyway because I don't have these moving parts, right? But so I just wanted to point out that there are these interesting, the history of operating systems is kind of the history of this, I don't know, maybe it's a happy relationship, maybe it's an unhappy relationship, troubled relationship between hardware and software, right? But it's a constant renegotiation about who should do stuff, right? You guys are here, so you're in the forces of light, right? That's software people, okay? And hardware people, you just have to keep in mind, right? Hardware people have some advantages over you, right? Like they build stuff and that stuff's fast, right? But it only does one thing, right? Kind of, right? Now we have more programmable hardware, so maybe that's a problem and let's hope that that never works very well, right? Because that would be a trouble. But we'd be programming it, right? They can build the programmable hardware and they can just give it to us to use. And in general, I mean, this is a very, very rough dichotomy, right? But in general, software is more flexible but slower, right? I mean, hardware's fast, right? I mean, hardware will kick your butt when it comes to implementing algorithms and getting things done, right? But the problem is it's done, right? And so, when we look at disks, right? There's this long history of trying to figure out where should things be done, right? So the OS in control, the operating system, as you might expect, has a lot of visibility in how things work, right? It can see relationships between files, it understands consistency requirements, and this information may allow it to improve performance, right? The cons, especially for them to see from a hardware perspective, is that operating systems are slow and buggy, right? I mean, hardware people just fundamentally don't think that software people ever test their code, right? And actually, to some degree, we never test our code to the degree that hardware people do, right? Because with hardware, it's like you ship a product, right? If that disk has a bug in it and doesn't work, you don't get to be like, oh, I'm gonna push an update tomorrow, you know? No, no, it's gone. That thing's like, that thing's in the hands of people on Newegg and they are bashing you, right? You're like down to one egg or something. So that's not good. And again, when we talk about block layout, this is kind of what we're talking about, right? So the OS determines where blocks go and the disk determines, and the disk says, you know. You know, hardware has the advantage of knowing more about itself, and the buffers and caches that hardware can provide are closer to the disk, which may improve performance. On the other hand, if hardware starts taking responsibility for things without understanding the requirements that the operating system understands, it might violate those requirements, right? So this, you know, this battle, if you wanna call it, if you call it a negotiation, a friendly argument, it's still going on, and you know, these are, I won't go through these, I've been talking about them, right? This is kind of the stereotypical view of the hardware and software communities as they're trying to work together, right? Let's see, I'm not gonna go over too much, but FFS is still going on, and I just wanted to show you this. So this is Kirk McKusick's, so Kirk McKusick apparently has, I don't know him personally, he'd be someone who I would appreciate making the acquaintance of because he has a huge wine cellar, and he also has, and I think he had this way before it was cool, it's still not cool, but he has this incredibly well instrumented house. So this is actually like live information from his houses, let me see if I can pull it up again, we'll see if anything's changed. This is like live data from his house, see? Yeah, see the hot tub just went up by 0.3 degrees, right? So he has all these sensors throughout his house, he has the wine temperature sensor, right, which of course is very important to monitor. This is pretty cool, right? He lives in Berkeley and apparently the wine of the day is, I think he has, he apparently also has a database showing how many bottles of everything are available, right? I don't know how you keep that database up to date, right? Especially after you start drinking some of the things that are in your database, right? So maybe that's a consistency problem that he's working on. Okay, so I'm unable to talk about log structure files, and I will see you then.