 Ladies and gentlemen, ladies and gentlemen, ladies and gentlemen. All right, Malik, good morning, join us. All right, so today we're going to start the first of two case studies where we're going to look at a single file system design. And as I was preparing today's slides, I realized that actually in many ways we've been talking about this file system, the Berkeley FAS file system, for the last two weeks, because so many features that were introduced by FFS are now really kind of, you know, features that are standard in many other files, right? And this is also, so I tried to give this lecture a little bit of a historical tone to it, because this is one of those lectures where I don't feel like I'm really teaching sort of bleeding edge material. I mean, FFS was being, I have the date on there, like in the early 1980s, right? So when some of you guys were just a twinkle in your parents' eye, and I was like three years old, Kirk Lee Husik was designing FFS. So again, I mean, what we'll try to do is we'll try to look at this, we'll try to see how is it a response to the technology of its day, what features in FFS have survived, what features were dropped, you know, why is this relevant? And then there is one continuing evolution of disk design that we're going to discuss a little bit today, because I think it provides kind of a neat look into an ongoing and interesting tension that plays out between the hardware and software communities, right? Namely, you know, where do certain features get implemented, and how do where those features get implemented have implications for the design of software, right? So we'll talk a little bit about modern disks and what modern disks do the disks back in 1982 did, right? All right, so you guys, how many people are done with assignment two and have turned it in their last copy? Okay, that's good. So time at the time of two is pretty easy, right? Like not terribly hard, you know, just kind of a walk in the park, pretty easy stuff. Great, right, because assignment three is on its way. And assignment three is like, I don't know, I don't know how to put it, like the assignment three is the mountaintop for this class, right? That's where we're stopping this year, because assignment three is really, in my opinion at least, the most fun assignment, it's the most difficult assignment, and it'll occupy you for the rest of April. So I think how many people are taking Steve's class, distributed systems class? Okay, so I guess the distributed systems class has a couple more assignments left, and people have been thinking, oh no, like there's a lot more to do in that class. We only have one more assignment, but that's not to say there isn't a lot to do, right? Assignment three, but I think you guys will find assignment three really fun. Assignment three gives you guys a lot more rope on some level, and you can use that rope however you want, and we'll try to keep you from using that rope in certain unfortunate ways, right? And what's a good way to use rope? To lasso the assignment, and reel it in, and tie it up, and make it submit to your will, right? And we'll give you a lot of rope to do that, right, as opposed to other things. Okay, and then last, like, I haven't actually gone to the forum, but have people been uploading t-shirt designs? Is there a t-shirt design area that's been established? No, well, get started, guys. I mean, you guys had so much fun. I mean, now you guys have a couple days off between assignment two and assignment three, and there were these great, you know, 421 mems that were being posted. Some of those look like they were fodder for t-shirt designs, right? So keep in mind when you're designing t-shirts, it's kind of like standard stuff. You know, big graphics don't work well on t-shirts. You know, things that require like 16 colors, probably not the greatest thing, too. So maybe start off with something a little simpler, okay? Any questions about sort of course logistics? Yeah? That's the question I didn't want. Right, so, yeah, so that's a great question, and the answer is as soon as we can done grading things, right? And clearly that's the answer, but, so assignment zero is almost done. Like, I have some stuff up on the development site where you can see your assignment zero grades, and I'm just kind of, I didn't want to start moving that to the main site while people were still working on assignment two, right? But now that most of the assignment twos are in, I feel free to break things for a couple hours. So the website might go down actually tonight as I move some things across, right? But we will have something up where you can go in and you can see the marks that you've gotten. And then we need to keep working on assignment one and assignment two, right? Things are coming. I understand the concern. I guess we've passed the resign date, so people who were really nervous maybe made a decision about that. And again, I mean, this is the one part of the class that is kind of still a work in progress this year, right? I'm sorry to subject you guys to the guinea pig aspect of having to wait for the automatic grading to be done, but that's just kind of where we are, right? So I'm sorry. I know that's not a good answer, but look, I'm working as hard as I can and the TAs are working as well to get the stuff done, so just bear with me. Does anybody else want to rub it in? You gave us an extra week, sir. What's that? You gave us an extra week. I did give you an extra week, right? So why don't you guys give me like an extra six months, right? Is that a fair trade? We'll meet again in December and I'll hand out all your marks. We will have marks by the end of the class. And again, this course is supposed to be hard. You guys are finding out that it's difficult. The grading in the class is not going to be harsh. I expect to give out a lot of very good grades in this class. I know you guys are working hard. I've been seeing people really slaving over these assignments. That's the point, right? I mean, you guys are learning something. It's frustrating. If the assignments are easy, then you really shouldn't be taken to class, right? I mean, if you just wrote it up and there's no problem, then why are you here? I mean, you're not really learning much, right? If you're struggling with the assignments, then you're starting to learn some things and you're probably learning them the hard way, right? Which is, of course, on some level the most frustrating way to learn things, but also the way that you actually might actually have an impact, right? So when you've spent 24 hours running down some sort of C, stupid little silly C, memory allocation problem, right? These things always seem silly when you're writing it. It's like, oh, whatever, I'll just allocate this thing and it'll be fine. And then when you don't do it correctly, then maybe the next time you start to allocate things, you think, maybe I should look at that really a little bit more carefully and make sure I'm doing the right thing. So anyway, I know that it's frustrating, but don't be worried about the grading. The grading will work out okay. All right, any other questions about logistic stuff? Okay, so Monday we talked about file system caching and consistency, particularly consistency achieved via journaling. So does anyone have any questions on this material? Any questions on consistency? Good morning. I feel like every class I have, like there is some percentage of the class, maybe it's like 20% of you that show up every time, and then it's like a random grab bag of other people. So it's kind of fun, like some of you guys haven't seen for a while, so it's like welcome back. I'm glad you're still in the class. All right, so any questions on caching and consistency? Okay, so let's talk. We talked about where we put the cache, right? We did a little bit of a design exercise where we talked about what are the implications if we put the cache above the file system layer, right? Above the file system. If we put the cache way up there, then what are the objects that we're caching? What are the things that are in the cache? Anybody? Files, files and directories, right? Files and files, really. Directories are files, so files and files, right? And then what's the interface to the cache, the file system cache, if we cache at this level? I mean we're operating on files and directories, so what's the interface that the cache has to support? Open, close, you know, read, write, right? The standard UNIX file interface, right? Okay, so this was option one. Option two is that we put the cache below. We put it down kind of right on top of the disk, implemented in software in the operating system, and maybe actually we can even write one cache that spans multiple devices, but still at the device level, right? So what does this cache, what are the objects that are in this cache? Disk blocks, right? I mean that's what we see at this level, right? That's the disk interface. And then what's the interface? You know, what are the read rock? Write block, right? I mean these are the operations that the cache has to support, okay? So what were the, I don't have this up on the slide, so you can help me remember, what were the advantages of having the cache, cache files and support the file interface? What was that going to potentially allow us to do? Or what was that going to expose us to that was going to be helpful? Anybody over here? This side of the room? Carl? Okay, well metadata is an interesting answer. Let's come over here, Alex, Keith. So it sees file operations, right? So at some level it makes more sense, right? When I'm below the disk, when I'm at the disk block level, all I see are these block numbers. And those block numbers don't necessarily mean anything. The disk block cache may have no way of knowing that blocks 42, blocks 89, and blocks 158, or all correspond to the same file, and actually they're sequential, right? So maybe I want to start to affect some additional sequential blocks in that file, right? So when I do that at the block level, it's very difficult to see that, whereas if I do it at the file level, I can see, okay, I read the first 4k of foobar and then the second 4k of foobar and the third 4k, so you can start to see things, right? So when I'm operating up here, I have some more information. I'm going to go back to the drawback to having things above the file system. The file system may not see certain kinds of operations, right? So if I'm caching reads and writes, the whole point is that I don't want to pass those things down to the file system, right? But the file system might want to see those things. Why? What might I end up doing if I hide read and write calls from the file system, particularly writes? What might happen? What might the file system not be able to do? Process made a bunch of writes to a file, and yeah, so the file system may want to actually provide some consistency guarantees. It might want to have a journal that it needs to maintain, right? It might want to flush those writes to disk if the process is asked it to do that. So this gets, in a way, a little bit of consistency in certain cases. What about below the file system? To some degree, the pros and cons are similar, right? So the pros are reversed and the cons are reversed, right? I'm trying to remember. Oh yeah, what's another pro of this approach? I think, Carl, you started to go there. You offered a word that was... So right, if I'm down at the block level, I can cache all the file system metadata structures that the other cache doesn't see, right? So the other cache doesn't see changes to things like the super block, right? Whereas this cache will see that it's just another block on disk, right? The super block, the InoTables, they used to allocate data blocks and InoBox. They're all just blocks, right? And so this cache sees everything and can cache it, okay? So remember that a real enemy here is the fact that file system operations usually require modifying multiple blocks on disk before those operations can be considered complete, right? Before the file system is in a consistent state. So how does caching exacerbate this situation? Why does caching kind of get in the way of consistency? What does caching do? So let's say I have an operation that requires that I modify three different disk blocks on disk, right, John? Right, so I've got these three blocks, right? And let's say I'm doing like a create or something and I need to change the directory, allocate the inode, and allocate a data block and associate it with the inode. So maybe there's actually like four or five blocks I need to modify on disk. The point is that normally I would do that operation and they'd be sent down to the disk immediately and there'd be some window of time there where they were still in flight and if I disconnected the power one of them might not get finished, but when I start to let things lodge in a cache then I'm increasing that time window, right? So it's possible that one of those operations gets flushed from the cache, but the other four are still sitting there, right? And then the power goes down and so now I've got a situation where I've got an entry in a directory for a file that contains an inode pointer but that inode isn't even marked as allocated, right? So that starts to be kind of gross, right? And that might mean that when I reboot the system I have to do some pretty long and tiring consistency checks. So how many people have ever run FSCK on a file system? So there were these tools and actually one of the guys who wrote FFS we're going to talk a little bit about today I think wrote one of the first versions of FSCK. So there are these file system tools and you can imagine, I mean you can essentially walk the file system and you can do all these consistency checks to make sure that everything aligns up and stuff like that and there are tools to do this, right? People who have run FSCK, what's the problem with this? It's slow, man. I mean the disk is really slow. So now I've got this big disk and FSCK there could be anything wrong with it, right? It's possible that if FSCK runs it just has to assume the worst. And so it's got to walk through the entire file system, look at everything, double check everything here and there. So it's generating all this IO and it's slow, right? Especially on older file systems. FSCK would take forever to run, right? So it's kind of like, okay, well, if the power goes down and the machine crashes you can recover the file system, right? But it might be tomorrow by the time you're finished, right? Or at least a couple hours and when you're talking about machines that are trying to maintain high uptime that can really hurt, right? So, you know, it's one thing if I can recover the file system but it's another consideration, how long is that actually going to take, right? How long is, you know, cartoons.com going to be down and the little kids are going to be crying and whining and not be able to, you know, to use the cartoon web or whatever, right? So that's a consideration, okay? All right. So lastly we talked about journaling, right, which is kind of a modern approach to maintaining the file system consistency, okay? So what is journal? Fundamentally what am I doing when I journal a file system? So I'm essentially using a, yeah, I'm using a special data structure on the disk called the journal and I'm using that to track changes to the file system that are not fully finished, right? They might be partially finished, right? Depending on the caching policy, parts of them may be done, right? But what I'm doing is when I create the file system before this is finished, right? And as things start to filter down to disk, some of those things might happen. So what do I do after a failure, right? So I fail, right? I've been writing things to this journal. I come up again. How do I use the journal to help me during failure? Somebody from this side of the room. John. Right, so, exactly. So essentially it's called replaying the journal, right? So I start at the last point when I knew the file system was in a consistent state. I start this in one little bit. And then I kind of roll forward, right? And I say, okay. I had marked in the journal that I was going to do this thing and it required these five operations on disk. And so I look on disk and I say, is I node 567 allocated? Yes, check, okay. Is it linked in this other directory? No, okay, I need to do that, right? So I use the journal to roll forward and get to a consistent state, right? And when I find the last complete journal entry, that's where I stop, right? So it's possible that if I was in the middle of writing a journal entry when the failure occurred, that there could be a little bit of stuff that's left undone, right? But that probability that I'm in the middle of writing the journal is probably much lower than the probability that I was in the middle of writing five or six different disk blocks in an unjournaled system, right? So I've reduced the amount of state that I'm going to potentially lose. And I've made it also a lot easier, right? To check the entire file system, assuming that anything could be wrong, the journal kind of gives me... One of the things the journal allows me to do is to focus on the things on the disk that are likely to be wrong, right? So the journal kind of says, okay, here's when I last knew it was consistent, here's the things I was in the process of changing, and so these are the things I need to check, right? So rather than old FSTK, which has to check the entire disk and assume that anything could be wrong, journal and file systems use the journal does this make sense? Yeah, yeah. What's that? You cannot do the form of your... If it gets full? Is that what you're saying? Oh, fails, right. So we talked about if I'm in the middle of writing a journal entry, right? Then on replay, I just have to discard the rest of that entry, right? So it's possible, for example, that I was going to create a file and I started to write the journal and what will happen when I come up is that file won't be created, right? And that's okay, right? I mean, part of the point here is that maintaining, like, you know, it's kind of like once the process gets in its mind's eye that it wants to create a new file, and once that actually happens on disk, there is a delay there in which a failure can cause that operation not to complete, right? What I'm trying to do is I'm trying to do two things. First of all, I'm trying to shorten that window as much as possible, right? So now, rather than having to write out a bunch of files synchronously, I just have to make a change to one data structure on disk, right? So that just might be touching one disk block. So that's nice, right? The second thing is on failure, I can recover much faster, right? So a journal allows me to do two things. First of all, it shortens that time window during which a failure can cause some loss of... It's not even a loss of state, it's a file system back into a consistent state, right? Or making sure the file system is a consistent state, right? So yeah, so this isn't like a... Again, I mean, this is not a panacea, right? Like, from the moment you call open in your process and you pass it to createFly to have it create a new file, like, from the minute that happens, there is a latency there in which if I unplug your machine, that won't happen, right? There's no way around that, right? Because the disk is slow, right? And I have to make multiple modifications. Does that make sense to people? This isn't a cure-all, but it does allow us to do a much, much better job, both of preventing failures and of recovering. And then finally, so how do I... What do I do when I buffer... When I actually flush the... Let's say I call sync, or let's say for whatever reason I'm sending out to disk, what do I do in the journal? Do you guys remember what this is called? Right here. Anybody, yeah. Checkpoint, right? So I add a checkpoint to the journal, I just write a little note in the journal and I say, at this point, everything that I've written above is on disk. And at some level, at that point, I can discard those journal entries, right? I really don't need them anymore, because when I fail, I'm going to roll back to the last checkpoint and work forward, right? The journal up to the checkpoint describes the disk, meaning that I can discard anything before the check. Okay. All right, so I really blew through this at the end of the last class. So let's talk... As I point out to you, one of the nice things about this is I've taken five or six operations to different disk blocks and I've reduced them to one small write, potentially, for metadata updates, right? Because you can imagine that I can create a little data structure that I can write to my journal that describes a create operation that's quite compact, right? I don't know how big journal entries are, but they're small, right? All I have to do is I have to say, what's the I-know number that I'm using to create this file and maybe put some information about where the file was going to be, what directory was it going to be in, what's the impact, right? But what about the data itself? Does anyone remember there were two different approaches to data blocks, right? Because the data blocks are big, right? So I'm going to create a new 16K file, the metadata required to describe in the journal that I'm going to create that file and the data blocks that I'm going to use might be, I don't know, 64 bytes or something, right? But the data is still 16K, right? There's no way to get around that, okay? So what am I going to do with that data? What's the first approach? Or what's A approach? Either one. So I'm using the journal to track changes, right? The data blocks are kind of changed, right? Like the data blocks mean that there's data on disk that wasn't there before, right? So what's one thing that I can do with those data blocks? So I can include them in the journal, right? That's one way to do it. And then when I flush the data blocks to where they're actually going to go, I'm going to write them again, okay? So the downside of this is the upside is that the data blocks are in the journal, meaning that on recovery I can potentially make sure that they get put where they're supposed to go. But on some level, what am I doing? It's just like flushing those data blocks, right? I'm just writing them twice, right? I'm writing them first to the journal, and then later I'm going to write them where they need to go, right? So this is kind of gross. What's the other option? I can include them in the journal. Alternatively, I can what? I can exclude them from the journal, right? And that means that it's possible that when I replay the journal, I create a file foo bar of size 16k on disk, but its contents are just junk, right? Or zeros or whatever, right? So I might have a note in the thing that says, I created a file foo bar of 16k in this directory, and when I replay the journal, I can create that file again, but if I don't have the data blocks, if the data blocks were never flushed, then I don't know, what am I going to put in there, right? I can give you a file with four data blocks associated with it. I just don't know what's in it, right? On some level, though, this is probably okay, right? Because keeping the file system data structures consistent is a lot more important to the file system. I shouldn't say it's a lot more important. I mean, data is fundamentally what file systems do, right? But if you gave file systems the choice between losing data and corrupting their own structures, they would choose losing data, right? It's not a choice that they would like to make, but if you gave them that choice, they would choose to lose data, right? Because data, you know, again, if I at least know that that file exists and I know what data blocks are associated with it, then that's great, right? If, you know, I have those data blocks there, but they're not attached to an i-node, they're floating around, and now they're just marked allocated, but they'll never be free. Now I've got a bigger problem, right? So this is what I do. Okay. Okay. Cache and inconsistency, any questions? I know we kind of, I'm glad we went over the data block stuff again, because it's clear that I went through that too fast, all right? Okay. So let's talk about FFS, all right? Again, kind of historical journey into BSD. How many people have used a version of BSD? Okay. How many people have used code from BSD? Okay. Everybody put up your hand. Everybody raised your hand. How many people have used code from BSD? Trust me, everybody in here has used code from BSD, right? It is incredible where that, how many people have used Windows? Okay. You've all used code from BSD, right? How many people have used a computer? Like, I'm serious. So I was looking last night doing some research for this, and there's some website that's just determined, I don't know who makes these determinations, that BSD 4.3 is the best that the most, I don't know how they describe it, the most significant software project ever. Right? 4.3 BSD. Right? And this guy Kirk Mekusik was really involved in that. So BSD, right? BSD stands for Berkeley Software Distribution. So for a period of time, there was kind of a legendary group at Berkeley called the Computer Science Research Group, Computer Systems Research Group, CSRG, that one of the things, among many other things they did was they developed this version of UNIX, right? And again, if you look at the significance and the impact that BSD had, it's really incredible, right? I mean, BSD invented sockets, you know, standard socket interfaces, things like this. The TCPIP code that BSD used is one of the things that has essentially polluted almost every other operating system. So many people have grabbed it. Apparently, if you take a lot of... So BSD had a very permissive license, which is one of the reasons why, for example, Windows was allowed to use their code, right? Apparently, I read somewhere that if you take a lot of binaries and you just do a string search for University of California, Berkeley, you can find all of this whole... Again, this stuff's everywhere, right? So pieces of 4.3 BSD have just landed all over the Earth, right? And almost any computer that you use, right? And it's kind of cool. So this guy, Kirk Mekusik, is responsible for writing FFS. He wrote FFS, right? He probably didn't take... Maybe he took a class like this, and maybe they had a file system assignment and he just decided he just wanted to do probably not, right? I mean, he... But he wrote a file system, like, by himself, okay? So that's kind of cool. And it turns out that FFS is still around, right? So there's something called UFS, Unix File System. It's still in development. It's still in use by, I think, free BSD 9.0 is still using it. And a bunch of the BSD derivatives are still shipping with it. Linux has important... I read somewhere that... I read on the Wikipedia page that PlayStation 3 uses UFS 2 as its file system. I have no idea why. You know, some sort of... Colonel Weeney is clearly working for Sony. But anyway, so this is still around. He's still working on it, actually. And I was watching some stuff online of him coming to talk two years ago about combining some of the work that they did on consistency with journaling. So this is kind of a neat story, right? Let's see. So again, so FFS made a lot of contributions to file system design. A lot of the things that we think about, a lot of the terminology that we've already been talking about really goes back and is rooted in FFS, right? In the design of FFS. You know, FFS is kind of like the dawn of modern file systems. It's like when the first proto file system climbed out of the swamp and started to walk, you know, in search of food and a mate, right? So before that, it was like these weird things that we don't want to remember, right? Like FFS. There was something called FFS. Anyway. Kind of broken. So some of the contributions of this are around and were really lasting. Others were less so, right? So in my opinion, kind of the less lasting features were this real serious and some ways elegant and some ways kind of gross attention to improving performance by exploiting file system, disk geometry, right? And we'll talk about this. In FFS, all these really, really funky features that were designed to really try to, I mean, and at the time, right, this kind of made sense. Discs were really, really slow. I mean, disks are still slow, right? But Flash is helping and disks have improved a lot. Discs do a lot of their own buffering now. But at the time, disks were even slower, right? And so it really, you kind of like, wanted to get as close to the disk as possible and understand as much about the disk as possible to try to improve performance, right? And this is what they did, okay? So go back to what we talked about, I think it was like two weeks ago, right? About disk geometry, right? And we've been hinting about this. So what are some geometry related questions? So let's say I'm taking you back to 1980 and I'm giving you this thing that spins and it's got heads that move really slowly and stuff like that. What are you going to start to think about in terms of optimizing your file system based on what you know about this device, right? Like, you know, I'm telling you like, you got to get in bed with this device, you know? Like, take this, put it under your pillow at night, you know, like spend time with it, hang out, go get coffee, like get to know this thing, right? And then how are you going to use all this knowledge to improve the files? What's one thing that you might think about? Yeah, yeah, so yeah, this is, so that's one great idea and actually it's not even up on the slide, right? Where's data coming out the disk fastest, right? Outside, right? Now, it's kind of interesting because let me come back to that because that's, I think that's true and not true on early disks because early disks had some other problems that might have made that even not as effective as you might want it to be. So, but what other things, right? So that's a great start. So I want to think about where is the disk, like how fast is the disk spinning at different points, right? What else? Right, right, I don't want to move the head, right? So, you know, where do I put things? Like where does stuff end up on disk? So, and what are the data structures that we're talking about here? It's just the standard file system stuff, right? Where do I put inodes, right? And particularly, what two structures am I going to do a lot of seeking back and forth between? Inodes and data blocks, right? What's the canonical way of reading data from a file? Find the inode, use the inode to find the data blocks, go to the data blocks, read some data from the data blocks, maybe make some modifications to the data blocks and then go back to the inode, you know, update the inode, go back to the data box, right? So this is one area where I would expect to see a lot of seeks back and forth. And the closer I can put these things together, the smaller those seeks will be, right? Okay, what about trying to identify related files? Again, like we're really getting down inside this thing, okay? We want to, you know, what are related files? Where should I put related files? And then, and kind of how do I identify related files? What files are likely to be related? What about patterns of disk access can I use, or file system access can I use to co-locate files, right? Let me ask, I mean, just based on your understanding of file systems, what do you think is one group of files that is likely to be read all at once? If you asked, if you gave me a file system, if I gave you a file system and I said, come up with, you know, a group of files that is likely to be accessed together frequently. Yeah, I'm thinking about something a little, okay, so maybe files of the same type, that's an interesting observation, what else? What about the file system layout could I use? Same directory. So anytime you do LS, right, what happens? It opens the directory, it uses the directory to find the inodes of the files in the directory, and then it's going to read every one of those inodes, right? So just because of how file system hierarchies are set up, those files get accessed together a lot, right? And that's something that FFS observed and tried to use. Okay, so the second batch of things that we won't spend as much time talking about is that FFS just, you know, again, a lot of things FFS introduced you're kind of used to, right? So we've talked in the class about, you know, block size, right? We said the sector size on disk is 512 bytes, you know, FFS introduced the idea of these larger blocks, right? Larger blocks are more efficient. They allow me to lay out files more contiguously on disk. We'll talk towards the end of the class about some of the drawbacks here. This is kind of our standard fragmentation versus performance trade-off, right? Early file systems had essentially problems where it was difficult to allocate blocks that were near each other on disk, simply because of the data structures that they used to track what blocks were free. FFS fixed this, right? FFS had free lists that were ordered allowing it to easily allocate blocks that were close together on disk. And finally, there was just all this stuff that you guys take for granted, symbolic links, right? File locking, so a process can get exclusive access to a file to make changes. Restricted file system lengths, right? FFS had like a max file system length of 14 characters or something dumb, right? There were still some file systems that a lot of you guys had to use for long periods of time that had these sort of dumb restrictions, right? Like file name length, I mean, come on, you know? It's hard enough to name files, right? Now you got to tell me you have to do it in 10 characters. I mean, that's cruel, okay? And then user quotas were also something that was added to FFS. Simple, you know, simple stuff, like a simple thing that you would want in a file system to make it more usable, right? I want, especially in a multi-user environment, I want to prevent, you know, Michael's, you know, PDF stamp collection from taking over the entire disk, right? And preventing me from storing my important slides for class, okay? FFS, you know, again, FFS added these features. Okay, so we talked about, we've talked a lot about seek time, right? This is a major enemy of disk, this is the major source of disk latency, it's a major enemy of closeness, right? So moving the heads laterally like this, this is a big problem, right? And FFS addressed this. But FFS, like, took it to the next level, right? FFS didn't just address seek times, right? FFS actually also would do calculations of rotational latency in certain cases and use those to try to optimize file access patterns, right? So again, like, I mean, they took it, like, the seek time stuff has stayed with us, right? The rotational latency stuff, I think, is pretty much ignored now, right? But again, this is how intimate these guys got with their disks. So, I think I've introduced this concept before. Who remembers what a cylinder group is? But what is the cylinder? You started with the right motion, right? Yeah, Ben. Essentially, it's all the data that I can read from the disk without moving the head. So it comes from tracks on multiple platters, right? Potentially tracks on both sides of multiple platters, essentially kind of in a cylinder going down through the disk, right? And on FFS, I also think they actually expanded this a little bit where cylinder groups could contain multiple tracks that were close to each other, right? So the idea is all the data I can read without moving the head much, right? Certainly without blasting the head from one side of the disk to the other, right? Moving the head a little bit here and there, that's okay, right? Moving the head big jumps is less okay, okay? So on FFS, right, I mean, what did they do? This was kind of a cute idea, I think. Every cylinder group, okay? So a cylinder group is some partition of the disk, again, consists of tracks that are close together on multiple platters, right? The stuff I can read without moving the head much. It has a backup copy of the superblock. I don't know if they actually used this copy of the superblock. Maybe for reads and not for writes, right? It's intended to be a backup, right? Remember, superblocks are really important in delicate data structure and I don't want to lose it. So I keep backups everywhere, right? Every cylinder group has a backup copy. It has a cylinder-specific header that has superblock-like information for that cylinder group, okay? So it's kind of its own little cylinder block. It's got an inode structure that stores the inodes for the group. It has data blocks. What does this sound like? This is like a little file system, right? On one cylinder group, right? So without leaving the cylinder group, I can allocate an inode, right? I can associate some data blocks with the inode. If I'm lucky, the directory that I'm going to put that file in is on my cylinder group so I can change the inode without moving out of that cylinder group. So again, it's almost like little mini file systems, right? File systems inside file systems, right? And, you know, of course, when you say mini anything, now you have to show this picture, right? So, yeah, it's like its own little mini file system, right? Got a bunch of them running around. And again, if I'm good, what this does, right, is it places a premium on location, right? If I'm good, I can pretend that what I really have is a really fast file system, right? Of size, big file system divided by 64, whatever number of cylinder groups I have, and I can just live within that, right? And then from time to time, I make a trip to another file system group to do some other things, right? So this just occurred to me, but there's an interesting analogy here to Facebook that I'll just throw out there because I think maybe this is, this is maybe another system design principle that I don't know, might show up on a test or something. So divide and conquer, right? This is sometimes seen as an algorithms principle. It's also a system design principle, right? So one of the reasons that, one of the things that allowed Facebook to perform well when it started was that Facebook isolated users into schools, right? So Facebook had a server for Princeton with all the Princeton students on it, right? And most of the Facebook operations, the searches and other things, happened local to that machine, right? Or local to that little slice. If you try to do a global search over all the schools, then they had to go up a level, right? But that wasn't something that they made easy to do in the early days, right? So on some level, you know, when Friendster was taking five minutes to load the page, right, Facebook used this really natural and elegant partitioning to keep most stuff very local, right? So it's almost like this idea of having little mini files, right? Had a little mini Facebook, right? And as long as you were just looking at people at your own school, which is what most people were doing, you didn't have to, you know, pay the cost of these expensive cross-school lookups, right? You just did stuff locally. Now I have no idea what they do. And I looked, and I don't think these are actually, so remember we looked at some of the Superbox stuff from EXT4. I don't think these are actually old FFS-style groups, because you can see where the block bitmaps are. They're all kind of like very close to each other on the disk, right? If this was FFS, they'd be spread out, right? Because I would take a big chunk of the disk and I would use it for this and a big chunk for the next thing, right? But this terminology is still with us, right? And on some level, some of these ideas are still here, right? I mean, each one of these has its own I-Node table, its own I-Node bitmap, its own data blocks, et cetera, right? How those are laid out on the disk now, I really don't know, right? But that terminology goes back to FFS, I believe. All right. So I just want to talk a little about this rotational planet, because this is kind of wild, right? So the FFS Superblock contained this really, really detailed disk geometry information. The speed of the, like how many heads, you know, the speed the heads could move, the rotational speed. I think they did measurements on the disk, they did runtime to collect this information, so it's really, really sort of detailed, okay? And the reason they were doing this is they were trying to do rotational block placement planning, right? Now the question is, why do I need this, okay? And the answer is that on early disks, the speed at which the heads could read information off the disk exceeded the bandwidth over the bus or through the cache or something, right? Essentially, if I sat there spinning on one track, reading all the data, I couldn't transfer it fast enough back to the operating system, right? I couldn't do reads at full bandwidth, right? And so what would happen, right? So if I was trying to read a bunch of sequential blocks, I couldn't read 0, 1, 2, 3 off the same platter as I was going around, because I would read 0 and then it would take some time to get 0 over to the operating system, and then by the time I finished, I would be like over 3 or something, okay? So FFS actually incorporated this into their file layout, right? They would calculate this and they would figure out where, because otherwise what happens, right? So let's say I lay out a file in 0, 1, 2, 3 and those blocks are consecutive on the track. But let's say I actually can't read the track at full bandwidth. So what happens if I'm trying to do a sequential read? I'll read block 0, I transfer it, now I'm over block 3, so what do I have to wait for? I have to wait for 1 to come all the way back under the head again, so I've got to do a whole other rotation before I can grab block 1. So what FFS did is they were like, okay, well we'll calculate how fast you can read the data and we'll space stuff out on disk appropriately, right? So again, my reaction to this was some weird mixture of, that's, wow, that was really cool and I can't believe they did that, right? I'm sort of glad that we don't have to do this anymore. Disk kind of prevent us from doing this now, right? Okay, so now it's 2012, 30 years later and the important question is, does this stuff really matter anymore? And I think that the important way to think about this is, first of all, if not, why not? And second of all, if not, is it a good thing that it doesn't? And I feel like this is a good time to stop and again, consider the ongoing struggle for the fate of the computer systems universe between hardware people and software people, right? So the hardware guys had these very, very fast things that they can throw out problems that are incredibly inflexible, right? The software people have these really great flexible software things that they can change on a moment's notice that tend to be really slow, bloated, buggy, broken, etc. These are terms provided by my hardware friend. So when we think about disk performance, right? It was clear that disks were a problem, right? Discs are slow, disks, you know, a lot of operating systems, slowness and latency was being caused by disk access. Sorry, disk access. It could be caused by those. I mean, it could be caused by those people too, right? But because by disk access. So the question is, who's problem is this, right? And it's a problem that everybody tried to solve, right? So I've infested one after it using all this really detailed information about the disk, right? And we'll talk on Friday about a completely different approach to improving disk performance, also driven by information and observations about file system and disk usage, right? But between hardware and software, this is kind of battle, right? So what happens if I put the operating system in control, right? Well, the pros are the operating system knows everything, right? It's operating the system. So it knows about multiple users. It knows about workloads. It knows about consistency requirements. I mean, if the operating system doesn't know about it, it's probably not important, right? The cons are that the operating system is slow, right? And they also have bugs. The alternative approach. So this is kind of like how I see this approach. And this is to some degree what FFS was exploiting, right? It was exploiting the fact that disks did what they were told, right? Like you told the disk, when the disk told you this is my layout, it told you where things were, right? And what things were close to each other. And when you told the disk, put this block here, it said, okay, I'll do that, right? And I'll write it right now, okay? So back then, disks were these very subservient creatures that tended to obey operating system instructions, okay? Unfortunately, hardware people kept working on disks, right? And as disks evolved, they got more and more rebellious, right? It's kind of like my dog. They got willful. They did things their own way. They stopped listening so carefully to the operating system. And essentially the reason they did this is they said, you know what? I know more about me than you do, you know? I know me well. All you know is you have some read, block, write, block interface, but you don't really know me, you know? Like you don't know where things are on me. You don't know all the things I can do. You don't know all the special features I have that the hardware guys do working really hard on. So yeah, like, let's stop trying to, you know, do, you don't even know. Sounds like a 14-year-old, right? So, and again, I have all these special features that you don't know about, and they're really close to the disk, right? They're really close to me, right? They're right here. They're not way up there in operating system land. Across the slow bus, they're right here, right? But of course, you know, to the degree that the struggle is happening, devices trying to do their own thing frequently causes the operating system headaches because I told the device to flush that block to disk, but the device was like, I'm not going to do that right now. I'll wait, you know? That doesn't seem like a good idea to me. I mean, that just, you probably aren't really serious about actually wanting that data to be on the platter. So, you know, power goes out and now the operating system is like, oh, I flushed that block and the disk is like, what block? Not good, right? So, but I mean, this is the kind of relationship that I think disks want, right? Like, oh, disk, you know, you're so clever. You figure out what to do with it, and then disk is like, yeah, that's fine. I'll take care of it. You just go about your other things. And on some level, like, this is, you know, this is caused by this interesting misperception. Well, I shouldn't say it's a misperception. That's probably true, right? I mean, all these things are true, right? So, to me, you know, hardware engineers seem to spend all this time implementing features that they think are important while at the same time hiding information that I need to do my job. And to my hardware friends, I seem to spend massive amounts of time writing terrible, awful, gross code that never works and then whining about features that they should implement and can't implement in hardware anyway, right? Hardware has limitations. Hardware can't do anything. And so this is kind of this battle. So let's talk, should I do this at all? We've got two more minutes. So just a couple more FFS features that I think are kind of a little, a final little grab bag. I think it's important to point out that FFS has continued to evolve, right? Into UFS today, and UFS has a lot of fun features to it, right? So block sizing, right? What's the right size for blocks on disk? Small blocks, right? So who remembers, we had this whole conversation before, right? If you substitute pages for blocks, right? And what was nice about small pages? Less internal fragmentation, right? On disks, small blocks frequently lead to more seeks, right? Because files have more blocks and those blocks are more likely to be in more places and it's more likely that I have to bounce the head around looking for them, right? And of course, large blocks, more internal fragmentation, meaning that if I have a 16k block size, a 2k file is wasting 14k of space, right? But fewer seeks, okay? On modern disks, and I think a lot of files have gone to larger and larger block sizes simply because who cares about space, right? Like you guys all have these massive hard drives and again, I mean, no one really cares, right? Although actually my wife's laptop hard drive is filled up and it's frustrating because I have no idea what to do, right? I haven't deleted anything in years, so we're gonna have to figure out how to address the situation. Okay, FFS tried at some point, I'm not sure they do this anymore, some clever things about co-locating inodes and directories. So accessing directory contents means seeking to find all the other inodes that are inside, contained by that directory, and hey, why don't we try just jamming the inodes for that directory into the directory itself, right? So there were versions of FFS that actually stuffed the inodes into the directory file, right? Rather than having the directory have a path name and an I know number, now the directory contains a path name and an I know, right? And so that might solve this problem and it creates some other problems. Probably one of the more interesting, and I wish I had more time to talk about this, I wanted to talk about it in today's lecture and it kind of deserves half a lecture just for itself, is a consistency feature called soft updates, right? And soft updates are very, very clever, a very sophisticated approach to maintaining consistency in the face of multiple file system changes and they are somewhat compatible with journaling. In fact, the thing I was watching Kirk M. Houston talk about was journaling soft updates, which is a new UFS feature that they've added that kind of, as far as I can tell, uses soft updates, sorry, uses journaling to address some of the consistency problems that soft updates struggle with, right? So journaling gets 90% of the problem, soft updates gets the other 90%, maybe we can combine them to get 99%, right? So I just wanted to point this out, right? So, you know, Kirk B. Cusick, this incredible file system developer, you know, has been active in the file system community for a long time, has a really cool webpage, seems like a really neat guy, loves his wine, right? So this is his house status webpage and I just updated this and actually I was refreshing it last night and these numbers are all live, right? So the numbers change. I think this wine seller number is pretty, is supposed to be pretty stable, right? But you can see, he has this incredible wine collection. Here's the wine of the day, wine 2856. So based on his numbering system, he seems to have a large collection of wine. And I've heard this. He also tracks the number of bottles he has of any particular wine left, which is kind of cool. And then, you know, here's this other information about his house, including the temperature. You can see it's in a 48.8 degrees in Berkeley, California right now. Not that different from Buffalo at this time of year. There was that famous quote by, it was Mark Twain who said, you know, the coldest summer, the coldest winter I ever spent was the summer in San Francisco. All right, so on Friday we're going to talk about log-structured file system. This will kind of be the end of our file system unit. LFS is a fun thing to talk about. It's a radically different approach to file system design. It generated a huge amount of controversy within the file system community, including a big, long, no holds barred back and forth battle between some really, really well-known file system developers. And there's a lot of fun pieces to it. So we'll talk about it Friday. And I look forward to seeing you then. Good luck on assignment three when it comes out, which will be soon. If you haven't finished assignment two, get it done. See you on Friday.