 So, today we're going to finish up our discussion of sort of more general and generic file system design features. We're going to finish talking about caching, and in particular some cache design choices that affect the file system's consistency. And then, once we've created this problem of a relationship between caching and consistency, we'll look at one general purpose way to solve it, which is a technique called journaling. So we have regular office hours this week, everybody is back on duty. We will tell you when the midterms are available, I'm just waiting to hear from the print shop. So as soon as we get them back from the scanner, we will have them, and you guys can pick them up during office hours. I would also encourage you, so if you are upset about the grade you received on a particular question, here's what to do to maximize your chances of success. So the first thing is, come up with a coherent argument about why your answer was not graded correctly. That involves looking at the solution, looking at some of the greater feedback. Without that, you're really not going to make much progress. The second thing is, please come, so the online solutions were updated earlier today. So there are now overall statistics about the exam as well as per question statistics and some feedback provided for some of the questions by the TAs that graded those questions. So you should be able to use that to figure out which TA graded a particular question. So if you have a doubt about it or you feel like you have an argument that the question was not graded correctly, please come and meet with that TA. That should be possible. All the TAs do a number of office hours a week. So approach the right TA, make an argument. We will be flexible if there are mistakes that get made during grading. Of course those mistakes, I would argue, tend to cancel out when applied over the entire class and probably also when applied over your entire exam. But we will certainly address questions with grading on them. But please keep in mind that TAs are also working on helping you with assignment three and other things. Please most importantly, come up with an argument for you come in to ask for points. If you come in and say, I want more points on this question, you're not going to get very far. And yeah, so you guys are making some progress on assignment three up to a 40, making at least headed in the right direction. This is four or five points more than maybe 10 points more than last week. Just let me talk about what we're going to do sort of for the rest of the semester. So from this point forward, what I did last year is I sort of got to the end of my slides and then we talked about some research papers that were related to the topics in the course. What I'm going to try this year is interleaving those papers earlier at points that are sort of topically appropriate. So on Friday, we're going to do our on Wednesday this week, we're going to talk about two different file system designs that are quite different in terms of while their different file system designs are clearly different, but they're quite distinct from each other. They make very different design choices. And so they'll be fun to talk about together. And then on Friday, we're going to talk about RAID. How many people have heard of RAID? All right. So yeah, the RAID is the thing. RAID is the thing that you guys might have used or configured or you guys probably without knowing it use machines that themselves use RAID without you knowing it. So we'll talk about RAID. The RAID paper is posted on the website and you guys might also want to look at other sort of online resources about there. There's, I mean, RAID is a, you know, billion dollar technology. There's all sorts of information online about it, but please do try to look at the papers. We've heard of the paper in class and since this is our first time, we'll talk a little bit about how to approach something like this. The RAID paper is actually great for this because it's 26 pages long, although it's not that long. It's formatted in a strange way, so it goes fast. But it's also in a way sort of fairly didactic and can be difficult to penetrate if you're new to this. So we'll talk about ways to approach research papers on Friday, but please take a stab at reading the RAID paper before you come to class. And then going forward, we have a mix of topics to finish the class. We're going to talk about virtualization. We'll talk a little bit about OS design. And what I'll be doing is as we go, I'll be dropping in papers maybe one a week for us to look at. So you guys get a chance to see some modern, really modern sort of cutting edge sort of operating system designs and features. This stuff, of course, this RAID paper is from I think the 60s or 70s, so it's not modern and cutting edge, but it's a good paper to practice on, learn how to read these. Any questions about the rest of this semester? But this will be fun. You guys will be surprised that at this point, I think with the knowledge that you picked up in this class and with your previous experiences, you can actually sit down and get the gist of what these papers are about. Take a little bit of practice, but you can, and it's kind of fun because there's still some really very interesting ideas out there in the research community about how to build better systems. OK. So let's talk a little bit about how we organized data blocks. So our final data structure was based on what observation about files. We talked about a couple of different ways to structure our data block index. Multi-level indexes were based on the fact what sort of distribution of file sizes. Most files are small, small configuration files, little Plex files that are used by various programs. Lock files sometimes are zero size. Their existence is what they're there for. But some files can get really big. And so that drove this idea where on the multi-level index, so the inode stores some pointers to blocks, which we call direct blocks. And that stores the first n blocks of the file. This is how I find the first n blocks of the file. What about the next blocks in the file? How do I access the next blocks in the file? So for small files, all the data blocks might be linked from the inode itself. Actually, on Wednesday we'll talk about FFS, which had a sneaky trick of actually for really small files just jamming the contents right into the inode itself. So there were no data blocks at all. But for ASD4, there's always at least one data block. That data block might be linked directly from the inode. But once the file gets big enough, what do I start to do? I can. I can only link so many direct blocks off of the inode itself. So that would prevent the file from getting big. So I start allocating blocks that are not used for data. Instead, they contain pointers to other data blocks. So I introduced one level of indirection. So now my inode has some pointers that point directly to blocks, and some pointers that point to blocks that are used internally that themselves contain pointers to blocks. So this allows the file to grow up to another, maybe to a certain size. And then what do I do after that? This would still limit the maximum size of the file to something, and potentially I want it to be bigger. So what's the natural extension of this idea? Nobody wants to try to say it? So I have pointer to a block full of pointers. And the next thing I do is allocate doubly or triply or quadruply indirect blocks that just add additional levels of indirection. So now I have a pointer to a block which contains pointers to blocks which contain pointers to data blocks. OK. So we talked about caching. And this is a nice segue into what we'll do today. Operating systems use memory both for pages, to support address spaces, but also as a cache for the file system. And these two uses of memory are constantly at war with each other. They're constantly conflicting. And the operating system is making a decision at runtime about how much memory to allocate to the buffer cache and how much to allocate to process pages. What happens in the case? What can happen if I overallocate memory to the buffer cache? This can, in certain cases, this can cause great performance. What is an example of when this would produce really good performance? Yeah. OK. So that's when it's bad. So that's when things go wrong. If this creates too much memory pressure, then the system can start to thrash. Remember, because essentially what I've done is I lowered the amount of available memory on the system artificially by overallocated memory to the buffer cache. But when does this work well? Give me an example of a situation, and this would be the right decision. A hypothetical scenario. Yeah. So imagine I have a web server that's serving static pages, doesn't have a lot, and maybe the website's large. And so those pages, they're really stored all over the disk. And essentially what that web server is doing, it doesn't have a lot of internal storage. It's really moving data from the file system out over the network. And over time, if that's the only thing running on the machine, that might cause the buffer cache to get quite big. There's not a lot of internal data used by a web server, and it requires a lot of file access. Another application that's like this? Very common to have running on machines. You guys probably used them in the past. Yeah. Backup? OK. So that's a great point. If I'm backing things up, I'm doing a lot of file access, and I'm moving those files into doing a diff against a previous backup or something. What's another standalone application that would potentially create an enormous amount of file access? It's a database. A database has sometimes played tricks where they actually use memmap to map files into their virtual address space. But if they didn't do that, they would be using potentially an enormous amount of disk activity over a large enough area that a large buffer cache would be appropriate. And so this is sort of the other scenario. If I have a lot of paging use, if I have a lot of address space use, but little file system activity, or if the file system activities can find only a few files, then it makes more sense to not allocate as much space to the buffer cache as I would. What places sort of a runtime limit on the size of the buffer cache? Can the buffer cache grow without bound for any particular workload? So if I had a particular program and I start allocating space to the buffer cache, at some point its performance is not going to improve anymore. Why is that? What's the maximum useful size of the buffer cache? What's that? No. Well, at that point, I'm in deep trouble because I don't have any memory for anything else, certainly. But there's a smaller, you can do a tighter bound, the maximum useful size of the file system buffer cache. Yeah. Well, remember, the file system buffer cache is storing data blocks that are potentially in use. So if the total number of files, the total size of all the files that are open or have been accessed recently places a reasonable bound on the size of the buffer cache. If all of my I.O. is going to one small file, adding buffer cache doesn't help because there's nothing else to cache. Everything's already in the cache. Everything that can't hit in the cache is hitting in the cache. I have a 100% cache hit rate. And once I get to 100%, adding more memory to the cache doesn't help. So we talked about where the buffer cache is located on the system. So we discussed putting it above the file system, where would cache whole files, and below the file system. And where did we decide was the right design choice here? This must have had a fantastic weekend. Wish my weekend would have been that. Anything that happened on Friday, it's like, years ago, where should I put the buffer cache? Below, 50% chance of being right. There we go. Below, why? And so we'll come back to why. But what does the buffer cache store if I put it below the file system? This is a short-term memory question because I think I just said it just two minutes ago. It's below the file system. What is stored in the buffer cache? What do file system operations eventually manipulate? Disk blocks. Blocks on disk. And the nice thing about having it below the file system, or one of the nice things, is that I can store data structures, I can cache data structures that aren't part of the file. Inodes, superblocks, other on disk data structures that I would never see if the cache was above the file system. And as you might imagine, these type of file system data structures can be very, very hot in that they're accessed frequently. So remember to translate a path name. Where do I start? What's the first step of translating a path name? I started the root iNode. So what do I have to read? I have to read the root iNode in order to translate every path on the system. So where do you think the root iNode is pretty much from the moment I start the system? In the cache. Root iNode also changes very infrequently. You very rarely add or remove directories from the root of your system. And so the root iNode is just a great, the cache is dream. It doesn't change very often, and it's used constantly. And potentially, I would argue that probably a lot of the top level directories end up in the cache, because they're used so often during path name translation. Once I get down into the leaves of the tree, I don't have necessarily a high enough use rate to justify pinning them in the cache, but the top level stuff is going to be in the cache most. All right, so any questions about caching, path translation? OK. So now we've talked about how to translate names to iNodes. We've talked about how to link data blocks off of iNodes. We've talked about how to improve performance. And so now we're at the point where we're going to discuss some consistency issues. Because at the end of the day, the file system isn't really that useful if it doesn't reliably store data. You might have a file system that was extremely fast, and also it's really cool features. But if your files kept getting lost, if changes that, you know, if you were working on assignment three late one night, and you finally got it to work, you know, and it passed all the tests, and then you went to bed and you woke up and it wasn't the same anymore, and you were getting to 40 out of 100, you would be sad. And you would not use that file system anymore. And this has happened recently. There was a, I should be careful about impugning anybody on video, but there was a very popular file system called RiserFS that used some fancy data structures. But throughout its history, it was played with some consistency issues. And it made certain people, like the incredibly conservative people that maintain some of the big computer systems that you guys use, very reluctant to use it for production systems. Because that is one of the system, you know, when system administrators wake up in the middle of the night in a cold sweat, that's one of the things that, you know, it may be back in the 90s, this is more of a problem, but that's one of the things that they wake up terrified. They're going to come in one day and somehow the whole file server will have died permanently. And that data will be gone. That would be catastrophic for really any institution. Even our department would probably take a long time to survive the complete and utter destruction of the main file server that serves Timberlake and other things. If that went away, we'd have problems, a lot of problems. So consistency is an issue. Now, on the other hand, we just talked about these tricks that I'm going to use to make the file system faster. And what I want to point out here is that there's attention between these two goals. So how does caching exacerbate consistency problems? Short ID of the cache, I'm using memory to store file system contents. What can go wrong? Yeah. Yeah, the cache is in volatile memory. The disk is permanent. The disk will be there if the power goes out or if you unplug it or something bad happens. The memory is not. The system suddenly reboots or there's a power outage. The memory loses state pretty much immediately. And so now that I've introduced a volatile component to my file system, I need to think really carefully about how to make sure that the file system can stay consistent. And in certain cases that we're going to talk about here, it's possible that there will be some data loss. That's sometimes an OK price to pay for the overall file system metadata and structures being accurate. It would be better to, for example, lose a little bit of data than to lose the super block and a lot of my inodes and have no idea where anything was on disk anymore. At that point, you have to take it to some forensic disk specialist who charges you lots and lots of money and spends months trying to piece things together. That's what you would do if you were desperate. Most of us would probably just throw it out and start over. And remember the other challenge here, going back to when we talked about file system challenges originally, operations, the file system, and now that we've talked a little bit about the internals, you guys can see why this is the case, always require modifying or accessing multiple parts of the file system and changing multiple data blocks on disk. So here's an example of creating. So what do I need to do to create a new file in an existing directory? This is pretty similar to an example we did a week ago. What are the things I need to do? In any order, I want to create a file in an existing directory. What's one of the things I need to do? Find space for what? There's a couple of things I need. In order to have a file, I need what? Yeah, I need an inode. Remember for every file, there is an inode. So I'm going to need an inode. Assuming this file is not empty, what else am I going to need? Data blocks. Assuming I want to be able to find this file, what else am I going to need to do? I'm going to need to modify the data blocks in the directory. So I'm going to actually have to find the directory and modify it as well. So I've got to allocate an inode. When I do that, I'm going to mark that the inode's in use, otherwise somebody else might be using it at the same time. I need to allocate the data blocks. Same thing. I need to mark that those data blocks are in use so that I can reserve them for this file. I need to modify the inode and create any data structures I might need to associate the data blocks with the inode that could include direct blocks or doubly indirect blocks or whatever I need to do. And then I need to modify the directory file to associate this inode with a particular relative path name. And then I actually need to write the data blocks out. So I have these data blocks allocated, but I need to actually alter their contents so they reflect the information that the person is trying to write to the file. So the problem here, of course, is that if I get stuck in the middle of this process, or if I'm powered down or if the system fails when only a mixture of these things are done, if I'm not careful, I can create consistency problems in the file system itself. So the question is, why? So this could always happen. Even if I had no cache, even if every time you accessed a file, I had to do all these things, it could still happen that I could leave the file system in an inconsistent state. But why does caching make it worse? Even without a cache, even if I don't use memo, why does caching make this problem worse? Spring or what? You guys are quiet. So the problem caching creates is, depending on when I bring, so for reading, not a problem, right? Reading and caching typically are very friendly, because an item that's read and is in the cache has not changed from whatever the cache is caching. Writing is the problem. So when I write an object, ideally in order to take advantage of the cache, I'd like the right to not have to go to disk every time. And so what can happen is it's possible that there will be an object in the cache, like an inode, or data blocks for a directory, that has been updated by an operation, but has not yet made its way to disk. So the cache and disk have inconsistent contents. And now if I unplug the machine, those changes are not on disk, and so they're gone. And what caching does is, again, this can always happen, because I always have to do a bunch of these operations at the same time, quote unquote, in order to update the file system safely. But now when writes start to lodge in the cache, I have these longer windows of time when there's an inconsistency between what's on disk and what's in the cache. And now I have this bigger target for my power outage. Before, all I had to be was really lucky. I had to be really lucky to catch the file system right in the middle of doing a couple of disk operations. Now I've got these long windows of time during which I can unplug the machine and I can cause some sort of problem. So let's go back to the example and let's think about what kinds of inconsistency can result if the system was interrupted during this operation, or if one of these things didn't get done. So what happens if I allocate the inode, but I don't complete the rest of these operations? Let's say I get through one and then the machine crashes. So what inconsistency would there be if all I finished was allocating the inode? Yeah, so I have an inode that's incorrectly marked as in use. Now, is there a way to detect this? People are shaking their heads. So it turns out there is, right? How can I enumerate every inode that's in use on the system, on a particular file system? What do I have to do to do that? It's a good exam question. I give you a file system. I want you to enumerate every inode that is in use. And then I'm going to compare that with my on-disk data structures. So what do you have to do to enumerate every inode on the system? Yeah, I have to walk through the entire directory tree and record every inode for a directory or a file. But that'll work. So how many, I don't know. So file systems got so much better, it's possible that none of you guys have had to experience it before. How many of you guys have ever had to run a file system consistency check on a machine that took a long time? Yeah, so back when I was a youth, back in those days when people hadn't improved these things, there was a possibility that, so it's bad enough when your machine suddenly fails. Your website is going to hit with, you've just been slash dotted and your website is skyrocketing and you're super excited and then the machine goes down and you're like, oh no, what do you want to do? You want to turn it on and have it come right back up. No, that's not what would happen on those old systems. They would say, oh, it's time to run a file system consistency check. Why don't you go on vacation for a couple of days and come back later? Because the disk is really slow. And imagine I have to now walk through every directory. I might have to look at every data block on the disk and check these sorts of things. So it's usually possible to fix these things. But we also want to be able to fix them fast. And when we come back to talk later in the lecture about journaling, we'll see why that helps. So same thing with data blocks, similar problem. If I don't mark them in use and they don't associate them with a file, then I might leak them. So in both of these cases, what would happen if I didn't fix the problem is that there would just be a dangling inode, something in the system, one fewer inode than I thought I should have. Same thing with data blocks. And just like with inodes with data blocks, if I want to, I can enumerate every data block that is actually in use on the system. And that's even worse than the inode check. It takes even longer. I have to touch even more things on the system. So what happens? So same thing with, oh, OK. I like this one. Yeah, so sorry. This should actually go for number four, right? So if I don't get the inode associated with the directory, what I have now is a dangling file. I might have a complete file with an inode in data blocks, but it's not a part of any directory. And it's marked as being present. So if I delete a file, it's possible that I could just remove it from the directory and be done. I have a file that can clean up later. But it's also possible I have a file with contents in it that I want that's not linked in any directory. Does anybody know how some modern systems handle this problem? Many of you guys, you may have seen this directory on your own OS 161 virtual box. It's a special directory with a very appropriate name where the file system puts these. It's called Lost and Found. So you might wonder, what is the Lost and Found directory for? Lost and Found Directory is for cases where the file system has discovered a file that's not linked in any directory. And so where to put it? No idea. And so in those cases, it throws it into that directory. And it's now your problem to figure out where that file actually belongs. So just in case you were wondering what that directory was for, now you know. And of course, if I don't actually write the data blocks out before the file system crashes, I can have data loss. And that's bad too. OK. So now let's think about the policy in the cache. So remember, reads are always safe. So the policy with how I cache data really only applies to writes. So if I want to keep things as consistent as possible. And again, this is not sufficient to prevent every kind of failure. But if I want to minimize the chance of failure, what policy should I apply in the cache when data blocks are modified? What should I do? I'm not going to repeat it. What am I going to do with that data block? Yeah, so if I don't buffer writes, if I always push the writes immediately to disk, now I should fix this. It's not that I don't buffer the writes. I can still change the block in the cache. Why would I, if I'm going to write it to disk anyway, why would I also modify the block in the cache? It all seems like extra work. Can you read it that way? Yeah, so it's up to date in the cache when I do a read. So that's why I modify the cache even during writes, even if I'm going to write it right to disk immediately. So the term that we use for a cache that has this property where things are immediately pushed to disk is a write through cache. And you can think, the writes just go straight through the cache. They update the cache contents, and then they are immediately queued to the disk. So this ensures that modifications to the file system reach the actual disk as soon as possible. Now, let's say you want your file system to fail. What policy in the cache should you apply in that case? He already knows the name. So this is what's called a write-back cache. And in this case, what we do is you essentially buffer the writes in the cache without updating the file on disk as long as possible. And so until one of several things happens, it's possible that the block is evicted from the cache. It's removed to make space for something else. At that point, you have to write it to disk. It's possible that the file system is unmounted, or the machine powers off safely. In that case, you also have to write things to disk. But the write-back approach essentially produces, remember we talked about before, there's a period of time when I do any caching for writes where the contents and memory differ from the contents on disk. And for a write-through cache, that period of time is as short as possible. For a write-back cache, it's as long as possible. And so this creates the largest chance for some sort of failure. OK. And this is probably pretty obvious. So for performance, the write-back cache is great. It minimizes the number of actual modifications to the file system, to the disk I have to do. So imagine I have 50 different writes. For a write-through cache, all those 50 writes generate disk activity. For a write-back cache, there's only one update to the file that occurs, maybe even at some point later. So the write-back cache allows me to amortize as much activity in the cache as possible for writes. Whereas the write-through cache does not amortize any writes in the cache. OK. So of course we can do something that's a little bit of a mixture of these approaches. And there's a variety of different ways to find a middle ground. One is to immediately write out important file system data structures, things like inodes, directories, the superblock that's got all that really high level information, those things I might not want to buffer in the cache at all, those to pass immediately through. Data, on the other hand, I might be willing to keep in the cache for a little while. Because it turns out when we started talking about file systems, we thought, well, the most important thing for file systems is to reliably store data. It turns out that's not exactly true. The most important thing for file systems is to keep their own internal data structures up to date. Your data is a little less important. So if you reboot and you get an old version of the file that doesn't have a couple of changes you made, that's probably OK. If I preserve the integrity of the entire directory tree, and I don't have a huge portion that's fallen off, because I forgot to update a one directory. So imagine I move a subtree from one directory to another. It's possible that that whole subtree is now unlinked from the file system, and there's a failure. That would be bad. So you could potentially have a lot of files that suddenly go missing. And there's also, to some degree, an API that's provided by the operating system that allows processes to request that the file system be synced, which flushes all dirty buffers out to disk, or to request that a particular file itself be flushed to disk. OK. Any questions up till this point about consistency and right caching piles? OK, so let's finish up today by talking about a different approach to this problem that's used in systems that are available today in most modern file systems. So part of the problem here is that there's something that's not atomic. So remember when we talked about a right, there were all these different operations to perform. And fundamentally, from the perspective of the file system, what's the underlying primitive here that's not atomic? Yeah. Well, modifying disk blocks. So if I have to think about old spinning disks, with that right, I have to change a bunch of different disk blocks. Those disk blocks are located all over the place. So even if I push them down to the disk all at the same time, it's possible that some of them get done and some of them don't. And so that's not what I want. On the other hand, what is more atomic from the perspective of the file system? If writing multiple disk blocks is going to require multiple operations, what's the primitive that I can use to get some degree of atomicity out of the disk? Yeah, writing one, a single disk block. So somehow, if I could take these complicated operations that are traditional file systems required modifying a bunch of different things, and I could compress them into a single operation, somehow, and use that information, it's possible that I can do a better job of providing consistency guarantees, even in the face of cash. So this leads to approach which is generally known as journaling. Journaling takes a lot of different forms, but I'm going to give you sort of a high-level overview of how this works. So what journaling file systems do is they have another on-disk data structure that's called the journal. The journal records pending changes to the file system. So anytime I need to perform a number of actions atomically, or if you're taking database systems or distributed systems in a transaction, I first write those operations to the journal. Following a failure, I can use the journal to quickly figure out what parts of those operations were complete and what parts have not yet been done. And I can use the journal to roll the system forward quickly to a known good state. So keep in mind our little discussion from before. There are usually ways for a file system to determine by exhaustive search over the entire disk that its data structures are consistent. But that's not what I want to do. Because following a crash, I want to be able to come up quickly and keep rolling. I don't want to have the system down for hours while this really slow checking process takes place. And so you can think of journaling as an optimization that allows the file system to recover more quickly. OK, so here's an example. I'm in the process of creating a file. And here's my journal entry for this particular operation. What the journal entry does is it tells the system the different things that need to be done for this operation that take place. So in this case, I need to allocate an inode. I need to grab some data blocks. And I need to add the inode to a particular directory. And I would probably need to write down what the relative path name is. So that's it. Once this is in the, now, writing it to the journal, so the journal is a separate data structure, does not mean that I've completed all of these operations. It's possible that some of them are still being cached in various places. And so now let's talk about what we do when we, so OK, sorry, I was confused by the title of the slide. So as I bring the disk, so eventually what will happen is all of these operations will actually be propagated to the disk. So the file system is keeping track of what parts of the journal entry have been completed. Once the entire journal entry is on disk, then I move on. And what I do is I create a check point. So the check point is the point in the journal up to which everything is consistent. Past the check point, I don't know. So the check point, all the entries sort of above the check point can really sort of be deleted. They're finished. They're on disk. Anything below the check point, I don't know. I'm not sure. Some of it might be done, some of it might not be done. So once all this is done, then what I can do is I can put a check point below this entry. Now, in real systems, of course, there's a bunch of different entries in the journal, and I'm completing different parts of them at different times. But the thing to keep in mind is I can only put a check point below an entry if all of the steps have been completed for all the entries above the check point. Does this make sense? What the check point indicates is where I need to start looking when I'm trying to figure out whether or not the system is in a consistent state. So what I do during recovery, I start at the last check point and I go through all the completed journal entries, and I use them to figure out which on-disk data structure I need to check and what I need to do to them potentially to bring them up to date. So let's say that this was the entry below my last check point. So here's my check point, and then I was, now what this indicates is that I had allocated, I wanted to allocate a new file. It tells me some information about where the file should be, what it should be called, it would have to say what directory it was in, again, there's relative path missing, what data block should be a part of it. Now, remember, at this point, the system isn't sure what parts of this have been completed and what haven't, so what do I do? I know from here, back through the roof, those are done, but I don't know about this entry. So what do I need to do to process this journal entry and bring the system back up to date? The first thing I need to do, I just don't know, I can do them in multiple orders. Yeah, I use them. That's it. I don't necessarily have to redo them, I just have to make sure that they're done. So for example, I need to go to my inode allocation structure and make sure that inode 5.67 is allocated, because it should have been allocated. And then I need to go to the data block data structure and make sure that those data blocks are marked as in use and that they're associated with that file, that the files own data structures point to those data blocks. Then I need to go and I need to read inode 33 and make sure that inode 33 maps a relative path name to inode 5.67, which is this new file that I was creating. And it's possible, like I said, that some of these operations had completed, some of them have been flushed to disk, and others had not. So OK, in this case, I had already done the first step. Oops, hadn't actually updated the inodes data structure to the data block, so it's good I wrote that down. Now I can do that. Didn't do this either. The directory with 33 didn't have this entry in it. And now at this point I'm all caught up. And so the benefit of journaling from the perspective of recovery is it's way, you know, think about there's some entries in the journal, there's hundreds of entries in the journal. It's still way faster than looking through the entire file system and doing these huge, holistic consistency checks. Essentially the journal allows me to focus on the parts of the file system that might have problems. It identifies exactly what parts of the file system could be inconsistent, and so I can focus my efforts there. And when you recover a journal file system, frequently you don't even notice. A lot of times journal file systems, every time they boot up, they check the journal and replay. It's not even a special operation. And if there was some sort of crash, the recovery happens extremely quickly. OK. So what about, well, I'll just finish this since we're low on time. So if, for whatever reason, now the nice thing about the journal entries is they fit inside a data block. And so I can write the journal entries atomically. If for some reason maybe I have a journal entry that span two data blocks and I have an incomplete journal entry, I have to ignore those. Because I don't have enough information to know what the whole operation was. So the nice thing about the journal is that for small operations like updates to metadata, I can represent these in a way so that I can write them inside a single data block. And normally I can get a whole journal entry into maybe one or two data blocks that I can write out really quickly. And now I have a nice pseudo-atomic record of what I was trying to do. And I can use that as part of the recovery process. On the other hand, the data blocks are interesting. So the data blocks by definition are going to be potentially too big to write out all atomically. And so there's two solutions to this. One is that we actually do write them into the journal. And that allows us potentially to update things. But this would require writing every data block twice, which is not fun. The other alternative is we just don't ever write data blocks to the journal. And what this means, again, is that after recovery, it's possible that your file system data structures are up to date, but the contents of the files may not have been perfectly preserved. So sorry. OK, so I'm going to skip the little aside. Well, I have time for it. OK, never mind. So the final thing I'll just point out here that makes this even more interesting is, of course, the fact that the hard disk itself has a cache, as if you thought that we didn't have enough caches. So not only am I caching file system contents in memory, but the hard drive itself is caching blocks. And this can be a problem for any of these consistency approaches. So you buy a hard disk, it advertises it at a certain size cache, and you're like, oh, that's awesome. But how can that create another problem for consistency from the operating system's point of view? Let's say the disk is super clever and has a big fat chunk of memory there, and it's using that memory to, essentially what it can do, it allows it to accept data from the OS and say, oh, I'm finished with that right. But I actually haven't finished the right. It's just in the cache. I'm still working on getting the heads over to where they need to go, but the right's complete. So what's the problem here? Well, it's not even atomicity. It's just that the operating system now doesn't really know when stuff is actually on disk. Just giving it to the disk isn't good enough, because the disk may have it in the cache, the power may go out. And the operating system may say, I'm pretty sure I wrote that block, and the disk is like, I don't know anything about that block. There are these really clever ways to solve this problem. This is my favorite one. So you might wonder, why is the 1 terabyte hard drive that I buy from my laptop computer cost $100, and the one that I buy from my server costs $5,000? And the answer is, I really don't know if there's a good reason for that. But the server drive has a few more features. And one of the features it might have is a battery backup system allowing it to make sure that even if you remove power from it, it will still finish, it'll still make sure that its cache is on disk. So it has enough battery backup to if the power goes out and there's some writes that are still being cached internally, make sure those get to disk before it shuts down. There were also, I think, some drives that actually hardest the rotational energy of the disk platters themselves. So I've taken all this effort to spin the drive up. And you imagine that as that drive spins down, there's a fair amount of rotational energy there. I can use that to finish those last couple of operations, even if I don't have them. OK, so next time we will talk about the Berkeley FAST file system, which is a very old design, but a really canonical and important one that gave birth to a lot of features that you guys are familiar with. And then, for fun, we'll talk about a totally different radical file system design that's called Log Structured File System. So I'll see you guys on Wednesday.