 In crowd today, what's going on? Guess I'm just getting more and more boring as the semester goes on. That's all right. That's how it feels up here too. All right, so today we're going to talk about caching. So we've discussed how file systems provide core file functionalities. So translating path names to iNodes, finding stuff based on name, and then finding disk blocks based on an offset into the file. So one of the core things that we do to try to make file systems, so the disk is really slow. I mean, the disk is sort of out in Haley's Comet land slow. If Mercury is registers and the Earth is main memory, well, main memory is probably like Pluto, then the disk is way off and like some other galaxy. So we try to make the disk look fast. And this is true even for modern disk technologies. I mean, flash is faster in many ways than spinning disks, but still way, way, way slower than main memory, way slower than close cores that are close to the chip, like the L1 cache. So like caches that are close to the chip and obviously way, way, way slower than registers. So we're going to try to make a slow thing look fast and we're going to do that in a typical way. And then we'll talk if we have some time at the end of the day about journaling because the tricky thing with using memory as a cache is that it doesn't always interact all that well with our attempts to make sure that the file system stays in a consistent state. Because the more stuff we hold in memory, the fewer things go to disk, and if the system crashes suddenly, that can leave the disk in a place where it's difficult to recover. So we'll talk about how we do that. Okay, so just to understand what's going on for the next few days. So today we're going to try to do caching consistency along with journaling. We'll see how far we get. Friday, if we make progress today, we'll do a couple of papers on sort of classic file system designs, FFS, which is really crufty and ancient by now, but which introduced a lot of sort of core file system features. And then LFS, which was a really sort of controversial paper in the file system community and introduced a radically new design for file systems. So when we get that done on Friday, on Monday we'll talk about RAID. So for the RAID class, what I'm going to ask you guys to do is actually look through a research paper. I know this is a little scary, but it's like a classic paper. This is just one of the classic papers. I don't know how many times it's been cited, but it's probably in the gazillions by now. And then in class we'll talk a little bit about how RAID actually works. We'll go through the paper together. Okay, so what's our classic approach to making a slow thing fast? So the disk is slow, right? The disk is, the nice thing about the disk is that it's persistent. When I put data there, data survives from reboot to reboot. It survives when the power is disabled, but disks are also slow. Again, this includes today's flash drive. So what's our canonical technique for making a slow thing fast? Look faster. Caching, right? So I put a small faster thing in front of a big slow thing. What is the small, fast thing I'm going to use? So when we did this for memory translation, what was the small, fast thing that we used? The TOB. So the TOB acts as a cache of a recently used address translations that means that I don't have to ask the kernel every time I need to translate a virtual address to a physical address. So that makes things a lot faster. In this case, what is the fast thing that we're going to use? What is the cache here? What's the thing we're going to use as a cache? What's the slow thing? The slow thing is the disk. So what's the cache? Memory. Memory, yeah, memory of various kinds. So I also, to make sure that you understand how complicated this actually is. Does anyone ever bought a disk? Ever purchased a disk? Never. Wow, well you have some exciting experiences ahead of you in your life. Like I just bought two disks today. One of the things, really, no one's ever bought a disk? Okay, good. You guys are just slow today. I get it, okay. You guys are like the disk. I sent a command. I gotta wait a while for Data to come back. Okay, so I'm gonna make you faster. One of the things that you do when you create disks today, has anyone ever looked at a disk, the specifications carefully? What do disks typically have themselves? A cache. So big drives that you buy have their own internal caches. So while the disk is on, the disk has its own little memory cache. And sometimes those caches can be quite large. So it's actually caching data internally. How that cache interacts with the rest of the system is complicated and not so like we're gonna talk too much about. But anyway, remember, computer systems just a series of caches. So there's even one on the disk itself. So we're gonna use a cache. The thing that we use to cache stuff is memory. Now, so this is interesting though, right? Because we're using memory. What else do we use memory for? RAM, what else do we use RAM for? I mean this is kind of weird, right? What else do we use the TLB for? The TLB is to use the cache address translations. And what else? Anything else? No, the TLB is exclusively, the TLB is, I've said something wrong, I guess. The TLB is exclusively for caching these translations. What else do we use memory for? Come on, you gotta give me something today. I know it's something that got cold again. It's gonna get warmer, just wait, okay? Come on, we'll get there, it's gonna be May. The trend line is clear, right? There might be some local. What else do we use memory for? Like memory, remember all the stuff we talked about in the last couple of weeks? Like process pages, kernel data structures, the heap for processes, code the processes are running, like all sorts of other stuff. So the modern systems have this interesting balancing act to do where they have memory and they wanna put that memory to use, but there's two competing uses for the memory, among others, actually, but these are probably the two primary ones. One is to use for memory stuff, like process pages, parts that would be part, things that would be in my address space, and then the other use is as a cache for the file system. So modern systems, one of the reasons sometimes if you've left your computer on overnight or you've had it run some sort of really big like processing tasks, like maybe you had it index the entire disk. You come back later and the system is sluggish for a couple of minutes. One of the reasons that sometimes happens is that over time, because you haven't been using it and processes haven't been running, the operating system has decided to take all the memory or a lot of the memory on the system and convert it into a cache for the file system because there's a lot of ongoing file activity. So when you come back and start using the machine again, even though there's a lot of memory, it doesn't feel like there's a lot of memory for a minute until the machine sort of catches up with the fact that, oh, okay, there's programs running again and it starts to move the needle in the other direction. So at runtime, Linux, and I suspect Windows now are making decisions about how much memory should I use to cache file system contents and how much memory should I use to cache process pages. And Linux, okay, so let's talk about what happened. So let's say I have a big buffer cache. So the memory that we use to cache file system contents is something we refer to as the buffer cache. This is a cache for the file system. When the file system cache gets really big and I don't have a lot left in main memory to use for other stuff, what happens? What's that? Well, I'm gonna swap more quickly. What is that gonna cause to happen? So file accesses get really fast because a lot of file accesses is hidden in the cache, assuming that I have a large enough cache, but programs suffer because programs take higher page fault rates. So this is essentially like running the same set of programs and a machine that has a smaller amount of main memory. The only difference is all the file operations that the programs are doing get faster. On the other hand, what happens if I have a really small buffer cache and a lot of main memory? Yeah, so in this case what happens is that processes run pretty well, page fault rates go down, but every time I need to use a file, those accesses are really slow. And obviously getting the balance here is really tricky. Linux has this, maybe this is gone in newer versions, but at least on many versions of old versions of Linux, they had this parameter called swappiness, which I always thought is a fantastic parameter because like, I don't know. I mean, Linux has very few parameters, but this is a number between zero and one that essentially is a hint to Linux to tell it how much it should prefer one use of memory, that being for process pages and other address-based contents versus the other use of memory, that being for the buffer cache. Linux still reserves the right to make its own decisions about this, but this is sort of a parameter that tells it how aggressive it should be about trimming process pages so it can make more room for the file buffer cache. All right, any questions about this? Does this make sense? I know we haven't really talked about how the buffer cache works yet, but that's what we're gonna start doing. But you can kind of think about it, you know? Yeah? It's zero through one, it's like 0.0 through 1.0. It's a floating point number for some reason. Probably fixed point actually, but what was your question? Yeah? No, that's the point. Like you can't stop it from using memory as a cache, and you clearly can't stop it from using memory to cache address-based contents because you wouldn't be able to run anything, right? So this is just a parameter that kind of gives it a hint about where that balance should be, right? And specifically, I think what it affects is how quickly, so remember we talked about over time when a process hasn't run for a while, the kernel will actually go to the process and try to pull pages back, right? It'll say, okay, I'm trimming your pages to make room for other stuff. Swapping is, if I remember correctly, controls how aggressive the kernel is about doing that. Now, any pages that I recover that way, assuming there's file system activity can be reused for the buffer cache. So again, if you run something like some of my students have been doing some data processing on their machines, so they're processing like a couple terabytes worth of data, if you run really, really heavy file system workloads, the best thing to do at that point, as long as no one else is trying to do anything, is to use as much memory for a cache as possible, right? As long as you have a reasonable page fault rate for what's left. But what happens is after you do that for a few hours or overnight, the system could get very, very sluggish until you start moving the balance the other way. That make sense? Cool. All right, so now let's talk about where this cache goes. So caches always have to be somewhere, right? Caches are usually in between the big slow thing and the users of that big slow thing. But when we talk about where to put the file system cache, there's sort of an interesting design question here that controls what the cache sees and doesn't see. Okay, so here's a design, well okay, so this is sort of a diagram that's actually taken from OS 161. So, OS 161 like other operating systems, remember the file system is fairly decoupled from the operating system. I can have an operating system that is using multiple file systems at once. I was having a conversation with somebody after class and they were talking about, well how does this work like when I boot a recovery CD, right? If you boot Windows off a recovery CD, that can clearly read the disk. So one way to think about file systems is that file systems are an agreement between a piece of software that implements the file system and an on-disk data structure. So if I have an on-disk data structure in a part of the disk that corresponds to an EXT4 file system, as long as I have software that understands how to read and write that file system, I can modify it, I can take that disk out, I can sort it into another computer and I can modify the files on that computer, right? As long as I have the software required to interpret that file system, that's how thumb drives work, right? Thumb drives typically have a file system that's called FAT32, it's a really terrible crappy file system but it's been around for a long time and people use it because it's very portable. So I can take a thumb drive that has FAT32 on it, I can second it to a Windows machine, I can second it to a Linux machine, I can second it to a PC, all of those machines have software on them that allows them to interpret that on-disk data structure, make changes to it, find files and stuff like that. Okay, but one of the things that operate systems are required to do to support all of these different file systems is build these translation layers. So the VFS layer, I mean you guys, does anyone really have an idea about how VFS works? Who wants to be brave and raise their hand? Okay, at least, so when you guys implemented read and write, right? There was something kind of weird about what you had to do there, what was that? How did your read and write system calls? Hopefully I'm not exposing these solutions that bite asking this question, right? I mean, I don't think this is that particular, I would have told you this in Office Hour, so what are the corresponding calls that read and write make in OS 161 to the VFS layer? I mean, basically read and write, collect a little bit of information and then make a call to what? VOP Read. VOP Read, is VOP Read a function? No, it's a macro. Does anyone know what it does? Yeah. Yes, ooh, see function pointers. This is like a, anyway, does anyone know what a see function pointer is? Does anyone have done a programming with see function pointers? Yeah, you don't need see plus plus, man. You can do everything in see plus plus with see function pointers and a huge amount of bravery and stupidity, okay? So this is like as close to see gets as see gets to object orientation, right? Namely, I can have a function pointer and many of trial they work to be honest with you, but I think it, I will stop it. I think it points to a place in memory and I can make function calls to that. So the reason, so given this diagram, why do I use a function pointer to do these reads and writes? What is this to allow the view, what's actually happening here? Yeah, exactly. So this, again, this is as close as see gets to object orientation. So the VFS layer defines an interface that the file systems that implement that are allowed to run on OS 161 are required to implement. When you make a call through that macro, what happens in a simplified form is that the macro uses the V node to figure out what kind of file system this file has been mounted on and then it passes the call down to the right file system. So up until this point, you guys have been using the MUFS file system which allows you to access files on your local drive. Had we made you do assignment four, you guys are doing pretty well this year so it's not too late, maybe we'll add that toward the end of the year. You would have had a chance to work with the simple file system which is implemented entirely inside OS 161 and uses a virtual disk. But this is, so this is the idea. Essentially, again, remember, everything comes back to interfaces. The VFS file system imprides this interface to implement a file system on OS 161 you have to implement most or all of that interface, right? You certainly have to implement things like read and write and stuff like that. Does this make sense? So this is how this is working internally. So when you make the VOP read or VOP write call, depending on the information that's stored in the V node that you pass in, it could be calling a SFS function that does the read or it could be calling an MUFS function that does the read and I can add arbitrarily large number of additional file system implementations as long as they conform to the interface that the VFS layer is provided. Okay, we're never gonna get through today's lecture because I just talked too much about that. Okay, so here's one option about where to put the buffer cache. So in this case, the buffer cache sits directly below the VFS layer. Remember, I have this VFS path through layer that aggregates all the calls to the different underlying file systems and so this represents what we call in systems a narrow waste. This is one place where I can put a cache and the cache then sees a lot of the operations that it needs to see in order to do things. So at this and the other place I can put it is down here at the bottom. So all of the file systems, this is another place where these file systems come together, all the file systems on the system that I care about caching data for, there might be network-based file systems that have their own issues. But there's two pieces of commonality to all these pieces of file system software. The first one is they have to support this VFS interface. The second one is they read and write blocks to a disk. Does that make sense? What happens in the middle is completely specific to each file system but all the file systems support a common set of operations listed up here and are implemented by modifying disk blocks. So they use the disk block interface. So there are two options where I can put the buffer cache. I can put it on the VFS underneath VFS where it has to support that interface or down here. Okay, so if I put it underneath the VFS layer, then the buffer cache has to support operations. So the other thing to think about when you implement a cache is what are the operations the cache has to support? So what's gonna happen here is that before I pass these calls down to the corresponding file system software, I'm going to allow the buffer cache to see the call and potentially return cached information. Does that make sense? So some of those calls are not gonna make it down to the file system. They're gonna hit the buffer cache and return immediately because the buffer cache is gonna say, hey, I've got the contents of that file in memory right here, I don't need the file system software to go to disk. So if I put the buffer cache to the top, the buffer cache can actually now cache entire files and directories. That's all it sees. This is interesting in a vocation. And the interface I have to support is the same as the VFS layer. So I have to see or at least part of the VFS layer. Maybe I don't support open, maybe I don't support close, but probably read and write. Those are the heavy hitters. Those are the areas where actually I'm gonna have some win by returning data from the cache, yeah. Yeah, I mean, you probably need a configuration parameter so when you mount a file system, it can avoid these caches, right? Probably the other way to do this is to actually have the cache here on a file system specific way. That's another way to do it, right? So rather than putting it right here, I put it at the very beginning of the file system's own operations, right? That's another way to do it, yeah. There are some file systems that can't be cached either because they're not real file systems. They have semantics that the cache doesn't understand or whatever. Yeah, so that's a good point. Okay, questions. Did someone have, I heard of hearing things again. Okay, so let's give an example of how this would work. So opens, I need to pass down to the underlying file system. Reads, I look for the contents of the read in the cache. So I have a file system, I have an inode, and I have a spot in the file, and that should be enough information to identify the cache block in the cache if it exists and return that if I have it. If not, then I pass it down to the underlying file system. Write, if the file isn't in the buffer cache, I can pass the write down. And I also wanna load things into the buffer cache at this point. So file system caches are effective at absorbing both read and write traffic. But when I see a write to a data block that's not in the cache, it's frequently used for the hoist that data block into the cache so that future writes hit the cache. Does that make sense? So when I see a write, I read the block from disk, make the modifications in memory, and then write it back immediately or wait. And we'll talk about the policies that determine how long I wait to write things. Close, I can flush things from the cache when files are closed. Maybe I do, maybe I don't if I need space. Okay, this doesn't make sense. Okay, I don't know why the slides are being weird. So the nice thing about this is that this file system cache has more semantic information about what's going on. When we get down to the next alternative, keep this in mind. So I see a read as a read, not just as a random group of data blocks that are being accessed. There are a couple of interesting cons here. So one problem with this approach is that there's a lot of information. Remember all that stuff on disks that's not file contents? The inodes, data structures, stuff like that. We talked a few classes ago about how some of that stuff might be really, really great candidates for caching. Inodes, for example. Inodes get modified all the time, particularly the root inode. Why does the root inode get read so much? Constantly, yeah. Every path name starts at the root. So the root and probably the next level of directories under the root, those inodes get read and those directory files get read constantly. So I want those in the cache. In fact, it probably makes sense to just cache those immediately as soon as the file system boots. They're also very, very rarely modified. I mean, when's the last time you added a directory to root? You can, by the way. You can put your own stuff in there if you want to. You're not gonna get in trouble. Unless you're trying to do it on a machine that's not yours. Anyway, so yeah. So that's stuff I want in the cache. And in this approach, I can't cache that stuff because I never see that information. Remember, when I call open, there's all sorts of metadata that the file system is using to translate the path to the contents of, to the inode. But I never see any of that if I'm caching stuff at the top. Because all that activity goes on between the file system and the disk itself using the disk block interface. So this is the problem with this. There's another issue which is that there are certain consistency guarantees that are hard to provide up here. Because of the, because of the way, the little, because of the calls that I'm actually seeing. Okay. So let's try, let's think about the other option. So the other place to put the cache is below the file system. So what do I cache here? I know I should have pulled the diagram back. If I put it below the file system, what are the objects in the cache? Yeah. So I can cache anything that's beyond the disk, but what are the objects in the cache? What do I name them? They're disk blocks. So above the file system, I have to think about files, names, locations. Below the file system, it's just disk blocks. If I see a disk block accessed, I pull them into the cache and I name it based on the disk block where it came from. At some level, this is a little bit simpler. What's the buffer cache interface? Above the file system, I had to think about supporting open, closed, read, write, whatever, what do I support here? Read block, write block, that's it. There's no open, there's no close. So at some level, I've lost information about what's going on. So rather than understanding that a series of disk accesses corresponds to a read, all I see is a bunch of read blocks and write blocks. I have no idea what caused them, okay? The nice thing about this, as Ashish pointed out, is I can cache everything here. Everything that's on disk can be cached. That includes all the file system metadata and a lot of that metadata is really valuable to cache. And this is one of the reasons why it's done this way frequently. And so again, I mean, it also allows file systems to see operations. So even if I'm gonna cache the result of a read, the file system still sees that the read took place because the operation hit the file system implementation first before it got to the cache. And that's good in certain cases. That gives file systems some visibility that they need to provide certain consistency guarantees, which is good. The cons are that I've lost information at this point. So at this point, some of the semantics of things are missing. And this is a really common trade-off. So again, I mean, why are we talking about file systems? File systems are boring. They're not evolving very rapidly. But this trade-off is the same when we talk about web API requests in a data center. When the request comes in, I have a lot of semantic information about the request and what it's trying to do. And by the time it gets down to sending packets between two machines or doing local disk operations, it's gone. I mean, every time you load mail from Gmail, that corresponds to some number of network operations and disk operations and other things that go on in Google's data centers. But as that request goes down the stack, information is being lost along the way. I know the most at the top and I know the least at the bottom. Okay. And this is what modern operating systems do. This is for a variety of reasons, partly because it's so valuable to cache metadata. Really want to cache metadata. That stuff, there's a huge win to caching inodes and other on disk data structures. Okay. Oh, we're not gonna do review. Yeah, sorry. Okay. How did this get in here? Okay. All right. Okay, here we go. All right, any questions about this stuff before you go on? Okay, so I'll talk about cache. So now, okay, so this seems simple. It used to, yeah, I don't know. That's one, I don't really miss assignment for that much, but I do miss the buffer cache part of assignment for. We used to have you guys implement a buffer cache for the simple file system. The cool thing about it is, has anyone had that experience with an algorithm where you implement an algorithm and it makes something go much faster than it did before? You know, you're doing something dumb. That's kind of cool, right? Buffer, one person has had that experience? Okay, good, thank you. About to say, we need to change how we teach algorithms. Buffer caches are kind of the same way, right? Like, especially because OS 161 disk is so slow, so it's like, whoa, that thing suddenly really worked. Yeah, caching is good, caching works. And this is a good place where you can see, you can really see the effect of it. All right, let's talk about consistency. So the first thing to realize is that there's a sort of a direct trade-off between performance and consistency when it comes to file systems. How does caching, how does memory caching cause consistency problems? It's a basic level. Do any right analysis in store is a cache, but you haven't been telling it on the distance and sometimes it takes a lot of time before you decide to write it on the list. Right, so caching, particularly caching writes. Can caching reads cause consistency problems? It's hard to make an argument it can, right? I'm sure there's some weird case in which it can, but reads are pretty safe. So caching reads, allowing reads to return from memory caches is usually a pretty easy win. Writes, on the other hand, are problematic. When I hold a right in the cache, so when a right hits the cache and then doesn't go to disk immediately, what happens is there's a period of time in which the contents of memory and the contents of the disk are out of sync. So memory has information that hasn't made its way to disk yet. So why is this a problem? Yeah, yeah, so if there's a power outage, someone trips over a cable or something crashes, machine reboots, whatever, I mean on a reboot you're gonna flush these caches, but if there's some sort of unexpected event that causes the data in memory never to make its way to disk, then it's possible that the disk is corrupted in some way. And so the big goal here is not to make, you know, I mean it's easy, in certain cases I can just make everything go to disk, but the problem, the goal here is not necessarily to make this problem go away entirely because we want to hold writes in the cache, that improves performance a great amount. The goal here for modern file systems is usually to make sure that even if there's some data loss, the file systems data structures are consistent. So for example, it's okay to lose a few of the modifications you made to your OS 161 source code. It's not okay to lose inodes, right? Or to lose a whole directory, or to have bitmap, or have bitmaps that get corrupt so that I think that some of the file system is in use or have files vanish, right? Like if I'm in the middle of moving a file from one place to another, if the system crashes, it's okay for it to be in the first location, it's okay for it to be in the second location. It's probably preferable to be in both locations, although that's a little weird, right? And we might need to fix that, but it's not okay for it to be gone, right? It shouldn't vanish. So these are the kind of consistency problems that we try to prevent. The other thing, and the thing that makes this difficult, which I forgot to mention is that, remember going back to talking about reads and writes and other types of file operations, these touch a lot of blocks. You know, if reads and writes were typically confined to modifying one block on the disk, that might be okay, but I have to modify a bunch of blocks. And so now, maybe when the system crashes, some of those modifications would be written to disk and some haven't. And making sure that things go out to disk and consistent groups is another thing that we're trying to do here. So that either the entire operation happens or didn't happen, but it doesn't kind of happen. Like I'm not halfway in between. So yeah, we talked about this before. So in this process, if I fail here, for example, then I have a case where I have data blocks that are marked as in use that are actually not linked to any file. And so the capacity of my disk went down a little bit. Okay, so as I just pointed out, if anything that modifies multiple blocks can potentially leave the system in an inconsistent state, if a crash occurs, and not all of the modifications make their way to disk. And this can happen even without caches, right? I mean, even without a cache, I still have to make a bunch of different changes to disk blocks. So on old disks, it was like change block one, run over here, change block two, run over here, change block three, oops, crash. And suddenly I am only made three of the six modifications that I want to make. So this can happen even without a cache. I don't want to make caches the sole culprit here. But caching exacerbates the situation by increasing the time span. So think about there being like a window of time where the failure has to happen. So if I fail at just the wrong moment, I can always cause some sort of consistency problem. But as I start to hold objects in the cache and delay modifications to disk, I lengthen the period of time. So if I hold things in the cache for long, long periods of time, there's long intervals where the disk is in in sync. And if I'm not careful about how I do that, any crash within that interval will cause a problem. Okay, do you guys want to go over this? We went over this before. This is sort of what can happen in each one of these cases if there's a failure. So if I, so this is a write, right? So I have to allocate, this is a write to a new operation, a new file. So I have to find an inode. If I crash, and this doesn't make it to disk, then there's an inode that's incorrectly marked in use, which means I'm down one inode. If I crash without updating, after updating the data blocks, I have some dangling data blocks. This is a fun one. This is how stuff ends up in lost and found, right? So I have an inode that points to data blocks, but that inode actually hasn't been linked into a directory yet. So I have a file that exists nowhere, right? So that's not good. And then if I don't write the data blocks, I have some data loss, right? But again, from the perspective of the file system, I care a lot more about consistency than I do about data loss. Okay, any questions? Yeah. If you do like check this below. Yeah, I mean this is what FS, has anyone ever run FSCK before? Yeah, so there are programs that can take a file system. Remember, this is just an on-disk data structure. As long as I understand the semantics of it, I can check it for consistency. So for example, I can go through all the directories on the system. Let's say that I want to enumerate every inode that should be in use. I basically go through all the directories and I find all the inodes that get linked to. And so now I have this set of inodes that I know are in use. And I can compare that with the set of inodes that are marked in use in the inode bitmaps, right? Now, this operation, FSCK. I'm going through the entire disk, doing all these consistency checks. Does this sound like something that's fast? Yeah, so another goal of sort of modern file system designs was, look, stuff's gonna fail. Who knows why? Doesn't matter. The goal here is not to never fail because that's impossible. The goal is to recover quickly after failures. So again, let's say you're working in Amazon and their entire data center goes down, okay? You go in and you push a button that says, like you turn it on again. Like someone just flipped the data center off switch, right? So you just go turn it on again, okay? I'm cool. That sounds good, right? Like the data center's on again, except if I have to run FSCK on every machine, don't worry, it's gonna be on CNN, right? At that point, days go by, you know? So outages are not the issue, right? Outages that can be recovered from quickly. Imagine that your system could reboot instantaneously. Would you notice blue screens? Be like a little flicker, like blue screen, whoop, there it is again, right? Who cares? So coming up quickly after failures is really important. That kind of thing that we just talked about where I have to enumerate an entire disk and do these huge holistic consistency checks, those things are terrible. They take forever, they're slow. There are they good? Yeah, I mean I can do a really, really hard check of the entire disk that way. Do I wanna do that when the system already crashed and I've got customers yelling at me because I've got, you know, my downtime is going down, my uptime's going down. I'm losing those nines. I had six nines, I'll have five nines, another hour goes back, I have four nines. Yeah, no, I don't wanna do that, right? Okay, so the safest approach here is, remember, we have this problem with writes. Writes are what causes this problem. Reads are not as big of a deal. So the safest approach is simple. Don't buffer writes in the cache. Does this mean that my cache is completely ineffective? What's that? Well, okay, let me ask you a different question. In this case, why do I keep writes in the cache at all? What's the difference between writing back writes immediately, syncing them to disk as soon as they happen, and not caching writes at all? So what happens if I don't cache writes at all? Yeah. Okay, so I think we're moving in the right direction here. So if I modified the block in the cache and then write the contents to disk right away, what other operations does that help? Yeah, Steven. Well, remember, every modification is gonna go to disk right away. So this is not helping writes. There's like one other operation that could have possibly helped, yeah. Yeah, future reads that disk block can hit in the cache, right? If I don't ever buffer written blocks in the cache at all, it means that reads don't benefit from those modifications. So modifying in the cache and then running it to disk means that if I have 50 reads that follow that, they get the latest data from the cache without having to go to disk. Would you keep the writes in the cache? Yeah, that's what I'm saying, right? So, and I'm also gonna pull blocks into the cache even if they're written, right? So when I see a write to a block, if it's in the cache, I modify it and make sure that modification gets to disk immediately. If it's not in the cache, I read it into the cache, modify it and then write the modified result back. But I wanna keep writes in the cache because it helps reads. Now in this case, this is called a write through cache because when you think about it, writes pass essentially write through the cache. Writes pass right through the cache. That's the fun thing to say. So writes immediately, so there's no buffering of writes in the cache. They immediately go to disk. Now on the other hand, so this is the safest approach because remember, essentially what we're fiddling with here is the delay between when things get to the cache and when they actually get to disk. That delay is what creates the potential for consistency problems. So if the safest thing to do is to write blocks immediately, what's the least safe thing to do? Okay, so I have to write them at some point. That's always my favorite answer, like never write them. Yeah, I mean, that's an interesting approach. That will immediately, like, it's awesome. I bought this disk but I can never write anything to it. That would be a very frustrating disk. Do you guys remember this? It was like a few years ago, there was some scam where people sold these, do you guys hear about this? They sold these flash drives and they said they were like a terabyte or something. They're like a thumb drive. Maybe they didn't say that because that's too stupid. No one would be like, oh, a $2 terabyte thumb drive. Actually some people would, right? But anyway, so they sold these thumb drives and they claimed the capacity was like a terabyte. The capacity was actually only a gigabyte. But they were clever, right? So what they would do is they would store your first gigabyte, right? And then I think they were even more clever in that they would keep storing new stuff but they would just start throwing out crap, right? So it was like, imagine you have a disk that you think is a terabyte but it's actually only has the latest gigabyte of stuff. That was this flash drive. Anyway, they made some money out of it because again, they worked kind of. That was the clever thing, right? They worked at first. So they gave them like a few months to sell these things before people realized, wait, hold on, like, this thing actually doesn't store any data or it doesn't store more than a certain amount. Okay, so I do need to write them at some point but when's the longest I can take before writing them? Yeah. When they get a victim. So a write back cache buffers all operations as long as it can. So until I have to write things, at some point when I remove the block from the cache, this is like removing a page from memory. So a good mental model for a buffer cache is your memory for processes. When I pull it in and make modifications to it, I write it out later. I write it out to swap, right? So in this case, I keep all the operations until I need that block for something else and then I sync it with disk before I evict it, right? And obviously for performance, the approach that's gonna minimize the number of writes is this one, as long as I do a good job of choosing what's in the cache. The one that's gonna be the safest is the first one. And so normally we do something in the middle and there's a variety of ways of splitting the difference between the two of these. So one way is to try to make sure that I immediately write out critical file system metadata. So any changes to on disk data structures like in-use bitmaps, inode tables, whatever, those get written immediately because I wanna make sure that those are on disk. Modifications to data blocks that have file content in them, those I can wait on, no big deal, right? Again, a little bit of data loss, not a huge problem. Having serious consistency problems on my file system because I didn't write out key data structures is a problem. And typically file systems also provide a way for processes to expose some of their semantics. So one of the things that gets interesting here is different programs that are using the file system have their own consistency guarantees that they're trying to provide. And I can do that by making it possible through the system call interface to ask the file system. I'm pretty sure this is always a request, not a demand. Ask that the file system ensure that content for a particular file is on disk. So this is something called sync. And I think, well, sync is supposed to sync the entire file system. F sync allows me to sync one file. So it's a process saying, do your file system, I would like you to make sure that all of the content for this file is on disk. Thank you. And of course there's overhead to doing this. All right, any questions about? So this is the role of caching here. Okay, we're doing good. Okay, so here's another, there's another sort of fun, modern, so these are tools that have been incorporated into modern file systems. A lot of these ideas sort of emerged in particular file systems and then have spread to be used by a bunch of different file systems because they're pretty useful. So let's think about a different way to try to make file system operations atomic. So part of the problem that we're dealing with here, and this is actually a bigger problem than I've identified because remember the file, remember the disk has its own memory cache? If I'm trying to write multiple disk blocks, I may not know when they exactly get to disk because they may be wedged in the disks on drive cache. So rather than trying to make multiple disk operations atomic, on some level, if writing multiple disk blocks is not atomic, if I can't be sure that all of these are on disk, what could be atomic? So again, if I'm writing multiple disks, requires running all over the place and it's very possible that I can get done, I can get interrupted halfway through, something can crash, whatever. In contrast, how do I make sure that a modification to a disk block either happens or doesn't happen? Writing one disk block, so I write one disk block and as long as it just has a way to tell me when that's done, if I fail before I get that right done, it didn't happen if I fail after it did happen. So this is atomic based on the definition we were using earlier in the semester. So this suggests this technique that's known as journaling. So here's how journaling is going to work. When we make modifications to the file system, we're going to write down in a very specific spot what modifications are required to various file system data structures. Then that allows us to keep some of those modifications in the cache. When the modifications are actually made, we're gonna sort of check things off of the list and then periodically we're gonna update the journal to make sure that all the operations that have actually made it to disk are there. Let's walk through an example. So I track, again, there's a special file system, new file system data structures known as the journal. The journal tracks pending changes to the disk. After, so one of the goals of journaling is to make it very fast to recover from failures. Because what the journal allows me to do is when the file system recovers from a failure or when the machine reboots and the file system is reloading itself, it can use the journal to figure out what it was in the middle of doing and identify sort of incomplete operations that I might need to complete or in certain cases, undo. So let's walk through an example of this. So let's, so this is the create and this is sort of creating and writing data to a new file. So here's in a kind of a silly format what I would write in the journal. Obviously there's data structures for doing this. So I basically tell, I put this in the journal that's going to be on disk. I say, I'm going to allocate this inode and I need to be very specific about what I'm doing. I'm allocating an inode for a new file. I've got some data blocks that I've allocated to that inode to store the contents. I'm going to put the inode into a particular directory and I would have to write down both the name and the inode number in that new directory and that's it. So this is a single journal entry that sort of reflects creating a file and associating some new content with it. That make sense? Okay. Now as these changes get pushed to disk, what the file system is doing is updating the journal. So it's keeping track of journal entries that have been completed and when everything up to a certain journal entry has been completed, refer to that as a checkpoint. So that means that the file system at that point knows that all those changes have actually made their way to disk. So for example, let's say that, now these operations can be performed in the cache. Over time, eventually once I know that these changes have been written back to the actual disk, I can check off these things. So once inode 567 has been updated in the inode table, once the data blocks have been associated with that inode, once the directory for inode 33 has been updated with the proper name and everything, I can cross these things off and I can essentially remove this entry from the journal. Because now I know that all of these changes are actually on disk. Does it make sense? Yeah. Gosh, I have no idea. Yeah, I mean, potentially, I mean, journaling, okay, so that's a great point. Journaling and checkpointing is not an idea that's confined to file systems anymore. That's a great point. You can really apply this idea to anything. What it means is you can, you maintain a compact data structure that's easy to write out. So remember, the journal's on disk. The difference here is that the journal is written in one place, and so updating the journal can be done by updating single data blocks one at a time. Whereas all these changes are gonna be all over the place on disk. And so this sort of allows me to get the benefits of having a cache where I hold objects longer without the consistency problems. Okay, let me go through this example and then we'll be done. So on recovery, so let's say I am loading, and this happens every time you mount file systems that use the journal. They never assume that they shut down cleanly. They're always gonna look at the journal that they were maintaining from the last time they were mounted. So when you mount the new file system, what it'll do is it looks through the journal and if there are journal entries that haven't been check pointed, it's gonna update the on disk data structures to complete those operations. So any journal entry that's left after the last checkpoint means that there was some changes that I needed to make that did not make it to disk, or I'm not sure they made it to disk because maybe I got, I rebooted or I crashed before I could update the journal. So in this case, for example, let's say I have this entry in my journal. This entry is not, this entry is after my last checkpoint. The nice thing is it's very fast to confirm that I did these things. So for example, I can look and see if Inode 567 is allocated. If it is, cool, I must have done that already, okay? I can look to make sure that these data blocks are associated with the Inode. In this case, maybe that operation didn't make it to disk. And so now I can make sure that it does. So now I update the on disk data structures appropriately. Again, same thing here. I need to associate this Inode with the directory so that the file is linked into the rest of the directory tree. And then I'm done. So by using the journal entries that are still marked as uncheck pointed, the file system can identify precisely the operations that need to be completed to bring the file system into a consistent state. This makes sense? The nice thing about this is it confines the amount of work that I have to do when I reboot to a very small number of operations. So rather than having to check the consistency of the entire file system, like we were talking about before, I can just make sure that I complete any uncheck pointed journal entries. Okay, we'll start with this on Friday and review it again and then talk about FFS and LFS, maybe.