 OK. Thin crowd today. I guess there's an assignment due or something. This must be the people that are done. Actually, there's a lot more of you. I don't know why. Maybe they're sleeping. All right, so today we're going to finish up talking about general file system features. We'll talk about caching and consistency. So if you remember from earlier, there's this tension between making file systems fast, which involves storing their contents in memory and ensuring that they're consistent after crashes and after other kinds of failures. And so those two sort of goals, which are both important, are kind of in direct conflict with each other. And so we'll start off talking about how we make file systems fast by throwing memory at the problem. And later today or maybe Monday, we'll talk a little bit about some consistency approaches to making sure that they can survive various types of faults. OK, so we're working on getting the mid-semester grades together. Those will include assignment one, all the assignment twos, up to later this afternoon when I rerun the script. They will not include assignment three, probably. Maybe I'll go back on this, or the first part of assignment three, mainly because technically they're due at 5 PM, so is the assignment. And the mid-term grades, which Ali is almost done with, you can tell Ali's been growing his beard out. It doesn't even have time to shave. It's been so busy reading your papers. And we should have the mid-terms back, like Monday or Tuesday next week, for you guys to pick up. If you, I'll probably be sending emails to people who I'm concerned about, particularly if you haven't turned in any assignments at this point. Or I mean, I already emailed the people that didn't show up for the mid-term. Committing that particular type of academic suicide. So that's OK. But if you get an email from me and then we're concerned about your performance in the class, I know people are still working on assignment two. There's still some assignment two submissions coming in. And there were definitely cases where people did fairly well after the deadline. But anyway, OK. So we're getting to the point in this semester. I like this point in the semester where I'm going to start weaving in some paper, I know this is a frightening idea, like reading. And reading not only reading, which is itself a little bit of a challenge maybe, but reading potentially sort of dry technical papers about various things. But we'll try to make this as fun as possible. And I think you guys actually will enjoy it. It gives us a chance to talk about some contemporary topics. And at this point, you guys have built up a little bit of a library of knowledge about these systems. And you guys will be able to appreciate some of these ideas. Certain things like RAID, which we'll talk about next week, are so well-established by now in the popular, you know, you guys probably understand a little bit about them already. Toward the end of the semester, when we start talking about some other things, we'll get into some ideas that you guys probably are not as familiar with. So when we do this next, so what we'll do today, we'll get through as much of the material we can. Monday, we'll talk about a couple of, maybe Monday and Wednesday, we'll talk about a couple of sort of canonical file system designs. We'll look at FFS, which is a very old, crafty file system from back in the days, where we really had to worry a lot about disk geometry and other features of spinning disk. And then we'll look at something called log structured file systems, which are a pretty different, fairly radical approach. That would be kind of fun. And then later in the week, Wednesday or Friday, we'll get to start talking about RAID. When we talk about RAID, I'll actually assign you guys some written materials to look at. Maybe the original RAID paper, which is super ancient by now. I mean, the copies have clearly been scanned five or six times. But it's a cool paper. I think it's neat to see that there are not very many research papers that give birth to an entire billion dollar, maybe, industry. So that's kind of a fun thing to look at. OK, any questions about plan going forward? Obviously, the first part of assignment three is due today. The next part is due two weeks from today. Please don't stop. People are doing really well on assignment three so far. So just keep going. Don't like be like, oh, great. Let's take a week off, because then the other parts are potentially more challenging. But not that bad. I think you guys will do OK. All right, so we talked about file system. We talked about on-disk data structures. The problem remains that the disk is a slow device. So what is our one-size-fits-all strategy for making slow things look fast? Somebody already has their hand up, yeah. Use a cache, right? What is a cache? Who can define a cache the most general way possible? Yeah? What's that? Fast is a good word to use. Is a cache necessarily memory? What's that? Oh, one lookup. Oh, man. I love these features. Not necessarily. What is a cache more generally? What is a cache? Yeah? It's a small amount of faster stuff that I use to hold a partial view of the contents of a larger, bigger thing. Look, registers on the CPU are, in their own way, a cache. The L1 cache is obviously a cache that's in the name. But caches don't have to be memory. And a lot of times, they are memory, or there's some form of memory. You guys call them memory or whatever. But a cache is just a smaller thing that's faster. The TLB is a cache, right? It's not really memory. It has different properties. But anyway, and a cache more generally is take a small piece of something that's fast and use it to hold part of the contents of a bigger thing that's slow. The usual reason we use a small thing that's fast, rather than a big thing that's fast is why. Why not just use a big, fast thing? That seems awesome. Yeah? Too expensive. Yeah, exactly. Although, actually, maybe one of the papers we'll read this year, there are now people. So you guys didn't live through the previous transition. But at a certain point in time, there was this thing called tape. Tape was very slow. And you can imagine lookups. So on a disk to seek back and forth, I'm moving these heads around and the thing is spinning. On a tape, it's like, I mean, how many people have ever had a tape player? OK, so you guys remember, right? I mean, seeking on those things was terrible. Fast forward, am I hearing the right song? Fast forward, it was bad. So then we went to disks. There are now people who are proposing memory-only architectures. So there's a group at Stanford that's been working on this for many years. Why use disks at all? They're too slow. They break such a pain. Why don't we just build a memory-only cloud? And they've been doing this. And you can imagine it's very fast. So anyway, we make the big thing look faster by putting a small thing in front of it. In the case of the file system, which is the disk, that small, fast thing is memory. Now, this is kind of interesting. And so we refer to the memory that we use for this purpose. The traditional name is called the buffer cache or a file system cache. Buffer cache is a pretty widely used name. What's interesting about using memory for this purpose? This is no problem. I'm gonna use memory as a buffer cache. No, wait, in compare this to other things, right? I mean, the high-level processor caches, that's all they do. They just cache the contents of main memory. Memory, on the other hand, what is interesting about using part of it for a cache? Interesting trade-off here that I don't necessarily have with some of those other caches. What else is in memory? So now I've told you I'm gonna use part of memory to cache the contents of the disk, but what else do I use memory for? Yeah, like as memory, right? For process pages and as a cache. So there's this trade-off here on modern systems where I have some of the memory that I'm using for process pages that actually store memory contents and for code and things like this, for the stuff you guys are doing for assignment three, right? That's one very common use of memory, but there's another part of memory that I use to cache the contents of the file system and these two uses of memory are in conflict with each other because there's only a fixed amount of memory on the system. So what would be the case? You guys may have seen this effect on a system that you've used. Why would I, what would be a case where I want like a really, really large buffer cache? I want to use a lot of the memory to cache file system contents and I would end up with a small amount of memory that's being used to store memory. What kind of workload would benefit from this? Does anyone feel like they've ever caught their system in this state before? So what this will cause, let's say I put a lot of memory to work for the file system cache, this will cause the file system to speed up, but I may start thrashing into the main memory system because I don't have as much memory as I have to kick things out to disk a lot faster. If you've ever, has anyone ever come back to their system after it's been sitting overnight and maybe it's a little sluggish for a minute or two? Has it happened to anybody? So one of the things that might have happened overnight is you might have run something like their programs on Linux machines that index the entire file system. So they go through the entire file system and they build a database of all the paths on the system that creates a lot of IO. And you actually might have other indexing tools that are actually even looking at the contents of the file system. So overnight what happened is that there was this one program that ran and it was doing all this file IO and the operating system over time devoted more and more and more and more memory to the file system buffer cache to make that process fast. When you come back and start interacting with your system, a lot of the pages that were being used by your interactive programs like your browsers and your code editors and stuff like that, they've all been swapped to disk to make room for this big buffer cache. So when you start using the machine again, it's slow for a minute while those things start to come back in, right? On the other hand, if I do the opposite thing and I have a small buffer cache, what it means is I can reduce the amount of swapping that I have to do but ax file accesses potentially slow. So making this trade-off is something that operating systems are doing constantly, right? And figuring out how to do, because of these two potentially useful ways to use memory. This is probably my favorite kernel parameter of all time. Linux has a value called swappiness. That's literally what it's called, I didn't make up that. Swappiness controls how the kernel makes this trade-off. I don't exactly understand how. But it's a value that you can set between zero and one and on one end it says prefer to use memory for the file system buffer cache and on the other end it says prefer to use memory for pages and address-based contents. How, now again, you're just saying prefer here, so there's some trade-off that Linux is making that you are parameterizing with this value. It will not, I don't think it will, I don't think there's any way to force it to do only one or the other. But you can tell it sort of how to wait this decision. All right, so now I've got some memory that I'm gonna use to make the file system fast. What do I wanna put in this memory? How do I use the cache here? So there's two models for, and we'll discuss both of them, for where the buffer cache goes in the file system hierarchy. So imagine that this is my file system and this is actually sort of based on OS 161. I have a, operating systems typically have a virtual file system interface, which OS 161 uses. That is what makes it possible. So you might wonder, I mean, how do operating systems support so many different file systems? They do that by creating an interface that all the file systems have to implement. And so to some degree, file systems is an example of the kind of modularity that you can achieve behind a well-structured interface. So in order to be a file system on OS 161 or on Linux or probably on Windows, there's a list of functions that you have to implement. As long as you implement those functions, I can treat you like a file system and I can make sure that the file system system calls all work properly, et cetera. So the virtual file system interface on OS 161 defines the set of functions and then various client file systems implement them. So on OS 161 right now, you have MUFS, which is how you access the working directory that you run Sys 161 in. If you didn't have that, you wouldn't be able to load programs and really do anything with the system. And there's also something called the simple file system that is part of the fourth assignment for the class, which you guys are mercifully not required to do. Now, both of these file systems, now on a, you know, MUFS obviously works a little bit differently because it's accessing this weird file system that's outside of the simulator, that's not exactly normal. On a normal system, all the file systems use the same underlying disk interface, which is what we've talked about before, just read and write blocks. And when I format the file system, when I partition my disk drive, what I'm really doing is I'm breaking up the disk into a portion that one file system can use and a portion that another file system can use, right? Just sort of breaking it up logically. So one place I can put my buffer cache is below this virtual file system layer. So what it's doing here is it's caching the results of calls that are made down to these virtual file systems. So that's one option. The second place I can put it is down here right on top of the disk interface. And so here, what is it caching? I put the file system here, if I put the buffer cache here, what's in the buffer cache? Disk blocks, yeah. So at this point, the buffer cache really doesn't know anything about high level file abstractions. If I put it up here, to some degree, what's in the buffer cache? Files, right? So here, I'm caching kind of whole files. Down here, I'm actually caching the underlying disk blocks. So there's pros and cons to both approaches. If I put it above the file system interface, the buffer cache has to support the same interface as the underlying file systems. Because how do caches work? What happened with the way a cache works is a cache sees the call to open first and then decides whether the cache has enough information to complete the call without using the file system. If so, it does it, if not, it handles, it passes the call down and then has to do something with the result, right? So if I put it above, if I put it at the virtual file system level, I have to support the same interface as the underlying file systems. And I'm caching entire files and the contents of directories. And here's a rough mapping of how a cache at this level would work. Open, there's not really much a cache can do about open. I really just need to pass that through. What do I do with read and write? I mean, these are kind of my bread and butter, right? This is where I'm trying to make things fast. How does a cache handle a read? What's the best case here? Yeah. Yeah, in the best case scenario, the file's already in the cache. Remember, I'm trying to reduce the, and maybe I should have said this before. The only way that caches are going to make this whole process faster is by reducing the number of disk operations that have to happen, right? That's pretty much it. I mean, there's some other games that you could try to play, but fundamentally, caches work by making the disk do less work. So I'm trying to reduce the number of seeks and reads and block operations that the disk itself is doing. If the file's in the buffer cache, when I get a read, I'm in good shape. That means that something else must have loaded it before if the right part of the file is in my buffer cache, I can return the contents without having to touch the disk. So this is good. I just saved myself some disk operations. If it's not, what I have to do is I have to allow the file system to handle the operation and then cache the contents and add the contents to my cache. Same thing with write. So if the file's in my cache, I need to make sure that I update the contents. So why is it important that the cache modify the contents in the cache? At some point, I need to modify the contents on disk and we'll talk about this in a second because this is where the consistency comes into play. Why is it important that I update the cache as well? Sort of a common error when trying, doing a first pass at this is just to not handle writes at all, right? What happens if I just handle it? Let's say I just make write a pass through. I have a just pass the write down into the file system. What, how will that break things? Yeah. Yeah, if I'm modifying a file that's in my cache, I have to update the contents in the cache. Otherwise, the next read is going to read stale data. Close, I can remove things from the cache and maybe at that point, I update the contents on disk. So, oops, there's bugs with the slides today. The pros here are that the buffer cache sees operations at the file level and there are times when that's helpful. So what's an example of a case where I might, can you think of an example of a case where seen a file operation? Because remember, the below the file system thing will only see disk blocks. So what's an example where the cache might do something smart because it sees a file operation? Show me, propose one way to optimize a cache at this level. What might be a useful thing to do? Me an optimization that you could use that you could launch via open. So remember, when open was called, we didn't do anything. We just passed it through. What could I do in open? Yeah, so for example, when open is called, the cache might say, you know what? I bet that file's about to be read from or written to. I'm gonna load some of the contents or maybe all the contents of the file is small. I'll just load them into the cache. So the next time read is called, because usually I'm opening a file because I want to do something to it. So you call open, I load the file into the cache preemptively and then the read can return from the cache rather than disk, right? So that's useful. And the only reason I can do that is because I'm seeing these file operations. In a minute, you'll see at the disk level, I can't do things like this. The real problem with this is that, well there's two issues. One issue is that there are consistency guarantees that I might want, that the file system might want to make, that the cache is preventing me from seeing. So for example, if the cache caches are right and doesn't pass down that information immediately, the file system may not have a chance to update the structures on disk that it needs. The other limitation is more interesting and potentially more problematic. At this level, I can't cache file system metadata. Remember all, we've divided blocks on disk into two categories. There were the data blocks that actually had file information in them and other. And the other were all these on disk data structures that the file system is using to name things and allocate space and blah, blah, blah. Those, why is it a problem that I can't cache those type of structures? It's from a usage perspective. Why would you be worried about a cache that couldn't cache file system metadata? Yeah, it's not a permission issue. Yeah. Okay, that's fair, but that's not what I had in mind. Yeah. What's that? Oh, it's not a question of flushing, right? Yeah. No, at this level it has to, that's fine. What's the problem with, so for example, let me give you one of the data structures that this cache cannot cache. This cache cannot cache. It's not a very good cache if it can't cache everything, but it can't cache some things. One example of something it can't cache are the inode tables. Why might that be a problem? Every operation that the file system does that hits an inode table has to go directly to disk. Why is that a problem? Well, right, I mean, but that's okay, but why are things like the inode tables so much of a problem? This cache cannot cache inodes. It's simply changing. What's that? It changes a lot. Yes, it's used all the time. Those file systems, some of those file system data structures get used constantly. Every time I read or write to a file, I'm probably updating the inode. And on this caching model, all of those updates happen below the cache and they cannot, they cannot hit in the cache. So that's the big problem here. File system metadata. One of the reasons why it's okay to have these inodes that are stuck at various parts of the disk in file systems like EXT4, is that those parts of the disk just tend to live in the cache. They get cached early on when you start using the file system and they never get evicted because they get used all the time. And so that's one of the reasons it's okay to have those inodes in these fixed locations on disk. You might be thinking, well, I gotta constantly move back and forth between the inode and the data block, but you don't because the inode's in the cache. So at least when I'm doing reads from the inode, which is pretty common, I don't have to find the inode on disk. The inodes are in the cache almost all the time. All right, so below the system, below the file system interface, I'm caching disk blocks. This has, so the buffer cache interface is at a read and write block level. I have to mimic the interface of whatever is below me. In this case, what's below me is the disk, so my buffer cache interface is read and write blocks. The pros here are pretty much what we just described. I can now cache everything. I can cache file system data structures. I can cache anything that's on disk, which is everything. And the other nice thing about this is that I never hide information from the file system. The file systems above me are now, by definition, going to see every operation, even things that hit in the cache. And so if they wanna do certain things to other parts of the disk, or if they have consistency guarantees that they need to meet, I'm not stopping them from doing that. The problem here, of course, is that once I get down to the disk block interface, there's no notion of a file. And so that nice thing I wanted to do before, I said, oh, the file's about to be opened, or the file's open, I'm gonna read in some of the contents. I can't do it here, because all I see is, oh, well, read from block 2062, and I think I have no idea what that means. I've lost a lot of information here. But this is typically what modern operating systems do. They have disk, block, buffer caches. I think mainly because of the metadata. That's super important. The other thing that's interesting, of course, is that if I wanted, so let me go back to, let's go back to the example with open, how can I implement that here? What I want to happen is when I see an open, I want to preemptively read some blocks into the buffer cache under the assumption that they're about to be used. How could I implement that on this type of system? When I have the buffer cache at the disk block level. You guys are ready for the weekend. I can tell, very quiet in here today. What can I do? I'm, you're the file system designer. You're like, eh, it's terrible. You know, the cache is way down there, all it sees is disk blocks. What can I do? What could the file system do? Yeah, so essentially what the file system could do is it could either read those blocks preemptively, because the file system knows where those blocks are, so the file system could read them. It could issue read requests for those blocks, or a better way to do this might be to extend the buffer cache interface a little bit and allow the file systems above it to suggest blocks. Now, rather than read, which indicates that I need the contents in the buffer cache, I might allow the file system to say, by the way, if you're not busy, here are some blocks that I think might be useful for you to get, right? And then the buffer cache interface can decide whether or not it wants to load those. Might see them anyway in a minute. Okay. Oh, oops, sorry. I don't know why this is here. Clearly, we're not gonna do a review at this point. Okay. Be good. Okay. So now let's talk about how caching interacts with consistency. So now you understand a little bit about how I use memory to make the file system fast. Why can this also cause the file system to be unsafe? Okay, so what's in the cache can be different than what's on disk. And then what? Yeah. Yeah. And then what's in the cache? Poof is gone, right? Maybe I eject, maybe some stupid user, like grab the disk. I mean, you guys have probably seen these warnings over and over again about these angry messages you get from the operating system when it's like, why did you eject the disk without telling me, you know? I mean, people have seen this before. Like, systems still do this. It's amazing, right? I just eject the disk anyway. I'm like, I don't care, right? But then you get this error message. It's like, ah, and it's all, it's silly, right? Because it's like, I already did it, right? Not gonna go plug it back in. What are you talking about? I'm sorry. I'm sorry that we have this kind of relationship. You're always mad at me about doing stuff that seems normal. But this is kind of why, right? Because you just created this completely unpredictable event where if the file system was in an inconsistent state, how many people have ever had this cause of problem? Oh really? Oh man, I should have asked that question. It's never caused a problem for me. That's why I keep doing it. But in the case where I was modifying something or I had some dirty stuff in the cache, if you come along and suddenly yank that disk out, I'm in trouble. Not only was the cache different than what was on disk, but what was on disk might be inconsistent. And so going back to our view of the file system as this delicate and complex data structure that it is implemented by writing things to the disk, this can cause all sorts of funny things to happen. So let's imagine that I'm creating a new file and an existing directory. So what do I have to do to do that? I have to allocate an inode and mark it as in use. I have to allocate data blocks for the new file, associate them with the inode. I have to add the inode to the directory. So the directory contents, I have to add a mapping that maps the new name of the file to the inode that I created in the first couple of steps. And then I have to write the data blocks both for the directory and for the file. So you probably didn't think that this was that complicated. When you run touch, blah, in a directory, all this stuff has to happen. And we went through this before, but I'll just kind of remind you guys, if I don't do, if I get halfway through this and suddenly the disk is ejected, power goes out, whatever, there's a bunch of different things that I can find that went wrong. Remember the lost, I think I pointed out that the lost and found folder, how many people have a file system somewhere? Well, you all do, because if you're using our VM, there is a lost and found folder. Why does that folder exist? Anyone know? Yeah. Okay, so explain to me, give me an example of how a file would end up in the lost and found folder. Using this example. What, show me where the failure happens. What point do I fail that would cause that to be the association step? No, close though. Essentially, step four. So you think about the file system as a tree. Part of creating a new file system is linking the new file name to the tree. If I don't do that, what I have now are two trees. I have the normal tree, and it's not really two trees, so it's a tree, and then this node that's hanging out over here, disconnected from everything. In certain cases, this can happen because of failures, and what the system will do is that it can see, huh, there's this file over here with Inode 65,000, but when I walk the file system top down, I don't find Inode 65,000. Hmm, so where do I put it? Lost and found, right? That's where I put it because you know what? I have no idea where it's supposed to go. I might know its short name. I might know the name that you gave it in the directory, but I have no idea where it's supposed to be in the directory hierarchy because something failed and I didn't link it up properly with the rest of the tree. So that's where stuff ends up. Uncertain file systems, when I fail in a way that causes me to not be able to get the file linked onto the rest of my DAG. Does that make sense? All right, so, and you know, we've talked about this before, I mean, pretty much every operation that I might need to do, even something as simple as creating a new file in an existing directory, requires modifying a bunch of data blocks. You might think of the case of rename. How many people have ever moved a file before? Really? Please raise your hand. Just do me a favor. I hope, you know. I mean, I've seen some of you use the shell so maybe it's possible that you've never moved a file before. But I'm pretty sure you have moved files. It turns out that rename is pretty complicated, right? You might talk yourself through rename and think about all that can go wrong, right? Because in theory, when you move something, it's supposed to stop being at the first place and go to the second place. It can stop being in both places or it can be in both places. Both those are failures and both those are pretty common, depending, or I shouldn't say common, possible if I have failures in various places, right? Which one is better? If you had two file systems, one of which would fail such that the file was not in either directory and the second would fail when the file was in both directories, which one would you prefer? Both, I take both, right? At least I can find it. Okay. Now, the real way, there's two things that ways that caching exacerbates the consistency problems. The first one, and you guys will discover this sort of effect when you guys do the swapping part of assignment three. The first one is that the longer, think about it this way. When you ejected that disk, like you weren't supposed to, there were, there are these periods of time where the on-disk file system is in an inconsistent state. The longer those periods of time are, the more likely it is that you're gonna do something dumb or that something bad will happen or something will fail while it's in an inconsistent state. If I write everything out to the file, to disk as soon as possible, I keep those intervals as small as possible and the likelihood that a failure is gonna cause a problem is small. But once I start caching things, those intervals get longer and so the like, the probability that you're gonna eject the disk in between when it was in the cache and when I actually got around to writing it to disk is larger and that's a problem. And, well, that's entirely how caching exacerbates the situation. The other way that caching can somehow, can sometimes exacerbate the situation is if I have four or five different operations that all have to get done together. Caching can spread out the amount of time that they actually get down to the file system and at every point in between, I'm in an inconsistency. So when I have things that are kind of the equivalent of transactions that all have to get down and get done at the same time, if I start delaying them and letting them hit the cache, that can create this problem. Okay, so yeah, okay, we just went through this. Does anyone have any questions about this? I'm just gonna skip through this. We did this before. But these are just examples of different ways where if I don't do one of these steps and I do some of the other ones, I can have problems. So what's the safest approach to caching? First of all, what part of the buffer cache do I not have to worry about? What part will never cause a problem? Yeah? I think it will read. Reads, right? Reads, because they do not modify state, cannot cause any sort of consistency. I'm only worried about modifications. So as far as read caching, I've got no problems. I bring things into the cache and I leave them there as long as I can and reads can hit in the cache from here until Sunday and they will not cause any issues. So that's good, I've eliminated half of the problem. Now let's talk about writes. So what's the safest approach to caching writes? What's that? Yeah, write, don't buffer them. Now I can still cache them, right? I still wanna keep the contents in the cache so that the next read can get up to date contents from the cache. But what this means is every time I modify a block in the cache, I write that block to disk immediately. That makes sure that, again, I'm limiting the time between when the block was modified and when those modifications made it to disk. And so this leaves me with the fewest time spans where I can have problems. This approach is called write through caching because you can think of writes as sort of passing through the cache. They never, a write will never linger in the cache. It's always reflected immediately to the disk. What's the most dangerous way to buffer writes? Let's say you're just really interested in data loss and corruption. What would you, I mean, I'm sure some of you guys are, right? What would you do? What do you do with writes? Okay, never update is never going to work, okay? Cause then, yeah, I mean, I guess if you're just a total nihilist and you're like, I don't even care about this file at all. I don't even, like there is no scenario. If you never write to disk, there is no scenario when you guys are, hey, thanks. There is no scenario where that stuff's ever going to get there. So you have one of these, do you guys read about this little scam that some company did where they sold like these four gigabyte flash drives? They were like 25 cents or something like that. Or I don't know, maybe they were like four terabytes or whatever. Some ridiculous amount of storage. They're like 25 cents. And the thing was like you put it in and it would show you a directory and it would say, I've got four terabytes of space. How do you think they worked? I mean, four terabytes of space is expensive and not something you're gonna get onto a flash drive yet. I don't think. So how do I create this four terabyte flash drive and sell it to you for 10 cents? Well, actually, how do I make it for 10 cents so I can sell it to you for a dollar and still make money? How does it work? How do I fake it? I mean, okay, look, you can't assume people are super dumb, right? So what do I do? I mean, clearly the thing has to misreport the amount of information it has. So that's step one. So when I plug it in, it has to look like a four terabyte drive. But how do I handle access to it? Or just drop it. Yeah. So essentially what these things did is they said, they were, I don't know, four terabytes. They actually had like four megabytes on there. And so they would hold four megabytes of stuff. No problem. So you got it and you're like, wow, this was such a good deal. I start putting photos on it. I'm happy for like two days. And then when I try to write past a certain block, it just says, no, no, no, it just drops everything, right? So yeah, so that sort of file system we're not that interested in, okay? But what's the most dangerous approach if I actually do want to persist to data? What do I do? How long do I keep writes? Yeah, I mean, I hold writes in the cache until the block is evicted. So the block could be evicted because I need space for things that are more active. It could be evicted because the files close, whatever. But I just keep things in the cache and I don't write them to disk as long as possible. Why is there a trade-off here at all? Why not just do a write-through cache? Why are we even talking about this? What is the trade-off here? The cache would be slow. What's that? The cache would be slow. No, because the write happens after the cache is accessed, right? Oh, hello. Like what is, I'm amazed that took that long today. That's interesting. That wasn't me this time. What is, so what's the trade-off here? The second approach will improve performance. Why? Yeah? Lessio. Lessio, because all of the writes, let's say I do a thousand writes to the file. In the first case, I have to write all of them to disk immediately. So I do a thousand writes. In the second case, I do one. That's awesome, right? Just write it once. The second approach you call a write-back cache. And there's sort of different levels of this, right? So for performance, obviously I wanna use the second approach because that allows me to eliminate as many disk operations as possible. For safety, I wanna use the first approach because it means that the file system is consistent as possible. What else could I do? How can I sort of gain some of the benefits of both approaches? Yeah, let me know what. There's usually time, right? So other caches will say, okay, well, I can also, yeah, so I can also delay data writes for a period of time. The other thing to observe, which I forgot, is that metadata, file system metadata, remember I'm caching at the block level here, so I'm seeing all this file system metadata, that stuff's super important. So the file system metadata, I may not want to cache it at all. I may want to, well, sorry, I'm still caching it, but I don't wanna buffer writes. So I may have those writes go right through the cache. With data blocks, I can keep those in the cache for a little while. From the file system's perspective, it's a lot more important to it to have its own internal data structures be consistent than to lose some of your data. It turns out that, you know, that may surprise you, but if the file system data structures get corrupted, you're in trouble. So years ago, I had a, I, you know, I was building up an MP3 collection in college. I was pretty excited about this. And so I bought this 20 gigabyte hard drive, which was huge at the time. I was putting all these songs on it. I had my own little MP3 collection. It was pretty exciting. And at some point, I did something dumb and I overwrote some of the super block of this drive. And so it was toast, right? So I was sad because it spent, it used to be very hard to get MP3s. You guys don't realize that you live in the future. So like I spent a lot of time and energy in college when I should have been doing other things, like finding these MP3s online and downloading them. It was all legal, I assure you. But so I was sad that I lost all this information. So I bought this, I spent like 50 bucks and I bought this computer program that was like, oh, we can recover your data. And I was pretty desperate. So I tried it and it found all this stuff. And I was really excited. It found like, I don't know, thousands and thousands and thousands of MP3s. And I thought this was great. And then I noticed something interesting, which is that it hadn't found all of them. Maybe I started with like 10,000 songs and when it was done, it had 9,950, okay? But I thought that's cool, whatever. Those 50 songs, I can download them again. I'll take a few more days. Then I started to listen to the songs. Where do you think those other 50 songs were? Little snippets all over the place, right? So it's like you're listening to Britney Spears and suddenly, you know, back for a quarter of a second, right? And then it was, so I had to throw the whole thing away. Anyway, it was very sad. But that's the problems that you can get into when your file system data structures get corrupted. All the data was there. So if I had sent that away to some sort of forensic disc person, they probably could have found it. But, you know, the file system data structures were gone and so the file system didn't know where it was. If you ever get to the point in your life where you have to hire one of these forensic data people, you just know that you've had a very bad experience, right? It's always a very sad thing. Back up your data. And there's also an interface for controlling some of this stuff too. If any of you guys have ever used sync, like you can actually kind of tell the system, I want all the dirty buffer cache blocks from this file to be written to disk. Let's see where we are, I think. I think we're gonna stop here today. We will talk about journaling on Monday. Good luck wrapping up assignment three and have a great weekend.