 All right, good going. I've been drifting later and later. So today, I thought I would actually start. 9 AM, give everybody their money's worth, all 10 of you guys that showed up here. I hope that there's an asymptote to this curve. That's not zero, because the attendant seems to be dropping, but I hope there's some basement. Or maybe the bus is just running late today. So it's Friday, and we're going to try to finish up file system kind of basics today, so that next week we can talk more about the specifics of a couple of different file system implementations. So this part I think has spread out a little bit more than I anticipated, which is fine. It's been, I'd rather go slowly and make sure everybody understands things. But next week, I want to talk specifically about a couple of file system implementations, and then maybe some stuff on data recovery and sort of crash recovery. So today, we're going to get through basic file system operations. How do we locate data blocks? How do we associate data blocks with a file so that it can grow and shrink? And then finally, we'll talk about caching. A little bit about caching. Where does the cache go? What's the interface to the cache? And some brief little bit of discussion of what are the applications for file system consistency? No announcements today. I think everybody knows what's going on. Assignment two is due on Monday. Remind you, you guys have five late days for assignment two and assignment three combined. Five total late days. So I suggest that you use those late days. I mean, there's no point not using them. The question is, do you want to use them on assignment two, or should you wait and use them on assignment three? So if you're done with assignment two, fantastic. You've got five late days in the bag waiting for you for assignment three, where you will probably need them. So anyway, any questions about assignment, sort of course, logistics, anything? We've gotten some of the old lectures up online. I think maybe we're like one or two behind at this point. But I'll work on that this morning. So they'll be up. We'll be current as of today. So we'll have all the old slides and video online. Questions about course logistics? All right. So on Wednesday, we talked about iNodes and about how we implement directories. Questions about this material? Any questions about iNodes, iNodes structure, iNode location? Directories, what are directories? How do we implement directories? How do we find things on the file system? Go through an example of path name translation today. So we'll get a little bit more reinforcement of this stuff. So who remembers? Remember, file systems are systems. So systems like numbers. Systems don't like names. Systems don't like characters. Systems like numbers. So iNodes are the index nodes that store information about every file on the file system. But when I request an iNode, I do that by requesting an iNode number. So how do I translate? How does the file system translate the iNode number into an iNode structure? How do I find the iNode structure if I get the file system and iNode number? So super, right. So super block is part of the answer. But what's the general approach? Where do I store the iNodes? It could be spread across the disk, but what do I need to know about, if you give you an iNode number, you need to be able to translate that into a what? Into a disk block or a location. So essentially, I create all my iNodes when I create the file system. So when I format or initialize the file system, I create all my iNodes. And I put them in locations that are well known. Either the locations are stored in the super block or they're in the same spot every time. So you can imagine if the simplest approach to this would just be to put all the iNodes at the very, very, very beginning of the partition, starting with block two or something, and a big array of iNodes. So that way, I could translate an iNode number very, very easily into the data block and the particular piece of the data block that stores the iNode. But what's the consequence of this approach? I allocate all my iNodes at format time. And so what can happen? You miss that class. What about down here? No guesses? What can happen? I allocate all my iNodes right up front. Oh, you're on your head. I'm drifting over here to the savers corner. Are the savers in the playoffs? If they win tonight? Oh, wait, OK. Capitals. All right. No idea. Scott? Right, so there's a fixed number. Well, there's two consequences, right? One is if I put the iNode in, you know, when I format the file system, I have no idea where the data blocks that are associated with the iNodes I'm creating are going to be. So it's possible that those data blocks might not be located near the iNode, right? And that could be a problem. Why? What's that going to create? No clue. Sean, slow y. Right, so I have the disk head. Yeah, the disk head's going to be seeking back and forth between these iNodes, right? So how do I access the file? You know, we're going to talk today about how I find data blocks, right? But the first step is I find the iNode for the file. So I find the iNode, and then the data blocks are over here. And maybe I updated the data blocks, and then I need to update the iNode. So now my disk head is, you know, blah, blah, blah, and you can hear it going, you know, and it's bad, right? Remember, we don't want to move the heads. The heads are slow, OK? And then what's the other consequence? Scott alluded to this, right? I've got a fixed number of iNodes meaning what? Ben? If I have a fixed number of iNodes, what can happen to the file system? And what would that mean? I can't create any more files, despite the fact that I still have space on the disk, right? So if I run out of iNodes before I run out of data blocks, too bad, right? Your file system may have, there may be a lot of data blocks that are still left unallocated on the disk, but it doesn't matter. The file system is out of the structure that it needs to create new files, and so you're done, OK? All right, directories, right? So directories are just special files that contain a series of path name to iNode map names, right? And I don't know what else is on this slide. Oh, OK, here we go. A description, that would be a set of directories here, right? And I'm showing you, if I list the contents of the directory by iNode, I can see the mapping between the relative path names and the iNode numbers, OK? Any other questions about this? Somehow this was all that we did on Wednesday, although it was a fun class. Any other questions about this? We can keep going. All right, great. OK, so let me go back. I know I showed you this stuff on Wednesday, but let's just review it a little bit. So remember, the file system superblock stores all sorts of interesting information about the file, and there are tools on Linux that will allow you to access this information. So we looked a little bit about the quite detailed output of the debug.fs command shows super stats, right? So this reads the superblock and parses information out of the superblock and puts it in a human-readable format, right? And again, this tells you all sorts of things. It tells you what the volume name is, where it's mounted, what kind of file system it is, where the iNodes are, what the iNode size is, what the data, what the block size is, et cetera, et cetera, right? So this is kind of a neat view into the internals of how these things actually work, right? And then here's another page of it. It shows you, in particular, where the different data structures are on disk, right? So my freeblock bitmap, which I need to figure out which data blocks are free so that I can allocate them. It shows you where that is, shows you where the iNode bitmap is. What do you think the iNode bitmap is used for? John. So the block bitmap is used to allocate blocks. Therefore, the iNode bitmap is used to allocate iNodes, right? So when I'm looking for an iNode, if I'm creating a new file, I'm going to look in the iNode bitmap to figure out what iNode structures are unused, right? Then my iNode table starts at a certain block, and this kind of shows you the statistics for this particular group, right? And you can see here there are no free iNodes in this group. Now, I don't know actually where these groups map to on disk, but it's interesting here. You'll see that both group 1 and group 0, and I wish I would have shown more groups here, are completely full, right? There are no free iNodes in either one of these groups. And so what I'm wondering is, are these groups in a good spot on disk, meaning that EXD4 is allocating from them first, right? So who remembers based on the disk geometry, what's a particularly good spot on disk? Spinning disk, right? Where would I, what is the part of the spinning disk that I might want to use first? The outer edge, why? Carl. Well, no, no, no, I think this doesn't matter, right? Because what's true on the outer edge of the disk? Constant density of data storage, right? There's more data on the, so one way to remember is, there's more data on a track on the outer edge of the disk because it's longer, right? But yet the seek, the head spins, it takes the same time for the head to make one revolution, whether I'm on the outside or inside of the disk. Ergo, when I'm on the outside of the disk, I see more data per revolution than I do when I'm on the inside of the disk, right? Meaning that the achievable bandwidth of the disk is higher on the outside edge, right? Does that make sense, Carl? Well, but again, if I pack everything out to the outside of this, that's OK, right? Because remember, seek time is always lateral, right? So if I can get everything out to the outer edge, then I'm only seeking on the outer edge. If I get everything out to the inner edge, then I'm only seeking on the inner edge, but my bandwidth is lower, right? OK, yeah, yeah. So this volume is actually not full, right? But my question is, what's the EXT block allocation strategy? And what I'm saying is, I think based on this evidence, and I wish, again, I had shown a couple more groups, that it looks like the first blocks, and I know it's being allocated on this particular volume, are being allocated from the lower numbered groups, right? So again, there were 64 different groups. I wish I had shown some other ones, but maybe I'll do that for next time. OK. All right, so again, when I call open, the process hands a path to the file system, OK? And the file system needs to translate that path to an inode number. Remember, I think the file system thinks that I know numbers, right? So this could be a file or directory. I'm trying to open it, and I need to figure out what's the inode number corresponding to this path, OK? So for assignment two, you guys aren't bothering with this, right? You just pass this into VFS open, and you're done, right? But what is VFS open doing, right? So what's the process of translating this path into an inode number, right? So there is a unique inode number on the system for Etsy default keyboard, right? This is a file. I don't know why I chose this file, maybe just because it wasn't that deep, and it was actual file, right, rather than a directory, all right? So what's the first step? What's the first thing I need to do? How do I bootstrap this process? Who remembers? I'm going to go through a process of translating pieces of this path name to inode numbers, right? Find the root, and how do I find the root? Inode number two, right? Remember, I put the root in order so that I can bootstrap this process, I put the root directory at a well-known inode number, right? So any EXT4 file system will have the root directory as inode two, okay? And that's how I bootstrap this process. Otherwise, I'm stuck, right? Otherwise, I need an inode to translate the path, but I need a path to produce an inode, right? So this is what helps me get started, okay? Let's start with inode two, right? So I use this fixed agreed upon inode number like two. Now what do I do? I have an inode number two that corresponds to the root directory. What's the next step? What do I do? I have an inode number two. Now what am I going to use that inode number to do? Well, right, so I expect that inode number two points to a file that is a directory, right? But what am I actually going to do, right? I've got an inode number, now I can tell the file system, I can give the file system an inode number, which is what it understands. But what's the next thing I need to do here? What's that? So I need to find the inode for ETC, right? But that's a good point. So why ETC? Why not ETC default keyboard? What's the other baked-in assumption that you guys are operating under when you're solving this problem? What's that? Well, so what's a special semantics that's built into this file name? It's an absolute path. You guys are getting closer. Again, you guys are so programmed that you don't even realize what? It's hierarchical and again, what sort of special formatting is going on here? For the forward slash is being used as a separator. So the forward slash is what standard Unix hierarchical file systems use to divide components of the path, right? Is there any reason why forward slash is necessary for this purpose? What else could I use? What's that? Anything, right? I mean, I could use a period, right? I could use an underscore. I could use a carrot, right? And you know, I mean, this is just what we're used to looking at, right? It does mean however that using backslash as part of your file name gets to, or sorry, forward slash as part of your file name is problematic, right? Because that special character has a special meaning to the file system, right? So if I use period, then I wouldn't be able to use period in the actual path name components, right? But anyway, so, okay, I wanna look up, I need to figure out what the I know number is for ETC, right? I know the I know number for the root, which is two, okay? So what am I going to do, John? Okay, yeah, so this is close enough, right? I'm gonna use I know number two to retrieve the contents of the root directory, right? I tell the file system, give me the contents for file with I know two, right? And I expect that I know points to a directory and so the contents of that directory are going to be a series of path name component I know number pairs, okay? And so yes, I'm gonna open this directory, I'm gonna read in the contents and I'm gonna parse the contents, I'm gonna, you know, and it turns out there's a special structure to directory files as you would imagine that is understood by the file system. And I'm gonna look for the path name component ETC and if I find ETC, then there will be an I know number associated with ETC. What happens if ETC is not in there? Then this file doesn't exist, right? If the path name component isn't in the directory, then there's no file on the system with this name, right? Okay, so now what do I do, right? Let's say I do find ETC and it's got this I know number 393218, now what? What's that? Where do I look for default? I'm trying to get you guys to go step by step here, right? So yes, I'm going to look for default, that's the next path component. But what do I do? I have this I know number 393218. What do I tell the file system? I read the contents of the file with I know number 393218. And again, it's another series of path named I know number pairs, right? And I look for default, right? And I keep doing this, going down the file, translating I know numbers to path components until I exhaust the entire string, right? And then what I've produced is an I know number corresponding to ETC default keyboard, all right? And let's see here, if I go back and show you this, well, I didn't go this direction here, but this gives you an idea of how this works, right? I mean, I start with two. Let's say I was looking up home challenge, right? I would start with two. I would list the contents of two, right? I would see that 393219 is home and then I would go into home. And of course, again, I'm asking the file name, I'm asking for the file by the file name here, but in reality, the file system would be looking at it by the I know number, right? Because the file system doesn't understand file names, right? All right, maybe I actually have a, nope, okay. So this is how I translate paths, all right? Any questions about path name translation? This is not a terribly hard thing to do. What about, what happens if this path is not a absolute path? What happens if it's a relative path? I have to work with the current working directory, right? So if the process asked me to open dot, dot, slash, ET, sleege, slash, default, whatever, right? I perform the same process, but instead of starting with the root I know, I start with the I know the corresponds to the current working directory of the process, right? What do I do with dot and dot, dot and path names? How do I treat those? Does anybody know? What's that? No, no, dot, dot is parent, but how do I, do I need to do anything special when I translate those, Ben? Yep. Yeah, and it turns out I think that, again, if you go back here and look at this, well, this isn't showing dot and dot, dot, right? But if I printed off this entire directory, dot and dot, dot are just in there. They're just path name components, right? They're just a string that, a special string that the file system interprets either is pointing to the parent I note of the directory or is pointing to the directory itself, right? So there's nothing really that special about them. The only thing that these file system might do with dot and dot, dot would be what? What could you not do to dot and dot, dot? In order for things not to go a little bit weird, right? What's that? You can't remove them. So if you try to do, if you try to, I mean, you can execute a remove command on and but it operates on that structure. So you can't take dot and dot, dot out of the directory, right? Carl, you're looking skeptical. So, okay, so we talked about this, right? We talked before about wanting the hierarchical file system to not have cycles, right? And that is true and that is something that's enforced when you move files around, right? So one of the things that happens is when you move a file on the file system, if the file system detects that you've created a cycle in the directory structure, it will fail, right? And the reason for that is what we talked about before. We want files to have unique names, some unique canonical name that points to that file, right? Once I introduce cycles, then that requirement is gone, okay? Yeah. How do I handle symbolic links? How do I handle symbolic links? Okay, shoot, I forgot to put a slide in this. Okay, so there's two linking tricks, right? So how many people have used links on UNIX? I think you guys probably have done this. How many people have used symbolic links? How many people have used hard links? How many people know what a hard link is? Okay, so one of the things we can do, right? Remember, my directories just contain translations between a path name component and an I know number, okay? So let's say I want the same I know number or the same file to appear in two different directories. What's one way I could do that? I could just have the I know number in two different directories, right? So I could have Etsy default keyboard, point to I know number 1000, and I could have foo bar keyboard, also point to I know number 1000, right? So that's what's called a hard link, right? What it means is that when the file name is translated by the system, what it gets out is the same I know number. And so the contents of those hard links, those hard linked files are identical. It is the same file on disk, okay? What's, but what is one consequence of this, right? So I'm translating the path name to an I know number, right? What is the I know, does an I know number make any sense when I cross file systems? Let's say I have two different partitions that are formatted with the same kind of file system. Does file, does I know number 1000 mean anything on the other system? No, I know it's always relative to the file system that you're on, right? It's, they're created at format time. And what this means is that hard links cannot cross different parts of the system. So if you have two different file systems mounted in different places, you can't create a link that spans them, right? Because the hard links are just by I know number, right? A simple way to fix this problem what's called a symbolic link, right? So this is what you asked about. What symbolic links are is a symbolic link is a special file that simply contains a path name, right? The whole contents of the file are just a path name. How do I, so what do I do? So again, I wanna use this as a link. I wanna use it to allow one file to point in another file. So how do you think I translate this? Let's say Etsy default keyboard is a symbolic link, okay? So I translate Etsy, I translate, I go root Etsy default keyboard, finally I'm at I know number 394692 and it's marked as a symbolic link by the file system. I open it and I get a new path name. Now what do I do? How do I open the file? I mean, basically I just open that path name, right? So if I give you a file that's a symbolic link to open, the first step is you do this and you get an I know number and you read in the contents of that I know and you identify it as a symbolic link and it's just another string. And all you do is you pass that string right back into open, okay? This string can be anything, right? It can be a relative path name. It can be an absolute path name. It can be a path name to something that's on a different file system. So essentially these are completely, there's really very little semantics associated with these other than it's a file name and that's how it's interpreted, right? So for example, one of the things that can happen with the symbolic link is if you move it, it might break. So if I create a symbolic link to, if I was in, if I create a keyboard as a symbolic link to dot, dot, dot, foo keyboard, which meant that Etsy default keyboard would point to Etsy foo keyboard. If I move that, then it breaks, right? Because it's a relative path name, right? It's relative to where I am and dot, dot, might point to something else, right? All right, so I should have put this on a slide. I'll do this next time to get onto the slides, okay? So let's talk, so we've talked essentially about how we do open, right? Today the goal was to talk about file system operations. So we've talked about how do we do open, open takes in a path name and wants to, wants to locate an unknown number, okay? How do we do read write? Now I have an inode that I know that I'm operating on and how do I do read write? So essentially what read write is doing at the file system level is it's taking an inode number, which the file system is going to remember after you call open and it's giving you an offset, okay? So what do I need to do to this offset? It's right on the slide. I shouldn't have asked that. So one way of thinking about this is that open translates path names to inodes. Read and write translate offsets to data blocks within given an inode, right? So I give you an inode, I give you an offset and the goal of read or write is to find the data block, find the portion of that file that corresponds to that offset, which essentially means find the data block that corresponds to that offset, right? And any other data blocks that are involved. If I do a read or write that spans multiple data blocks I need to find all of the data blocks involved so I can either load the contents or modify those contents, right? But one way to think about this is simply translating an offset to a data block within that file. All right, and there are several different ways of doing this, right? Essentially the problem we have is how do I associate data blocks with an inode, right? I found an inode, we just talked about how to find inodes. Now, how do I find the data blocks that are associated with that file? So here's one solution, right? We're gonna talk about three different ways of doing this and then we'll compare them to each other, right? And actually this ends up looking a little bit similar to something else that we've talked about. Now I'll ask you guys if you remember in a minute, okay? So first solution, the ino number contains a pointer to the first data block associated with the file and I just maintain a linked list. Every data block contains in some sort of wrapper a pointer, excuse me, to the next data block and maybe also to the previous data block for simplicity so that I can walk through the file either forwards or backwards, okay? Does everybody see, understand how this would work? What do I do if I want to append to this file? How do I do that? I add another data block to the end and I just link it into the linked list, right? Pretty simple, okay? So what are the pros of this? What is one of the pros of this approach? It's simple, right? Quite simple, right? I mean if you can implement a linked list you can implement this. It might not be the simplest, the simplest one might be the next one, but it's simple, okay? And the ino doesn't have to store that much information about the file contents. It just stores the pointer to the first data block, okay? What's the problem with this approach? How long does it take? Let's say I give you an offset. How do you translate that offset? What do you actually have to do? You potentially have to walk a large portion of the file contents just to look up an offset. Doesn't sound like a fire alarm, it's not loud enough. All right, so translating an offset requires walking this linked list and that's slow, right? So this is a potential con. Oh, and lookups, right? So here's another solution. I use a flat array. The inode contains an array of data blocks and it's just a flat array and essentially I allocated it when I create the inode and then when I create data blocks, I just add them to this array at the appropriate index, right? So what's the pros of this approach? Right, so it's very easy to translate my offset. It's also simple, right? And offset lookups are really fast, right? I essentially take the lookup, I do a modulo by the file block size when this case would be 4k and that gives me the index into my array that I'm going to use to find the data block. What is the biggest problem with this? So the size of the inode, right? Either if I use a small inode, I've got a small file size because I just can't put that big of an array in the inode and if I have, if I wanna support the largest possible file on the system, let's say I wanted the largest file to be like four exabytes or something like that, right? Which new file systems support files of that size, okay? If I do that, then it means I need to have this gargantuan array in every inode, right? And what do you guys remember about most files on the system? Are most files huge? No, most files are small. That's why EXD4 can get away with an average file size of 16k because most files are small, right? There are some big files, but a lot of files are small. I wish I had a graph with this. It probably is a good one I could pull up. Maybe I'll do that next time to give you guys a sense of what the distribution of file sizes is because this is important for file system design, right? And again, so because of this huge difference between the largest possible file and the smallest possible file, a large portion of the array, either a large portion of the array might be unused if I support huge files or I can't support huge files and I have a smaller, right? So again, our observation here is that most files are small but files can get big, right? We want to support very, very large files. All our file systems had these irritating limitations. So how many people, for some reason FAT32 is still around. I don't know why. How many people use FAT32 or are forced to use FAT32? So FAT32 has the largest file size, anyone know? Four gigs, right? I'm sure when they designed FAT32, they were like four gigabytes. Who would ever have four gigabytes of data in a single file, right? Well, now everybody does, right? And it's a real pain in the butt to work with FAT32 because you end up having to split things across multiple files and it's just kind of a disaster, right? So newer file systems, again, I can't remember what the max file size supported by EXD4 is, but I think it's actually in the exabytes, right? So it's massive, right? At the same time, I do wonder if we'll ever have exabyte files. That's just me. I'm just gonna put that out there, right? And probably, you know, 10 years from now, I'll be saying like, some people thought, we never had exabyte files, right? And what do you put in them? Okay. Maybe by that point, we'll all be watching movies. That'll be like 18 hours long, you know, in like 10 billion by 10 billion resolution or something. Okay, so the way that we support this on most file systems is through what's called a multi-level index. And a multi-level index has the inodes store several different types of block pointer. The first type of block pointer I point are what are called direct block pointers. And these, as they sound like, point to blocks, okay? Remember, average file size, 16K. That means with a 4K block size, with just four direct pointers, I can already get up to the average file size, right? So that's good. That means that a lot of times, all the blocks that are necessary to read for a file have pointers that are stored in the inode itself, meaning that access time's very fast. I also, however, allow the file to grow bigger by storing some pointers to blocks that themselves contain pointers to data blocks. I call these indirect blocks. So I contain, I don't worry, I have a picture. I just like putting this up here, because it's fun. Because it gets even better, right? I also store a couple of pointers to blocks that contain pointers to blocks that contain pointers to data blocks, all right? And I won't try to do the next one because I'm just gonna get tongue-tied, right? So I call these doubly indirect blocks. And I can also store triply indirect blocks, right? So pointers to blocks that contain pointers to blocks that contain pointers to blocks that contain pointers to data blocks. I think that was right, all right? So it ends up looking like this, right? So this is just showing one level, right? So this is just an indirect block. So here's my inode, and I stored a couple of direct pointers, meaning that the inode has a couple of pointers to actual data blocks. But because this very, very bizarre inode only stored two of those, I wanted to let the file grow bigger. And so I added another special data block here that just has more pointers. And so as I add data blocks to this file, what I do is I would fill up this indirect block. And then what happens when this indirect block gets full? What do I do then? Okay? So then what I would do is I would allocate another block down here, and then I would have it point to other blocks with pointers, and I would start allocating from those, right? And I would just essentially keep doing this until I ran out of space again, and then I would allocate more. And then maybe I would allocate the block points to the block. You see where this goes, right? So this is this nice clever way of, again, being able to efficiently support small files, right? So most files will have all of their data blocks pointed to within the inode itself. But, you know, that one exabyte file that you just created, I can still support that file, right, through doubly and triply indirect blocks. So the pros here are obviously that the index scales with the size of the file, which is nice, right? That's what we want. And offset lookups are still pretty fast, right? Especially in the common case, right? So in the common case where I'm just doing offset lookups within the direct blocks, they're very fast, right? It's essentially 01, right? I'm looking up in this array. Once I start to look up, once the file starts to get bigger, what starts to happen? What do I have to start doing in order to translate offsets? I have to keep following these pointers, meaning that I have to read more data blocks, right? So for example, a one byte read from this file, if it's in the first, you know, 16K, is gonna cause me to read the inode and then go directly to the data block. Once the file gets big enough that I need an indirect block, then I have to read one block, two blocks, three blocks, right? And then if it gets really, really big, then I have to read one block, two blocks, three blocks, four blocks, right? So the number of blocks I have to read to get one byte of data out of the file goes up slowly, which is good, as the file gets big and big and big. All right, so that's nice. And again, I mean, I can keep my small file small and I can keep the metadata associated with them small, but I can allow files to get really huge, okay? And they're right, I didn't really come up with any cons of this approach, they probably are, all right? Okay, so questions about, oh, goodie, I've got time to get through cash. Questions about storing file data. And now I'll come back to my earlier question. What did this start to look like for a minute? Page table entries, all right? So we had a similar data structure design problem when we addressed page table entries, right? With page table entries, we came up with what I would, I would think that's a different solution, right? So why did we come up with a different solution for page table entries? Because the address space is, now the number of pages in use by the process is not a fixed size, right? But the address space that we have to span with our data structure is a fixed size, right? And there's the sparsity. Files don't have this problem, right? Files, if I store four gigabytes of data in a file, I don't store it sparsely into a 16 gigabyte container, right? I store it into a four gigabyte container, right? So it's a little bit of a different problem, and that means we came up with those in solution, John. Yeah, yeah, yeah, and man, I didn't put in a slide for this, but I wanted to talk a little bit, I guess I mentioned this last time. So that's exactly right, right? And extents are another way that file systems address this issue, right? So I mentioned this last time, and it's a feature of newer file systems, and actually, EXT4 actually uses extents. So if you print off the information about files using debugFS, you can see that they have extents. And yes, one big goal of file systems is to allocate things contiguously whenever possible, right? And that's because, you know, the number, you know, if I need to read 16K from a file and it's all on one track, I'm great. If those four data blocks are in four different locations, you know, I'm on the struggle bus at that point, right? It's gonna be slow. All right. Okay, so now we talked all the stuff about on disk data structures, right? But the real stinky thing about file systems is we don't want to use the disk at all, okay? The disk is slow, slow, slow. I mean, disks are getting faster, flash is helping, but disks are still, you know, orders, the spinning disks are orders of magnitude slower than other places where we can store information, right? So how do we do this? We have our, you know, we've talked about some design principles. What's our, what's the canonical way that we make a big slow thing look faster? We use a cache, right? We put a smaller, faster thing in front of it and use that thing as a cache so we can avoid using the big slow thing. And in this case, also with file systems, the other thing we can do is we can avoid using the big slow thing inefficiently, right? So we'll get to that maybe in a later class. But using a cache allows us to store information temporarily and defers the point where we actually have to make a decision about how things are gonna end up on disk. And that deferral can help the file system better allocate on disk data structures. Okay, so in the case of the file system, again, I mean, we're gonna use memory. And the memory we use to cache file system data is referred to as the buffer cache, right? And buffer caches have this really dramatic improvement in performance, right? One of the things I guess that is sad about the fact that we've trimmed assignment four from this class was that a portion of assignment four used to be to implement a buffer cache for the simple file system in OS 161. And one of the reasons that, you know, I decided to align this particular assignment is because that process of learning buffer cache shares a lot of features with assignment three, right? So I didn't think there was really a lot of new technical details there. However, implementing the buffer cache just produces this massive boost in performance, right? So when people actually got it to work, it was like, whoa, you really noticed, right? I mean, a test that would take 10 seconds before it ran in like 0.1 seconds, right? So it's really, really dramatic because disks are so slow, right? All right, but I'm gonna be calling this thing the buffer cache, right? I don't know why it's called that, but that's what it's called. It is a cache and it's caching, you know, buffers. It's buffering data for the file system, right? Okay. But you might say here, and this is an interesting observation, right? Well, aren't I already using memory for something? What am I using memory for? I'm using memory for memory, right? Like I gave processes access to memory, you know? So why, now I've got this buffer cache that wants to use memory, right? So yeah, operate systems provide memory to processes that use it through the standard memory interface using standard instructions and virtual address translation. We talked about all that good stuff. Now I'm also gonna use it to cache file data and I'm gonna do that to improve performance. But, you know, these two things are competitive, right? I've got a fixed amount of memory on the system and some of it, you know, now I've got two uses for it. Before I had this fantastic world where the only thing that was competing for memory were processes competing for memory with each other. Now I've made things more complicated because I have the buffer cache competing for memory with the memory management system itself, right? And you can imagine what happens here. So let's say I say, okay, I'm a file system person and the file system is the most important thing on the system and the disk is really slow and so I'm gonna allocate this huge buffer cache and I'm gonna take that small sliver of memory on the system and use it as main memory. What's the problem? What will happen here? Well, I don't have, like, you know, again, I can use swapping, I can use other tricks. I don't have to run out of memory but what's gonna happen? I'm gonna swap more, right? I might start thrashing in my memory subsystem. The interesting thing here is that if I limit the amount of main memory, what starts to happen? I have to use the disk to swap, right? So I was trying to avoid using the disk by making the buffer cache really big, right? But now the main memory system is hammering the disk because it's gotta swap all the time, right? So this is not good, okay? Okay, so now let's say I'm the memory subsystem maintainer and I'm saying, okay, the buffer cache isn't that important, small buffer caches work pretty well, now I need lots of memory, so what's the problem here? The file access is slow, right? File access might be really slow, especially for a certain workload. So this decision about how to partition memory between the file system buffer cache and the memory subsystem is usually not made static, right? This is something that is allowed to evolve over time. And most systems have an integrated buffer cache and memory management subsystem, meaning that there is one pool about, it's not like I say at startup time, okay, memory system you have 50%, file cache you have 50%, based on the access of the system, the buffer cache is allowed to grow and shrink and the amount of main memory is allowed to grow and shrink. And on Linux, there's this parameter that I've always loved, I mean, whoop, whoop, swappiness, right, what does that even mean? So it turns out, apparently what it means, this is one of these like black, little black box magic numbers, is it controls how aggressive the kernel is in pruning unused process memory pages, right? So we talked, when we talked about memory management about the fact that operating systems don't just trim when they run out of memory, they also trim constantly, they're constantly looking for unused pages, pages that haven't been used for a while, they're running the clock algorithm, they're spinning the hand, and they're writing those pages to disk and potentially evicting them, right? Preemptively, we talked a little bit, when we talked about memory about the fact that this happens because I wanna prepare myself in case an application starts and I need a big gob of memory. The other reason that it happens is because the file system buffer cache is using pages that I'm evicted. So if the memory system has a bunch of unused pages and it can evict them, then the buffer cache can use that memory to buffer files, right? So this is another reason this happens, and on Linux, this swappiness parameter controls how aggressively the operating system is doing this, right? And essentially, hence, it controls to some degree the balance between the memory usage and the buffer cache itself. And you can control this on your system and I don't know what effect it has. It's probably set to some reasonable default. But for certain cases, there are times when you might wanna set this differently, depending on if you understand the features of your workload. So the other question with the buffer cache is where to put it? So I haven't put up this slide before, but I think we should stop and talk about it for a second because you guys are starting to see some of this, right? So we talked about how the file system interface on Unix and Linux-like systems is designed to allow multiple file systems to use, be used. And the way it's done is it's done by providing a virtual file system interface that presents the appearance of all these different calls that the file system could use and then having individual file systems implement that themselves, right? So when you call VFS open, what happens is that that call to VFS open is translated to either a call to SFS open if you're using the OS 161 simple file system, which you're not. So actually it's a call to MUFS open. And there could be an arbitrary number of different file systems that are here, right? So Isaac could decide that he wants to do assignment four and write a file system himself and he could write Isaac FS right here and then in October when he's finished, he could come show it to me, right? So again, having this virtual file system interface allows an arbitrary number of files to file systems to be implemented and the way that you implement a file system is you implement this interface, right? You implement the VFS interface and then VFS translates, we'll send you calls that are appropriate depending on what kind of file system is mounted, right? So when I mount a simple file system on the system, the simple file system code starts to receive opens and closes that are appropriate to its files, right? At the end of the day though, all the file systems are essentially that are using disk or using this block level disk interface. So all of these calls, the simple file system is gonna decide what kind of data structures I'm gonna use, where am I gonna put things on disk? But at the end of the day, it's reading and writing disk box, right? And MUFS, well I don't know about MUFS, maybe that's not true because MUFS is only using the underlying file system. This might be a lie. But in general on a, if I had systems that were using a real disk device, rather than running on top of another file system, they would all be using this low level disk interface, right? They're using different parts of the disk, but they're all reading and writing disk blocks. So there are two choke points where I can put a cache, right? So my goal is I wanna write a single buffer cache that can benefit all the file systems on the system transparently. I could, so one approach here is I could have every file system have to implement its own cache. And you know, that would become a terrible, right? Because now you're pushing all that work onto the file system developers when you could consolidate it at the system level, right? So where are the two choke points here where I could put my cache? Two places in here where things come together, yeah. So right, so I could put it down here, right? Between the disk and the file system and where's the other place? There's somebody over here. So I could put it up here, right? So I could put it under the virtual file system or over the virtual file system layer, right? And if I put it here, so here's the implication, right? If I put it here, what's the interface to the buffer cache? What is, it's the file system interface, right? It's open, close, read, right? Like those are the calls that the buffer cache would see, right? Because what would happen if here is I would make the call to the VFS layer, it would pass it to the buffer cache. The buffer cache would try to perform those operations out of memory. And if it failed, it would pass them down to the individual file systems, right? So this is one approach. And the other approach has been pointed out. I can put it down here at the disk level, okay? And what that means is that it's gonna use the disk interface, it's gonna read and write blocks. So, you know, my file system operations are gonna go through the file system invitation and when it says read block, it's gonna say read block 15 or whatever, the buffer cache is gonna try to find that block in memory and if it can find it, it will return it from memory. If not, it's gonna pass those calls through to the disk, right? John. No, no, no, so yes, devices themselves also will do caching. But disk caching is actually done in the system itself, right? So John's exactly right, I should have mentioned this. When you buy disks, disks have small caches on them that they use themselves. So it's possible that, you're right, it's possible you could send a request to the disk and the disk might have that block cached as well. So the disk might not actually do a seek, the disk might actually fulfill it out of its own cache, right? Operating systems are a series of caches, right? Cache, cache, cache, cache, and then you might actually do something, right? I don't actually know how file, how disk devices use those caches. That would be really interesting to find out about. My guess is from what I've read is that a lot of times disks store, like if a disk is just, like let's say a disk is trying to seek one block or one sector. So it moves the heads and then it's sitting there waiting for the sector to come under the head. Hey, I've got data flowing under me. Why not just read it and put it into my cache, right? So in cases where the drive is operating and it's waiting for something to happen and it's a position to read data. When I'm seeking, I can't read data because the head's often stabilized over a track. But once I've stabilized the heads on a track and I'm just waiting for the data to come underneath the heads, if I've got a little bit of memory there that I can read data into, why not just read those blocks into memory just in case the next thing that comes through? But there's probably smarter ways of doing it, okay? But yeah, so we're not talking about the disk level caches. We're talking about a system level cache. All right. Okay, so if we cache, let me see how much time I have, I might have to abort. Okay, I'm gonna take two more minutes. If we cache above the file system, what are we caching? Again, open, close, read, write. Like what are the things that would be in the cache? Carl, you're looking like you know the answer and you're just, well, yes, yeah. No, no, but like what are, semantically what are the things that we're caching, right? Like it's gonna call open and the file, the process is gonna call read and the cache is gonna satisfy that request. So the cache is operating on what? It's operating on files, right? So I call read and I'm supposed, the cache needs to be able to find the file that that read is associated with and return some of the contents, okay? The buffer cache interface, as we talked about, is the file system interface. It's open, close, read, write, right? Let me see. I'm gonna come back to this on Monday because I'm out of time and this is too much to go through the rest of class. So on the rest of class, Monday we'll finish this. This is like another two or three minutes just talking about the implications of putting this in different places and then we'll talk about the Berkeley Fast File System and how it locates data structures and essentially kind of, again, state-of-the-art 1970s design based on future suspicions, all right? Have a great weekend, guys. I'm sure you have nothing to do this weekend other than enjoy the sunshine and go out for a drink with friends and watch a couple movies and sleep in and things like that. So have fun and we'll see you on Monday.