 Welcome everybody to 162. We've been talking about file systems and we were actually going through some case studies last time of some real file systems and I would like to continue that. But first I wanna settle a context to make sure we're all on the same page here. So on the left side of this diagram, we have the user interface that's above the system call level where you open and close files and so on and read and write them using a path name. That path name is resolved by a directory structure which ultimately finds an i-number which is just an index into the i-node structure on the disk and there may be several of them. I'll show you that in a moment. But that i-number then gives you enough information to find a structure that then points out which of the data blocks are part of that file and ultimately they're on the disk. So this discussion that we've been having including the FAD file system and the Berkeley UNIX file system, et cetera is all about these structures that somehow map from the byte oriented file paths that you're used to at the user level down into individual blocks on disk and puts it all together so that it looks like a file system. Okay, and so we had this for instance, we talked about the FAD file system. This was the simplest example of that where we have a whole bunch of disk blocks which are linearly addressed. Okay, one, two, three, four, five, six, seven. The file access table or FAT basically has a one-to-one mapping with each block and all it does is it provides a way to link blocks together into files. So here's an example where the file i-number in this case of 31 represents a file whose first block is here and then that points to the next block which points to the next block which points to the next block and so we have four blocks each of which might be say 4K or what have you in size and that's a complete file. Okay, and so that's one very simple index structure and this is one that's lasted for since the 70s basically. So that's pretty long time and it's in all of your cameras and USB keys and so on that you pass data around with. And of course, there are certain things that are important here that we talked about like where is this data structure, the FAT which is essentially just an array of integers where is that stored? It's stored at a well-defined place on the disk which is the beginning of the disk. Usually there's a couple of copies of it to handle errors but there's a well-defined place on the disk that the file system defines and that the operating system knows about. So were there any questions on the FAT file system before I move on? So that's a good question. Can you usually format a USB key to some other file system? Yes, you can often format them to other file systems. The reason that most cameras and other devices use the FAT file system is it's so simple that it's easy to put into firmware. So sometimes when you do that formatting that you talk about there you might be only able to plug it into a bigger machine like a Linux box or something that is running a different file system rather than a camera that's running the FAT file system. That's a good question. All right, any other questions? So the other thing, the next file system we talked about is we started with the 4.1 BSD file system. This is a slightly different picture than I showed you last time but it's the same idea. And here the inode or the index node has some metadata at the top such as what are the mode bits? Is it read, write, execute? What are the owners? Timestamps for modifications? How big things are and so on. And then the important part here is this multi-level tree where there's some number of direct blocks. And in the original BSD file system there were 10 direct blocks that later got expanded to 12. But those direct blocks are really pointers from within this inode structure to a block on disk. And what does a pointer mean? It means a number of that block on disk, all right? And then the singly indirect pointers is a pointer to a block that has a bunch of pointers in it. So the direct ones you can follow directly to get to those data blocks, the singly indirect you follow to a block which then gives you pointers to a set of blocks. And in the original BSD there were 1K blocks and four bytes a pointer. And so in some sense there is 256 data blocks within this indirect block, doubly indirect. Basically you have a block which points at a block which points at data, okay? And so for instance, you can very easily see that since there's 10 direct pointers if you wanted to go for block number 23 well, the first 10 of them are direct blocks. The next set of them, the next 256 of them you'd actually have to read the indirect block first and then the data block. And so there's two block reads, okay? Yeah, and so the question here was was the design decision to go from 10 direct blocks to 12, what was the story there? Basically that was built on data. They decided that was the right thing to do at the time. Sometimes decisions are made for reasons that aren't always the greatest but that one I believe was actually made because they figured they needed a couple of more direct pointers there. So you should be able to do this kind of calculation on a test for instance, like how about block number five? Where's that? When we know the direct blocks, the first 10 of them are here. So block number five would be zero, one, two, three, four, five, okay? And then block 340 that's gonna be into the doubly indirect ones. And so you're gonna have to do to read the double indirect block and then the direct block and then the data, okay? So the pros and cons of this scheme were really that it's very relatively simple. And as we talked about last time and the time before is that basically it supports small files really well but it also handles large files. Now the question about how do we know that 343 is in the doubly indirect block range? Well, because these data blocks you see here are all linearly laid out. And so the first 10 blocks are in the direct blocks and then there's 256 now. So that's taking up 266 blocks. So block number 340 is clearly larger than 266. And so that's why we're getting into this range over here. And it's not large enough to get into the doubly indirect region. Does that answer your question? So there are no metadata in any of these pointers. These pointers are just pointing to data, okay? All of the metadata is in the inode itself. And so that's important. You can kind of think the inode is the file or the file is the inode because it has all of the metadata about who can read and write it, et cetera, and all of the data is pointed at. And everything to the right of the inode is just raw data blocks with nothing else other than either data, which is just binary or pointers to data blocks, okay? Now the downside of this, well, so when reading a file the question is, how do you know what the next block to look at is? Well, simply if you remember the file description that you get when you do a file open keeps track of what your offset in the file is. And so once you know the offset, what byte are you on? Then you can divide by the block size and that'll tell you what block you want. And then it's an easy mapping just like we did here. Block 343 is always down in here. And so that piece about how do you know what the next bytes are or next block is that's because the next byte pointer is kept in the open file description that's part of your file description you got when you did an open, okay? So you can look back to lectures from about a month ago where we talked about that in more detail, all right? So downside of this is there's nothing in here and there's certainly nothing in the FAT file system that says anything about making this perform well. So ideally what we'd like is successive data blocks are laid out on the disk on the same track or close by tracks in order to make things really high speed. BSD 4.1 didn't do anything to help with that. And so as a result, you'd format the file system from scratch and everything would be nice and fast. And then over time as you wrote files and deleted files and created files and so on things would get progressively slower and that's because things would be scrambled in locality on the disk itself. And so the BSD follow on or the fast file system which is what we were talking about when we ran out of time last time did a lot of work to try to make things perform fast and well and essentially they kept this layout of the Inode although they had a couple more direct blocks and they made the data blocks larger. And so let's look at this a little bit. This is the fast file system. There's a paper that I put up on the resources page that you can take a look that talks about this fast file system. It's got the same Inode structure, modulo a little bit. One thing I said last time that I was incorrect on, sorry about that, I had a typo on my slides. Basically the block size in the original system was 1,024 and that went up to 4,096 minimum although there was also options to have slightly larger blocks. The paper is up there, you can take a look at it. And there's a number of performance and reliability optimizations that were done in the fast file system. So as I'm gonna show you, rather than putting all the Inodes on the outer tracks, they're distributed throughout the disk. There's bitmap allocation instead of linking things in a free list. And if you have a bitmap where a one means in use and a zero means free, you have a much better ability to take a look at the bitmask as a whole and say, oh, here's a big string of zeros, free blocks that I'm gonna start a new file in so that there's ability to lay things out well, okay? And so part of what was done in the fast file system was an attempt to actually allocate files contiguously and address some other performance issues like skip sector positioning, which I'll tell you about a little bit. And one of the interesting things that you might or might not realize is that by forcing the fast file system to always keep 10% of all of the data blocks in reserve, meaning that when the disk is 90% full, it appears to be 100% full. Keeping 10% it turns out is a high enough number of free blocks that the likelihood of finding a big string of empty blocks together on a track is much higher. And so it turns out that 10% is an important aspect to getting good performance out of the file system, okay? So first thing that I said was they changed the inode placement. And if you look at the early UNIX file systems and Windows FAT file system, et cetera, headers or the inodes were all stored in special array on the outermost cylinders. And it was a fixed size array. And so the number of inodes was fixed at the time you formatted. And each is given the unique I number. And so you can say for every I number, there is an inode, which means there is a file associated with it. That's okay, except you can imagine it's got some pretty big performance problems because the inode is stored far away from the data potentially. And it's also got some reliability problems because if the disk head were to crash on the outer cylinders and trash all the inodes, you've effectively lost all the files even though all the data is still okay. Okay, so there's a number of reasons why this particular layout wasn't good. And so here's, put them down here as problem one and two. So inodes all in the same place can potentially lose a lot of or all of your data when you lose the inodes. And when you create a file, you don't really know how big it is. And so there isn't any really good way to handle the layout of the blocks relative to tracks when you sort of shove all the inodes in one place. Okay, and so let's take a little bit of a look at what they did instead. So in the FAST file system, they divided each platter into a whole bunch of groups. Okay, so block group zero, one, two, et cetera. And so rather than, for instance, putting all the inodes on the outer track for the whole platter, what you do instead is you have some inodes and a free space bitmap for each block group. Okay, and so what's good about that is that now the inodes associated with a given file can actually be in the same block group as that file. Okay, and if you choose to have a directory with files in it, both the inode for the directory and the inodes of all the files in that directory can all be in the same block group along with the data. And so things like LS, LS-L or whatever that give you the metadata of all of the files in a directory can run very fast because of the locality that we've gotten out of this, okay? And so the file volumes divided into block groups and so set a track. So moving the head back and forth within a block group is not that bad. And so we've got data blocks, metadata, free space all arranged within the single block group. And so just by going to a block group with some extra space, you have the ability to lay out things very efficiently. Okay, all right, I think I said all of that. So when reading, let's see. So furthermore, the way that the layout algorithm works if you remember what are the problems that we have with the UNIX interface is that when you open or create a brand new file it's empty, file system doesn't know how big you want it to be and it only figures it out on the fly as you're writing to it. So what fast file system did was for small files or files that have just been created it fills holes in the local block group. And then when it crosses a certain threshold it goes to a whole another block group and finds a big string of empty blocks to continue on. And so there are these thresholds that actually cause you to go to another block group. Okay, and the feeling there is that if you're running a big sequential read and every so often you have to switch to another block group that's okay because you're still getting relatively high performance. Okay, with these occasional hops rather than having to go back and forth and back and forth for every block, okay? And again, it's important to keep about 10% free in order to make this work, okay? So a block group is so good question is a block group the same as a cylinder group well a block group is what you see on a surface. And if you remember there are a whole bunch of platters on top of each other. And so if you take the block group and you go through the whole stack that's a cylinder group, okay, does that make sense? So the cylinder group is the set of rings of a block group going through all the platters. And you remember the reason that we talk about that that way is because the head all of the different heads in the head assembly move together as a group. And so in some sense we move into a cylinder. And so then a cylinder group would be a group of cylinders that are all small movements of the head. Okay, good question. Now, so the summary of the layout pros are for small directories you can fit all of the data the headers and et cetera all in the same cylinder with no seek for small directories and small files. So that works out well. The file headers, the inodes are actually much smaller than a block so you can get several of them at once. And so that really optimizes for doing directory operations that use all of the files in a directory. So that works really well. And then last but not least, this is certainly discussed in the paper and it's an important side effect of this is that by putting the inodes close to the data what that means is if the head crashes or that's where the head touches down on the spinning disc and takes out a track or set a whole block group even all the other files are fine because their inodes are safely next to the data in other block groups. So this is a big reliability advantage as well, okay? Yeah, you could think of it. So again, think of a cylinder group as all of the block groups on top of each other on the different platters which are double-sided. So just to say a little bit about the allocation. So if you remember there's a bitmap per block group and so for instance, this is just one big, it's defined in a set of blocks at the beginning of the block group. There's just a set of bits that are together that tell you which blocks in the block group are free. So we have ones that are in use. We have a couple of free ones. And so what we do is if we write a couple of blocks in a file, it just finds them very quickly because it can just look at the bitmap, okay? And then when it is writing a large file it's easy to figure out which blocks are available again by looking at these long strings of zeros. And when you get past a certain threshold, as I mentioned, you go to another block group and pick a big string of zeros to write on, okay? And this is basically the heuristics, they're heuristics to keep this overall speed of the FastFile system fast even over time as you delete files, create files and write files, okay? Good. So the last thing I wanted to give you an idea about is the rotational delay problem, which is an interesting one for old systems. So the issue is that if you remember, as the disk rotates, the head is picking up all of the various sectors that are in the track that it's on and in some of the older systems, back in the beginning of the FastFile system, you'd read a block or a series of sectors off the disk in a memory and you'd notice that it's, you have to work on it a little bit before you went to read the next thing and by the time you got back to read the next thing it had actually passed under the head and so you had to wait for a whole new revolution to get the next block. And so if you're not careful, everything slows down, you do such a great placement on the track and because the blocks are too close to each other on the track you actually miss them. And so what the FastFile system did was it did what's called skip-sectoring, which is it calculated kind of how much time was needed. And so for instance, all of these magenta blocks on a given track are all part of the same file and you put that extra space in there so that as the disk is rotating, you grab a sector, you process it and when you're ready for the next sector it's coming underneath your disk head. That was called skip-sectoring and they implemented that part of the FastFile system paper. Today of course, as I've implied a couple of lectures ago we actually have a whole bunch of RAM on the controller and so what happens on today's disk is you just read the whole track into a RAM track buffer and then subsequent reads, it doesn't matter how long it takes to get back to them you can pull it off at high speed without worrying about the physical rotation, okay? And so this is a good example of something that was solved back in the original FastFile system day that has been obsolete by smart disks. Now there's a question in the chat there of does the FastFile system get used anymore? Yes, so its descendants are basically in the Linux EXT two and three, the BSD versions of the UFS file system and so on. So the descendants of that code are all still used, okay? So this is not just a historical artifact except for the rotational delay fix. So it's a useful file system to know about, okay? So modern disk plus controllers to all sorts of things that I mentioned not only do they do full track buffering they also run the elevator algorithms and to some extent they figure, well to a large extent they figure out which blocks are bad and they hide that even from the operating system in some instances by transparently mapping good blocks in over bad ones. So the pros of this FastFile system which was in the 4.2 version of BSD so it's a very efficient storage for both small and large files that just comes from the structure of the Inode. It has good locality for small and large files that's because of the way that the block groups were divided up and Inodes were spread about and there's a good locality for metadata and data and you don't need to do any defragmenting to get performance back unlike earlier versions of the file system. The cons are it's still pretty inefficient for tiny files because for instance, if you think about it a one byte file actually requires both an Inode and a data block, okay? So let's look at this for a second. You know, it's surprising but if you let me just go back to this. If you look at this layout it doesn't matter how big the file is you still have to have all of the Inode structure and then you have to have a complete block for the data. And so if you have a few bytes in your file it's extremely inefficient, okay? And that's just part of, that's one of the consequences of this layout. Now this does do quite well for small files in general but for really, really small files it's not so good. And there's always this Inode separate from the data, okay? So we can do something about that which is what was done, oh, we'll get to that in a second. That's which is what was done in the NTFS. And I'll tell you that in a second. I did wanna say do a little administrivia which is we're done with the grading. I know I said that we were gonna be early soon. I think that's come out over the weekend. We had a higher mean this time 55 standard deviation of 15. So that's about standard for this class. And as we've talked about before there's a historical offset here of 26. So you can take your grade, whatever you got and add 26 to it and take a look at the various bins that we put up on the website to sort of get an idea what grade you got on that exam. There's no class on Wednesday. So I guess you won't be hearing me on Wednesday and you can take that time to do a breather and get outside a little bit. The other thing that I wanted to mention is again I've said this before but make sure if you've got any group issues or if you have a group member that's MIA make sure that you let us know when you do project evaluations and that your TAs are well aware of what's going on. Okay, that's to A maybe we can reach out to them and help that situation where we couldn't otherwise or maybe we can get you all together to talk. Try to make sure that project three is smooth and certainly important for us to know that when it comes to awarding points for the project at the end. Okay, I don't think I had any other administ trivia. Anybody have any questions? I know that the regrade requests and so on are still going on, so okay. So I guess with that I'm gonna move on now and so this issue that we had with the standard kind of indexed file system is this idea that the inodes are separate from the data which actually works pretty well for most size files. The question is, could you do something different? And it's just this is an answer to the question from the chat earlier is this idea of is the fast file system even used at all? And the answer is yes. So the descendants of the fast file system have found their way into Linux in the the XT3 file system is one that is pretty standard in Linux these days. And there's also in the free BSD as a variant of the original fast file system as well. But here's an example of there's block groups laid out in the XT2 or three. I'm gonna keep them together for a moment because they're effectively the same thing in one instance and that's this layout. So here's our block groups. We have a group descriptor table that's kind of along with the super blocks at the beginning of the disk in a well-defined place. Super block is describing information about the file system as a whole. If you look in that descriptor table you can figure out where block group zero is, et cetera. The other thing that Linux has got is you can at the time that you format a new file system you can pick the size of the blocks you can make it 1k, 2k, 4k, 8k, 4k is pretty standard. Very similar to 4.2 BSD with 12 direct pointers. If you look here, for instance, you could say, well what if we wanna create a file in slash dear one slash EXT3? What happens there is you gotta find the root directory. So you go to a special spot on block group zero and in that in that inode table you say, well inode number two is where the root directory is and so that points to say block in the root directory points to block 258 for the actual data. So the inode there points to the data for the root directory. You look up dear one that says, oh, that's gonna be inode 5033. You look down at inode 5033 which is in block group two, for instance. In there, you start looking at the data and it's points to say block 18431 which is the data for directory one. And you look in there and you can find, you can create file one as an entry in that directory and you can allocate a block for the data. Okay, and so this is kind of the way to follow these pointers is what you should think about is that every pointer here represents a block number that's being pointed at by some other block and that's how we get all of these, the structure of the file system itself. Now, the difference between EXT three and two is that EXT two is like the original file system. EXT three adds journaling on top of that in order to give us a level of reliability. And if we get that far today I'll tell you more about journaling too. Okay. Okay, questions? All right. So if you remember the directory abstraction I do wanna say a tiny bit more about that as well. So you have the root directory, it's got the USR directory inside of it and that points to say user lib four dot three and user lib those are separate directories and inside this directory points in an actual file, et cetera. So directories themselves are basically specialized files in a specialized format and their lists of file name, file number, mappings. Okay. And there's a bunch of system calls that actually interact with directories directly. So for instance, open or creative a file with a file name actually traverses this directory structure to figure out which one of these sub directories you're gonna put the new file name in. There's make deer and remove directory system calls for making and removing directories and to give in place. There's also link and unlink which can remove just this link. So potentially if this particular file foo has two different names in the directory structure you could unlink one of them or you could create a new one. We'll talk a little bit more about link and unlink in a second. So the question of is the kernel itself stored outside the file system? It depends on what you mean. So the kernel itself is certainly on the file system. Okay, so there's a special boot code that just knows enough about the file system to pull the kernel in off of a special slash boot directory. For instance, in the root directory it's not terribly intelligent but it knows enough to read that through to pull it into RAM. And then once that's been bootstrapped then it starts booting and it can do the rest of the file system. So yes, the kernel is actually in the file system which is an interesting catch 22 when you think about it because you have to make sure that the boot code that you load from some well-defined place on disk has enough information and knowledge on how to interpret the file system to just pull the kernel itself in. Yeah, that's a great question. So libc provides a bunch of support like opendir and readdir. You guys should take a look at these system calls. They basically allow you to actually open a directory and scan through it for a bunch of file names to find out what are all the files that are that are in that directory or are they directories instead of files? You can do that. There's a set of calls that are in libc that are there for you. I don't know if any of you have actually used them yet but I've used them many times in the past. So what's a hard link? So I wanted to tell you a little bit about a hard link. This is a mapping from name to file number in the directory structure. So a hard link is really just a directory entry, okay? But I'll show you why I call it a hard link in a second. But so for instance, in this directory slash user has a lib4.3 name in there and it matches up with an iNumber which is the iNode for this next directory. And so a hard link is really a name iNumber mapping that's inside of a directory, okay? And the first hard link is made when you do create and you can actually create extra hard links thereby giving sort of this file multiple names in the name structure with the link system call or the ln user call, okay? And you can only do that typically if you're a super user. Unlink will remove this link and if as a result you've got a file that's sort of floating in space and disconnected that will effectively delete the file because all of its resources will be freed up at that point, okay? And so that leads into an interesting question of when can the file contents be deleted? So the answer is, so this lib4.3 slash foo is a file with some stuff in it. It can be deleted if there are no links left to it and nobody has it open. So once you open a file, you also have a link to a file, okay? And so, but it's in memory. And so if you have a process that has opened a file and then you go delete it out of the file system that file is still gonna stick around long enough for that process to read or write it. And it's only when it's closed at that point if it doesn't have a hard link in the directory structure then it goes away. So that's a little bit weird behavior. I'll tell you last term when I was teaching 162 our first example of an midterm like you guys had was some code that we wrote to produce Google forms. And we had a de-scrambling thing so that when a student would ask a question about whatever their 1.4 was it would tell us what 1.4 was really for us. And then anytime we put out a correction it went back to everybody who had that scrambled thing, okay? That was great and it worked pretty well except the server we were running it on its log filled up. And so the server crashed and then we couldn't reboot it because even after we had tried to delete all of the data in the log it was still being held onto by processes that had access to the hard links. And it took too long for us to repair it in the midterm ended without corrections for the last like third of the midterm. So that was bad scenario. All right, so in contrast to hard links are soft links. So this is a soft link or a symbolic link on some operating systems they call them shortcuts. And this is gonna just map one name to another name without it being an actual name to i-number mapping. So instead it's really a name to file name mapping. Okay, so in our regular directory hard link is a file name and a file number and that's sort of supported directly in the file system. Symbolic link says, well, if you get to this point in the directory and you look something up by this file name, what you get back is just another file name. And so rather than a direct pointer to one directory down or whatever with a soft link you can basically point it pretty much anywhere because you're saying, well replace this name with this complete name which could be an absolute name so you can end up in another file system or what have you. And so that's typically with LN-S is how you make symbolic links, okay? And the OS looks up the destination file name each time the program accesses through. And so this lookup could fail, whereas this lookup there's always a file name to hard link that will never fail because if there's some file that's not pointed at by anything it'll go away and so you'll never get a file doesn't exist problem here but in this symbolic link it's possible that you find a file name that maps to something that doesn't exist. So these symbolic links are much more convenient for producing trees of file names that are sort of part of build packages and stuff. But all right, so that's the difference between a hard and a soft link. So there are a number of kernel, there are a number of kernel facilities that actually will go ahead and use symbolic links as if they're real links. So for instance, if you do open and you give a long string part of resolving that file name may actually go through different sim links and that'll work fine because that's been set up to interpret the symbolic links properly. All right, now let's look at one last thing about directories I just wanted to show you this one more time. What if we're opening slash home cs162 slash stuff.txt? So the first thing is we're gonna have to find the I number for the root I node configured in the kernel we'll say it's two for instance we're gonna read that I node two from its position in the I node array. So remember in the outer block group there's gonna be an array of I nodes. We pull that out, we examine the I node we find the first block and we start working our way through so we take that I node we take the block we scan through it until we find home mapped to another I number say 8086. Okay, and then we look up 8086 for slash home. Okay, and that's gonna be another I node structure which is gonna give us another block which we can look up. Okay, and then finally, and yet a third one. Okay, and then finally when we get to slash home slash cs162 we've looked up slash stuff. Okay, and last but not least, now slash stuff is actually pointing at the I node which is actually the file. And so what I've got in green here is reading the file. So there's every directory is a file just like files or files. And what you're doing is you're traversing your way through the directory structure until you actually get to the files of interest. Okay, all right. And this little thing I have in the lower right here actually represents the block cache which I'll talk about in a little bit. Okay, and the last thing I wanted to mention is remember everything in an operating system is a cache. And so because everything in the operating system is a cache then we can cache all of this stuff. That's what you're seeing here but we can also cache the translations in a name cache. And so assuming nothing changes in this directory structure then we have a name cache which says that well slash home slash cs162 slash stuff.text. And also the intermediate pieces are actually stored in a hash table so we can very quickly look that up in memory assuming that nothing in the underlying file system has changed. And so that cache of names called the name cache makes looking things up subsequently like for instance if we wanted slash home slash cs162 slash other stuff.text it would be much faster because we wouldn't have to traverse our way through those directories. Okay, what happens when the array runs out of space? You mean the name array, I'm assuming. So the great thing about the name cache is since it's just a cache it doesn't matter if you throw things out you can always get them later. If on the other hand you're talking about the Inode array and there's several of them. Yeah, okay so there's several Inode arrays throughout all the different block groups. If you run out of Inodes you can't make any more files. And I've had some file systems in the past that I've made where there were so many little files that we actually blew out the set of Inodes. And at that point you basically can't make anything new. So it doesn't matter how full the actual data portion of your file system is that's it game over. All right, so one thing that's unfortunate a little bit about in which I haven't really shown you here in great detail is if you have a really long directory with a zillion files in it, standard UNIX forces you to go linearly through the directory to find the file you want. And it's not indexed in any way and it's just linear and it happens to be there in the order in which it was put into that directory. So that's pretty inefficient. There are things like FreeBSD and NetBSD, OpenBSD actually have the ability to swap in a B-tree like format for data inside the directory to give you a much faster lookup possibilities, okay? But that's optional and a lot of, and Linux doesn't do that and a lot of early UNIX systems didn't do that as well. So let's now talk about Windows NT or NTFS because this is a little different. So this was the new technology file system. It's a default on modern Windows systems. This was kind of what came in after the FAD file system was developed and then rejected as too flaky and really not reliable enough for a big heavily used file system. So NTFS came along and what's interesting that it's very different from the BSD file system we just talked about is we have variable length extents rather than fixed blocks. So the file system we were just talking about all the blocks were the same size, call them 4K. I-nodes were smaller, typically an I-node's like 128 bytes, so there's several of them in a block. In the NTFS, there's a possibility of most of the disk space being laid out in variable extents where you sort of have the first block and then the length and you'd follow it along a track to get all of the data represented there. And so you could really represent a chunk of data that was many tracks worth of data that way. And what's the internal portion of that file system? Well, instead of the FAT table like we looked at or the I-node array, you actually have the master file table, which is like a database, okay? And like a database it has a maximum one kilobyte size entry and each entry is essentially representing a sequence of attribute value pairs, but one of the things you can have is data. And so you can have an attribute value pair which represents data and the value is the actual data. And so this is an interesting twist because it allows you to have both the metadata describing who can read and write the file and the data itself all in one chunk and a one kilobyte size entry, unlike what we've been talking about with the FAS file system, okay? And so every entry in the MFT has metadata and the files dated directly or a list of extents or for really big files, pointers to other MFT entries with more extents, okay? And rather than worrying about whether you got that all, I'm gonna show you here in some pictures. So here's an example of the master file table. It's like a database, like I said. And so there's a series of these records and these records represent pointers or these records have within them both metadata and potentially pointers to longer extents, okay? And block pointers basically cover runs of blocks now instead of individual blocks. Okay, and this is actually similar to what Linux did in the X-T4 as well. And the other interesting thing about NTFS is when you create a file, you can give it a hint as to how big the file is gonna get so that it can pre-allocate a big chunk of contiguous memory for you. So this has the ability to be higher performing under some circumstances. And it also has journaling for reliability. We'll discuss that a little later. So here's an example, master file table. Here's one entry. There's some standard info is one of the attributes you can have and that's basically the stuff that we put in the inodes for BSD file system, things like create time, modify time, access time, who's allowed to access the file, et cetera. There's the name of the file which is included in this record. So that file name is kind of like, this is kind of like in the directory, right? So this is the actual file name. And then there's a bunch of data which could be resident and all of this could be together in a single one kilobyte file. And so one kilobyte master record, excuse me, and so as a result, it's very much more efficient at small files than the BSD file system is because everything's all together. You don't have to have both the inode and the data. Now, if we get bigger files, then what we can do is rather than just having the data in this part of the MFT record, we can start having pointers and links to bigger extents that are spread throughout the disk. Okay. Now, hopefully those of you that remember when we discussed fragmentation back in the memory days, when we were talking about virtual memory, this, as you can imagine, since we're not using blocks that are all the same size, but rather extents that are variable size, now we all of a sudden have this problem with potential fragmentation when we start allocating freeing, allocating freeing. And so that can be a problem, okay? And so on an NTFS file system, you can actually start getting some fragmentation over time. Here's an example of a very large file where we have some of the master file records pointing at other master file records. And then each of those individually pointing at extents. And so you can make really, really, really big files if you want, again, at the potential expense of fragmentation problems. Here's an example of a huge fragmented file, okay? With lots of extents spread all over the place. One of the problems once the space becomes fragmented is you can no longer have big extents. You have to have lots of small ones. And so when you finally get extremely fragmented file system, then things don't perform well at all and you have to go through and defragment. So other little things about NTFS is the directories are B-trees by default. The file number for a file is really its entry in the master file table. The master file table always has file names part of it. So the human readable name, file number of the parent directory, et cetera. What's kind of interesting as well is if you have multiple names, hard links for the same file, they can be in the master file record. So looking at one record, you can know all of the individual names and what directories they're in as well as the data. Okay, now let's talk about memory mapping now for a moment. So memory mapped files are a different way to do IO. So we've been talking about that interface where you open a file and then you read and write it and then you close it. This involves multiple copies into caches and memory and system calls, et cetera. What if instead of open, read, write, close, we just map the file into memory just like we map other stuff into memory when we were talking about virtual memory and then you can just read and write memory locations and you implicitly page the file in and page it out and so on. And so what file manipulation suddenly looks like memory reads and writes, okay? And this is something that is a well-defined interface. Executable files are treated this way when you exec a process. What happens is you actually just point the virtual memory to where that process is on disk and then it'll start faulting parts of the code in as it's needed, okay? So here's how that works. So if you remember, by the way, this is a slide you've seen but let me remind you of how virtual memory works since I know we've passed midterm two and so everything you learned in the first two thirds of the class is now fuzzy. But if you remember, we basically have an instruction access tries to get looked up in the MMU. If we're lucky, we find an entry in the page table and that goes ahead and lets us do the access. On the other hand, if there isn't a page table entry and we get a page fault, then what happens is we get an exception. We go into the kernel, page fault handler takes over, it starts scheduling a read from disk which can take time. And then eventually when that's finished, it updates page table and then we go back on the ready list, we get rescheduled and we try it again and this time we succeed. So that's virtual memory. So why not do the same thing with just regular file? So here's the idea. So you use a call from the library called mmap which goes to a system call. And what it does is it says, well, here's a file, map it to a region of the virtual address space which really means we create a set of virtual address space pointers that point at the file. And now we go ahead to read and if we try to read from the file region here in blue and nothing's mapped, we'll get a page fault just like we did for a virtual memory, get to an exception. Now what happens is we read a part of the file into DRAM and then we get to retry and at that point we go forward and read the contents from memory, the mappings are set up and we just read from memory and we get stuff out of the file. So what's neat about mmap is it actually lets us take a file, put it in memory and now all of a sudden we're accessing it as if it was just data in memory, not on the disk. Now if you were to look up mmap, you could do man on it, whatever, here's a variant of it where what you do with the mmap call is you give it an address in your virtual address space where you wanna put the file, you give it a length of how much of that file you want and then some other flags and it basically lets you map a file into a specific region. Now if you don't care where it is in the address space then you just put a zero in here and it'll return as avoid start, it'll return an address where it decided to map it in your virtual address space. Okay, this is perhaps close to what you're supposed to do for project three. This is used both for manipulating files and for sharing between processes. So I wanted to show you an example here. So here's some code that is very simplistic but if you notice I've got a static global here called something that's set to equal to 162, I've got some things on the stack. And so what I'm doing with printing to the console is I'm saying that the data is where, well, I'm just telling you where the static part of the data is so that's the address of something. The heap is at, I give you what I get back from Malik if I Malik a byte and then the stack is at this address of the M file variable and that kind of tells me where the stack is. And then what I do is I print that stuff out and then I open my file, which is an argument and assuming that everything is good, I get down here and notice what I did for my M map. I said, find me an address, I don't care what it is that's what'll come back for M file. I say length 1000, I say allow reading and writing and then these variables here or these flags are basically saying go ahead and map this as a file and here's the file descriptor I've already opened, okay? And so if you look, if I run this thing, notice what happens. It tells me where data heap stack and M map is at so M map is the thing that comes back from our mapper and notice how M map is low in memory. It's not high in memory like heap or stack which is interesting there, right? So the thing that we get, oh, by the way, so it prints those things out and then it tells me what was in the file test and we back this up, okay? But notice once I've printed out what was in that file, notice how I did that. I just said print F, M map is at and that gave me this guy and then just by saying put the string M file on the screen, it just printed out all the contents, okay? So that's this is line one, this is line two, this is line three. All of those contents got printed out on the screen just by printing the string that was at that variable, okay? So this line here is innocuous as it is. This puts line is actually doing a read from the file and printing it on the screen just by pretending that that string was already in memory, okay? And then this is where this gets a little amusing. We go 20 characters in and we write over it with something by string copying this string over that spot and then we close the file descriptor which is gonna flush everything out and we return and what happens when we can't test and see what's in there is notice if we were to count 20, we would see that starting it by 20 is the let's write over its portion and so we've actually literally written over this part of the file simply with a string copy, okay? Now we could also think of this as a way to share so we could have a file in memory and we can map that file into different places. Doesn't matter, they could be the same or different places in the virtual address space and once we've done that, now we can share in shared memory with the file as kind of a backing store for what we've got, okay? Now you can kind of see that the file is a little bit extraneous here in some sense because maybe we don't care what's in the file, we're just using that as an excuse to set up this channel in shared memory and it turns out that there are ways of setting up something called anonymous memory. So we did map file in the earlier example I gave you can map anonymous which will mean do this kind of setup but without the file, okay? All right, so different processes have different address spaces, yes? So this part here isn't, so take this part of the file, it goes to a different part of virtual address space two than virtual address space one given the way I've shown this. This is shareable, the reason this is shareable is not because the virtual addresses are the same but because data I write here that shows up there is data that this guy can read. Yeah, and again, notice the irony of what you said there about this being through the file it is through the file but it doesn't even have to go to disk for this to happen, okay? Now, if you wanted to do this for real, A, you might wanna use anonymous memory if you really didn't care what the backing store was but the other thing is you're gonna wanna find out what the address was, you got allocated on one of these and try to use that in the M map on the other so that you can actually align the virtual address space portion of this as well so that then you can actually have shared lists and other things in that shared memory. All right, good. So the kernel has to keep this interface going where the user portion thinks that everything's bytes and the disk is in blocks and so we've got that mismatch to start with and basically the kernel has to pull things off of the disk and put them into memory to do that matching. So if I'm gonna read four bytes at the beginning of the file it's not gonna read four bytes off disk that's not even possible it's gonna read a whole block of 4K, put it in memory and then give you the first few bytes and the good thing is if I keep reading I don't have to go to disk again until I run out of that block. So that seems like maybe we ought to start talking about caching here so that multiple processes for instance can share data that's come off of the disk. And again, just because operating systems as I said everything's a cache. So the buffer cache really is this generic cache of blocks in memory that's separate from virtual memory. So this is not the blocks that we choose to use to help map virtual addresses but rather this is a set of blocks purely as a cache and it can hold things like data blocks and inodes and directory contents, et cetera for future use. And it can also have dirty data. So if you write a block in a file it can actually have that data sitting in there before it goes back to the disk, okay? And so the key idea here is we're gonna now set aside some DRAM to help exploit locality by caching disk data in memory and really help us, okay? Name translations, so mapping from paths to inodes, disk blocks, mapping from block address to disk content, et cetera. And as I mentioned, this is called the buffer cache and it's really memory used to cache kernel resources including disk blocks and name translations and can have dirty data, okay? So let's look a little bit at this. I just wanted to give you an idea. So here's our disk surface that we had earlier. So the buffer cache really is a set of blocks in memory. There's some state bits associated with it and because it's a cache some of these blocks might be free or some of them have been invalidated. And really if I were to abstractly think about what I've got in my cache I could say, well I've got some data blocks I'm gonna think of them as here I've got some inodes I've got some directory data blocks I've got some the free bitmap which is actually the set of all blocks that are free are kept in that bitmap. And so this is a cache on the disk but specifically for access through the file system, all right? And for instance, when we have file descriptions of open files that are associated with process control blocks and file descriptors those are really pointing at inodes which are locked down in the cache so that when I go to read from the file I can immediately find which blocks I need to read from, okay? And so this file system is really support it's not a direct access to the disk. What it is is it's supported on top of the buffer cache and we pull things in and out of the buffer cache as we need them, okay? And that's really how we do that mapping between byte level operations and even operations in the inode to the block level interface of the disk. So for instance, let's suppose I'm trying to do an open operation. So I'm gonna assume I've got my inode for a directory that's my current working directory that's already open here. I'm pointing at this. And so this current working directory is an inode I've pulled in previously. And so what I'm gonna do is I'm gonna try to look up some other file name relative to that so that I can do an open. And so what I'm gonna do is I'm gonna wash and repeat I'm gonna try to load blocks of the directory I'm gonna search through there to find the next directory pointer that I'm gonna load blocks from the next one and so on. And so this is sort of a recursive process and this buffer cache we have to for instance mark a block as transiently in use when we start using it and then we can pull in data off of the directory and now that's cached here. And then I can search through it to find a name to I number mapping, okay? And now I can look up that I number to start reading the data. So what I'm gonna do is I'm gonna put aside a marker here saying this part of the cache is in use. I'm gonna read the data in and now I've got an inode cached and I can map that to a file description. And so now I have my open file and it's inode is locked out of memory, okay? And so then from that point on now I can do reads for instance well I've got the inodes so I've got data blocks I can pull the data blocks into the cache and then use them and I'm gonna traverse the inodes so this thing in green is an inode like we talked about on previous slides and so it's gonna let us know which blocks I need to get next I pull them into memory and now I can access them or maybe they're already in memory, okay? And so this is typically this buffer cache is got a hash table that lets me find blocks in it very quickly. And so that's how I can figure out whether I've already got the blocks in the buffer cache or if I have to pull them off of this. And of course for writing what I've got here is I might actually have a dirty block that basically says this data has been updated relative to the disk and I can't get rid of it until I've written it back to data and so this buffer cache also has to keep track of what's dirty and what's not, okay? So this is implemented entirely in the operating system and software it's not like memory cache is in the TLB it's a little different because it is in software we always have to enter the kernel to do file operations as opposed to when we were talking about virtual memory where we had to do reads and writes from the hardware and so we needed to put a hardware interface to keep that fast. Blocks go through transitional states between free and in use being read from disk, being written to disk, et cetera so that as multiple processes are all reading and writing the same data they have to be careful to make sure that they don't stomp on each other or take away a block that's there waiting for data to come back from disk. Many different purposes for this as I've already mentioned and when the process exits things may stay in that cache indefinitely unless they've actually been flushed out, okay? So what do we do when we fill it up? Well at that point we need to start finding free blocks and of course we all know that if they're read only we can just throw them out and if they're dirty we have to write them back first, okay? So what's our replacement policy? Well we could do LRU and in fact most folks do you can afford the overhead of full LRU because we can link blocks together and know what the oldest one is and the most recent one and so on because we always have to enter the kernel so the number of instructions to do full LRU is small relative to the overhead of having gotten into the kernel already and this works very well for all sorts of things named translation. It fails if you ever have an application that scans through all of the files on disk. You should try this sometime just for the heck of it don't do it on your friend's computer while they're trying to use it but if you say find dot that's the current directory and then you say exec grep foo et cetera you can actually and then slash colon this will go through all of the files in the sub directory and grep them, okay? And so there you're gonna blow out the cache if you have LRU and so some operating systems give you the ability to say for these following file accesses do it just once. Don't even bother putting them in the cache because I wanna keep the things that are in the cache there. So how much memory should we put in this cache? Remember, this is separate from the backing store that we use for virtual memory and too much memory and you won't be able to run many applications because you don't have any virtual memory too little in the file system and applications run very slowly because there's not enough caching at the disk level and so the real answer is you adjust this boundary dynamically between the buffer cache and the virtual memory and that's pretty much the way modern operating systems work. There was a time when I first started building kernels where you actually had to set a constant build time to figure out how much you put in the buffer cache versus in the file cache and fortunately that's dynamically figured out now. So once we've got a cache like this now we can start thinking well maybe I should try to avoid cold misses, right? And how do I do that? I can do this with prefetching, okay? And so the key idea here of course is exploit the fact that most common file access patterns are sequential and prefetch subsequent disk blocks and so most variants of Unix and Windows also basically say that if I read a disk block I'll read the next couple into the disk cache into the buffer cache and as a result I get far fewer times where I'm held up by having to wait for the disk, okay? And the other good thing is even when you've got a bunch of prefetching from a bunch of different processes if you have a well operating elevator algorithm then all of those accesses can be rearranged and the head movement can be managed to scan its way through the disk, all right? And so prefetching isn't as bad as it seems because what happens is all of those prefetches from different processes all get re-ordered automatically. How much to prefetch? Well, if you do too much then you're gonna start kicking things out of the cache unnecessarily and so too little you have a lot of seeks and so usually it's a couple of blocks or the automatic prefetching. So delayed writes, so the buffer cache is a write back cache, writes are term delayed writes here and what does that mean? It means I do a write to a file, it sits in the buffer cache, it's not necessarily pushed back immediately to disk, okay? So writes, copies data from user space to the kernel buffer cache and returns to the user quickly, seems good, okay? Read, fulfilled by the cache so reads see the results of writes. So it doesn't matter that I haven't put the data on the disk yet. The reads have to go through the buffer cache as well and so any data I've just written I read back as if it was put on disk and so from the standpoint of an interface I don't know the difference, okay? That's that transparent access of the cache is a good thing but start wondering I'm sure as you're sitting there when does the data from a write actually reach the disk? And the answer is, well clearly if the buffer cache is full and we need to evict something, yeah, that's fine but when the buffer cache is flushed periodically maybe we wanna, excuse me maybe we wanna be flushing periodically because the more dirty data we leave in the cache the more chance it is that we'll lose some data. So in fact, even if the cache is being used very well and we have lots of processes all writing the same disk blocks and they're staying in the cache that may be fine to not wanna push it out to disk but if the system crashes we just lost a whole bunch of data and so there's a periodic flushing that's going on in any system that has delayed writes and so we don't have to wait to run out of space in the buffer cache. In fact, we typically periodically flush it out, okay? And that's actually about a 30 second time frame as a default in a lot of operating systems that are Unix style. I'll mention that in a moment. So the advantage is of course is you return to the user very quickly without writing the disk. The disk scheduler has enough interesting writes for instance that it can reorder and do a really good job when it decides to finally put them on disk of not moving the head too much. So that's good. We might be able to allocate multiple blocks at the same time. So this is also good. So if you think about what I told you earlier you open or you create a brand new file and you start writing in the file system doesn't know how big your file is gonna be. Well, with the file cache you can actually allocate as they're writing a bunch of things in the file cache and you can defer even finding physical blocks for that data until it's time to flush it out. And at that point you can make sure you have a long enough run on some block group somewhere to handle all of the data you've just written rather than sort of dribbling it in one at a time and trying to find a big run. And the amusing side effect of this is you've been doing plenty of builds over the years over the last term, I mean, where you've done make and what happens there all of these files get created and deleted and created and deleted. An amusing side effect of the buffer cache and delayed writes is that some of these very temporary files may never even need to go to disk because they're created and deleted before anything gets flushed out. So these are advantages of writes. So the replacement policy in demand paging is really not, it's not feasible to do LRU as we discussed because you'd have to readjust on every read or write in hardware. And so we use an approximation like not recently used or the clock algorithm. The buffer cache, LRU is okay because we only enter the buffer cache when we're actually trying to do disk reads or writes. So that's a little different management of those two pots of memory. The eviction policy of course is that when we're doing demand paging we evict a not recently used page when the memory is close to full. The buffer cache, we wanna be writing these dirty blocks back fast enough that we don't lose any data, but not so fast that we don't get some of these advantages I was just telling you about. And so that's always a little bit of a trade-off but you can imagine that if you're paranoid about your data which a lot of people are maybe this is just not enough. So this idea that we're gonna flush every 30 seconds means that when you crash you might have lost the last 30 seconds of your information. This is certainly not a full proof way if you flush every 30 seconds therefore of keeping everything around. And so even worse is if the dirty block was for a directory you could lose all of the files you've just created in the directory in the last 30 seconds just because you didn't flush the directory out. So that seems pretty bad. So metadata like directory data is even more sensitive to being lost than the data itself. Okay, so the file system can get an inconsistent states all sorts of bad things happen. So take away from this discussion here is really that file systems need recovery mechanisms and ways to protect the information even as we're trying to use a cache to give us good performance and a lot of other benefits we need some way of preventing our loss of data. And this idea of flushing 30 seconds you can say I'll flush every 15 or I'll flush every 10 it's still not quite enough under a lot of circumstances. Okay, and it depends on how sensitive you are to data loss but maybe we need something additional to what we've got so far. So that leads me to talk a little bit about illities. So there are three illities that I like. So one is the availability which is the one you've probably heard a lot of and this is the probability that system can accept and process requests. So it's often measured in nines of probability. So like for instance, 99.9% of the time means three nines of availability. Okay, the key idea here is independence of failures and that the system can accept and process requests. However, one thing you probably didn't know is that availability doesn't mean the thing works properly it just means it responds. Okay, and so availability is something that you often hear quoted, oh, this is great. I've got five nines of availability. Okay, but is it actually still working? Okay, and so that leads to two other things that I think are very important to point out in this space. So durability is different from availability. Durability says that certain data will never be lost. Okay, and so I'm gonna the durability of data is the ability of the system to recover it under a wide variety of circumstances. And it doesn't necessarily imply availability. So it's different. Okay, and I like to think of the pyramids in Egypt for centuries, there were all these interesting hieroglyphs on there and nobody knew how to read that data. But boy, that data was secure because it was carved in stone literally and didn't go away. So it was highly durable, but it wasn't available to anybody. And then of course the Rosetta stone was found which led people read it and so now the data was available again. So if you're ever trying to explain the difference between durability and availability, I think the pyramids are a good example there. The third thing is really what people mean I think when they wanna brag about availability they really wanna brag about reliability which is the ability of a system or component to perform its required functions properly. Okay, and this is actually, there's an IEEE definition. It's certainly stronger than availability. It means the system's not just up but it's working correctly. And so it also includes like availability, security, full tolerance, durability, et cetera. And so really the interesting question for me in file systems is a, durability because once data is lost it's never recoverable so that's really important. And then also reliability is the system reliable or not? And that 30 second flush we were talking about is a mechanism for performance. You could almost say it's a mechanism maybe that improves availability because there's less work being going on there. I'm not really sure I would say that that way. It's really not a good mechanism for durability and it certainly isn't a good mechanism for reliability. And so it's like a very simple heuristic. And so what I'd like to do, we're not gonna get through all these slides today but what we're gonna do this time and the next Monday is we're gonna talk about how to get durability and reliability out of a file system. Let's talk about durability for now. So how do you make files system more durable? Well, one thing is you gotta make sure that the bits once they're on the disk don't get lost. And so disk blocks, disk sectors essentially have Reed Solomon coding on them which means that if I'm gonna write a four kilobyte block of data I'm actually writing more data on the disk and that extra data or redundancy together with the original 4K data makes it possible for me to recover bit errors that happen in the middle of that block. Okay and it basically allows us to recover data from small defects in the media. And if you think about when we talked about shingled recordings and we talked about how close the tracks are and one terabit per square inch, I don't know if you remember all these numbers we talked about a couple of weeks ago. It's really easy for noise and local heat and whatever to cause Reed errors. And so you absolutely have to have really good error correction codes on the disk in order to recover your data. That's operating all the time, okay? And the second thing, and so that's just part of the disk design. The second thing is you wanna make sure that writes survive in the short term. So when we write stuff to the buffer cache and we leave it there that doesn't necessarily make sure it's durable in the short term because a crash will immediately remove it. So if we start getting really paranoid we could either abandon delayed writes entirely which is gonna have a huge performance hit or we could do something like battery backed up RAM that's called non-volatile RAM or flash, et cetera, that's actually associated with the file system where we put things until they get pushed out to disk, okay? And a lot of hybrid disks these days actually have flash on the disk. And as a result, you can do a really quick write to the flash memory on the disk and it will worry about getting it eventually on the spinning storage. And so that's another approach to making sure things are short term durable, okay? So once we've got read Solomon codes and then we make sure the data that's written but not yet on disk is stable, that's a good start. But now we have to start worrying about, for instance, what if the disk fails, okay? So how do we make sure things survive in the long term? Well, we need to replicate more than one copy. So and the importance of this is in advance of failure. So we could put copies on one disk. So then we have different copies on the same disk. The problem is if the disk fails, that didn't help us much, right? So we could put copies on different disks. But if the server fails, maybe the disks fail too. We could put copies on different servers, okay? But if the building's struck by lightning, that doesn't help us. We could put copies on servers in different continents or we could have a copy in our archival store on the moon that we beam up there with a laser or something. There are many ways to deal with this but what we wanna do when we're really trying to be paranoid is we wanna put our copies out in places that have independent failure modes. So the problem with different servers in the same building is that if that got struck by lightning and fried all the servers in that building, that's not independent failure mode, okay? So to now back us down from worrying about lightning here for a second, so I'm sure you've heard about RAID. I just wanted to remind you. So one type of redundancy that's very easy to use these days is what's called RAID 1, that's from the original Patterson naming scheme, which is disk mirroring. And so the idea is that every disk that we have in the system actually has a partner that we put the same data on, okay? And what's good about this is to call it the shadow disk. And so every time you write, you actually write two copies of the file system and they go to both disks. And what's great is from a high IO standpoint, I can read back at twice the bandwidth I did before because I've got two copies, okay? This is the most expensive way to get redundancy at the disk level because we need 100% extra data storage. So the bandwidth is sacrificed a little bit on the writes because we have to synchronize our two pieces of the file system and so on. Reads can be optimized and recovery is fairly simple because if a disk fails, you just replace, say the pink one here failed, I just put a new pink one in and I copy from the green over to the pink and as soon as that copy is done, then I'm back up to go. And one of the ideas that people use sometimes is what's called a hot spare, which is a disk that's just sitting there in a power down mode ready to go as soon as a disk fails, okay? And you can buy, if you buy anything bigger than a laptop these days, you buy a desktop or a small server or whatever. RAID-1 is something that you can easily order from Dell or from whoever you buy your computers from and they just put in two disks and they set that up and it just works. And I've gotten saved by that many times over the years where I have all this configuration and a disk failed on a brand new machine. I just called them up and they sent a new disk out and I plugged it in and it was as if nothing bad ever happened. So that's kind of cool. The downside of course of this, just give me a couple of more moments here and we'll be done. The downside of this of course is this has got a hundred percent overhead. So another option is RAID-5. And so RAID-5, again, this doesn't have anything to do with number of disks. This is just in Patterson's naming scheme. But in this instance here, we take a set of disks and I'm showing you five disks here, which doesn't have anything to do with RAID-5. But what you notice is that at any given time, we have four of the disks blocks. Say this is the block zero on disk one, block zero on disk two, block zero on disk three, block zero on disk four. They're all XOR together to produce a parity that's on disk five and we do that for every block. And so now this is much more efficient from a storage overhead standpoint because really my overhead is only one out of five disks as opposed to here where my overhead is one out of two disks. And so basically all of these groups together is called a stripe unit. They're all written potentially at the same time. We get increased bandwidth in writes and in reads potentially if we do this right. This green block is gotten by taking all of these data blocks and XORing them together to produce the parity. And we notice that I rotate the parity through and the reason is that the parity block is kind of a high contention point. If I'm trying to overwrite a small amount of disk block two, I actually have to read the parity, read the disk block, write back disk block two, write back the parity. And so that parity gets highly used over the other disk blocks. And so we just rotate the parity through, okay? And we can destroy all of the data in one complete disk and get it back. How does that work? So we just say, oh, I lost everything here. So now if I put a new disk in there, how do I get back all that data? Well, it turns out I can just XOR D0, D1, D3 and P0 together and I can get back D2. And the way I've described this is like in the same box but we could actually spread this across the internet and have each one of these disks in a different cloud storage area and we could make for very stable data, all right? So that's all questions, okay? Has everybody seen the RAID technology before? I think they talk about that in 61C maybe. Good. So what I wanna close with is I wanna tell you that RAID five just isn't enough, okay? So, and in general, all of these RAIDs are called erasure codes, which means that I know for a fact that this disk is dead. How do I know that? Well, I know that disk is dead because either the motor doesn't spin up or all of the error correction codes of all the data on the disk is they're failing. And so we just know the data's gone and we can't recover it. That's called an erasure and these codes are erasure codes. So what does that mean here? That means in this instance, I erased this whole disk and so when I reconstruct that data, I don't try to get anything off the disk. Instead, I XOR the other four disks together to get it back because this is effectively erased. And so these are all erasure codes. And today RAID five, which can replace one disk, is not enough because new disks are so big that if I was doing that recovery process, by the time the recovery was done, I might've had another failure in the meanwhile. And so you actually need something like RAID six or higher in today's disk, which allow two to fail, okay? And I'll talk more about this next time, but notice that what we've got is we're talking about durability, how to make sure the bits, once we've got them, are stable. We haven't talked about reliability, which is gonna be more interesting and get us into transactions as well. So in conclusion, we talked a lot about file systems, how to transform blocks into files and directories. We optimized for size access usage patterns. We are trying to maximize sequential access by finding big runs of empty blocks. And the OS protection and security regime, all of those that metadata is in iNodes typically. And so it's associated with the file, not with the directory, okay? So the file is defined by the iNode. We talked about naming is actually working our way through the directories to find the i-number of the file we're interested in. And that naming could either be in each directory, could either be linear or it could be a B-tree of some sort. We talked about 4.2 BSD's multi-level index scheme, which is currently used in Linux and several others, we also talked about NTFS as an alternative. We talked about file layout and how to do free space management in block groups. We talked about memory mapping with M-map. We talked about the buffer cache, which can contain dirty blocks that have to be written back properly. And we talked about multiple distinct updates. Well, actually let's leave it at that for now. So I think I will wish everybody to have a great holiday on Wednesday and we'll see you all back on Monday.