 Good morning everybody all right. It's Monday morning Over the weekend there was an event here how many people participated in the hackathon raise your hands I know some of you aren't raising your hands who participated now leave your hands up if you were in the group that won the hackathon So let's give these guys a little round of applause As if they needed it given all the free stuff. They're gonna get so if you're here next year It was really a very cool event. So and it was really pretty amazing what people were able to build in 24 hours and How little sleep people were able to to survive on? All right, so today we're going to continue on talking about files and file systems on Friday I answered some really incredible questions. So we slowed down a little bit But today we're going to continue looking at where the different parts of the file that you're used to interacting with come from Where are some of those semantics come from then we'll get into hierarchical file systems We'll unpack a little bit about naming things that maybe you guys have to thought about in terms of why Hierarchical file systems are set up the way they are and then finally we're going to talk a little bit about the design goals of file systems And what makes file systems different from each other because if you've used most computer systems You're probably pretty used to what file systems are like right and the differences between them may seem very sort of small and Cosmetic, but it turns out that they're all providing you a fairly familiar similar interface in terms of hierarchical Naming and you know directories etc. Etc certain set of features that people have kind of gotten used to But under the covers when it comes to how those things actually map onto blocks on disk That's where the differences are right and that's where the interesting design Is going on right and we're going to talk as we go forward about several different file systems that have Designed themselves very differently to solve different problems right in response to different Characterizations of IO and and different file access patterns. Okay all right So we are we are seriously trying to catch up on grading my Let's see here. My wife is in Boston for the next three days So I'm going to be living kind of like the old bachelor lifestyle Which will involve a lot of grading right when I when I was when I was in graduate school I don't know I guess I would watch a lot of television when when my wife left town but now that I'm a faculty member I have to catch up on grading right so Hopefully this will be done the next few days and and once it starts to happen Then we're going to return a lot of things to you because there's a lot of things that are kind of like very very close So I know I've said this like 50 times right, but it will be true at some point right before May So we're getting to where I'm starting to hit my own set of deadlines here right okay Any questions about the stuff we talked about on Friday? We talked a little bit about files I had some great questions about about the file interface. We talked. What do we talk about? We talked about really file basics, and then we talked a little bit about unix file synantics as far as process Relationships between files and processes so any questions about this material Questions about files going once going twice okay, so What does the file have to do at least a file has to do to be useful our The our late arrivals back of the room both late and in the back so destined to answer review questions Why don't you tell us what does the file have to do to be used? Provide data that's the store data. Okay, and then what's the other thing? Okay, so persist the data it has to store data persistently. That's what processes expect What else what do they have to be able to do with that data? They're going to store some data, and then later they're going to try to Well, they want to find it right so file needs to reliably store data and also has to be able to be located Right, so that's be some handle, and this has to persist as well I mean the handle can change right if you move a file or rename a file, but there has to be some way for the Process to request the data that it's stored. Okay So we talked also talked on Friday about file metadata, and we talked about some interesting design decisions when it came to Storing file metadata so other than the contents that are in a file, right? What other information about a file might we want to know? Okay File size. Okay, so that's that's that's permissions. I heard anything else But so we talked about attributes we talked about file systems that actually provided Flexible attributes, but what's what's one set of attributes that a lot of standard file systems provide without any special support? What about time? Access times modification times when was the file created last access or modified Permissions, you know who is allowed to do what to the file or what processes are allowed to do what to the file and then and then We talked about some other things. Okay So we talked a little we talked to I presented sort of open clothes, right? This is the portion of the Unix file system interface that deals with Establishing relationships between processes and files before a process can use a file It has to call open when it's finished with a file it calls clothes Right and these these are interesting right because on some level, you know, for example Does a process have to close a file? How many people have ever used you know, Python or any how many people have ever done file IO with any language? How many of you always remember to close all your files when you're finished with them? All right, so what normally happens if you forget to close the file what normally happens when when does the file get closed? When the process exits right We would hope right so so clothes is interesting because clothes is just kind of a hint Right to the system that I'm done using the file right. I mean it is binding. I have to be done using the file I can't use that file handle anymore But but on some level clothes processes aren't required to call clothes. It's just nice if they do right? What's one reason that I might have to call clothes? Okay, well right. I mean yeah, actually a lot of file a lot of Operate systems and a lot of even modern languages won't flush the contents of the file So clothes also sends a signal to the operating system that any data that the operating system has cashed Which we'll talk about later should be written to the file, right? So that's the point at which there's no point caching information for that process about that file anymore And so I could flush the cash to disk right But what's another reason that I might need to close a file you guys are in the middle of an implementing assignment to Right, so you guys should know what what might a process run out of causing it to have to close files File handles right so most operating systems don't provide Processes with an infinite number of file handles and if you're trying to provide processes with an infinite or near infinite number of file Handles on assignment to my suggestion is don't do that That's a really bad idea. I when I went when I took this class might my short-lived partner for assignment to was trying to allow Processes to open two to the 32 files potentially right and Actually a lot of the code that he sent me via email for assignment to was dealing with overflow Right, so what would happen if the process tried to open two to the 32 plus one files, right? Because that would really be a critical situation because it would you know We would wrap around to zero and we had to handle this very carefully all over the place Never mind that there was no there's no way that a process was gonna even get like two to the ten files Open before your little operate system crashed, right? But these this is the kind of this is the little design quagmire. He had he had gotten stuck it. All right Okay, so but but again, so why why do we want processes to? Provide these sort of hints to the operating system about when they're using files. What what can I do? If I know when a process is about to use a file and when it's done using a file We already talked about one thing which Carl pointed out when it's done using the file I can flush the cache and write those contents to disk persistently. Okay, what else can I do when it when the files? Well, what why might it be good to do know when files are open? Carl you answer the last question You want to take a guess? Why why why would why might the operating system like to know why files are open or what files are open? So that's possible, right? I mean that idea of open is that I'm sending a signal to the operating system I'm going to use that file, but why why do this? Why not just have read and write take file names? That would work, right? So okay, so I can I can use the processes intention when it opens it to make sure that it's doing what it said It would do with the file right if a process opens a file read only and then tries to make a right I can stop that right from happening, but but what's more fundamental here? Why do I care about the process? What's it? I? Can improve performance, right? I know what processes what files on the system are likely to be used, right? Before I know that before a process reads a rights to a file has to open it So it has to tell me hey, I'm about to use this file, right? And so I can keep a smaller You know circle around the the number of files on the system that are actually in use, right? And I might be able to do some things performance wise in the cache or In other parts of the system to optimize those files for access, right? Because those are the ones that are actually going to be used, okay? All right performance, and then what's what's the other what's the other reason what else might I be able to do by? Forcing processes to open and close files Right, so I might have some semantics is Associated with open where for example, I might guarantee a process exclusive access to a file Right when it opens it meaning that another open on the same file is going to fail and Because I have to call open that means that I'm granting that process exclusive access to the file for some period of time, right? Okay, and then and then I pointed out that some network file systems don't do this for for you know fairly good reason Okay, any other questions about the file material that we covered on Friday Alright going once going twice, so we're gonna okay, so let's continue on remember there were there were three sources of the Your usual idea about what a file is right one is what I think of is like the basic You know the basic job of a file, right, which is to hold data persistently and to be able to be found, okay? The other you picked a good spot today, buddy the other The other part of a file semantics comes from Unix, right from the Unix file system interface And and there's nothing you know That's just how Unix has decided to offer processes that use Unix like semantics access to files, right? There's nothing universally true about that that could be different and on Windows I'm guessing that the file system interface is different, right? I don't use Windows and I don't know what that interface looks like, but I'm guessing that it's somewhat different, right? And then the third place that we get a lot of what we're used to thinking about files is because of hierarchical file systems Which we're going to get to today, but let's keep talking about Unix semantics, right? So one thing you guys are finding out in the sign of two is that Unix semantics for read and write are that the Offset the location that reads and writes are performed to is implicit, right? Read and write don't take a location parameter They operate on a location that is saved for each process and again This is not a requirement. This is simply a convenience. It's simply the kernel saying hey, by the way I'll keep track of where you were reading and writing to the file, right? Because I know that you know for example a lot of writes are sequential, right? That's a common way of writing to a file is just writing the file from starting to bite zero and finishing to bite in, right? Reading a file in that way is also fairly common, right? But this is not again. I mean this is not something fundamental about files This is just something that Unix does, you know to make things convenient, right? Where else so so let's say I wanted to let's say I wanted to implement for whatever reason on Something where the operating system didn't store the file offset. Where else could I store the file offset? That would be fairly transparent to processes. What else could remember the file offset from call to call? Ben Well, okay, but so here and that's okay. This is a great. This is a great Claim so what's wrong with storing this information in the V-node? Why might I want not want to have the offset in the V-node? It would be global, right? And I don't really want a global offset. I want a per-process offset, right? But if multiple processes have the same file open, I don't necessarily unless it was the opens are shared explicitly because of four I don't necessarily want to share them So if I open a file and I didn't you know duplicate the file handle somehow then I don't necessarily want to see reads or writes The effect of reads or writes on that file object, right? So rather than going down What about going up, right? I went up the stack. What's that? Well, okay again, you guys are talking about kernel data structures. I just said the kernel is not going to track Where are the where the where the location of the last read or write came from right? Where can I store that information? I can store it in the C library or something now that would make it difficult to share after fork, right potentially Well, not impossible, but difficult it would change the synantics of how it was shared after fork But again, so this is not really a requirement. This is this is a convenience and it's also of course done expressly because of fork, right? because You know the semantics of file handles after fork allow processes to do IPC and if the offset's not shared that becomes more difficult, right? All right, this is a low-energy room today That's it everybody get up stand up. No, I'm serious actually standing Alright there's stretch a little bit. I know it's Monday. It's like the I got cold again You know it was feeling like spring and I got cold so so clearly clearly people are you know The people have just started to sign it to I don't know All right, so now let's hold it back. Okay So again use open and close to establish relationships now We're just going to go quickly through the Unix file interface, right reading and writing are for performing reads to a particular File that I have open. I think this you guys understand this stuff You're implementing it So I'm just going to sort of scoot through it really quickly, right? And then L seek is is the way that I actually adjust the file pointer explicitly, right? So this is how I adjust the offset explicitly reading right do it implicitly it comes along with the call I'll seek is how I do it explicit. Okay, so again, I think you guys are see this you're building it You're intimate with it, so I'm not going to I'm not going to bother. All right What call heaven I covered here? Doop why why why not cover do I mean? I don't think do doop doesn't really have anything to do with manipulating files, right? Doop has to do with the other in a processes view of files, right? But I think you know when I was thinking about this it's like a okay, you know dup Yeah, I mean dup is kind of a nice way of again manipulating the way the processes look at files But doop doesn't have any effect on the file itself, right? Whereas all these other calls could potentially do that open and close of course don't really have much effect on the file But at least they're somehow involved with Maintaining a view of the file. Hi all right so Fine so so we talked a little bit about the file sort of on its own now We've talked about the relationship between processes and files and how Unix semantics come into play there So now let's talk about a file system, right a whole system of organizing multiple files, right? Like one file by itself is not really that interesting, right? But files together start to make something that's a kind of a really really powerful way of storing and querying data, right? so Simplest requirement for the file system goes back to our Requirements for the file. I have to be able to find the file, right? I want to give the operating system a name and I should get contents back And that means that the name has to be unique across the entire system Seems kind of obvious, right? So let's say that I have a file system, right and I'm trying to keep my names unique and I'm writing a series of letters to various people And you know so this this is pretty easy, right? So I start off right a letter to mom, you know, I write a letter to to my wife I write a letter to choo-choo dog language I write another letter to my wife. I write a third letter to to my wife so So again, you guys are so brainwashed by hierarchies, right? Did you probably think this is just so dumb, right like what why not just use a hierarchical namespace? Well early there were some early file systems that didn't use a hierarchical namespace or had a very very limited support for hierarchical namespaces, so what you have is a flat namespace, right? So imagine you had one directory on your computer and every file had to go in there, right? And every file had to have a unique name. So when I was writing these slides I thought it'd be really interesting to count the number of files that I had on my laptop and Figure out how long would the file name have to be if it was composed of If it was composed of you know, if it had to be unique, right? How long would the file name have to be to cover the entire number of files on my system? So I started running find in my root folder and after about half an hour I got bored and I killed it, right? But there are a lot of files on your system, right? And there you know and and a lot of them are small files and to some degree the the hierarchy starts to provide some form of Organization at semantics so but the point is that again I mean keep in mind hierarchical file systems were kind of a new thing at some point maybe like in 1962 or something But but it happened, right? And so at some at some point, you know We started to get this directory model and it's used useful to think about sort of why this happened, right? The big thing that that hierarchical file systems allow me to do is They allow you to view only a portion of your entire computer at once, right? And this is kind of a powerful idea, right? I mean this is used in a lot of ways when we design systems and maybe it's a kind of another design principle that we're getting to but You know don't give a Don't give the user a view of all the features at once, right? Don't give a user the view of everything in your system at once, right? Allow them to organize the way that they look at the system, right? So now rather than you could you imagine if you had to sort everything in one directory running LS, right? You'd run it and it would sit there like churning away for five or six minutes, right? You know and then you would pipe it into less and you'd sort of like to have to figure out where things were, right? So now, you know, I can keep my directories any size I want, you know A directory can have two files in it. It can have 2,000 files It really depends on what I want to do but it allows users to Create essentially views of the underlying files, right? Views that are organized on some level by Hopefully organized by by something meaningful, right? So here's, you know, an example of how I've organized my my my letters, okay? so One requirement that we do impose is that every file should have a canonical name, right? So there should be one Name on the file system that references that file, right? And you know, I guess that's not actually true All the time so I guess I'm glad I have that disclaimer, right? But but you can imagine that we can start from a place where every file has one unique name, right? so there's a one-to-one mapping between names on the file system, right and the file contents, right and and at some level you can think about a position that the file has within this now Hierarchical file system that I've built, right? So so let's so let's think about what it means now that I have this idea of location, right? Location is associated with the view that I just described, right? So I'm giving you a view of the file system and that view is associated with you being in some particular place, right? Like some location on the file system where you can see other files, right? A certain set of files that's associated with that location, right? So what does this mean? Well, it means first of all, I need to be able to move around, right? I'd have to be able to navigate the file system somehow, right and You know, I might want to navigate both in a relative way, right? Relative to where I am, you know, meaning that my locations now might include pointers to other locations, right? I know this seems really obvious, but it's bear with me, right? Because it's kind of cool to think about how all the stuff fits together, right? And again, so the other All right The other thing that's interesting here that again, we don't necessarily think about and it's it's so natural is that Location and names are bound together, right? so when I give you a full path name to a To a file on my computer not only is that name Unique and guaranteed to be unique across the entire system, but it's also Associated with the location, right? And again, you guys have just absorbed this Kool-Aid over years and years that you know When you get up in the morning and you look at yourself in the mirror and you're all pink You think you look normal, right? Like but but this is this is really interesting right and this the way that these things get coupled is kind of nice, right? It's a very elegant All right, so why are so here's an interesting question Maybe somebody can help me with why are file systems normally organized into a tree, right? You know a single root or maybe depending on the file system a set of roots and and some sort of you know Direction within the file system a cyclic right no cycles. Why why are most file systems organized this way? What's that easy searching? Okay, I hadn't thought of that. I don't know if that's true or not, but that's possible What else? Single names thing. Okay, so we're getting closer, right? So this goes back to our unique name requirement, right? So I have a clever example here, right? Here's my file system, okay? The green things are locations or directories in the file system and this is a file Right, so my question is what is the name of this file on? this system How many possible names does this file have? Is there a unique? Can you even like there's two ways to answer this question? One is you can write a function that generates all possible names for this file The other is you can say this question doesn't even make sense, right because named well relative to what right? you know So there's there's no there's no name for this file, right? This file is struggling now, okay, so let's say You might say well, I need a starting point, right? I need somewhere to start you haven't given me a point of reference from which to generate canonical names So let's say now I give you a route, right? So you is now your route, okay? So now what is the name of the file? How many possible names out there? No, there are many You used to love me you used to love well You me love to used you me love well Right, so I still have a problem here, right? I've got a root But what do I have in this file system that's making this difficult? I have a cycle, right? Okay, so now I'm going to break the cycle and now I can say the name of the file is you used to love well okay, and This is why we do this right this is why we have a root and this is why we don't allow cycles in our file system Is because we want files to have at least one canonical name, right a canonical name is a name It's not relative to anything. It's relative to the root, right? So here the canonical name is again you use to love well, right? What's one relative name for this file? What's one relative name that's that starts at the root node? You guys are used to Unix give me another name for this file, right? Give me another path that resolves to the same file It starts at you How about this? You used to love me Well That works, right? Here's another relative name, right this one starts here love me Let's see love me dot dot dot dot love me dot dot well right So anyway, I thought this was fun. This is this is based off this poem, right? Which I like it's a sestina. I'll leave it up for just 10 seconds See you guys can read it. It's it's quite cute. But anyway, this was the generator for this example Any questions about this stuff? I know this stuff seems really obvious and Maybe it's good that it's obvious. It's good you guys have absorbed it But I wanted you guys to see kind of the reasons for some of these things, right? Any other questions about this? Yeah All right, so okay. This is a good point, right? So some early email clients Adopted this sort of hierarchical directory structure, right now You know, there's a lot of people who have who have who have attacked Not this poem, but you know hierarchical directories in general and said, you know, like Why is this a good way of organizing information, right? How many people have a file cabinet at home where your file folders have sub folders That themselves have sub folders that can have pointers to other folders. I mean Most people that aren't weird don't organize things like this, right? So it's not clear that a hierarchical file Model right despite the fact that we've been programing it into you guys for for decades now, right? It's really necessarily a great or natural way of organizing information But there were early email clients that said, okay Well, you know people like folders, right and and we'll have folders with the same set of semantics that we're used to With hierarchical file systems, right? What's different about Gmail labels? How do Gmail labels differ from the because you you can you can organize your Gmail labels in a way that makes them look very much like Like a hierarchical file system. You have late Right, right. So labels in Gmail can be so for example If you click on a label and you click on another label, right? If you were in a typical traditional hierarchical file system, you would see two different sets of email, right? but in Gmail, you know labels can be applied to any to any Any email and emails can can have multiple labels and when I click on a label or when I use labels What I'm viewing is the set of all Keep calm The set of all emails that have that label, right? And there's no so there's no you know mail can very naturally be in several places at once, right? You can imagine building a file system that works that way, right? Well rather than associating a file with the unique location people just put a label, right? And then the way that I access the file system as I choose a label and that label Shows me the files that have that, right? You probably implement something like that on top of normal files, right? But again, it's a really interesting question about whether hierarchical files doesn't matter anymore period, right? Given the way that most people access their computers now, which is again if you're not weird like me an old-fashioned You just use spotlight or you know windows that does windows still have the dog whoop whoop like File dog or whatever speaking of dogs. Where is my dog? Is he over there? Okay, good Yeah, so so anyway, so that you know searches is pretty big. Okay So let's so let's switch gears now now that I've bored you to death Talking about things that are obvious, right? And let's talk about some things are less obvious, right? So what are the design goals of a file system, right? So we've really up till now been talking about the types of things that a file system is going to try to accomplish, right? One of the things is that I need to efficiently translate file names to file contents, right? Names of the handles that processes use when they open and close when they open files And so one of the things I need to do is path name resolution I need to take the name and I need to figure out what are the contents that a process expects When it uses this name, right? I Need to support Changes the files, right? File content changes, right causing file attributes to change and the file location could change or really the file name, right? Given that names are coupled to locations in a fairly direct way, right? I Guess one thing I should have put on here that I didn't put on here was that Preserved data, right should be goal number one, right? Do not lose stuff and Yeah, I should have put that on there. I'll fix it, right? So so so you know job zero is Don't lose things, right? The rest of these are features after storing data reliably, right? Like if you fail a move, that's okay, right? If you attempt to move and you lose data, that's not okay, right? So so really on some level keeping file contents secure is number one, right? I Wanted the file system is going to try to make Access to files as sufficient as possible, right? This improves performance on the system we know that the disk is slow, okay and The rest of the system in many ways is frequently bottleneck on the bottleneck down the disk And so the file system is going to do its best to optimize access to single files, right? And then the other thing I'm going to do if I'm a clever file system is I'm probably going to try to think about File layout and I'm going to try to think about how to optimize access to multiple files, right? So can I observe relationships between files that might mean that I want to put them in various places on the disk, right? Or there there are certain clusters of files that might be often access together, right? And if I could do this in a clever way Especially if I can exploit some layout features of the disk assuming I know them which we talked about Then I might be able to do this better, right? And then finally surviving failures, so There's two there's two things I want to happen, right? When the file when the computer shuts down or you unplug it Now this actually happened in my machine during the hackathon towards the end someone just yanked the power cord out of it And it was okay, you know like it it didn't here's the thing It didn't come up and the file system didn't forget where all my files were, right? Now there are probably some file operations that were going on when the machine was suddenly Disconnected from the power source that didn't finish, okay? But it's more important that we fail a couple of operations that were in progress But be able to come back and see and still see a consistent view of the files, right? So this is really really crucial You know and what we're going to talk about in a second is that all file operations usually involve modifying multiple different structures, right? And one of the things that can happen is these Modifications part of it can finish part of it doesn't finish and so I have the file system in inconsistent state and certain file systems They happen say just punt right and they say I can't find anything anymore, you know my Superblock is corrupted right or my I know table is broken, and I just you know sorry You know here's a few files back, and then the rest of them are kind of gone for it, right? So it might be hard, right? But I want a way to survive failures that allows me to either Immediately see a consistent view of the file system or rebuild a view of the file system that allows as much of what was happening when the machine failed to To have been sort of moved to disk right to be visible on disk, okay? So the files that we're going to talk about for the next few days all support these sort of Classic features right they support persistent state, and they these is hierarchical namespaces, right? So again, this is the part where you know rather than talking about stuff That seems obvious to everybody because you guys are used to using this stuff We'll talk about stuff that's happening down at the disk block level, and that's I think more fun Right because you guys probably don't know about that Right and so the difference happens in how these things are accomplished, right? So when we get down to the disk level Okay You can really think about file systems as storing two types of data The first type is the stuff that's in files, right? I mean that's somehow, you know some of the more important stuff to store, right? the contents of the file and Those you know get stored in something that we call data blocks and data blocks as I just As I just said contain file data, right? So if data blocks contain file there, what do index nodes or inodes contain? Directory index they contain not file data, right anything that's not file data any metadata that the file system needs to translate names to You know to locate different parts of the file to allow files to grow in strength blah blah blah blah all that stuff You know gets stored in these metadata or index nodes, right and and so directories No, actually so directories. It turns out at least on the file systems We're going to talk about are actually implemented as regular files, right? They just happen to be regular files with this very special format. Okay? all right So again come back to our question. These are our design goals. What makes file systems different, right? So if file systems are all going to provide this hierarchical name space, they're all need to support the Standard UNIX file system interface. We're talking about UNIX file systems. What's different about them, right? What what can possibly be different, right? Anybody any ideas what makes these things different, right? Where you know or where would you see the differences? Let's let's say that I gave you two identical file systems, okay? Two disc partitions identical size Two different file systems storing the exact same data, right? The file hierarchies are identical, okay? So what and I asked you let's say let's say I did this I asked you Here are two file trees that are identical. Are the file systems the same or not? Where would you expect to see a difference? Well, right, but but I mean let me one thing, but again, where would you look to find a difference, right? What would be the easiest thing to look at to see if these file systems were different and let's say the history of these file systems were identical like I've had two disattaches my computer and I've just been mirroring every file operation, right? so where would you look to see if these two file systems are the same file system or No, no, no, don't don't cheat don't cheat this is she I'm asking a more fundamental question what Where what are files? What do file systems do fundamentally that makes them different? They're translating file operations into what operations on what what low-level thing? Disc blocks right at the end of the day What information gets stored where on disc that's the difference between file systems even if they're supporting identical file trees, right? to identical file trees same operations same orders if The file system is identical the on disc structures are going to look identical if the file systems are different The on disc structures are very very different, okay? So on this layout is on some level the lowest place you could look But the source of the difference is in the different data structures the file systems use, right? So how do they translate names? How do they allow files to grow in shrink? Where do they put updates to files? Right, if you write over a portion of a file where does that right go? And so we're going to talk about so this is the stuff that we'll start to talk about over the next couple days You know what happens at the disc block level, right? That's where that's where things are different, okay? And then the other big source of difference that's that's pretty interesting It's how do they recover from crashes how do they prepare for and recover from crashes? and there's there was a lot of work in this area over a couple of decades and Now again now the file systems that are that you use on most modern systems do a way way way better job of this Then old file systems did right and they do it in a very nice way that again allows me to you know My machine is disconnected from the power. I push the on button and boom. It's just there, right? There's no laborious long process that has to run to rebuild the file system Everything just appears where it I expected it to be right that's pretty nice actually it took us a long time to get here So it's a it's a nice story that we'll talk about a little bit We talked about different ways of surviving failures and again a lot of it comes from preparation, right? How do I prepare for failure? What do I do so that when the computer comes back on? I have some state that allows me to recover and produce a consistent view of the files, okay? So what's hard about this right? What's hard about file systems? Well on some of you can think about a file system as just this big data structure, right? It's this big complicated data structure might have multiple little data structures embedded in it Right a lot of file systems use, you know B plus trees your various types of efficient data structures to locate File contents to translate names etc. So so that's and so you have this big complex data structure and Making changes requires touching lots of different parts, right? So this is kind of the consistency nightmare, right? You guys looked at this stuff when we did the synchronization assignment, right? Now I've got all these changes to make to different data structures on these slow Discs right meaning that any change takes a long time, you know I've got to modify this this block and I've got to modify this this block and then I need to write some data I need to update that that but you know and and and the problem is failure can happen in any time, right? So my the window over which a failure can occur is so much bigger for file systems because this are so slow, right? So I'm giving you a big target I'm giving you a lot of time to pull out the plug when I was in the middle of doing something that required me updating multiple parts of the file, right? So let me let's go through an example here And I know I haven't introduced to do too many of the on-disk structures yet We're going to get there, but I just want to illustrate why this is difficult, right? So let's say I have a process and it wants to write data to the end of a file I'm going to do an append, right? This is pretty common operation, right? What is the file system have to do? To allow this to complete, okay Well first I need some space, right? The file is getting bigger, okay? So I've got to find some disc blocks to use to put the new content, right? And there's probably a data structure on disk somewhere that tracks where those disc blocks are and whether or not they're in use, right? So here's my first thing find empty disc blocks and once I found them and I'm taking them I better mark that they're in use because if I don't somebody else might come grab them before the rights finished, right? And that would not be good, okay? Once I've got those disc blocks now I have to somehow associate them with a file that the process is writing to, right? So these disc blocks need to go from being unallocated to associated with that file So the next time a process opens a file better find those disc blocks if it doesn't then I can copy the data into them But, you know, no one else is ever going to see the append including the process. It's about to do it, right? So now the process is growing. I've got to make the process bigger. I've got to associate those disc blocks And now this is potentially another structure. I have to update, okay? I probably don't want to calculate the file size every time it's requested So probably have some metadata I need to update, right? The file size is getting bigger I'm writing 4k to the end of it. So there's another right I have to do Now finally, you know, after I've done all that I can actually copy the data, right? And these things aren't necessarily in order, right? Certainly I need to get disc blocks before I do anything else, right? Because I don't have anywhere to put them, but some of these other things can go in multiple orders, right? Maybe I adjust the size after the right is completed, right? And again, from the perspective of a process or another person, another process using this file All these things kind of have to happen synchronously, right? So if, you know, if I get interrupted and I've got some disc blocks Another process shouldn't be able to use those disc blocks, right? And if two processes are doing a pens in parallel, you know, they have to be ordered, right? One of them needs to finish and the other one has to finish, right? So there's all these different ordering constraints and difficulties here And remember, the other thing too, the file system that keeps file system designers up at night The system can fail at any point, right? Like, at any point, you can hit the power button or, you know, your dog can trip over the power cord or whatever, right? And, you know, you might be here, you might be here, who knows, right? So you have to be prepared at any point for failures and essentially be able to recover and have some idea of what happened, right? And again, at the disc level, these are all these asynchronous operations, right? So depending on my disc structures, I've got a free, you know, I have a data structure that stores which blocks are in use, that's got to be updated I've got a file structure that stores which blocks are associated with the file, that's got to be updated I've got some metadata in the file that might be stored somewhere, that's got to be updated And then I do actually finally have to write the data, right? So there's all these different parts that have to be And part of the fun of file system design, as you can imagine, is, like I said, on disc layout matters, right? When you have spinning discs, because moving the heads around all over the place doesn't work well So now I've got this fun game to play, right? Where do I put this stuff on disc, right? Where does it actually go? And that's what a lot of file systems spend a lot of time thinking about, okay? Alright, so I'm actually done on time for once, even a little bit early So next time we're going to talk about path resolution, we'll talk about actual mechanics of growing and shrinking files How do those files, how do the file structures actually allow things to get bigger and smaller? And then we'll talk a little bit, we're going to start talking about some of the principles of the Berkeley FAST file systems This is kind of a classic file system that gets discussed in operating system classes It's very, very location based, but there are some nice design principles embedded in it, so we'll talk about that There's one more minute if anyone wants to ask questions, otherwise you guys can just I would have a scenario you had Yep, yep Those are all logical steps, asking for that space No, no, no, this all happens when I do a write Process calls write, this is all stuff that, what I tried to separate out here are the things that have to happen to different parts of the, potentially different parts of the disc Right, what are the logical steps? I've got to find some space, I've got to associate that space with the file I've got to update maybe some file metadata like the access time and things like this, modify time, and then I've got to actually do the write And all these steps, because they're all making modifications to disc blocks, take time, they might mean that the drive head is bouncing all over the place The bitmaps on the inside of the drive, the files on the outside, the free blocks or somewhere else, you know what I mean? So these are all, from the perspective of the process, this is supposed to look synchronous I call write, when the write finishes, the file is bigger, right? And the file system has to make sure that those blocks are allocated to find all this other stuff, right? It doesn't write anything until it makes sure that it So I guess the point is, so you're right, a lot of this stuff can get stuck in the cache, right? A lot of it might hit the cache, right? And we'll talk about this So one classic thing that file systems do, and that the operating system does to improve file system performance, is it holds data in the cache So you do a write and you think that that write is going to disk But what actually happens is that write is in memory somewhere and the operating system is waiting to write it to disk And we'll definitely talk about caching, right? But at some point, the write has to actually go to disk if it's a file and it's going to behave like a file that you're used to And a big performance trade-off in file systems is how long do you hold data in the cache? Because the longer you hold it there, the more likely it is that it will never get to the file system Because if it's in the cache and you turn off the power, that stuff's gone, right? Like the memory ain't coming back, right? So again, we've talked a little bit about this before At some level, the fastest file system is the one that never writes anything to disk Until either you run out of cache or the machine shuts down It just holds everything in memory for as long as possible And then when you shut down the machine cleanly, right? It's like, oh, okay, now I gotta, you know, you can imagine, right? When is the disk useful? When the machine is off, right? That's when the disk is doing something useful, it's storing data persistently When the machine is on, it's just a slow pile of junk that you're trying to get things on and off So if you can use it as little as possible while the machine is running, that's the best way to improve performance It does have implications for fall-hand, but that's a good point Any other questions about this stuff? All right, I will see you guys on Wednesday