 All right, everybody. Let's get started. Today, it's Friday. You hear it for Friday? Who likes Friday? Who doesn't like Friday? So it's a judgment day. What's that? So it's a judgment day? Judgment day. What is judgment? All right, so today we're going to talk about files. And today is one of those lectures that as I was writing and I was thinking, this all feels really obvious, but I think it's worth talking about. Partly because files are one part of the operating system that I think most people think that they understand. They think that they know they've used them before, so they think they have some idea about how this stuff works. But today what we're going to try to do is kind of carefully unpack the different components of the file abstraction, where they come from, so that you guys can kind of understand what part of this stuff, like this big dead zone right here. It's like there's a nuclear waste spill or something and people are drifting. I can kind of drift over here. So anyway, so we're going to go through the parts of the abstraction and kind of in a detailed way and try to pinpoint where they actually come from. So what things are kind of like basic to a file itself? What things that we might think about as being part of the file are actually UNIX file semantics? What types of things associated with files are actually associated with kind of hierarchies and hierarchical file systems, which is a certain way of organizing files. So I'm going to try to like split these things up, put them in their respective boxes, and you guys can see that we have a lot of flexibility in terms of how we build and implement file systems. So Judgment Day, you heard it here first. Judgment Day is April 2. Assignment 2 is due on April 2. That's an extra week. So I decided to extend the deadline. Does that make me a bad person? I don't think so. So look, I know this is not the first time this has happened and there is one deadline which will not move, which is the class is going to end on like May 2 or something. And I can't move that deadline. So Assignment 3 is kind of, you guys are probably thinking now like every time he gives an assignment he always extends the deadline. And so far that's been true, but I'm just saying that this pattern is going to run out of steam in the next iteration. So what will happen is that Assignment 2 is going to be due a week later than it would have been. That's on Monday. Assignment 3 is going to be shortened by a couple of days and it's going to be set up so that your five late days should take you exactly to the last day of reading period. And that's just as far as I can push things. That's as much time as I can give people in order to do these. The thing that I think that has happened that is good is that people have started to work on the assignment and they've discovered that it's not that easy. So how many people have everything working? Where's Isaac? He had everything working like two weeks ago. OK, so this assignment takes some time. And you have a little bit more time to work on it. But now you guys have some momentum built up. Don't waste it. Don't go home and be like, woo-hoo! Spring break part two, let's just sleep for a week and then we'll get up and finish the assignment. No, the assignment, I think most of you guys, if you keep working at the pace that you started to establish over the last couple of days, you might actually finish the assignment by April 2. So this class, I think that this assignment sneaks up on people a little bit. The next assignment should also, well, hopefully, will not sneak up on you after you've done this assignment. The next assignment is harder and bigger, but you'll have more time and you'll be ready because you'll have done this and you'll see, oh, this stuff's not easy. There's two more announcements that didn't make it up onto the slide. One is simply please follow instructions for contacting the core staff. I don't want to have to talk about this every day, but I sent on email and I'm still having issues with people emailing me directly, not emailing the core staff, sending us questions that should really go up on the forum, et cetera, et cetera. So, oh, second thing about that, so there's actually three things. If you talk to me about something or to the core staff or we communicate over email and we give you a clarification about the assignment, please, by all means, feel free to post that to the forum. That's what that's for, right? If I had more time, I would try to do more work on the forum and maybe I should be answering questions there rather than in my office, but when I tell you something, it's not supposed to be privileged knowledge for you alone, right? I'm offering clarifications on the assignment and be the nice person, be a big person and share those with your classmates and maybe when they're the person who's getting the special hints, they will reciprocate. So, anything I tell you, anything the core staff tells you is definitely postable on the forum. People have asked about this, right? The third, there was a third thing. Oh, right, the third thing, when you approach the core staff for help, please be specific about what your problem is, right? So we get people writing in and it's like, oh, you know, like, I ran my kernel at crashed. Can you help me? No, you know, I mean, there's almost an infinite number of reasons or I was working on fork and I ran it in a crashed. No, right, like you've narrowed down the search space from 10 billion possible causes to one billion possible causes, right? Like, this stuff's hard and there's a lot of different things you can do wrong and once you guys start to work on these assignments, you're building your own private universe, right? So I don't know what you did, I don't know what lines of code you wrote and like, you know, if your kernel panics, what would be a useful thing to send to the core staff if you need help? Anybody? A trace? A trace? What about just the output of the panic itself, right? Like the panic, the kernel panics, it prints some very, very useful information that you can use for debugging purposes and you can tell us those things, right? Or if you have an error, cut and paste the error message into your email so we have some idea of what's happening, right? Don't describe it in English, that doesn't work, right? So anyway, the more specific you can be, the more likely that you're not gonna get a response to your question that's kind of like, good luck, you know, because there's really nothing we can do. I mean, there's so many different possible root causes for these things and without sitting down and looking carefully at stuff, this is the other reason why it's great to come to office hours because you can come in, you can bring your laptop, you can bring your code, you can sit down with the core staff and you can sit down line by line and look at things, right? Debugging these sorts of problems by using error messages or panic messages is itself almost impossible. I mean, we might be able to give you some general idea of where to start looking but we can't solve your problem without seeing what you've done, right? There's all sorts of things that you could have done wrong, all right? Okay, so on Wednesday we talked about disks, the basics of disk operation, parts of the disk, locations on the disk, how the operating system reads and writes things to disks, sources of latency, what makes disks different from other parts of the system. So any questions about the material that we covered on Wednesday? Really exciting stuff. We watched some movies on the interweb. It was kind of cool. No movies today, unfortunately. Any questions about this? Okay, so let's do some review. So, Manuja, what is a platter? Part of the disk, you weren't here. Well, it's, I mean, yeah, I mean, you're close, it's... No, it's the disk itself. It's the physical object. And disks are composed of multiple of these, right? It's a circular flat disk on which magnetic data is stored. Usually on both sides, coated with some thin layer of magnetic material and the disks read and write data to both sides of a platter and disk are usually composed of multiple platters, right? That's a whole stack of, you know, 8, 10, 12, I don't know how many platters. Okay, spindle, Jason, you were gone. So I won't pick on you. Spindle. Yeah, it's the thing that... Wrong finger. It's the thing that runs up and down through the platters, that actually spins the platters, okay, the drive shaft. And then, you know, this is what actually applies the power to the platters causing them to rotate, all right? What about the disk head? We'll go over here. Disk head. Doesn't remember, Jimmy. Right, it's the actual, I guess it's an actuator and a sensor that reads and writes data onto the magnetic surface. It reads data by detecting the magnetic information that's been encoded on the platter and it writes data by changing that information, right? And again, it's sitting there skimming, skimming is the good word, right? Not floating, skimming the surface of each platter, right? As the platters rotate underneath. Okay, good. What about locations on the disk? These are a bit harder. Track. Who can describe what a track is, Ben? Right, so if the head stays at one location, as the disk spins underneath it, a track is the, you know, the strip of the disk that the head sees as the disk makes a single rotation, right? So think of a lane on a racetrack, running around in a circle around the platter at an equal distance from the spindle, okay? What about a sector? Anybody wanna place a guess on sector, John? Right, so it's like a piece of pizza thinking about the hackathon. So how many people are doing the hackathon, by the way? Super excited about the hackathon. So there's a hackathon tonight sponsored by the CSACM Society. It's gonna run for 24 hours. There's like five square meals. I think they will all be pizza, right? I'm looking forward to pizza for dinner, pizza for a late night snack, pizza for breakfast, pizza for lunch, and pizza for dinner, right? I mean, what else do you eat while you're programming? Anyway, so think about a slice of pizza. That's what a sector's like, right? All right, what about a cylinder? Cylinder, right back here, later rival. The question is what is a cylinder? In the context of a disk, right? Not the abstract, idealized, aerostotally cylinder. Okay, so it is a stack, so it is a vertical thing, but it's not a stack of platters, that would be the whole disk. What's a cylinder? It's a stack of tracks, exactly, right? So if a track is a circular piece on one platter, then a cylinder is the stack of tracks running vertically through the disk, right? And again, why do we care about cylinders? Who remembers why we care about cylinders? So if I leave the head in the same position, I could potentially read all of the data encoded on a single cylinder without moving the heads, right? Because I can read it from every track on every platter, top to bottom, right? So depending on how many heads I have, I might be able to read all of that data without moving the heads. I might even be able to read some of that data in parallel from multiple heads at the same time as the disk rotates underneath the heads, all right? Any questions about this, Carl? Well, okay, so that's a great question. So yes and no. Man, that's a really good question, actually. I should have had some slides about this. So here's essentially what Carl's question is. It's a really, really good question. So remember that the interface that's provided by the disk data operating system is a block interface. Blocks are usually five out of 12 bytes of data and the disk numbers those blocks sequentially starting with zero, all right? So roughly speaking, the interface between the operating system and the disk is, write this 512 byte chunk of data at block Y, right? Some block on the disk. Now, here's the question. Let's say I wanna do some sort of layout on a spinning disk that utilizes some of these location-based properties. What has to be true about the block mapping that the disk provides? Anybody wanna, what is one of this light? It's like twilight zone over here. Okay, anybody wanna, Carl, do you wanna continue on with your own, answering your own question? Right, right, so there's two questions there. One is, is there any relationship between the block ID that the disk provides to the operating system and the location on the disk? If there's no relationship, then I might as well give up on all this cylinder-based, sector-based, track-based stuff. It's just, it's over, you know? Like, because I don't know. I tell the disk, hey, store that piece of data in block 16, and then I have a related piece of data, and I say, store it in block 17, and it turns out that those blocks are, you know, on different tracks, like on different sides of the disk, they have nothing to do with each other, Malin. I don't know what SCSI calls it. I mean, yeah, each one of these interconnects has its own universe of terminology, but at some level, you can, again, you could think about block-based addressing as being supported in some form by any disk. Right? Whatever it's called, yeah. I'm gonna call it a block, just because I save a little bit of my precious breath. But logical unit number, right? Or we could call it a lun, I guess, that sounds weird. Anyway, so if there's no mapping at all between the block IDs that are exposed by the disk and where the block is on the disk physically, then I might as well give up on all of these location-based techniques, okay? Now normally, there probably is some kind of mapping, but as Karl pointed out, what disks will do to hide failed sectors is they'll remap those sectors, right? So, you know, 14, 15, block, 14, 15, 16, 17, 18 will be next to each other, and then 19 will be way over somewhere else, because 19 was a failed sector that the disk is remapped to some other part of the disk, and then 20, 21, 22. So, it's possible that most of the sectors, most of the blocks have a nice, you know, some nice physical relationship to them, but there's some small subset that don't. And in that case, you can imagine that these location-layout-based techniques will work pretty well, right? But one of the cases where this really breaks down is in rate, right? And we're gonna talk about rate maybe next week, maybe the week after, but rate is a technique for improving the redundancy of drives, right? But in general, anything I do with the lower-layout that breaks that connection between block ID and location on the disk, renders all of this complex stuff completely moot, and actually had a good friend of mine who wrote a paper when he was in graduate school that was called Stupid File Systems Are Better. And his thesis was exactly this. He said, we've got all these complex file systems that do all this layout based on positioning on the disk. And yet, now we have these devices that completely break any sort of relationship based on block ID. And so what happens is, the file system does all this crazy stuff to try to put files close to each other on the disk or try to put them on the outer edge of the disk or try to put them in cylinder groups, and the disk is like, ah, whatever, you know? And they just end up randomly placed, right? So his thesis was that in certain cases, it's better to implement something quick and dirty and simple in the file system layer because these sort of assumptions about layout are being broken. All right. What if it's true that you're bored to that separate network in the alley is it totally different program or different? So right, so I mean, I'm actually exactly what the question is, but on some level, if you wanted to write a program that read and wrote raw disk blocks, you can do that, right? And the idea is that you would probably write it above the disk device driver, right? Which would itself know how to talk to the disk and there would probably be some other code in the operating system that would be there to drive the, whatever the interconnect is, whether it's IDE or SATA or whatever, right? Scuzzy, you know, I mean, all these things have their own details, yeah. Yep. For the new partition and the blockchain number, or is it the same number that's not in it? I think that, so that's a great question. I don't know the answer to that. My guess would be that the disk just understands block IDs. And so the partitions are actually done by writing some information into a spot on the disk. And I think that that information, I mean, if you use, like, F-disk, do you know the answer to this question, Ben? Thank you, save me. So, what happened? Uh-huh. Right. See, I knew that. F5 for the 100th partition, too, sir. Right, so I think the answer to the question is that the blocks aren't renumbered within the partition. The disk stores information about what partitions exist and what blocks are inside each partition. They support dynamic partitioning, and I had a personal experience where I had to do a static partition and I do a hard partition. Then I realized that my hard disk did not support static partitioning. Is there any difference between what's on the disk and what's on the disk? I'm gonna elect to refuse to answer this question because I don't know enough about it, right? So my experience with partitioning, so Ben can correct me if I'm wrong. Most drives, I think I've always, most drives going back many years have sorted some degree of static partitioning, right? Something that's happened recently is you've been able to dynamically resize partitions, but that's as much a function of the file system as it is of the disk. So, for example, if I have a partition, I don't want to spend too much time on this because it's kind of going off in a little bit of a tangent, but I have a partition that's n blocks, and the file system has built all its structures on that. And now let's say I wanted to make it bigger, right? Well, there's two things I have to do. One is that I have to rearrange the partition on the disk, but the other thing is the file system has to adjust its own internal structures to account for the extra blocks, right? Making partitions bigger is usually easy. The challenge could be shrinking, right? So let's say I have a file system and I've told it you have n blocks to use, right? And maybe all the blocks are in use, maybe it's only using half of them, right? But now I'm gonna say, well, I'm gonna reduce your partition by a third. So now the file system has to reorganize itself to leave me a big bunch of contiguous space. So anyway, disks are messy, ooh, gross. Okay, anyway, so we're still in our review in like half a week through class, so let's keep going. These are awesome questions, cross question in particular is really cool. All right, so why are disks different, right? So let's go through this quickly. What's the biggest difference about disks from every other part of the system that we've talked about? It's the only thing that moves. Spinning disks move, right? What about degree, right? What are, disks are the blankest part of the system. Slowest, right? Dists are big, disks are slow, right? And then disks are also integrated as devices, right? Devices which provide their own low level interface that the opera system builds a file system interface on top, right? And it plans to talk a little bit about the file system interface today, okay? So let's go through, why don't we read or write to a disk, all right? So what's the first step? What's the first thing we have to do? Well, that's what we're doing. That's the whole process, right? Make the block where you want to write. Okay, so let's say I know the block I want to write. The file system knows the block IDs, right? What's the first thing I have to do? That's the second thing. What's the first thing I have to do? Issue the command, right? Gotta tell the device what to do, okay? And once I tell the device what to do, the device will select head. Try to select the best head to perform this operation, this can have multiple heads, they might have even multiple heads on an arm, right? So there's some algorithm that goes into selecting which head to use, okay? Now, what's the next thing? You had it. Move the head, right? So now I've got to move the head. This is one of the largest sources of disk latency, right? I'm moving, I'm actually, something is moving within your computer. Something moves in there, it's moving. All right, and then what's the third thing? Jimmy, do you remember? Read time? Read time, this is kind of all read time. That's what I forgot, John. I've just thrown the head over to the right track and now what I've got to wait? I've got to let the head settle and stabilize on this very, very, very, very tiny track, right? So I've got this little strip of data and I've got to let the head kind of settle and get to the point where it's stabilized enough that I can read data and believe that it's not bogus. All right, fourth thing, else. Yep, time to wait for the plotters to rotate underneath my head and then finally, just transfer the data, right? So I've got to actually read the data up into the head and then it's got to be spit out over the interface back to the output system, all right? Okay, and then the IO crisis, why do we have this slowly emerging IO crisis that's been going on for 20 years, right? Disk are getting what but not what? Bigger but not faster, right? So we've encouraged people to store all this crap on their disk but we're not actually getting much better or the disk have improved slowly as far as being able to move that valuable personal information back and forth to the parts of the computer that process and use it, okay? All right, any other questions about disks? Get some great questions before. I think Ben is just adjusting his eye in some way. You're not asking a question? Okay, cool. All right, so again, so today we're gonna try to sort of like split up some of the ideas that go into files and file systems, right? Because I realized as I was starting to think about this that my own ideas are a little bit confused and it's kind of fun to think about files were so used to, you know, how they work and we think we know how they operate but let's try to think about why, right? Why in a really fundamental way, okay? So we'll start out with just like what is a file at the most basic level, right? What is its function in the system and what are the minimum requirements for a file? The only things we really have to be able to do, okay? Then we're gonna talk about information that file systems store about files, right? So files themselves store information but I also usually store some other information about files, right, we'll talk about what that is. This can be referred to as file metadata or in other ways, okay? Then we're gonna talk about establishing relationships between files and processes and that's really where the UNIX file system interface starts to come into play, right? Because what is the system called interface? It's how the process communicates with the kernel, right? The kernel has implemented a file system that is in control of the disk devices. If the process wants to use it, it has to talk to the kernel through the file system interface to do that, right? So we'll talk about the UNIX file interface and some of the things that we sometimes think of is associated with the file abstraction that are really just, you know, conventions that are produced by the way the UNIX file interface works, okay? And then finally we're gonna talk a little bit about file system organization, right? Again, so much of this stuff seems totally obvious to you guys, you've just absorbed it, right? And you feel like it's the most obvious thing in the world but sometimes that's the thing that's the most interesting to think about, okay? All right, so at minimum, right? What does a file have to do in order to be useful? Anyone? That's a contained data, right? So it's gonna store information and not only store information, reliably store information, okay? And this is actually the reliably part, kind of hard, right? You know, a lot of the work that goes into file system design and a lot of the bugs, I mean, that the canonical bug in file system is that somebody loses data and when people lose data, they get sad, right? And when people lose data, they get sad and they get angry, right? Margo Seltzer, he used to teach this class had somebody tell her this quote, which is, you know, the worst system bug is the one that not only makes somebody sad and angry but gives them enough time to just call, well, you know, in the old days, call you up on the phone or now flame you over email or hunt you down in a chat room to really complain about the mess that you made of their life, right? And data loss tends to do that, right? Because while your computer, you know, is at the special technician who you're paying $1,000 an hour to try to recover your MP3 collection, you have a lot of downtime, right? And you can find the person who screwed up your file system and take revenge, right? And then the other thing a file has to be able to do is we've got to be able to find the file, right? So if I just, if I had a file system that was like, oh, here, here's this container, source of information in it. And then I never, it vanished, right? And the file system was always like, oh, I've got your data, man, I've got it. But you were like, where? It's like, no, I've got it. I really do, I really do, you know, say the magic word, right? So it has to be able to be located, right? So, and we do that via name, right? And again, this stuff seems really, really basic, but it's, but these are kind of the basis for file system operation. And these names have semantics. And I kind of hate that I put up a real name there because that even has semantics that aren't just from file naming, right? But I have to be able to name the file. Have to, there has to be some string handle, some sort of identifier that I give to the file system that causes it to retrieve the data that I previously stored in this object, right? Okay, I don't know why that, okay. And so again, at minimum we expect that the file contents should not change unexpectedly. So if there's not a change to the file, the content shouldn't just change into some different value, right? Or worse, vanish or whatever, right? And then also, when the process requests that the file contents change, they should change. And they should change to what we want them to be, right? And this part is actually, itself, kind of hard, right? When I was doing this again, I was kind of curious about file system data loss because I was kind of hoping, oh well, this was a problem in the 80s and 90s, but we've solved all these problems now. And now no one ever loses data because of file system problems. And then I found this, when I Googled this, it was like on the first page from three weeks ago. Serious file system corruption and data loss cause to other NTFS drives by Windows 8. So it's like, woo-hoo, you know? File system data corruption alive and well. I shouldn't pick on Windows. I mean, there's plenty of other systems that have this problem, right? But file systems are complex. They're big complex lobs of code. And, you know, they kind of do the wrong things sometimes. And that's sort of sad, right? The real bane of the existence of most file systems are failures, right? And the failures that can really affect file systems are when the system shuts down unexpectedly, right? Or loses power or whatever. And the file system is in the middle of doing something to the file system. Remember, disks are slow, right? And so in the time it takes certain disoperations to complete, a lot can happen. Like you can trip over the power cord to your machine that has a broken battery and the thing can shut down, right? And the other thing that happens here is that some of the things that the operating system does to try to improve file system performance that we'll talk about later, end up making file system crash handling more precarious, right? And the simplest way to keep in mind here is that memory is fast but transient. Power goes out, it's gone. The disk is slow but reliable. So one of the typical ways of making file systems faster is to cache stuff in memory, right? But what doesn't happen to that stuff in memory if the power suddenly goes off? It doesn't ever make it onto the disk, right? So this is one of these kind of classic trade-offs here. If you want your system, if you want a really fast file system, just store everything in RAM, right? And then wait till shut down and then write everything to disk all at once, right? You know, I mean that would be an awesome file system, be blazingly fast, right? But you better really hope and pray that you can get to shut down cleanly every time, right? All right. So what else might we wanna know about a file? So a file's got some data in it and I need to be able to locate it, right? So I need to be able to give the operating system a handle or a name and retrieve that data. What else might I want to know about a file? Okay, for permissions, right? We haven't talked about security in this class but most of you guys know that systems have users and groups and other ways of essentially figuring out what processes are allowed to do what to the file, right? So can I read from it? Can I write to it? Can I alter the contents? Can I change the name? You know, there's a whole bunch of different things about the file that I might be able to change and essentially what I wanna know is when a process makes a call through the Unix file system interface, should that call succeed or fail, right? So that's good. What else might I wanna know about a file? But what I wanna know, anything else? Size, okay, yeah. So there's some statistics about the file size. What else? What about things associated with time? Access time, when is the file created? Has it been modified recently, right? So there's some when things here. Again, there's some who things in terms of permissions, process permissions, things like this. And then there's this other interesting question which we're gonna do a little design exercise here in a minute of what about other kind of file attributes? What do you mean? What compatible? If I say file has dot ABC format or something. Yeah. Windows supports it, but the Mac cannot support it. So that, so all right. So there's this other, yeah, there's, yeah. So file extensions, file name extensions are kind of gross, right? And I don't actually talk about them here because they're just, they're kind of a clutch, right? So files do have extensions typically and the extensions can kind of be thought as hints to the operating system about which applications should be used to open that file if the user requested, right? But I don't want to get too much into that because that's kind of clutchy and there are better ways to do it. But let's go through an example where I might want to store some other information about a file, right? Information that really belongs with a file, but maybe I don't want to store in the file itself, right? Yeah, this is metadata or attributes, right? So think about MP3s. MP3s are a great example for this, right? So an MP3 stores audio data, right? That's really what should be in the file is whatever audio samples, encoded audio samples are required for an MP3 program to produce a waveform that sounds pleasing to the ear or displeasing depending on what kind of music you like, right? But what other sort of attributes might I want to store along with an MP3 file? Like the title, you know? This isn't a weird extra piece of information. What's the name of the song, right? What else? The artist, the band name, right? Or maybe the date the song was recorded or the album was released or whatever, right? So these things are kind of interesting, right? And we have three options about where to put them, right? What's one option? What's one place that I can put this information so that I can find it when I use the file? In the file, right? So one place is just stick it in the file itself, right? And now, rather than a file which has this nice property of just being a set of MP3 samples, now I've got some other metadata in there that's kind of becoming a database, a little weird, but okay, that's one place and that has some pluses and minuses. Where else can I put it? I think you said, right? So I can have a separate database file, right? Where I store a bunch of information and maybe that file is only used or accessed by one application, maybe multiple applications, but I can always split off this information and store it in another separate file, right? Or you could even have a small file of metadata per MP3 file, right? That starts to sound really gross, right? But actually I guess album covers are stored that way a lot. Like the album cover is stored as a single image file for an entire directory of other files. Carl, were you gonna ask a question? So there's one other option here. Does anyone know? Well, the third place. So you guys have gotten kind of the first two, the obvious ones, right? Yeah? So okay, so in the directory I would say that's either in another file if it's in the directory or in the file, you know, that's essentially another file, right? We think of the directory as another file. What's that? But I mean, what does that mean? It's just gonna be in another file on another machine, right? So it's really in another file. So there were these file systems including some very famous file systems that supported what we're called attributes, right? And attributes are essentially key value pairs that I can associate with any file in the system. And I can query them using database like queries, right? So that's, so for example, what I would do here is I would say to the file system, you know, store the attribute title for this and store this value in that attribute, right? That'd be three titles, right? And there were, again, there were file systems that actually provided an interface allowing the system to query. So if I wanted, for example, all the titles for all the songs in a certain directory, there was a fast and effective way to write that query and they would return that metadata associated with those files, right? Yeah. So the files, you know, to the process, who knows, right? The file system stores them somewhere on disk, right? Well, and not a file that's visible through the file system, right? Like it's going, you're right, it's going to be on the disk in a disk block somewhere, right? But we need to be able to distinguish between a disk block where that's can store information because there's a lot of disk blocks that store information that are not directly visible to processes or to you as a user, right? Because they're used to store file system metadata of all sorts of different kinds and we'll talk more about exactly how they could implement it on Monday. Yeah, you gotta, it's not really part of indexing. And again, most of the systems that you're associated with don't do this, right? It's too bad because it's kind of an elegant feature, right? But let's go through this design exercise just for a minute because I think this is kind of an interesting good example of how to think about system design trade-offs, right? So if I store the information in the file, if I jam it in the file, what's a pro to this approach? What's a good thing about it? Yeah? Everything's in one place. Everything's in one place and what does that mean, right? What does that mean about that extended information? Is it a fine, it can be modified, what else? That's true and so what happens if I move the file? It goes with it. If I move the file to another machine, it goes with it, right? If I copy the file, it goes with it. So this is nice, right? It's just part of the file context, right? What's a con? Well, that can happen with the file contents itself. File's a little bit bigger, but the data's gonna have to be somewhere, right? Well, yeah, I have to, that's actually, that's a good point. The permissions end up being the same as the rest of the file, so I can't protect the metadata separately, so that's an interesting point. But what else? What's another problem here? What about the applications that use this file? Well, I have to do that, but it's been pointed out, every application that uses this file has to know about this metadata and understand the format. So if I write an MP3 player, and MP3s do have this, it's called an ID3 tag, this is kind of the way that this is actually done using MP3s, I think. But if I open this, if I have my MP3 player and it doesn't know about the ID3 tag, then it's gonna try to interpret that data as audio samples, right? It's probably gonna sound really weird, right? So this essentially creates a coupling between two things that maybe I don't want a couple, right? Every application now has to be able to know about this, and who's gonna decide what this format looks like, right? And if it changes, every application that plays MP3s has to understand and react to that change, right? So this kind of, you know, just, you know, it's getting a little gawky, right? All right, so in another file, again, this is also done for MP3s. Your iTunes database, for example, stores, I'm assuming, some information about how many times you've played the file, whether or not you like the song or whatever, things like this, okay? And that's sort of in a separate file. So what's a pro here? Think about some of the con that we already looked at. Yeah, Luke. So right, it's iTunes specific, right? iTunes is coupled to that file, but no other application is coupled to that file. So as long as iTunes, you know, understands how to parse that file, I'm good, right? So yeah, it's maintained, it can be maintained separately. And if I have five different MP3 players, then I can maintain five different per application databases, and I'm good, and every application can set up its database how it wants, right? What's the con about this approach? Well, the information has to be somewhere, right? Well, right, so the information doesn't move with the file, and usually what's worse is that all this information gets jammed together, right? So if I give you a copy of my MP3, you don't know how many times I've played it, right? And if I wanted you to know that, which I don't probably, find out how many times I'm playing my, Katie's Perry's greatest hits or whatever, then I would have to embed that information into the file before sending it to you, which would require parsing this big database and pulling out just this one piece of information, right? So that's a little gawk, right? Okay, so my attributes, right? So I don't have a great example for attributes because a lot of files don't support it. One of the reasons that got me thinking about this was that does anyone know about BIOS? So this is like kind of really part of computer systems hack or more, right? So BIOS was this relatively famous, I mean, I think when you look at fame versus people who actually use the thing, BIOS probably beats out any other operating system because I think BIOS had like four users, right? But there's a lot of people, including you now, who know about BIOS, right? And BIOS was actually really elegant design in some ways, and one of the things that was popular about BIOS among the four people who used it was the fact that its file system supported these extended attributes and actually made them really easy to query and store, right? So now again, let's say that the file system stores this information for me, what's the pro here? What's nice about this? Well, I can change it, I need to be able to change it, right? I mean, if I decide to fix the title of the song, then I need to be able to change it, yeah, John. Right, so there's some degree of centralization that happens, but I think that the best thing here is that this is maintained by the file system. And the file system actually, BIOS had this whole database-like structure that it would set up so that again, queries across file system metadata were really fast, right? So if I wanted to, for example, find the title of all the songs in my, you know, like when you use iTunes and you type in a little query and it's supposed to filter your library, that could be done very rapidly, right? Across the file system itself, right? A con is that like other file, you know, like the separate database, this doesn't necessarily move along with the file, right? Now that could be a good or bad thing because another system might want to store its own metadata, right? Part of what you'll find out as we talk about file systems is that there's this tension that's been established over many years about where to store certain pieces of information. So this is a great example. It's one of the reasons why we've talked about it at length, right? But file systems, you know, on some level, you can think about file systems as just a key value store. That's all it is, right? It's a key value store that happens to be set up to store relatively big values. And yet over time, file systems have started to collect and provide access to other pieces of information. And there's always this question of, does that piece of information belong in the file system itself? And should it be maintained by the file system? Or should it be maintained separately by application, right? Applications usually say, this is a really useful piece of information that every other application will want access to and it'll be so much faster if I put it in the file system. And the file system developers usually say, that's a completely bogus statement and you're the only person who's ever gonna use or query this attribute, so forget it, right? So, but this is this constant tension that goes up. All right, so, let me see how we're doing that time. So let's talk a little bit about establishing relationships between files and processes. We won't get too far into the Unix matrix we'll finish this on Monday, all right? So a portion, so there's two portions. If you think about the file system interface in Unix, right? So now we're shifting to Unix file system semantics, although some of this is common to other systems because some of these ideas are useful, right? The Unix file system interface, which you guys are busily implementing, right? Really has two, you know, two parts to it, right? One part is the mechanics of actually reading and writing to files. The other part is the process of establishing relationships between files and processes. Now, you guys are, again, you're so used to this, right? But why do I even need to have calls to open and close? What's the point? I could just pass you a file name with my read. I think that sounds really elegant, right? Like I wanna read from a file, I tell the operating system here's the file name, here's the number of bytes I wanna read, here's the position, go for it. Why not do that? I mean, can we at least all agree that that's a valid alternative? Like that would work, right? Like you could do this and it would work, right? So the question is, again, why do I have these calls that establish relationships, right? Cause that's what they do. Open says, I think I might use this file, right? No guarantees, it's just a hint. And then close says, I am done using this file. Like close is actually a little bit more dramatic than open, right? There's no guarantee that by opening a file I will ever perform another operation on that file. But if I close a file, there is a guarantee that I cannot perform an operation without calling it open again, okay? So again, why establish these relationships? Right, so that's great, right? So one of the reasons is a really great point, I hadn't thought about it this way, but one of the reasons is multiplexing, right? So files are shared resources. The operating system is trying to multiplex them between different processes. And by a process declaring that it's going to use or stop using a file, I can attach semantics to it. Now not all systems do this, right? We'll talk about some networked file systems in particular that do not do this, right? But by saying I have this file open, it might mean that another process trying to open that file will not be allowed to. Really depends on the semantics that are supported by the call to open, right? Oh, I gave away the other one. So one of the reasons is performance related, right? If I know the, if I have to, you know, if a call to read can come in for any file, anyone in the system at any time, then the process of trying to improve file system performance you might think might get more challenging, right? Cause now it's kind of like, I gotta be ready for anything, you know? Like the ninjas could come from any direction, right? Whereas, I'm gonna use a ninja turtles reference here. If I force processes to open files, I can fight the bad guys like they always do in the movies, namely one at a time, right? Or I shouldn't say one at a time, but in small groups, right? I can say, okay, the process can only have, you know, 32 files open, these are the only files that can use and any sort of optimization I'm gonna do to improve performance can only, you know, I only have to worry about this set of files, right? So it gives the operating system some more information about the files that are in use on the system that might be useful to improve performance, right? And then the second one is, was pointed out earlier, I might want to provide guarantees or exclusive access based on this relationship so I can use this to improve or provide certain multiplexing semantics that might be useful to processes like exclusive access to a file, okay? And again, so some file systems, particularly, to do, how many people like to use the shared computer accounts that you guys get through CSE, like Timberlake or whatever? Do people log into that on a regular basis? Really? I just logged into Timberlake for the first time the other day. What's that? I know, I know, oh man. I just have to be special. Well, I logged into Timberlake the other day and it was butt ass slow. So you guys should be happy that you're not. What's that? Everybody's on it. It's morning. Thankfully, I'm not on it, right? I was like, I couldn't remember. I passed it, it was terrible. I was like, oh well, okay, I guess I don't use it that often. But anyway, so some network file systems, right, that you guys are probably familiar with because I'm guessing that Timberlake probably runs NFS or something similar. Some network file systems don't bother with this relationship, right? So NFS doesn't provide any semantics associated with opening a file, right? And particularly, it doesn't provide exclusive access. So I shouldn't say that. I should say NFS pre-version four, right? So NFS is a very common network file system and until the latest, greatest version which a lot of people still don't deploy, NFS didn't provide any guarantees about opening. So why is that? How many, some of you people are taking Steve's distributed systems class, so he's probably got you all paranoid about all these things that can happen over the network. So it's up on the slide, but you guys can either read the answer or tell me in your own words. What would be the problem? Let's say I grant exclusive access to a process, to a network client that opens a file. What can go wrong? No, but let's say that I can verify that the process has permissions to open the file. I go to the exclusive access and then what can happen next? What's the problem? What's that? Transaction problem. Well, transaction problem specifically. It drops about at least one. Like your computer shuts down, right? And the network file system is giving you exclusive access, but you're gone, right? Like you're off the network, you've closed your computer or something. And so in these cases, network file systems, actually some of the more mature ones do a lot of work to try to make this actually work, right? Because it can be very difficult to find any sort of locking for files when you have clients that could potentially die, right? All right, so on Monday, we're gonna talk about, continue talking about UX semantics as far as file location. We'll talk about hierarchical file systems and we'll start getting into how these things actually get implemented on disk in disk blocks, all right? Have a great weekend. If you're doing the hackathon, good luck and enjoy the extra week to do assignment two.