 I hope you guys enjoyed your long weekend of working on assignment three. So today we're going to continue talking about files and file systems. And today we'll talk about files, talk about some things that you guys are probably very familiar with that maybe you've never thought about from a systems design perspective, particularly related to pass and how files are named. And then we'll get to the point where we start to talk a little bit about what makes file systems hard. And that will be sort of a preface to the next few lectures where we'll talk about the details of how certain file systems work. All right, so yeah, there we go. That's what I just said. So we're still working on grading the midterms. I think the TAs made quite a bit of progress over the weekend, so I suspect they will be done later this week. Once the TAs are finished, we take them over, we get them scanned. Once they're back from the scanners, you guys can come pick them up and try to get a few points back here and there or whatever. So once the midterms are available, we will let you guys and we'll also update the solution with statistics about the exam overall and about each question. And because the TAs are busy grading the exams, we're just doing ninjas and office hours this weekend. There's one office hour Thursday from five to six that's normally only staffed by the TAs and so that one gets dropped when we drop the TAs from office hours. All right, I think that's it. Work on assignment three. You guys have like a monthish, maybe a month plus this week or something. So it may feel like a lot of time, but hey, go ahead and finish it and then you'll be done. And then you can brag to me or to other people about how I gave you way too much time to do the assignment. But there's no way that you can brag to me about that unless you actually do the assignment. So if you're done next week, maybe next week we'll start putting up before class anybody who's actually finished. So a little bit of incentive for people. Go ahead and finish it and then you can just relax, right? Focus on your other less interesting courses and stuff like that, whatever. Okay, any questions about discs from last time before we do a little bit of review? So the big thing to take away from discs and from what we talked about in a second is simply that keep in mind some of the physical properties of disc drives that the operating system and file systems in particular have to cope with, right? Okay, so parts of the disc, who remembers, who can define a platter? What is a platter? What's that? Yeah, it's a circular part of the disc. It rotates. Why do I care? What's important about the platter? Yeah, Isaac. Yeah, it's where the data is stored. So the platter has a magnetic medium on it. That's where the data is read and written from. So this is how I store data permanently on an old hard disk drive. I'm actually making physical changes to the magnetic properties of the platter's surface at a particular location. What about the spindle? Part of the disc is the spindle. I guess it didn't grow up with nursery rhymes about witches and spindles and stuff. Yeah, it's the center of your spinning wheel or of your disc in this case, right? It's the drive shaft where the platters are mounted. What about the head or the heads? What part of the disc is that? Yeah, this is the sensor actuator that actually, you know, when I'm reading data is sensing the magnet properties of the magnetic material, of the magnetic material. And when I'm writing data, it's actually altering those properties, right? Okay, locations on the disc. And these will start to become important as we, particularly when we talk about the Berkeley Fast File System. So define a track. Yeah, it's like, it's one, you know, I don't know, it's a track, right? It's like in Formula One, right? It's a big, boring circular track that runs all the way around the disc at the same radius from the spindle, right? And the reason why the track is important is because it's the place, it's all, the track defines all the data that one head can read without moving. So by just waiting for the disc to spin so that other locations come under the head, without repositioning the heads, I can, one head can read all the data on one track. What about a sector? Anybody remember what a sector is? Okay, look in. Yeah, it's one slice of pie, right? Actually it's not a piece of the track, it's actually a piece of the entire disc. So imagine cutting the disc up like a pizza, right? Each one of the slices is a sector. And then the cylinder, you said you wanted to raise this hand, you want to give it a try? Close. So the cylinder describes a location on the disc. It does have to do with multiple platters, yeah. Right, so the way to think about a cylinder is it's all the data that all the heads can read without the entire arm being repositioned. So it's all the tracks, the equivalent tracks on every platter vertically through the disc. So that's all the data that could be read or written from the entire drive without moving the disc arm, okay? So we talked about some of the differences in discs that give rise to file system design. So what's the difference in kind? There's a way that old hard drives are unlike every other part of the system. Yeah, they move, yeah. So old hard disk drives moved, they always moved. Even those old ones that you put those big discs into, right, those big floppy discs, those got spun around internally, right? So the internal hard drives have a similar thing like that. So they move, right, parts of the disc move, nothing else in your computer moves. Difference in degree, discs are very slow, partly as a function of the first difference. So when we start to look at how to make discs fast, making sure that the entire system is as slow as the disc is really important because the discs are quite slow. And how, what's the difference in integration? Yeah. It's not quite it. So integrated into the overall system in a different way, what can you do potentially to a disc on your computer that you cannot do to another, to memory or the CPU? I've called on Isaac too many times, yeah. I can, well, unplug, unplug and play, right, I can, I can remove it. And I can, you know, it's, I can potentially remove it, you know, in the middle of operation, which of course, as you would imagine, create some problems with trying to make sure that the data on the disc is in a reliable state. But discs are integrated as devices, whereas memory and CPU are sort of part of the core parts of the system itself. And so the disc itself, the way that we build file systems is the disc itself provides a very simple interface, allowing you to read and write blocks of data. And then everything else on top of that is software. And we'll start to talk a little bit about some of the ways that software works today. Okay, so sources of Sloan is when I'm accessing the disc. So what does it mean to issue the command? What does this, what does this involve? This is the first step, yeah. Yeah, telling the disc, remember the disc interface is pretty much read, write block of data, where block could be 256, 512, 100, 1024, some, you know, some number of bytes of data. That's the interface that the disc exposes. And part of that is a location, so read this block where I have numerically ordered blocks, you can think of that as being the disc interface. So the first thing I have to do is tell the disc what to do. And that takes a little bit of time to get out to the disc itself. Then what's the next thing that happens? Yeah, yeah, so I actually have to move the arm to the appropriate place on the disc at some point. And somebody last time asked about disc, what are called disc scheduling algorithms. So how does the disc figure out what to do in what order? And I realized that that's not something I cover in a class. I'll be happy to talk about it on Piazza if the person who had that question or anybody else wants to know about it. But we don't really talk about that. It's not super interesting, but it does happen, right? So the disc at any point in time may have multiple requests from the operator system that it knows about that it's trying to handle. And how it moves the arm between those requests in order to try to minimize the seek time, there are sort of algorithms for doing this, right? So now I have to move the heads to the right point. What is the settle time? Yeah, yeah, I have to wait for the head to stop vibrating or stop shaking or really for the head to be centered on the track. So I'm certain that I'm reading the right information. Then I have to wait for the data to actually rotate under the heads because who knows where it is by the time that I've actually moved the heads to this point. And finally, I need to read the data, probably initially into an internal buffer on the disc, and then transmit that data back to the operator system. So these are the steps here. What's the slow part? What are the slow steps? Seek and settle. Everything else I can optimize through just better electronics, right? Wider buses, higher clock rate buses, et cetera. But moving things back and forth is always going to be a source of slowness with physical drives. And we pointed out that there was a lot of consternation at a certain point in the systems community because we were shipping machines that had much bigger discs, but those discs weren't getting proportionally faster. So we're giving people more and more storage and that storage was giving slower and slower. Particularly slower, I should be careful about this, slower proportionally to the rest of the system. So clock speeds are skyrocketing for processors and the disc is sort of poking along. It's not getting slower. It's getting faster through better engineering, but it's not getting faster as fast as other parts of the system. And so ratios between the disc performance and other parts of the system are changing. All right. Any other questions about discs? OK. So on top of this sort of weird, physical, potentially problematic device, we are supposed to build stable, reliable storage. And that starts out with files. And today I want to talk a little bit about parts of. So the file of all the parts of the system that you guys use on a daily basis and feel some degree of familiarity with, I would argue that the file is probably the one that's at the top. So before you took this class, maybe you guys knew that there was some sort of thread scheduling going on. And you knew a little bit about some of the other abstractions. But this is something that you guys use on a daily basis. You create files, you move them from place to place. And so you feel like you know something about them. And to some degree, that's true. But there are some choices, design choices, that you guys have internalized here that I want to identify so that we can talk about how file systems provide some of those properties. So hopefully this isn't too super boring. I promise once we get to actually talking about how file systems are designed and implemented, it'll be a little bit more interesting. I just need to make sure I go over this because, again, just because it's obvious to you doesn't mean that you have seen sort of some of the internals here. OK. So today we'll talk a little bit about what we expect from files. So what the file abstraction needs to provide. We'll talk about ways in which we associate other types of useful information with files. Because frequently, files go hand in hand with other information that we need to know in order to make the file contents themselves useful. So we'll do a little design exercise when we talk about how this happens. We'll talk a little bit about the semantics associated with the file system interface. So at this point, you guys are pretty familiar with it having implemented some of these system calls. And so we wanted to spend too much time here. But I just want to sort of remind you about part of what the file system interface means. And we'll talk a little bit about why certain calls exist. So for example, why do I need to open a file in the first place? Why can't I just read from a path? Why do I open and close files? Why do those calls happen? And finally, we'll start talking a little bit about one of the things that file systems do, which is organize multiple files in an effective way. So this is part of the file system sort of design requirements, is laying out files on disk so that I can optimize access to it. All right, so the minimum, what is a file? What are the two things that a file has to do in order to make it useful to you or to an app or to anyone? There are two properties I would argue that a file has to says. Yeah. What's that? Well, OK, it has to, I don't know about encode, it has to store stuff, right? We expect files to be a container for some sort of arbitrary information that we can put into it. And we expect them to preserve their contents. So we expect their contents to change in a predictable way. What's the other thing? Well, user permissions are sort of a feature here, I would argue. There's one other thing about a file. Yeah. Yeah, you better be able to find that file, right? If I gave you this nice abstraction that stored data, but once you store data in there, you can never find it again. It's not super useful, right? So I need some way to name the file so that I can find it later and so that I can potentially, I want to be able to name that file in a meaningful way, right? And how I name files has a lot of implications, right? So how many, I'm gonna date myself here, but how many people were around sort of in the days of copy-serve? Do they remember copy-serve email addresses? There we go. I can tell like the people who are old in the room other than me. So when there was a, I'm pretty sure it was copy-serve, I may be wrong, but one of the earlier email providers decided to have email addresses that, so if you looked at a copy-serve email address, it was this string of digits, right? Because I guess somebody at the company thought, well, hey, I mean, people have phone numbers, right? Phone number is this 10 sort of 10 meaningless digit string of information that is that you are forced to use by, I don't know, Sprint or your local phone company. So why not just give them an email address that looks like a phone number, right? Just 10 digits. 10 randomly assigned digits from copy-serve, right? How popular do you think those email addresses were? I would rather rather have first name, dot last name, dot, you know, woof, right, at gmail.com than some meaningless string of numbers, right? So how you name things, I think there's someone famous computer systems researcher who has identified naming as one of sort of the canonical problems that's faced by every computer system, right? So how you do this matters. And we'll talk about some of the earlier ways to name files that didn't work very well in a few minutes. Okay. So when it comes to how file contents change, we expect them to change predictably. So when we change the file contents, we expect them to change in the way that we've asked, and we expect them to not change at other times. So if I haven't made modifications to a file, I expect the file to have the contents that I wanted, right? This seems simple, but there are plenty, and this continues today. I mean, you can find more recent examples of this. There are plenty of examples of modern file system designs that don't meet this requirement. So you might think, eh, you know, like spending disks have been around for a thousand years. You know, Jeff already said that all these smart people worked on them because it was a really hard problem. And so it's 2015, and you know, like this must be a solved problem, but it's not, actually. And we'll talk a little bit about why this is so hard later. But there are still file systems out there that will lose data, right? Or will data will change unpredictably, or where files will vanish, never to be found again. So this is, so hopefully this gives you some sense that this is actually a really hard problem. Not that there are dumb people that worked on it, right? The fact that we still have file system bugs after 30 or 40 years of work on them is just a sign that this is challenging, right? And there's some cool things to learn. All right, so why does this happen? Why do we have failures? What are some of the types of things that file system designers stay up at night thinking about? Power outages, right? So this is one thing that can cause or can put your system into some sort of bad state, right? So you trip over the plug to your desktop, right? Suddenly your server goes down without warning. The UPS that you bought for $1,000 doesn't actually work. Maybe you never turned it on properly or whatever. And so the machine suddenly loses power. With newer machines, it's probably also true. I mean, how many of you have seen those very angry error messages that you get from most systems when you unplug something improperly, you know, Mac throws up this really scary message. Don't do that ever again. You need to tell me when you're about to unplug something, but why, right? Why do you need to do that? It's because the system is not necessarily really eager for you to suddenly do something to it that causes, that might cause the on distate to be corrupt. Has anyone ever corrupted a drive that way? Just curious. Really? Okay. It didn't work afterwards, okay. I wanna try that sometimes, just like, you know, have some IO going to the drive and just unplug it repeatedly over and over again and see if I can actually get it to fail, right? It will, right? I mean, it's guaranteed at some point you will catch it in the bad state, but it's never happened to me before. So those error messages are always a little funny. What's that? And part of what we'll talk about later too that part of what makes this challenging is the fact that because the, so what is one of our major solutions to making slow things look fast? Use a cache. So what am I gonna use to cache the contents of the file system? What's the smaller, faster thing I have available? Memory. So what's wrong with using memory to cache the contents of the file system? What's that? No, the speed gap is great. That's what I want. Yeah. Yeah, what happens to the contents of memory when I unplug your machine unexpectedly? Boom, gone. And so that's actually one of the things that makes this hard, right? Is I want to use memory as a cache because it's fast, but the fact that that memory can suddenly cease to exist is a problem. Okay. So let's talk a little bit about file metadata. So one of the things that you guys probably haven't thought about when you use files is other information about the file that the file system might store or that you might want to know. So give me some examples of file system metadata. So when we describe metadata, we're talking about things that don't necessarily belong in the contents of the file itself, but are useful information that you might want to know about the file. Yeah. Yeah, when the file was created, when it was modified last, things like this. So there's a whole series of timestamps associated with files. What else? Yeah. What's that? Size, yeah. Yeah, very, yeah. Before you transfer that file over the network or try to post that picture to Facebook, it's helpful to know how large it is. What else? Yeah. Permissions. So normally file systems provide a permission mechanism that allow you to control who can alter view, move things like this, who can do what to a particular file. Okay, so you guys have come up with some good ones, right? What about other file attributes that are a little bit more specific to how the file is being used? So these are all very nice general attributes, but what about other things? Okay, so that's another good example of information that the operating system might want to store about general files, right? But let's say I told you metadata about a particular type of file, right? Maybe metadata about a movie. What else might you want to know about files that are videos? Yeah, what's that? Okay, so yeah, actually that's a great example. Thumbnails, right? Thumbnails are weird, right? I mean, technically thumbnails are metadata. Frequently they're stored as separate files, which is a little weird and kind of confusing. What else might I want to know about a video? Yeah? Yeah, something about the format, how has it been encoded? What else? Get somebody over here. Okay, so that's another great example of general file metadata. How do I open this file? That's typically stored outside of the file, right? By the operating system, but if I just double click on the file in my file folder, what application is opened, what application is launched to try to open the file? But what else? Come on, movies have other attributes, yeah. What's that? The bit rate, right? How high quality is this file that I downloaded, this completely legal file that I downloaded from BitTorrent, right? Is it really the 1080p that the BitTorrent file said it was, right? What else? Movies have names, right? Like the title, right? What is this movie, right? Who are the other people associated with it, right? Things like this. So there's a lot of potentially sort of type-specific information, right? So as another example, with an MP3, you could think about the title, artist, the date, right? The data which was recorded, you know, the label that it's on, linked to the artist's website, blah, blah, blah, right? So there's all this sort of information. And this sort of stuff, right? So it certainly is unclear that the file, there's certain information, a lot of the stuff that you guys came up with first, that you would argue that the file system should store for all files because it's generally useful, but this information is much more file-specific, right? And so where do we put these, this type of type-specific information? So what are some of the options? One place is I can put it in the file, right? So clearly I can jam this information into the file itself, but where else could this information go? Yeah, here we go. So Windows user, right? In the registry, right? In some sort of registry, right? So I'll generalize that as saying into some other file. And then there are also file system designs that include support for what are called attributes, right? So general attributes, and you can think of attributes as just, you know, arbitrary key value pairs that I can associate with a particular file. The nice thing about attributes is they're structured in a way that I can run things like queries on, right? And there are file systems that supported this feature, there are probably, are still file systems that support this feature. So what are some of the trade-offs here? So if I put it in the file, what's an example of this for MP3s? There is a specific way of doing this that was directly designed for MP3s. Maybe you guys don't use MP3s anymore, you know? It's not cool. Yeah, so there's something called an MP3 ID3 tag. I don't know why it's ID3, maybe ID2 and ID1 didn't work out so well, but anyway, it's ID version three. And this is information that I think is, is sent some sort of header in the file itself. Well, it's jammed right into the file. Now the, what's a nice thing about this is that at this point, if you take that file and put it anywhere, if you put on a USB stick and move it to another one of your machines, that metadata files follows the file around. What's a con with this approach? What is this? If I put it in the file, what does that require? What's that? Okay, the file size is getting a little bit bigger, but you're gonna assume that this information takes up some amount of space regardless of where I put it. So what's another problem with this? Yeah. So the file system's not aware of it, that's true. So it's possible that maybe I'd have a little bit more of a powerful interface that the file system knew about this stuff, but what's another problem? Let's say tomorrow the MP3 consortium or whoever it is that makes these decisions comes out with ID4, right? The next version of ID3. What potentially is that gonna cause to happen? And let's say that I'm a maintainer of a tool that plays MP3s because certainly there aren't enough already, but mine is written entirely in purl. So what do I need to do now? Oh, forget backwards compatibility, right? The problem is I need to go change something. So the users of the file all have to be aware of this metadata. I don't know what would happen if you tried to play an MP3 file, if you just tried to play the MP3 ID tag from the file, it would just probably sound like a little bit of static at the beginning of the file, but that would be annoying. If every time you switch files, there's this little bit of static and simultaneously if your MP3 player didn't know stuff that the other MP3 players knew because it didn't interpret the file. So what about in another file? So who can give me an example of this? Who uses any sort of modern media program? How many people use iTunes? Really? Oh, okay, so you have Spotify people or whatever. I have no idea how Spotify works, right? Sure, it's magical. Anyway, so let's say you use iTunes, like these old, cludgy programs that your parents still use for listening to MP3s, right? So iTunes has a bunch of dedicated files that it uses to store file metadata. And you can have a lot of fun with iTunes by moving these files around or corrupting them or doing other things and at which point iTunes gets totally confused about what's going on and sort of trying to re-import things or just gives up and dies or whatever, right? You know, if you go in those directories, you're not supposed to go in on Mac and just sort of start playing with things. Sometimes that's the only way to get iTunes to start to behave again, so anyway. But the point is that iTunes stores this information somewhere. It's in a separate file. It's not stored in the file system itself. And to some degree, this is nice, right? Because each application can now have its own metadata that it associates with the file. So iTunes, for example, seems to have or maybe has access to some proprietary database of album artwork, which it downloads and stores with the file. That's kind of nice. iTunes may have access to other information about the file, how to buy it in iTunes or other sorts of metadata that it can store separate. What's the con? Going back to the example of the ID3 tab. What's the con here? What's the drawback? What does not happen when I move these files from place to place? Yeah? Yeah, so the problem is that the metadata is now A in a separate file, which has to be sort of synced independently. It also doesn't follow the file around, which may or may not be a good thing. The other advantage of this approach is that the file that stores the metadata is potentially accessed a lot more frequently than the other files. And so you may be able to do better file system caching by sort of organizing all your access to that metadata into one file rather than spreading it among a bunch of different files. So the last option here is to put it in attributes associated with the file. And I only bring this up because this was a really elegant design idea that came out of an early system that was called BIOS. So if you go back in time, about 30 years, well, actually, I'm not that old. So maybe if you go back in time, 15 years, shall we say, there were all of these cool, weird operating systems out there. And some of them are still around. Is anyone ever you try to use plan nine? That's got to broaden your horizons a little bit, especially now that you have virtual box. I mean, who cares? You don't need to install it on your main machine. Just get virtual box up around. You can play with these weird old research operating systems. Some of them are really cool. So BIOS had all these really neat features. For example, I don't know why they optimized for this, but on BIOS, you could play six movies simultaneously, really smoothly. You could not do this on any other system. You try doing this on Ubuntu and it would get all weird and laggy. But on BIOS, it was like, I can watch seven movies at the same time. Again, I have no idea why this was a use case that they sort of promoted, because I don't know anyone who does that, right? Except maybe if you're like an octopus and you have separate brains and every, every, they can watch a lot of movies. Anyway, so, but one of the things that BIOS introduced was this idea that I could associate attributes with the file. That means that the file system can cache those attributes and potentially improve access to them. It also means I can provide common interfaces to querying those attributes almost in sort of a database-like fashion, right? So for example, some of the, if you use a tool like iTunes that organizes information about music and you run searches, those searches are actually being done by iTunes internally using its own metadata. If you had something like BFS, you could actually run those searches right on top of the file system itself. It's kind of cool. And of course, the problem with this is that this also doesn't necessarily move with the file, particularly if you move the file between different file systems. So this is now really tied to the file system that you use. All right, so hopefully that gives you a little bit of a taste of some of the design decisions that file system developers have to make. So now let's talk a little bit about the file system semantics that the Unix file system introduces, yeah. So the, okay, so it's a great question. How are file system attributes different than storing it in another file? There's no other file, right? The attributes are stored internally by the file system. So somewhere on disk there is your data and then there is other metadata associated with the file. For example, the things that we've talked about previously like the timestamps and ownership permissions and things like that. And BFS allows you to add information to that area. So there's no separate file. There's, the file system will just allow you to tell it things about the file that it will then remember and give you some nice interfaces for access. Does that make sense? All right, cool. Okay, so one thing I wanna point out here and we've talked about Read, Write, L-Seq and all these things, so I'm not gonna linger here. But one thing I did wanna point out is the semantics of open and close. Because this is sort of important when we start to think about some of the optimizations that file systems can do. And again, you might think, well other than establishing sort of a symbolic name for the file, why do I use open and close? Why do I have these ways of allowing the process to tell me I'm using this file right now and then I'm done using the file? You don't really have to call close at all, right? Until you run out of file handles, which many applications may not. You do have to call open, of course, because the interface sort of forces you to. So the question is what's the benefit of having this information being recorded, right? Why does it help the file system to know that a particular process is potentially in the process of using a file? Why would a file system want to establish these type of relationships? Why do I want this other little piece of information? It's clearly not necessary, right? Hopefully, my midterm question convinced you of that. There is no need to call open. I could just pass a path name to every read and write that I perform and that would work fine. Why would a file system want to know something? What might this help me optimize or how might this help the file system improve performance? Yeah. So that's a good point, right? So close is kind of helpful, right? Because close indicates a point at which the process is telling the file system and the OS, I'm done. I'm not using this file anymore. And so the degree that I'm doing any caching or other things, that gives me a moment where I can tear down some of those data structures, make sure that the on-disk contents are synchronized and do other things, right? So that's fair. But why open? I guess if you have close, you have to have open. That's not a very good answer, right? Why open? And again, why create, you know, why force the process to express this temporal period of time when it's going to be using a file? What might open allow me to do? Yeah, I've, yeah, how about that, right? Open gives me a hint that certain files are in use. If I look across the entire system and I look at all the files that different processes have open, that's a group of files that the file system may want to perform special performance optimizations for. Because those are, by definition, the only files that can actually be used by processes using read and write. So in order to use read and write, being forced to call open gives the file system a nice hint about what files are likely to be in use, right? And I could certainly use that to do caching, right? I could also use it to do read ahead. So I see open, you know, along with opening for certain files, I might want to, you know, if I have a bit of idleness after the open, like let's say for an MP3 file or a movie file, what might I want to do right after open? I took, I did all the work to get my disk head over to the file in order to perform the open. While I'm there, what else might I want to do? It's an MP3 file. Yeah, like grab a piece of it, right? Because what's the next thing that's gonna happen? It's, I mean, maybe you're gonna skip ahead to your favorite 30 seconds of the song, but it's unlikely. Most of the time you're gonna play the song from the beginning. So I can use this to do some read ahead. I also need this to provide certain types of semantic guarantees to processes. So there's a way to use files to enforce inter-processed lots. And in order to do that, I need to have the concept of exclusive access. So I need a process to be able to open a file in a way that ensures that nobody else has it open. So that property I also need open to provide. Now in the other hand, if you're taking Steve's class or you learned a little bit about networks, there are a lot of network file systems don't even bother with the idea of open, simply because what's the point? There's no point tracking these sorts of relationships if there's a high probability that your client is gonna die or just get disconnected from the network or other things. And so early versions of NFS didn't have any concept of open at all. And if you looked at how open was implemented, it was just a no off. I didn't do it. Okay, so let's, okay, I'm gonna skip this. We know how this works. Got this, gonna go through this. You guys remember this, right? We did this like a month ago, right? This is the file system interface that this year you guys have completed implementing this so you know how it works, right? Tube two is it about this? Okay, so let's talk about names and the process of naming files. So in order to identify the contents that I've stored in a particular file, it has to have a name. And if I want to have more than one file, then those names need to be distinct. And so what's the simplest way of doing this? The simplest possible way. Just forget all of the things you guys have learned and absorbed from 20 years of using a particular type of file system, yeah. I like that, random, right? That's even worse than what I'm about to talk about, yeah. But imagine every time you used an application, you went to the save dialog. It didn't actually ask you for anything. It just put up a number, right? It was like, here's the number of this file, right? Write this down if you ever want to find the file again, right? And you got to keep your own little list of like, okay, so like my final version of assignment three that's gonna score me 100 out of 100 is this long number and you'd hope that the machine didn't crash while you're writing it down or you didn't write it down wrong because then you were pulling up some other thing that was broken anyway, yeah. So you could do that, right? That would work. That would be terrible. No, I don't think you'd have many users of your file system unless there are people that are sort of intentionally sort of computer masochists or something. There probably are people out there like that. Anyway, so okay, random is one way, but what else? What's a slightly more user-friendly way? Yeah. Position on the disk, wow, okay, we're going backwards, right? That is definitely less user-friendly, right? And it's worse than that because as we'll talk about, the files aren't stored in a particular position on the disk. They're stored in lots of places and so you'd have to keep track of all that. Like at that point, you're playing file system, right? Like you were playing file system, you don't want to do that, right? That would be bad. So I could have one member, I could have a list of multiple numbers. Can we go back towards things that humans like? Yeah, okay, forget numbers. The number of things played out, yeah. Sequence numbers, yeah, yeah. A timestamp, ooh, it's even better. Still a number, technically. I want to go, I want to go, a username. So you only get to have one file, right? Okay, I would say that's close enough, right? So here's an example, where I can have what's called a flat namespace. Every file has a name. That name is a sequence of characters. If I was using a very early system, I might have some limit on how many characters I could use, which would make this even more interesting. But let's say I don't. Let's say it's just an arbitrary sequence of characters. And so here's my first file. Here's my second file. Here's my third file. Here's my fourth file. Here's my fifth file. And there were early file systems where every user had a flat namespace. So imagine you're on Timberlake and you can't create a directory. You have one directory and you have to store all your files in there. People probably came up with some really cool creative naming strategies. But actually, a lot of them probably didn't. They just had a mess. So this was bad. We didn't like this. And again, like I said, you guys are so used to this stuff, right? You can't even imagine this being different, right? Maybe the next virtual box that I put out for this class will have a flat namespace until you get done with assignment zero or something, which you can unlock new features like hierarchical directories. So, and at some point people started to say, okay, there's this nice concept of having a hierarchical directory. This allows me, but really think about what this is. These are different contexts that I've created on the disk. And then of course, what's the really nice thing about hierarchical directories, about directories in general? What's the best thing about directories? That's right? They are organized. That's not always true. Okay, I can certainly have disorganized directories. Yeah, that's okay. So I can have multiple files with the same name. It turns out at Microsoft Excel, you can't open those files at the same time. Little known fact, which is really weird, but yeah, I can have the same file name in multiple directories, which is cool. But fundamentally, when you guys use the user interface, imagine if every time you typed LS, you saw every file on the system. Imagine if you had one folder window and you were trying to navigate and that folder window had every file on your system. So the real nice thing about directories is they reduce cognitive load because they allow you to see only a portion of the files in the system at a time. That's really the nice thing about them. They take, because your file, your system probably has tens of thousands of files right of find when you get home later and see how many files your system has. It's probably at least five figures, not six. And they're all over the place. Some of them, you will never open and you'll never even go anywhere close to that directory. But the nice thing about having directories is I can focus on the ones that are important to me in any particular time and potentially, if I'm good at it, I can organize them and so I see files that are important. So here's one way of organizing my question of letters and people can use this to organize things in ways that they want. Ideally, every file on the system has a name, a single name. Although we'll talk about nice file system features that relax this requirement later, they're fairly straightforward to implement. But the idea is every file still has a unique name. That name now just has other semantics to it. What's that? Oh, I had two. Oh man, that's terrible. Yeah. Clearly there's a bug on the slide. Yeah, that should be three. Well, that would be the same letter. Okay, so, so now, but now that I've introduced this idea of having location and having context, I have to figure out a way to navigate, right? And this introduces this whole idea of being able to move around the file system, go up, down, sort of in and out. And the structure that you are, so, and again, the location is sort of inherently tied to the name, right? The name describes the location within this hierarchy. And the structure that you guys are navigating, when you guys navigate your hierarchical file system is something that's known as a tree, right? It's a, it has a root and has leaves. And so now we are gonna do this incredibly interesting exercise where we think about why this is the data structure that's been used, right? So I pick a single root and the canonical name of the file is expressed by locating the file in the shortest path from that root. But why do I need to do this? Remember, the goal is that each file has a single name. So imagine I have this file, right? This is my file and this is my directory structure, right? So this directory has two entries in it, used and me. This directory has two and you. This directory has used and love and this directory has well, me and two. So what is the name of this file? Does the file have a name? Does this file have a unique name given this file system? Well, no, because I could have well over here too, right? That would be, that would be legit, right? Okay, so again, now I picked a root, right? What's the name of the file now? So what's the, so before I didn't have a root, right? So without a root, I don't even know where to start, right? I can start here and I can call the file to use to you love me, love well, right? Or I could start here and I could call it you use to love well, right? Without a root, I can't even get started, okay? But now I have a root, so can I solve this problem now? Why? What's the name of the file, right? So it could be used to love well or me love well. Both those get me to the file. Or I could just spin, right? In here like three or four times and then come out because why not, right? So what I really need is I need not only a graph but a graph that doesn't have any cycles in it. So now what's the name of the file? Use to love well or you use to love well, right? So now that I have a root and a graph that doesn't have any cycles in it, I can uniquely and appropriately name this file, right? Any questions about this? Using a tree, that's something else for next year. Cycles in your file system, just for fun, right? Or a file system that takes you to a different root directory every time you start. Why not? Yeah, so when you develop the canonical name of a file, symlinks are not considered, right? They provide sort of an alternate name. Okay, so on Wednesday we'll pick up here and we'll start talking about on-disk data structures and what makes it difficult to keep file systems synchronized.