 Sort of groovy, huh? It's like puts you in a kind of a Friday mood Okay, it's Friday How about that? I Guess everybody here is just done with assignment 3.1. Is that why you're here? I should have canceled class. I'm so I don't know man standards of quality are dropping up here I'm gonna be like hey, it's a snow day. You know, we had no snow days this year right snow today So all right now let's talk about file systems any questions before we get Rockin You want a snow day? We had a snow day. Yeah, really? Oh I wasn't here. I think Yeah, I didn't have a snow day. So look we're talking obviously it's about me right and not having a lecture No, no, no, you had a snow day. I missed that. Okay, so so today We'll start talking a little bit about one of the distinctive features of modern file systems Which is hierarchical naming we'll try to unpack some of the implications of that You guys thought might not have thought of a little bit and then we'll start talking about on this data structure So again, you're the challenge what makes file system so interesting is how do I build a reliable data storage system? That has some of the features and properties that we've been talking about on top of this flat array of disk box And what really distinguishes file systems from each other So how many people like if I gave you two systems and one was running exd4 and the other was running like another Lennox like file system like Zfs right how many people think without using some sort of like utility they could tell the difference I Mean You know CD works stuff works, you know, so the from a user perspective the operation of these file systems is not particularly different What's different about them essentially I mean there are some feature differences and there are still file systems that are adding new Features, but the differences come in how file systems lay out things on disk and the on disk data structures, okay Obviously, yeah, super one is doing like two hours three hours A lot of people have submitted already, so I think people are doing okay Assignment 3.2 is due two weeks from today It's not a huge amount of time for that part of the assignment. It's significantly more difficult How many people have finished 3.2? All right, so so these are the people that should be volunteer TAs now right for the next two weeks We've just increased the ninja staff by three. That's awesome. Are you guys done with three three? Okay, okay. Yeah, and you guys are good in there. Yeah, they're trying to do things that are hard to which I'm pretty impressed by Oh, what's that? Don't use go to right exactly actually we spent about 20 minutes after you left like figuring out what happens when you Do go to with local ver it's it's like we can talk about it later. It's weird It's actually not weird. I mean go to is very simple, but in general if you're using go to in your code Please don't right because when you go to get help from me, I will be confused and you may not get help All right, so we started talking about the Unix file interface. We went through this quickly at the end of class last time These are ways that I can establish relationships between files And we sort of I think by now after you've implemented assignment to you understand sort of the semantics of this particular interface in terms of how I Decide that I want to use a file how I actually read and write from that that file and then how I can reposition the file pointer Which is normally sort of updated implicitly So if you think about a dupe to really isn't a file system system call Dupe to doesn't you know if you remember when you implemented dupe to it doesn't actually use any of the underlying Vop calls because this is really more about the mappings that I'm Presenting to the process. This really doesn't have anything to do with the file system Okay, so let's talk about file organization We know that in order to refer to a file I need a name that was one of the things that files had it was a useful property and the file name is actually the full path to the file and Early file systems did weird things with names So for example, I might have there were there were some early file systems Wherever user had a directory one directory basically and then inside that directory It was up to you to create some sort of naming scheme That would allow you to name all the files that you are going to use So, you know, here's letter to my mom let it to my my wife There's a you know a letter to my dog And this starts to get difficult, right? Like if you imagine if I forced you to put all the files that you have on your system in one directory Maybe you know somebody who does that there's that sort of their organizational strategy But this gets this gets difficult, right and sort of nasty. So And and these flat namespaces just don't work very well So we decided to start using these hierarchical namespaces and hierarchical namespaces are not just a naming feature They're also a usability issue because it gives you the ability to you know again if I put all of the files on the system in one directory and Remember at this point in time we're actually finding things by looking through that directory You don't have these helpful things like spotlight and stuff like that I'm actually finding files by looking through the directory looking at the file name and Trying to use that as a hint to figure out what the contents are Once I start allowing you to create directories then you can organize things a little bit differently And so I can put sort of related things in one place, you know I might have a you know, so here's an example of a directory naming scheme Now every file on the system needs a canonical name so the system has to you know when you open the file I have to be able to to name it This is You know and obviously this is important you don't always see this right so a lot of times you may open files with the same name So I might open food attacks and I might open food attacks And those might be different files, but the names of those files are actually distinct. Why? Yeah, I'm in a different folder It's always amused me And I'm gonna use another chance to complain about Windows products Excel cannot open the same has anyone noticed this Excel will not open the same Base file name and two different directories So if I have formed out XLS in one directory informed out XLS another directory, you can't open those at the same time Okay, that makes me want to kill myself, right? So broken this is 2017 right because like, you know, I anyway, I won't pour you with the details But I frequently will name when I am forced to use Excel spreadsheets, which is someone who's already pointing a gun to my head A lot of times they'll be like form dot XLS, right? Which is in a directory that indicates anyway, whatever you guys don't care But it's a super broken right if any of you guys work at Microsoft and you fix that I will pay you a thousand dollars I'm completely serious about this. I'm making this on camera if you fix that and ship that to me I will pay you a thousand dollars. Yeah, I don't know. I don't know I don't care about that stuff, right? I mean, it's it's bad enough when you have to use those products right trying to use them in the cloud is like Yeah, I hope not right It is always struck me as one of the worst bugs in a deployed piece of software. I don't use outlook though So maybe I'm missing out. Okay. Anyway So obviously in order to make this usable I have these features like relative navigation So I can be I have this idea of being in a current working directory This is something that's actually sort of exposed to the system call interface and it allows me to To translate and to look up Names relative to particular location, right? So it's kind of a weird idea, right? I mean, I started off with this idea that I wanted to be able to store and name content And now I actually have this look notion of a location on the system. That's sort of odd Right. I mean, this is kind of a neat this matches up to some degree with our human intuition about stuff like You know how many people in here are organized in some way at least slightly. Okay, you don't have to all raise your hands I mean, that's okay. I'm seeing some of my faculty colleagues offices, right? So you like you put some stuff over there You put some stuff over there like all the underwear goes in one drawer and stuff like that So that's that's kind of the same concept that we're providing to users through these hierarchical file systems, right? I can I have this idea that a file is in a particular place Despite the fact that this is just an artifact of the naming and the fact that I can navigate the system in such a way So in order to get this to work I I implement this idea of hierarchical namespaces. It means that there's a unique name for the file and that maps to one location, right? Now normally the way I do this is I establish a route And there's actually I guess there's no reason that there'd be a single route to the file system But it usually makes sense So the file system namespace ends up having a single route which corresponds to slash and then and actually think like in BSD there are actually ways to have multiple namespaces. That's what con colon is for Anyway, but what you guys are used to UNIX life systems is a single route and then I have essentially a tree hanging off this Why is that anyone ever thought about this? Yeah Right, so there's there's actually two things I want right so first of all the there it's The graph is is sort of dag right so it's one route and then it's acyclic So I can't point anyone ever tried to do this like I can't like you know point a directory I can't move a directory into a subdirectory of itself Which would not only detach it from the tree, but also create a loop in this thing that I've created And the reason is oops, sorry The reason for this is I want to be able to provide a unique name for file So let's say this was the my file system. This is a graph It doesn't have a root and it's not directed. So what's the name of you know a particular file? Let's pick this one. Right. What's the name of this file? Remember the file has to have a unique name. I mean I have lots of different options, right? I could it could be well it could be used to love well It could be used you me love well, right? It could be used to love me you used to love well, right? And by choosing a route so that's the first thing I need to do now once I've chosen a route I still have a problem. I still have two names Right and I can still loop here, but at least I have a starting point and then I need to disconnect the cycle So at this point You know this this file actually has a unique name, which is the shortest path from it to the root This match up with your intuition about how you guys use Use file systems. Yeah, I mean so so relative symbolic links I'll complicate the situation slightly and I'm happy to talk about them, but Even on its systems with symbolic links. There is still a canonical name for the for the file Normally, you know programming languages you use probably have a way to give you that so you say I want the absolute Path of this file and that will discard symbolic links and it will sort of collapse things So it's the shortest path from the route to whatever file file. I'm trying to name. Yeah Hard links are not treated differently hard links and symbolic links are pretty much the same There's a there's a small difference in how they're implemented. Let me come back to that later. Yeah, I Mean, there's a big difference in terms of how they work But on some level from a naming perspective, you could think of them as the same thing But there has to be a path to the file that ignores hard links or symbolic links, right? Okay And so this allows me to produce a canonical name which in this case is you know this particular path And I can also produce lots of relative names You know the now relative names in this context means names that include dot and dot dot, right? So this is another feature of Unix like file systems. I have two special Directories that are present in every directory the directories are dot and dot dot. So what does dot refer to? Current working directory. What is dot dot refer to? Parent right so I can get where I am. I mean, why do I need dot anywhere to thought about that was dot for What's one of the uses of dot? Yeah Right, so does anyone ever tried to run a program that's located in your current directory your current directory is not in your path And if you type program name, it doesn't execute it says program name not found. Why is that? Yeah Okay, so this is a security feature Why is it a security feature? Yeah, exactly, right? So here's what I can do on a shared system I can put a version of I can put LS right into temp right a shared directory a lot of people can go into and Then if I can get you to go into temp and try to run LS that'll like remove your entire Home directory or something like that, right? This this is true like I could do this And so this is why this I'm just going to turn around ask this question turn on close my eyes Does anyone have dot in their path? I? Hope nobody's hands are up. All right. Don't do that. It's super unsafe people like oh, I'll put dot in my path makes this a Lot more convenient, right? Yeah, okay By the way, you should go check in temp I put a file in temp that tells you the account number where there's a million dollars waiting for you Right, you just have to go in there and find it so go into temp and run LS everyone who has done their path Okay Yeah, so but dot dot is that the obviously useful in another sense Dot-out always allows me to get to the parent directory regardless of where I am So this is really useful when you're navigating the files, right? Dot is not as useful for navigation, but has other purposes This is a cute poem you guys can look at it later Let's use for this example. All right any questions about the file abstraction at this point We're really sort of done talking about files now. We're going to start talking about some of the ways that file systems I should have to implement these features Yeah, she Yeah, so so yeah, that's a great point. So the drive designation and there are other I think there are other file systems and why you'd have like labels, right? So a single root may not be necessarily required, right? C and D and those drive letters Yeah, yeah, I mean technically those refer to different file systems usually I think right so for example Part like it I can have multiple disks on my machine that each have separate file systems Does anyone like on a on a Unix like system? What do I typically do to support that? I don't use drive letter colon Whatever, right? What do I do instead? Well, I mean I still have to mount drives and windows, but where do they get mounted? So to boot this system I need some sort of root file system that contains the kernel and all that are stuff other file systems I might want to do mount on other drives, right? Like I add a drive to my system I want to put a new file system on its source and stuff there Where does that get attached like on windows that might be E colon something where did what do I do on? What's the typical way of doing this? Yeah, I mount it so it's part of the same tree, right? So one directory or subdirectory or sub subdirectory of my tree may actually lead me to a new file system That's mounted on a different machine, right? And so translating that path means I have to find that file system and then cut off the rest of it and pass it to the file system So that understands what the what the root is. Yeah, that's a good point Typical typical stuff. All right So file system design goals so so now that we understand about the file abstraction Here are some of the things the file systems actually have to be able to do One is look up file names So I need given a file name and file names, you know today file names have these nice properties They can include arbitrarily long sub directories weird characters Whatever I need to be able to find the contents associated with that filing So when I look up a you know something by path whether it has relative components or whatever The file system has to be able to translate that into whatever information it needs To be able to find the contents that are associated with that file I have to do these things I have to allow the contents of files to change So I need to be able to obviously allow editing I need to allow the size of files to grow and to shrink efficiently Keep in mind files can file sizes can vary order over multiple orders of magnitude So there's a lot of give me an example of a tiny tiny little file that there's probably dozens of on your machine Oh, yeah. Yeah, that's a great point actually those little stupid metadata files that are created I always delete those but then they reappear again. Yeah little meta files that are created by file indexing systems. What else? small file couple lines In it hot PY. Yeah empty files. I like that. Yeah stupid empty files required by certain, you know programming languages. Yeah Gdb in it. Okay. You guys are thinking like computer scientists like weird people, right like Normal systems in it like configuration all these little configurations, you know like Mac, you know Windows has this thing called the registry, which is does it still have the registry? Okay, anyway, so Windows Windows gets away with this with its own solution to this problem, which is Nightmarish, but a lot of other systems have tiny little configuration files burn it sprinkled all over this. Yeah Yeah, like Vim, you know Vim, you know, yeah So there's all these tiny little files on the system now. Give me some example of huge files That you probably have a couple of line around I'm sure legally obtained Legally obtained torrents like that. That's my favorite answer to this question, right legally obtained torrents I'm sure like copies of Ubuntu for example, right? Yeah, exactly that you downloaded using the torrent because you were nice Yeah, like movies can be multiple gigabytes So these tiny little configuration files could be like one kilobyte Just maybe a few hundred bytes and then I have files that can be multiple gigabytes and now like big data sets Then I might be running scripts across multi terabyte I mean files get big and so this is an interesting, you know design challenge. I Also need to allow something else about the file to change and that's its name So I need to be able to effectively rename files and this is something that's actually a trickier than then you might think If you start to think about it, you might see why File systems also do a variety of things to try to improve performance And this is something that's done at the layout level and we will talk about old file system designs that You know really Tied themselves very very closely to the structure of spinning discs Which is obviously something that you wouldn't do today because you're not sure what type of system you're going to be deployed on And then there are but a lot of the innovation when it came to file systems were battles fought over performance And a lot of the changes had to do with where I put things So I can try to access optimized access to single files if I can notice that certain files are related to each other I Might want to put those files In the same place. Why would I do this? Like there's okay There's games I can play with optimizing access to single files might think about how that works But what can I do with multiple files that might be like think about this on like a an old spinning disc Where am I if I can see that there are two files that are always accessed together What might I want to do with them or with their contents? Yeah What's that? Well, I don't want to waste space remember I have old crappy discs small Yeah, was that Michael's it? Yeah, remember old spinning discs where things are physically matters So if I have files that get accessed a lot together if I can actually move them to the same spot on the disc Then once I get the heads close to one file the heads are close to the other file Right happily. We don't have to play as many of these games anymore on newer newer drives One of the biggest file system design challenges is failure surviving failure of a variety of different kinds, right? So is anyone ever wondered I'm sure you guys see those irritating messages when you plug in a flash drive It's like do not unplug me without pushing this button, you know, I always do it anyway, right? And then and then my favorite thing is then you get another message where it's mad, right? It's like you didn't do what I told you I'm like, I don't care. I already did it. It's too late, right? Like clearly I'm not learning you've showed me that message hundreds of times But this is why okay This is why you're not supposed to do that because the file system is like wait hold on Like there were things that you wanted me to say that aren't on disc yet, and you just unplugged me So dealing with unexpected events like this whether they're user generated or power failures or dry Failures has anyone ever dropped their laptop before? No one's ever dropped their laptop. Yeah, I mean spinning discs think about that disc is sitting They're spinning it like 10,000 rpm you drop it At some point it hits something and then all all hell breaks loose, right? So yeah, so I mean failures and a lot of this when we start talking about failure Failures and in other sorts of problems a lot of this has to do with maintaining some degree of consistency It's probably okay. If let's say your you know power goes off to your data center It's probably okay if there's a few saves that don't happen What's not okay is if somehow the file system got itself Scrambled internally and can't find anything ever again, right? Or alternatively if I have to run a program that takes days to finish to clean up the disc, right? So when we get to talking about failure Maintaining some type of consistency across failures and being able to recover quickly our core file system design goals okay Common features right so the file system we're going to talk about when we talk about file system support the idea of files including attributes permissions file names Hierarchical file names no real restrictions on those although Does anyone know something weird about the Mac file system? I've always been hoping that someone will be able to answer this question for me. Does anyone use a Mac? Does anyone notice something odd about the file names on your Mac? Yeah, oh you can have yeah, you can have I think spaces of valid file name. I wouldn't suggest using it though Okay, that's not what I'm thinking of there's something even more basic at least on all versions. I think this is still true Mac file systems are case insensitive. I know there is but it won't boot on top of it. I tried this Yeah, anyway, so so the default Mac file system thinks capital FOO and lowercase FOO are the same file Right why this is done makes no sense to me Clearly there's more code required to get that to work, right like I have to You see what I mean, it's just like if I would just compared the bytes that's easy, but now I actually have to figure out Oh, I don't know who did this or why right? Right, right like what is capital upside down B? Is it capital B upside down or do you flip it right side up like I have no idea, right? I just don't know so anyway, that's always boggled my mind like somebody did this first of all I think it's bizarre second of all. It's actually more work to implement. Yeah That seems to be true of your generation. I will point that out, right? You guys seem to think it's totally cool to capitalize any word in any sentence at any point right for emphasis Okay, don't do that by the way like I don't know where you're getting those ideas Yeah, capitals and lowercase are different, right? They're used differently. They're appropriate at different points in time Anyway, I'll stop ranting about this so names So this is the file feature set that that we're going to talk about right? We will not discuss how to implement case insensitive file system now Okay, so now we're actually talking about on disk data structures any questions before we get to this point Okay, so let's let's like let's start to think about this Okay, so you've got your job is to implement this file system. You're given an array of disc blocks those disc blocks are like 512 bytes in size, maybe okay They are in the disc only understand his numbers So the disc will read and write chunks of data to a numbered block okay the first thing to think about is the following so Broadly speaking there are two types of there are going to be two types of disc block You guys been working on assignment 3.1 So you're sort of mentally prepared for this because this isn't all well It's a lot more complicated than your core map, but it's similar in a certain way Okay, so there are if I was able to just go down to the disc and start pulling blocks from it at random I would find two types of blocks for all they speaking What are what are they? Now if I'm going to store data on the disc right Clearly that identifies one kind of block so there is a block that has what in it Data from a file right there is a block I Will pull blocks and hopefully a lot of the blocks that I pull assuming that the file system is close to full We'll have data in them from a file So I'll be able to find a file on the file system and say okay This 1k comes from this part of the file right awesome. Is Every block on the file system like this No, so what are the other blocks? Yeah, Mike Well who knows doesn't matter right the other blocks do not have file data in right they have some type of data Structure that the file system is using internal so this is sort of like your core map right your core map is in memory It's taking up memory. It's used to manage the rest of memory So it's a data structure that's competing with the rest of memory for space Has anyone noticed this when you format a drive anyone? Like do you wonder so do you guys have a mental model for what happens when you format a drive? Does anyone know? No No, you're yeah, you know I know you know this right like what does it mean to format a drive? It's you have to do it right like format and because you these choices and then like something happens if you're on windows Sometimes it takes for like days for reasons. I don't understand quick format right What is the what is so this two interesting things that happen when you format right one thing is you bought this like Four terabyte drive and you format it and then you run df or you look at the space and what happened There's like less space like where did that go? I paid good money for that, you know What's actually happening? What's the format actually doing? No pretend there's nothing on this drive the drive is zeroed front to back So formatting writes any on disk data structures that the file needs to file system needs to bootstrap itself So again think about your core map right? I mean you're in and as you probably understand the disk on data's this this boot up the on data data structures The on disk data structures for a file system are way more complicated, but they still take up space So that's what the format is actually doing it's putting data structures, and that's why it's file system specific right that's why the format for NTFS is different than you know EXT 3 or whatever is the data structures are file system specific they have to do with the implementation That's what the format actually does That's also why you why you lose a little bit of space because the format is taking over parts of the disk For data structures that the file system is going to use later. Yeah Yeah, probably it depends. I mean sometimes there's some some has anyone ever used at ext make It's like make FS dot ext4. There are parameters that you can use So like some of how the file system structures are built can be modified at format time Yeah, so we'll actually we'll come back to that because there's some interesting results Yeah, does it screw it up? No Well, I mean, let's let's put it this way if you format a drive that already has data on it It is unlikely you will be able to find that data easily Right, so let's say I take an NTFS drive and I reformat it with ext4 Okay, remember a lot of the data is still there But the data structures that the file system needs are gone This is another reason why people say well, you know before you give away a machine You should fully white the disk drive if you just format it with ext4 All it does is it just changes a few blocks on the disk and so there are people out there in the world I can't believe that there are people out there like this But there are right because we know there's bad people in the world Who will actually buy that disk and run a program that just basically looks at every disk block and looks for things like Credit card numbers for example, which are actually pretty easy to identify right like people do this on eBay They'll just buy drives and scan them looking for social security numbers stuff like that because if you didn't format it in a way That destroyed all of the data and sometimes you actually have to write the data multiple times because magnetic disks have this memory to it Anyway, this all gets very strange and scary. I guess right But broadly speaking two types of disc blocks data blocks index nodes So the data blocks contain data for that corresponds to a file the index nodes or inodes Contain a variety of different types of on disk data structures And we'll talk about I Notes are the most common, but there are other types of on disk data structure that we can also talk Okay And I already pointed this out I don't need to go through this in depth the thing that makes file systems different is these on disk data structures And not only sort of the format of them, but where they go how they're used That's what distinguishes NTFS from from other files X3 and And how so where do where does stuff get put? How do those things get used? And how do I do crash recovery? But that also has to do with putting things on disk and using those data structures after failures and stuff like that Okay So again as I've been harping on what I'm really doing here is I'm trying to build this big complex data structure That has to do all this stuff On top of a very very simple data structure, which is the disk block array and What's hard about this is that as you imagine Doing things that may seem simple to you So for example moving a file from one directory to another can potentially require modifying lots of parts of that data structure and Like any other you know system if I get caught halfway through that It's possible that I leave the system in some sort of inconsistent state So if I'm not careful about how I access the file system And it crashes halfway through I can have files that are totally disconnected from from the tree has anyone ever Just the ec4 still have this the lost and found directory. Did anyone ever notice this before? Root of some Linux based file systems is a directory called lost and found. I'm not kidding about this Sometimes people find things in there And the reason that things end up in there. It's not funny. Okay, this is a problem is because They got disconnected from the tree. It's like and then you run a program. It's like wait. I found this file over here And it's kind of hanging off in space. What do I do with it? I don't know what the name is so I put it into this directory And I hope that you know where it came from like that is literally the failure model For some file systems that you guys use right now how often does that happen? Hopefully not very often if you find a lot of files in lost and found I would suggest that maybe maybe your disk is having some issues But but again, that's the other thing that makes this hard is because the disk substrate can fail Right, so you may have a sector on your disk that just stops responding to commands And so all the data that was in there is suddenly gone, you know And maybe that's one of the so I can have files that lose data I can also have parts of the files and data structure that are affected by this Okay, so let's think about a write Let's think about the process of we'll just walk through one example before we finish up today So just writing to a file. So what are some of the things that have to happen? To write to a file and I'm writing data to the end of the file So this is implicitly going to increase the file's length. I'm appending data What do I what are some of the things that the file system has to do to accomplish this? What's that? Well, I've already got it open. Okay, so I've already done the well We'll talk about path name translation, which is itself kind of interesting, but I'm assuming I already have it open But what do I have to do I have new data Who knows where it came from but I want to write it to the file Right, I'm appending to the file But on disk think about the stuff that has to happen the changes have to get made to these on disk data structures Yeah, she's Okay, well, I may not have to move stuff around but I certainly have to find somewhere to put this data, right? I have new data I need a new empty spot on the disk to put this data in so that's one thing I have to do I have to find an empty block In order for this to complete there has to be at least one disk block somewhere sitting around that hasn't been used And so I have to find that So that's your cormap allocator right there. That's just step one, right? That's all your cormap allocator does it's just find a find a page. I have to find an empty disk box You have some idea of what that I? Have to somehow link those blocks with the rest of the file because next time I open this file And I read it from start to finish I need to find this data So I need to find the new data that's being written here if this doesn't happen then the data may be on disk But it will never be located again What else has to happen? Yeah, I might have to adjust the size of the file So that the next time I run ls or something like this the correct size is reported So this may be located somewhere else Then I actually have to write the data itself. I haven't done this yet. All I did is allocate space Now I actually have to move the data into the right spot on the disk Now from your perspective as a programmer all this stuff has to happen essentially at once But let's think about it here So what happens if the file system what could potentially happen if the file system fails after step one? What sort of inconsistency can I have here? What's that? Yeah, my my disk capacity just went down and there's this phantom block that is not associated with any file And I will never get that block back. It's gone. It's still on disk. It's sitting there with zeros in it Probably, but it's marked as allocated But it's in it's been marked as allocated in a way that it will never be able to be deallocated So even if I truncate this file to zero and close it that disk blocks never coming back. Okay What happens if? What happens if I fail after this May not be the worst Case of all. Yeah Yeah, so the problem is how do I distinguish between here failing here and failing, you know And actually completing it so there's a couple now the metadata should be easy to fix right next time I look at the file I might say oh, okay Well, I opened the file and I found 1200 bytes, but the metadata says it's only a hundred bytes. I'll just fix that That's not that hard to do But how do I tell maybe the user wrote zeros into the end of the file? So how do I tell whether or not that data is legitimate? It's attached as part of the file But I don't know about whether that the data is is legitimate here, you know, obviously if I fail There's data missing from the end of the file, which is not good So it just gives you the sense of all the different things that can go wrong here and what makes this hard and So there's another interesting performance implication here, too, right? The there's some data structure that's used to determine what disk blocks are available, right? That's what I used in step one Then there's some data structure that's associated with the file itself that allows me to find all of the data blocks that are associated with the file So that's another second data structure There's a data structure that has metadata about the file and that might be the same one as I used in number two And then there's the actual disk blocks that I'm copying stuff into So at minimum here, I've touched three different data structures that are probably located in three different places on the disk And so from a performance perspective, I started off with the heads one place I moved the head somewhere else and I had to move them somewhere else and all this just to do an append This is not a complicated operation All right any questions at this point Yeah, I just said that cool Okay, let's keep going so so we're going to investigate how file systems do two things, right? translate pass and Look up I notes or sorry find data blocks So when I open a file, I need to translate the Human readable string that I was given into some information about the file and we call this an I know each file has one I know Then starting from there, I need to be able to locate all of the data blocks for that I know so we're not going to talk specifically in class about certain on disk data structures like the free block bitmap for example You guys can imagine how that works, right? but we are going to talk about these two challenges like path name translation and Finding data blocks associated with the file because this is where some there's some interesting Design choices or some things that you guys just really ought to know about, okay? so oh Okay, fine alligating free odds of a box. Yeah, not really And then we're going to use some ext4 examples since that's the that's the modern Particle one one one note about I know does anyone ever run DF and noticed that there's an I node count that's provided Does anyone ever so another question this may be an experience that not very many be assured Does anyone have a have a file system that runs out of space? Despite the fact that there's actually a large amount of space left on the file Why did that happen? Okay, awesome. Well, someone is going to learn something today Yeah, so this has happened to me this is super irritating so there's one I know profile one I know profile name For ext4 it creates all the inodes when it formats the disk So this comes back to your question about you know, can I screw things up if I format incorrectly? It depends The number of inodes that ext4 creates is based on assumptions about how large files are right? If I create too many inodes the inodes consume the entire disk and I there's no actual room for content If I create too few inodes, I can't create very many files So when you create an ext4 file system it uses a default that's supposed to represent an average file size on modern systems The problem is what happens you have a few lots of really really small files So this is the case if you if you create you can try this if you create billions of tiny tiny little files You know like like zero file or even small like 4k files What will happen is the file system will run out of these data structures before it runs out of data blocks and So it's and it can be very confusing because it'll say like no space on disk and you're looking you like There's like terabytes left on disk, right? But what happened is it ran out of these data structures, and this is something that there's maybe a way to fix this I guess But it's I'm not sure I didn't know of one at the time So we ended up having to move all the data off reformat the disk and copy things back, okay So let's talk about some terminology that we need to understand to talk about the rest of things So these these are disk units So a sector is the smallest unit of data that the disk allows me to read or write So the disk does not address or or we don't assume that the disk allows us to address every bite Instead these are sort of like pages. This is the you the granularity at which I can address the disk So I say to disk write sector 10,042 here are 256 bytes This does mean that to modify one byte on the disk Frequently, I have to read the sector alter the byte in memory right out the whole sector again Right just you know disks at the time were not byte address The a block is a unit that's chosen by the file system and and blocks are the are a larger unit That are multiples of sectors, but this is the largest Unit that the file system is actually going to write in one time Because for performance reasons and given the size of files It's typically not worth writing a single second. I'm if I'm gonna write if I'm gonna modify the file I might as well do it in bigger chunks like 4k Now extents start to become a file system specific feature. This is something that's used by ext4 and the reason So extents are groups of blocks. You can think of extents as even a larger chunk of a file When I allocate space for files So there's there's one way, you know when I need data blocks for a file One way to do it is to just grab 4k blocks wherever I can find them on the disk Why am I not want to do that? So again as soon as I write to the file I grab one 4k block and then if I write another 4k grab another 4k block Then if I write another 4k and there are another 4k. I just keep doing this So you're saving your 4 terabyte movie and I grab any block that's available on the disk 4k at a time and when I'm done I have a bunch of 4k blocks that are sprinkled all over the disk Why is this not necessarily good? Yeah Yeah, read it and think about that poor disk head when you're trying to read this You know, it's it's running here and there all over the place and so To and this is particularly true is I think his discs have gotten bigger File systems now are willing to trade off space efficiency for contiguity Contiguity is that how I say that? Yeah, I think so. Contiguity, right? So I am willing to waste a little bit of space here and there So that I can put things next to each other and so when I allocate space for a file I'm not going to get 4k at a time. I might get 4k at the beginning because I'm hoping the file is small But once the file gets a certain size, I'm going to start getting big chunks I'm going to say okay. I'm going to get a megabyte for the file I'm going to find a bunch of blocks that are all next to each other And I'm going to save all of the data for the file into that contiguous spot. It is just to try to improve performance Okay, just said that Great. Okay So any xd4 one. Yeah, sorry right so So when files are created, I don't necessarily allocate a lot of space right away, right? This is one of the design challenges for file systems file systems typically and actually extra credit for anyone who goes and does this People have done this in the past actually Find a program that will give you a distribution of the file sizes on your machine I would be very interested in seeing right so your personal laptop run a program that just you know show me the Distribution is a histogram of the file sizes Because what I'm going to suspect is that your computer has a lot of small files Unless maybe it's windows because it stuffs a lot of that in the registry which is weird Max and Unix machines typically have a lot of small files And then a long tail where you have some big files way out there, right and there may be interesting Distribute you know bumps in the distribution out there So if you have a lot of music files on your computer, there may be an interesting bump kind of around epi3 file size There may be an interesting bump around photo file size or movie file size But yeah, so this is this is one of the things that file systems will do is they'll say when you open the file I'm not assuming it's going to be a terabyte in length, right? I'm assuming it's going to be small, but as the file size grows. I start to make assumptions that it's going to get bigger All right, I'm almost out of time. I've got 20 seconds left. Okay. Let me just get through exe for inodes One I know for file the I know data structure itself. So this is just a data structure There's just a little data structure one profile 256 bytes per I know and so I can get one of these per sector Or 16 per block I Know it's as you might expect are allocated together. So there are parts of the disk and we'll come back We'll talk about this next time Where there are bunches of inodes that contain information about hundreds of files where the data for the files is other places But the metadata for the file is all stored together The exe for I knowed contains information that allows me to locate the file contents the permissions that you're used to on Unix systems All these timestamps that you used to see when you do things like LS And the inodes are named and located by number There's a flat address space for inodes that starts at zero and goes up to some large number that depends on the size of your disk And they are all created when you format the disk So this is kind of a neat command that shows me information about a particular I knowed And this is essentially dumping the data structure. So I see the I know number There's a type we'll come back and talk about directories the mode you guys are used to all of these timestamps and Then information about the data box. I Knowed to does anyone know what I know to is On ext4 It's root. I don't know why it's called to Maybe because zero was an old one and one was taken. I don't know. Anyway, okay We'll come back and we'll start here on Monday. Good luck finish. Well, you guys are done with assignment 3.1 Go help somebody else finish assignment 3.1