 All righty, welcome back to Operating System. So we survived the midterm ish. So the Discord thing I said about curving it down, yeah, that was a joke. I didn't even like scan it that I said in the morning before I even scan the midterms. So your midterms are scanned, scanned them this morning. They're all on crowd mark. TAs are grading them as we speak. Hopefully they will be back to you sooner rather than later. Hopefully Monday at the latest ish. I don't know, they're grad students. They're like you, so they'll probably be grading it like midnight or whenever I'm asleep. So who knows. But two of them are at least grading at this very moment. So hopefully it will be back to you soon. So midterm solution also posted if you want to check and pre-yell at me, I guess. If anything is bad, you can let me know and preempt the TAs so I can instruct them. But I told them to be nice, so hopefully they're nice. And everyone does OK. Yeah, all right, so going to continue the course to talk about file systems. So on our usual POSIX file system, so Linux, macOS-ish, it kind of follows this, kind of not, there is actually a defined standard for what directories should exist. It is called the FHS, or file system hierarchy standard because we are great at naming things creatively. And what it says is, well, at the beginning, you need to have a root directory and then all the directories come from that one. So within the root directory, define some other directories that must be present and their name signifies what should be in there. So there's a bin directory. There should be binary files in there. So your programs that you want to execute, things like that like LS, WC, what have you. There's a directory called dev, which stands for devices. So there'll be things that look like files in there that actually represent your hard drives and other hardware on your machine. There's ETC, which is generally configuration files. And then there's a home directory. And then the convention is each user on your system should have their own home directory that they can use. So for me, my username's John. So I would have a directory called John. And in it, I can put a bunch of stuff. For example, I could put todo.txt and then there's also another one called mnt, short for mount. So that is where things go that you plug into your system. So if I plug into a USB drive, it should probably have a directory created here and then within that directory, that's what's on that USB drive. You'll figure out what mount kind of does in lab six and you'll also be implementing your own, while you'll also be writing the data that goes into a file system too and then you'll be creating your own files. So we've kind of used directories before. So each process has a working directory. So the directory that process is currently in. And for example, if my working directory is home slash home slash John, well, there's relatives pass. So that's relative to this current working directory. Where do I have to go to find todo.txt and then there's also absolute directory, so or absolute pass, which start from root. So if I'm in home John, anyone want to tell me how I can access todo.txt and what would relative be? Yeah, full stop or dot dot slash todo.txt. So we've used dot before, hopefully. So we'll see what that means in a little bit, but current directory is usually what dot means. What about the relative directory from home John to USB? Yeah, dot dot. So if I'm in this current directory, dot dot is usually your parent directory. So dot dot would bring me to home, then slash dot dot would bring me to root and then I could go mount slash USB and that would be my relative path for USB, which in this case is longer than the absolute path which would just be slash mount slash USB. So here's the answer for that. Any questions about that? All right, so the special symbols are dot. Again, that's the current directory and dot dot, which is the parent directory. So has anyone know how you make like hidden files on Linux? Yeah, dot and then a name. Anyone know why that is or the history behind that? So this is actually an instance of some bug that has become a feature. So in LS, there is a dot and a dot dot in the directory. Typically you don't want to see them because you always assume they exist. So in that LS program, well it just shouldn't show you dot or dot dot. You would imagine that they might be like, if the name is dot, don't show it or if the name is dot dot, don't show it. That is inefficient programming. So whoever wrote LS was like, screw that. Instead of two if statements, I can do one. So the one if statement was, if the directory name starts with the dot, then it could be dot or dot dot, I just won't show it. And if that's all your code is, if the first character is a dot, don't show it. Well, guess what? If the file name is dot and then something, it won't show it. So that's why hidden files are just dot and a name because someone back in the day just optimized it and they just decided not to show anything that begins with the dot because they were trying to get rid of an if statement. So it was a bug, but that bug quickly became a feature. People took advantage of it and then as soon as people start using it, you can't really take it away. So that's why hidden files just start with a dot on Linux because it was a bug, now it's a feature. So any questions about that, fun trivia? Well, maybe fun for me, not for you. So other fun symbols is there's this tilde symbol and that's always the current user's home directory. Sometimes it'll be, well, it'll be also in this environment variable called dollar sign home and relative pass or calculate from the current working directory. Typically you can access it while there's a system call or an environment variable called PWD which is just print working directory. So now we have to talk about how what file descriptors actually are because I've kind of been lying to you or I've been not telling you the full story. So today we get the full story of what a file descriptor is. So usually when we're accessing a file descriptor we open a file and then we're accessing that file sequentially. So if the contents of that file are like hello world, well we could read, if we do a read system call and we read the first few bytes, we'll get hello. We do another read system call, we'll get world because it'll all be sequential, all in order. We actually don't know how to do things out of order and similarly for writes. If we just write to the file, we just write to it. If we do another system call to write just picks up right where it left off. So if we wrote hello and then we wrote world, the file would contain hello world because it's all sequential. But hey, if they're files you can just randomly access them if you really want. You can read and write at any point in the file and we don't know how to do that. So let's learn how to do that. So we have our open system call. So it takes like a path and then some flags and then some permissions in the mode. The flags will probably be something like hey, can I read and write this file? Kind of looks like permissions for your page table entry. Can I only write to this file? Can I only read to this file? There's also a special flag called O append which moves the internal position to the end of the file so that if you do a write system call it just goes to the end of the file because by default if you open a file and you do a write system call to it it starts writing those bytes to the beginning of the file. So if I open a file and it says hello world and then I open it and I do a write and I write howdy to it. It's gonna say howdy world instead of hello world because I overwrote the first few bytes. If I give it the flag O append then whatever I write will just go to the end of the file. So that's one way to modify that position in the file. The other way is a new system call has its usual C wrapper. It is called lseq. So lseq takes a file descriptor and the system call will change that position of whatever file that file descriptor represents. So it takes an offset so that's a number of bytes and then the third argument is the important one. Once it's kind of like a C enum that tells you how to use that offset. So if it is equal to seek set that offset is how many bytes from the beginning of the file. So if you have seek set and your offset is zero that means you are changing the position to the beginning of the file. If you have seek set and you change it to your offset is 16 then you change that position to byte 16 of the file. Seek current takes whatever the current position is and then changes it based off that offset. So if I'm currently in the middle of the file and I set one to seek current and then offset to minus two it'll just back up two bytes. If I put the plus two it will go forward two bytes and skip two bytes. And then finally seek end is relative to the end. So if once the seek end and offset is zero that means I go to the end of the file. If it's negative one it means I go to the second last byte so on and so forth. So your ranges have to be within the file so if I say go the end of the file and go 100 bytes past the end of the file this will probably give us an error and say hey the file's not that big or there's no space left or something like that. So any questions about this? Cause it's gonna get weird again. Everything gets weird in this course. Yeah. Well again so basically FD locates the start of the file. FD is the file descriptor and internally it has a position like where I'm going to read and write bytes from next. And this the offset combined with once tells you how to change that current position. So basically we start from. So you move it based off the offset. So if, oops, if I do seek set the offset is relative to the beginning of the file. So if I give you an offset, if I have seek set and then I say offset zero that means from the beginning of the file move it zero bytes. So that's the beginning of the file. No, so it's internally that internal position is tracked by that file descriptor. That's something the file descriptor tracks. So now if I read a file I did a read system call on I got the whole contents of the file I could L seek back to the beginning and do another read system call on it and get the whole file again. All right, so we saw earlier in lab one how to access the directory API. So we did an open dir that gave us a dir pointer but really it was a file descriptor under the hood. And this is how we used it but as soon as we get to lab six you'll see how this is actually made up and that will be the joy of lab six. Uh-oh. All right, so this is what file descriptors actually are. So if each process has a file table in it. So in this case I have two process control blocks one for each process, I'll call them process one and process two. So in process one this little table here is just called a local open file table. So these are all the files that this process has open and what I wasn't lying to you about before I said that file descriptor basically points to a file. I said basically because the file is only one thing it points to it also keeps track of the position within the file that this file descriptor represents. So for instance in process one at index zero so file descriptor zero it points to this entry and in the middle there that is a global open file table that is tracked by the kernel. So that is all the file descriptors that are open across all the processes on your machine. So what a file descriptor keeps track of is the current position, the flags that you opened it with and then a pointer to a V node. What the hell is that? Well it represents a virtual node. It's basically just anything that you can read or write bytes from. In this case a V node can actually point to a file. So a file can be an instance of a V node but it could be other things like pipes, sockets, anything that's a file descriptor that I can read and write bytes to. And in fact we'll see the actual data structure that the V node points to in the case that it's a regular file it is something called an I node and we will see an I node in the next lecture. So also in this case, so file descriptor one in process one points to this entry in the global table. So with a different position, different flags and a V node pointing to file B and then we could have process two. Process two it's file descriptor zero could point to a new entry with its own position, own flags and then a V node that points to the same file. So each of these processes are independently using the file they have their own position but you can imagine in the case of forking while that position might be shared and weird things might happen. So any questions about this? Yeah. So positions. Positions are like the offset in the file where you currently are. Yeah, so file descriptors are basically a pointer to this global table. So yeah, the index is just the index into this table. What's in this table? A pointer to a global open entry. And the pointer has its own position flags and a V node. That kind of okay ish. So we'll see more examples. Yeah. Yeah, so the global table can open or can grow and shrink based off how many files are currently open because this keeps track of what every single process on the machine has open. Every single thing. So we'll clear this up, we'll have some more examples but yeah, the file descriptor is basically just an index to this table. Each entry in this table points to a system wide global open file table. We creatively call it a goff, global open file table and that holds information so that position flags and points to a V node. That V node is something that supports read and writes. So pipe, sockets, regular files and we'll see the underlying details of how you store an actual file. So yeah, now we have to talk about what happens during a fork because odd things might happen. I guarantee you will run into this issue if you start making new processes and then you start accessing files. So it's good to talk about it now before it becomes a problem later. So same rules apply to the fork. The process control block for the new process it's an exact copy of the parent and everything gets copied in the process control block including that file table. So specifically for this, if we only care about files we just care that that local open file table just gets completely copied in the clone or the new process or the child, whatever you wanna call it and that means both of their process control blocks and their local table point to the same global open file table which means they both, if one process reads and updates the position, well guess what? They're both sharing that same position so you won't be able to read that part of the file unless you go backwards or something like that. So here is what that looks like after a fork. So remember, in this case of a fork process control block of one, if it had an entry of zero whenever we created the second process control block it would also have a entry at index zero and it would point to the same entry in the global open file table so they're sharing that position and they would be pointing to the same file because of that. So what impact does this have? So what's going to happen if both of these, if let's say the file has like two lines in it and both of these processes do a read system call? Yeah, so in this case, if both of them do a read system call because they're sharing the same position while we don't know which one is going to execute first whatever one happens to execute first may just read the contents of the entire file update the position to the end and if the other process tries to read it would probably get end of file because well the position's at the end of the file. Yeah. So read updates positions. Yeah. Yeah, so in this case, if we had partial reads too so say we had two lines and we wrote it such that each process just reads a line well we don't know what will happen. One process may read one line another process may read the other line but or one may read both of them the other may read none of them. We don't know just across all of them they'll read the contents of the file but it might be broken up between the processes. So any questions about that? Cause this will cause you difficulties if you're sharing a file between different processes and then you're assuming you can read the entire thing in both of them because that won't be true. Yeah. So whenever you do open it makes a new global open entry. So we'll have an example so we'll see that. So yeah, so these are some gotchas so I guarantee you will encounter this and this is why you're taking this course because well now you know about it if someone comes to you and they're like hey I forked and I can't read I don't see the entire file in one of my processes you can tell them hey you well that's because they share a position they're only ever gonna be able to read one in total across both of the processes and this is also worse once you consider that hey remember that L-seq well that changes the position too so what you can do like in real life you can screw with your parents so you create a new process and the child process can just do an L-seq system call that just resets the pointer back to the beginning or something like that and well you kind of mess with your parent making them overwrite their own information things like that so you might not want that and this is also why I was harping on you that's like yeah you should probably close any file descriptors you don't want to use you also don't want to give them to your child because they might just start screwing with them and you do not want that so if you don't want them to screw with it just close it before you exact and they can't touch it otherwise you can test how well someone wrote a program so just in your program just go through all your file descriptors and L-seq them just to random spots and see what happens if you can screw up the parent at all that's a fun thing you can do that will cause them to never be able to debug that that's what we call job security alright so let's see if we know what we're talking about so let's say I have a single process it does not have any file descriptors open for whatever reason so for each process how many local open files do they have and how many global open file entries exist what's their relationship if I do something like this so one process just does open to do.txt forks and then we do an open of b.txt so what happens in this case in the global sorry so both of them have two local and how many global three global well how many entries are in the global table alright so we got two I heard a three two or three anyone else yeah three total in the global and then two in each local so we can go over that real quick so save process a hundred comes in it opens doesn't have any file descriptors open so it would probably open the lowest available is zero so it would open file descriptor zero and we would have a new entry in our global open file table so it would have its own position own flags sure am sorry it would have its own v-node and this v-node would point to a file called to do dot txt so open always creates a new global open file table this is the only process it called open so creates a new entry that's what file descriptor zero is so comes here forks creates process 101 copy of the parent at the time of the fork so process 101 would also have file descriptor zero in its local table it would point to the same global entry so they're sharing a position now at this point I don't know which process is going to execute next let's say for the sake of argument it's process 100 so process 100 after the fork these processes are independent so process 100 calls open so that would cause a new entry to appear in its local file table probably be index one it would get a new entry with its own position own flags and then a v-node pointer to a v-node and this one would point to b dot txt now if 101 executes and it does its open well now it's independent so it would open file descriptor one get a new entry in its table and it would correspond to a brand new entry in the global open file table so now I have a v-node and it would look like this so now both processes 100 and 101 can read the full contents of b which sometimes that might be what you want so you can see that very subtle change that causes your program to behave way differently so if you're intending for both processes to read the entire contents of file b dot txt if I do something like this will that happen? will both processes be able to read the entire contents of b dot txt? no so that's like a very slight change that causes way different behavior so guarantee you will run into that at some point of your life if you're making forks remember this alright any questions about that? so now we know the entirety of what a file descriptor is that's it alright so there's a nicer picture than my chicken scratch so you have it on the slides alright so now we need to store a file on our ssd or something like that so we need to explain how we actually do that so your ssd will contain pages typically those pages or big chunks of data that we use in our ssd will call them blocks as a very generic term so all the blocks will be of the same size they may be 4 kilobytes they may be 8 so that is what we use to store our file and we need to be able to describe how to find the contents of the file on that ssd so in this picture I have three files there's a green file a red file and a blue file so if I'm storing them essentially like arrays making sure all the blocks are contiguous how do I describe essentially where the green file is so how would I describe an array how would I describe the green file an address so it's starting block what else do I need to describe it it's size that's a pretty concise description of a file I just tell you for any file I can tell you where it starts how big it is because it's all contiguous it's all I need to tell you any problems with that if we were doing that to store files yep is this the same mechanic as allocating memories? yeah it's very similar to allocating memory but instead of bytes we're dealing with blocks but same idea yeah if we wanted to make the red block bigger and we weren't smart about it well we just obliterate some blue block and we've lost some data if we want to be safe about it well we can't grow the red file we would have to move it somewhere and then grow it which maybe for memory allocators isn't a big difference if it's not that much but what if that red file was a gigabyte like what if it was the lecture recording made it a bit bigger I have to move a gigabyte and then oh may get a bit bigger now I have to move 1.2 gigabytes now I have to move 1.4 and you can quickly see how this is pretty bad so this is not how we store files so any questions about that being a fairly bad idea and yeah it's related to memory allocation which we'll also talk about so the nice pros is that it's space efficient to describe any file because it's contiguous you just need to tell me where the starting block is and how many blocks it is in size you also have fast random access so because it's all contiguous you can go to any byte in that file because you can tell which block it is on because everything is contiguous so if all my blocks were 8,000 bytes and I wanted to go to byte 10,000 well I could do a quick calculation to know that oh well block 0 has the first 8,000 so byte 10,000 is probably on block 1 and you could do that calculation for any byte you want to offset really fast random access which is good but the problems are due to fragmentation which I think is the first time we've come to this term fragmentation is applies to any time you have storage and you are wasting it fragmentation basically just means wasted storage and typically it's relative to a block so internal fragmentation means I am wasting space within a block so in this case if my blocks are like 8 kilobytes or something like that and I'm storing a file that's only 12 bytes well I'm wasting 8 kilobytes minus 12 bytes the rest of it is wasted and I can't use it for anything else if I only care about blocks so internal fragmentation is space wasted within a block and then external fragmentation is space wasted between blocks so for instance if I went here and I made the red file smaller by one byte and this little block here is now free and unused well there is a case where it might be completely useless if I never have a file that's exactly one block well I can't use that block for anything useful it's essentially wasted space and I can't do anything with it which is also not great and that's a wasted block which is what we call external fragmentation so any questions between the difference between those it'll also come up once we talk about this too because well same thing applies so we can use our friend a link list a lot of you like link lists so if I describe this green file in terms of a link list well I could put a pointer on each block and that could point me I could make my block a little smaller and then each block would just tell me that file and that seems like it could work so in this case what is the only information you need to tell me in order for me to find everything about the green file here yeah the start so you just say this file starts at block 0 and I can figure out the rest from there so I can follow it you know this is it's block 1 this is it's block 2 this is it's block 3 this is block 5 I can just follow so space efficient to describe a file you just need to tell me the block that needs to be stored the files can grow and shrink at will so we still have that internal fragmentation problem but we don't have any external fragmentation because one block is as good as any other block I can just point to it and use it I don't have to make sure they're contiguous or anything like that so that's great would there be any problems with this in terms of performance if each pointer was on a block well even if I'm using a solid state drive I'm also kind of screwed because this kind of sucks because if I want to follow the pointers well a block typically you just load blocks at a time into memory so if I want to go to the fourth block here I have to load this block into memory and then oh ok I find the few bytes for the pointer I have to load this block into memory because it won't be cached then I have to load this block then I have to load this block and it's terrible what could I do that is slightly better than that yeah so not as complicated as that I could do that too but something that's just a bandaid fix that just makes this slightly better in terms of the cache store the entries of all the first block close yet block size the size of the cache so it would be the size of the cache but my problem is I'm going multiple blocks what about the solution of well if my problem is all the pointers are on different blocks shove all the pointers on the same block shove them all next to each other would that work seems like a bandaid solution we like bandaid solutions so that's exactly what a file allocation table is and this is the first file system that's actually used so what it will do is basically just instead of the pointers being stored on the blocks themselves it just makes a giant table of pointers and how you read the table is well at index zero that is the pointer for block zero so that will have its next pointer so I just move it to a different data structure such that all my pointers are beside each other now in this case I just look at the table block zero the next one is block six then after that the next block is block two and these would hopefully be very close together all be cached and then you know I could go on and on and follow that until the file until eventually it ends so to describe the file it's the same thing I just tell you what block it begins at and then you use this table to follow all the next pointers and that's it so this so this table would also have to be stored on the disk problems with this table what happens when you run out so this table can get kind of big but actually this table is only as big as the hard drive itself because it needs to essentially hold a pointer for each block how many pointers it needs depends on how big the hard drive actually is so that's the big drawback for this one is if you buy a bigger disk suddenly this table gets bigger and you're wasting more space for pointers even if you don't use them you have to make it but guess what who uses windows here have you used a this as a file system what a file allocation table have you ever seen the words fat 32 what you've never heard of it usb drives sometimes format like that anyone else heard of fat 32 some people have so fat 32 is file allocation table fat not trying to do anything and 32 is the size of the pointers so the pointers are 32 bits in size so I can point up to 32 things yeah to the 32 things it could it depends how many pointers I need to represent the entire size of the disk so if you're using windows this is the file system it uses for your boot partition because it is very very simple but the drawback is if it's bigger this table has to be bigger but your boot file system typically isn't that big and it's simple so windows actually uses this to boot this is where windows will find your bios will find the windows kernel to load it it will be on this file system and it'll be fun yeah the number of blocks yeah the number of blocks because you need a pointer for each block so yeah if it's not the full disk so the partition just means that instead of being the file system for the complete disk you're only a file system for a smaller part of the disk so any questions about that so fat 32 file allocation table first one that's actually used your usb drive may or may not be formatted it 32 means the size of the pointers that it stores so they're 32 bit pointers so they can point to up to two of the 32 different blocks it just stores the index of the block so you can make the table as big as you want but each block needs a pointer that is 32 bits so yeah so your boot drive in windows your system partition is fat 32 so how to see it on the screen you can go to what is it called disk manager I think it's disk manager you can select your drive and then it'll show you all the partitions and then it will also say what file system is on it it will either say system or fat 32 if it says system it also means fat 32 yeah yeah so that's also why we have this 4 gigabyte limit for fat 32 because I guess what it only goes to 4 gigabytes of things so everything will be limited by that 4 gigabyte limit that's where it comes from because you can only point to up to to the 32 things yeah it should be blocked so maybe there's a different reason actually yeah maybe there's a different reason for it yeah so system on windows partition will be fat 32 it means the same thing it's just a blessed version that your bios knows about alright so there's fat so if this uses link lists and link lists are bad for caches is there anything we can do to speed up that random access speed because here we're still hopping through a link list and it's kind of bad so if link lists are bad what's another data structure that computers like that's much faster a heap kind of array yeah array is simple why the hell when I just have array so for each file have its own array and say well at index zero that's the block you point to so your block zero is whatever is that that so that's exactly what indexed allocation is so in this case I will use a block and in that block I'll have the same idea I'll just store pointers in that block but this red block will represent all the pointers only for the green file so in that case I kind of just flip the problem and just turn into a giant array for that file so that means this red block has all of the pointers for the green file and they're all in order so block zero green files block zero points to block zero on the disk green files block one points to block six on the disk green files block two points to block two on the disk so on and so forth so any questions about that just makes random access a bit bigger yeah so the link list all that table would need to be stored on a block or across several blocks depending on how big it is needs to be stored somewhere so in this case they would just probably be at the beginning and be gigantic so in this case instead of that will be more flexible and to describe our file now we just have to say hey for this file what block contains all of the pointers for this individual file and other than that we get faster random access so any questions about how that works what does oh yep so the red block in this case would not contain any other useful information just contain a bunch of pointers does this look somewhat familiar oh yeah so the red block would just represent the contents of one file like the actual contents and you will get intimately familiar with this because essentially you're creating these you'll be creating these in lab six what does this also look very similar to page tables hey guess what you already learned it so it shouldn't be that bad so in this case if it looks like a page table well we can have some scenario like this so we're using one index block it's like just one block to store the pointers we'll assume that there's no other information with it it's just a bunch of pointers and usually that is the case for most file systems of course with file systems you can just make up whatever you want but anyone that's used will have just pure pointers so in this case if my block sizes on my disk are 8 kilobytes and a pointer to a block is only 4 bytes well what is the maximum size of the file I can support with this and it will look very similar to two page tables yeah I was leaving it up because I'm just writing the same thing this is the rear case where I didn't screw it up alright so in this case first question is how many pointers can I fit on a block so what's 8 kilobytes in powers of 2 13 18 2 to 10 so this is byte sorry this is byte so 2 to 10 is 1 kilobyte so 8 kilobytes is just 2 to 13 right 2 to 13 so what about 4 let's make it a bit easier ourselves what's 4 2 to power 2 so how many pointers can I fit on a block then 2 to the 11 so if each of the pointers points to a block how big could my file be so ok so we have 2 to the 11 pointers to a block and then we just multiply it by our block size and that is how big our file could be so our size of a block is 2 to the 13 that's it which equals 2 to the 24 does that seem big yeah that that is actually only 16 megabytes that's not very big would you be happy if your file system could only support files of size 16 megabytes yeah maybe maybe if you use the computer when I was born Jesus that makes me sad yeah it might be alright but nowadays it's probably not good because that's essentially like an image of a cat probably image of a cat is bigger so what could I do to make this support larger files hint think about page tables level yeah I could have another level so I could have a block of pointers that points to another block of pointers and then I have another level and if I need bigger files I just tack another level on with that but you have some unlike your virtual memory system well same as your virtual memory system the more levels you have the slower and with this we kind of want to only pay for what we use so we're going to do something a bit smarter than what we did for page tables so to summarize we kind of looked at how files were stored on this at least we began our journey on it and learned more about what a file descriptor actually is so api wise I can open files, change the position to read right at we didn't really know about the position before but it's explicit now each process has its own local open file table that's what your file descriptors are they're indexes into this table and there's a global open file table but if you fork, you are sharing you might be sharing the position and then in terms of actually storing files there's some allocation strategies there's contiguous, linked, fat which we saw as the first actually used one and indexed which is not actually used because we have a problem our files aren't that big but we will fix it in the next lecture so just remember pulling for you we're all in this together