 All right, so let's get started. So this is lecture 14 of computer science 162. So we're going to continue the discussion about file systems. So as a quick review, here's the performance that we looked at last week for hard drives. So magnetic hard drives on top and solid state drives here. With magnetic hard drives, remember we have this issue of we have to move physical parts. So we have to incur time for seeking and rotational delays in addition to the operating system time and the time for our controllers, and then the media time, the time to read it actually off of the drive. So in general, we'll get the highest performance out of our drives when we place things close together, when we minimize the amount of seek or the amount of rotational delay. In contrast, with a solid state drive, we don't have any moving parts. So our read times are now all just a software time for queuing, a hardware time for controller, and then a time to read out of the actual NAND flash and transfer over the bus. So very, very fast performance. Order of magnitude better than hard drives or even higher for enterprise SSDs. Writes, however, remember are very complicated because we can't write a page that's already been written. We have to erase it first. So the drive needs to be able to find free blocks so that it can write new data. And so in the background, it's constantly going to be doing coalescing and garbage collection to try and make sure there's a constant availability of free blocks that you can write as fast as possible. The problem here is that writing and erasing are destructive to the drives because they require high voltage. And so you try to do some wear leveling, which is to spread the writes out. Now, ironically, the coalescing that you're doing is actually increasing the number of writes that you're doing to the drive internally. So if you write four kilobytes to the drive, because of coalescing, that four kilobytes is going to move all around the drive over the lifetime of the drive while it's on. Even if I don't write that data or touch it or even read from it, it still will be moved around by the drive. This introduces a problem because SSDs are an incredibly competitive marketplace. It means people cut corners and make errors on quality assurance. And if that happens, then one of the times that you're doing this garbage collection operation, boom, you can end up corrupting the data or losing the data. Now, I posted, I posited this as something hypothetical and said, oh, I've run into this problem and one of my colleagues ran into the problem and what should happen the day after lecture, but Apple publishes a firmware patch for Toshiba SSDs in 2012 MacBook Airs. And if you read the paragraph in the middle here, it says that there might be an issue that may result in data loss, i.e. there's a bug, probably, in the garbage collection algorithm. And so you should immediately apply this patch and if it bricks your drive, then Apple will replace it, data not included. So I'm a huge, huge supporter of SSDs. I think over the next five years, on the consumer side we're gonna see hard drives disappear and everything is gonna be flash-based. That said, I just wish the engineers would take a few extra minutes and make sure that they're not introducing bugs when they're introducing the latest cool feature or performance enhancement, yes. Yeah, that's a very good question. So when consumer hard drives first became popular, did you see similar problems? Absolutely, so I can remember some of my earlier laptop drives were very sensitive to temperature and we bought a bunch of laptops for my group and we lost eight drives in the first six months. And it was because of poor thermal design of the laptop and also the drive was very thermal sensitive. So absolutely, I think this is sort of the teething pains for SSDs. It's very, very competitive market. Some of the biggest vendors out there are losing huge amounts of money and about to go under and that's causing them to lower prices, which is causing everyone else to sort of race to the bottom for pricing. Okay, hopefully we won't see more of these in the future but I can guarantee you unfortunately it's gonna happen. Good backups, another public safety announcement for making backups. All right, so what did we talk about with file systems? Well, there are three goals that we have. Maximizing sequential performance because that's the most common access pattern but also supporting efficient random access because that's how we do paging. Your swap file, we need to read a random location in that swap file. We don't wanna have to read the entire swap file and then we wanna make it easy to manage the files, grow the file, shrink the files and other kinds of operations, renaming, linking and other sorts of things. So we looked at, the last one we looked at in the last lecture was the Microsoft DOS file allocation table file system. This is the most prevalent file system out there. It's on Android smartphones, it's on obviously Windows computers. If you use an SD card or a USB stick, most likely it's formatted with a variant of the file allocation table system. So how does this work? We have a file allocation table that contains one entry for every single data block that we have on the device and we create a file by linking from the directory entry, the directory entry basically contains the ID of the first block and then that entry in the table contains the ID of the next block and that entry contains the ID of the third block and so on and so forth. So that's how we link things together. The links are stored in the file allocation table, not in the actual data box. So some of the properties, sequential access for devices like disk drives and floppy drives is gonna be very expensive, especially if the file allocation table's not cached in memory because you're gonna have to seek back and forth and random IO is going to, random access rather is going to be very expensive always because nothing is next to anything else when it comes to these blocks. There's no guarantee that you're gonna have contiguous allocations which means lots and lots of seeks and rotational delay which means the worst possible performance for your drive. Now, if you use this on the USB or SSD or on a SD card, you're not gonna see the seeks so performance will be, you know, equi, everything's equidistant so performance will be the same. Okay, any questions? All right, so goals for today. We're gonna talk about some more file systems and then we're gonna talk about naming and directories and naming and directories should be very familiar since everybody in this room has created files with names and created a hierarchy of directories. Okay, so just as we have multi levels of page tables and things like that, we can use a similar kind of structure for how we organize our files. And so the first system that did this is the Unix 4.1 BSD. Does anybody happen to know what BSD stands for? Here, Berkeley Software, I hear Berkeley Standard. Berkeley Standard Distribution, any idea where that was developed? Stanford, no. Not Berkeley School of Music in Boston, here. Evans Hall. So the key idea for 4.1 BSD was to make small files very efficient but support large files. So here's how it worked. You have a file header. So this is the data structure that defines a file. So it contains a bunch of information like the mode of the file. So these are the access control bits and the owner of the file and the timestamps for when the files last accessed, when it was created, when it was last modified, the size of the file, and then some other metadata and then a bunch of pointers. This header is called an inode in Unix. And these pointers here, this is a fixed size table but not all these pointers are equivalent. In particular, the first 10 pointers here point directly to data blocks on the disk. So these would be block IDs for blocks on the disk. The next pointer, this 11th pointer, points to, it's called a singly indirect pointer and it points to a block of pointers to data blocks. That's why it's called a singly indirect because there's one level of indirection. So this gives us 256 more data blocks that we can reference. The next pointer, the 12th pointer is a doubly indirect pointer. So it points to a doubly indirect block that contains 256 pointers to singly indirect blocks which each contain 256 pointers to data blocks. And if you want a really big file, there's a triply indirect, the 13th pointer is this triply indirect pointer which points to a triply indirect block which points to, has 256 pointers to doubly indirect blocks. Each one of those doubly indirect blocks contains 256 pointers to a singly indirect block. Each of those contains 256 pointers to the actual data blocks. So this means we can have really large files, 16 gigabyte files. Well, that's not really very large today. You can have Blu-ray files that are 25 gigabytes or more. But at the time, this was many orders of magnitude larger than the biggest hard drive. The biggest hard drive at the time was probably in the tens or hundreds of megabytes. So 16 gigabytes was viewed as, we'll never create files that big. Now people create files that are hundreds of terabytes in size and have file systems that manage many petabytes worth of data. Okay, but for the time, it was probably the right decision. Again, this is one of those things where you want to go back and periodically revisit. So the file allocation table, originally a file system, was designed for floppy drives, right? 1.44 megabytes worth of data. And now it's used for memory sticks that have 64 gigabytes of data. And to make that work, they had actually had to make the block field larger. So there's a new version of the file allocation table system called the FAT32 file system. So here, one of the big advantages, yes, question? Yes. Okay, so the question is what is the inode? So the inode is just this table, okay? So this table contains a set of pointers, right? And then the rest of this is the rest of the file data, the file metadata structure, okay? Which tells us where to find the blocks. But the inode itself is just this entry, okay? Everything else, these are stored in, everything else here is blocks that are stored on the disk. The inodes are small, so we can actually put a bunch of inodes into a single disk block, yes? Yeah, that's a really good question. So why have all these different pointers, right? Why not just have the triply indirect pointer? Just say this points to a hierarchy of blocks. Or not do that, yeah. Okay, so it would make it, so the one comment was it would make it more difficult to calculate how many entries we actually had. So sure, we'd have to figure out some kind of system for knowing which parts of the tree were filled in and which parts of the tree were not filled in. Exactly, so if our file is small, then we don't need all of these additional data structures. If our file is less than 10 data blocks in size, then this is all we need to reference the file and to know exactly where to find all of the file's data blocks, okay? So again, the idea here was we wanted to make it really efficient for small files. So if your file is less than 10 data blocks, this is all, the inode itself is all you need to know where to find all of the components of the file. And if we have that in memory, cached in memory, then reading a block in the file just requires looking at the thing in memory to figure out where to go. No disk access is required aside from the actual data access, yes. Ah, so why not make it larger than just triple? You could, sure, so absolutely. So the reason why you have a cap is because these are fixed size. And by making them fixed size, you can pack them into, they were originally in a table stored on the outside edges of the disk. And you knew exactly how to index, given an inode or an inumber rather, you could index directly into the disk and figure out exactly where to find the inode. If they're variable size, then you wouldn't know where one began and the next one ended. Any other questions? Okay, so with this, we can reference files up to 16 gigabytes in size and we grow dynamically. So if we have a small file, we don't need a lot of the data structure. It's only when the file grows, we sort of organically start adding more of these indirect blocks and filling it in as the file gets bigger and bigger. Okay, so let's look at some examples. And that'll hopefully make this even clearer. So if we want to access block 23 of our file, and we'll assume the file header is already been loaded into memory, so that we already have the inode cached in memory, when we open the file, how many disk accesses is that gonna require to read block 23? Any ideas? Two, absolutely correct. So one, to read the singly indirect, because this is more than the first 10 blocks, so we have to read the singly indirect block, and then one to actually read the data. Okay, so two accesses to read block 23. How about block five? One, right? Because it's gonna be one of these pointers right here, and it's already in memory, so we just can go and read the block. Let's get a little bit more complicated. How about block 340? Three. We have to read the doubly, it's more than the first 10. It's more than the next 256. So that would be the first 266. And so we have to read the doubly indirect block, and then we have to read the singly indirect block, and then we can actually read the data. Okay? So pros and cons here. So the advantage here is it's very simple, relatively speaking. Yes, question? So the question is why is it that these pointers only point to a single block? Because if they pointed to more blocks, you'd have to be consistent. You'd have to always say, okay, a pointer's gonna point to two blocks, because otherwise, how would I know whether this pointer pointed to one block, or two blocks, or four blocks, or some number? So there'd have to be a count associated. Either have a count associated with each pointer, or the pointers would have to point to something fixed size. Again, everything here is made sort of uniform and fixed size for ease, for simplicity. You can always make a file system more complicated. There are a lot of file systems out there like that. Files can grow. So another advantage is files can grow pretty large. 16 gigabytes was ridiculously large at the time. Today we'd say 16 gigabytes, well, you know, that's not very big. And small files in particular are very inexpensive and very easy to access, because we're just using these first set of blocks, the direct blocks, or in the worst case, we go to a single indirect. The disadvantages, however, are where do these data blocks live? All over the disk. So lots of seeks, and of course, every time I say seek, I also mean you incur the rotational delay. And if you have large files, you could potentially be doing four IOs for each read. If you're doing random IO to a very large file, you could potentially be walking down the tree and depending on what kind of caching you're able to do of the tree structure, much like a dress translation, you're gonna pay a very high cost. So this is one of the most popular file systems for quite a long time. But they recognize that there were these flaws and in the next version, they tried to address them. Some of them at least. Okay, so in 4.2 BSD, it has the same file header and the same triply indirect blocks, but they incorporated a bunch of ideas from the Cray-1's operating system, the Demos operating system. How many people have actually seen a Cray-1? Okay, you should take a trip down to the Computer History Museum down in the South Bay and in their lobby, they have a Cray-1 and you can actually sit on it. It actually, it's designed in the shape of a C because that made the backplane of it really short, as short as possible. And the cooling equipment, the liquid cooling that it used was actually in the bench that's around it and they covered the bench in leather and everything because they thought, you know, here's the world's first supercomputer. We don't want it to be this big scary machine. We want it to be very inviting and oh, you can sit on it. I have no idea how many people actually got to sit on a working Cray, but you can go and sit down on a non-working Cray. It's a really cool museum to go down to if you haven't been to it before. So what did it incorporate from the Cray? A couple of things. One is using a bitmap instead of having a free list. So a free list is really easy to maintain. You have a free block, you just pop it on the free list. You need a free block, you pull it off of the free list. But then you end up when you're allocating a file with blocks scattered all over the disk and that's not gonna be good for performance. With a bitmap, now when I need to find blocks, I can just look in the bitmap, right? And then I can find a zero. That gives me an unallocated block. If I find a run of zeros, that gives me a contiguous range of unallocated blocks and that's a good thing. So that leads into the second thing, which is they then tried to allocate files contiguously. So we look for these runs of zeros and allocate within those runs of zeros. They also did two other interesting techniques. One is they reserved 10% of the disk space. We're gonna see this is very much like in the last lecture where your SSD reserves some amount of space to try and improve its probability of being able to have a free block, empty block when you need a block to write to. And then another technique that they did is called skip sector positioning. So we're gonna talk about those in the next couple of slides. Now, when we create a file, we have this big challenge of how big is it gonna become? You could ask the programmer or ask the user. I have a feeling they probably won't, in most cases, be able to give you an answer. You're creating a log file. How big is it gonna be? So this really makes it challenging to figure out, okay, well, how much space should I allocate contiguous space for this file? So the solution that they came up with was just to find a range of free blocks, which again is really easy in the bitmap because I just scan through the bitmap looking for a run of zeros. As soon as I find a run of zeros, I turn around and allocate my file at the beginning of a range of free blocks. Now the file can grow and those blocks that are allocated are gonna be contiguous. Yes, question? Sure, so the question is, if we've got a bunch of file creations going on at the same time, could we end up with them all on top of each other in these free ranges? In some cases, that's actually a good thing. So for example, if I'm creating a bunch of files in a directory, then that's really good to have the files in a directory be located close together because then when you do a grep of the directory or in general, people tend to access the files that are in a directory together. When you load your project files, you're loading them all from the same set of directories. So if they're all together, then your drive doesn't have to do lots of seeks. If they're not in the same directory, then you can simply pick a random spot in the bitmap and look for a run of zeros. Okay, so now when you hit the end of a range, when your file is growing and you hit the end, you just look for another successive range. So what's gonna happen is your file may not be 100% contiguous if it's big, but you're gonna have contiguous segments to the file and that's a huge win because those contiguous segments are gonna have much higher performance. Okay, and this gets to the second part which is they also try to store files from the same directory together. So put them either on the same track or in the same cylinder group. That way, I don't have to do any seeks in order to read additional files. Now the challenge here is I need to be able to find runs of zeros. So that makes an assumption that I've got my disk, the allocation sort of looks like Swiss cheese, big holes in it, not little teeny tiny holes. If my disk is empty, that's always gonna be the case. If my disk is full, that's not gonna be the case. There won't be many of these holes. So we kind of want our disk to sort of be in the middle. And so that kind of begs the question, how full are our drives? How many people have more than 20% free on their hard drive? Oh, you guys don't store anything. I put a half terabyte drive in this and I've got like five gigabytes free at any given time. If you look at most systems, the disks are always full. So about 10 years ago, I looked at the EECS department, our storage, our total department storage. It grew from 300 gigabytes to one terabyte in a year. So that's a 300, more than 300% growth in one year. That was probably the introduction of a new version of Office or something like that. Now we actually have many tens of terabytes. I think we have on the order of in use is about 60 terabytes worth of data. Here's actually a graph from our director of IT of our usage over the last, was it like eight years. And you can see we had relatively moderate growth rates. And then disks became really cheap. And so we bought a whole pile of disks. And so we actually wanted people to use them. So we dropped the rates and we actually changed the way we did billing. And you can see we entered a phase of kind of somewhat exponential growth, right? Both for, so on the accesses here we have time. This is the amount of storage that was in use. And then project is our research groups. Home is our users home pages, not home pages, home directories rather. And IMAP was we used to have departmental IMAP service. In 2010 again we restructured the rates and you can see the slope of the curve. The rates went down. The rates always go down, they never go up. The slope got sharper for research groups started storing and saving more and more data and people started saving more in their home directory. And then again last year we dropped rates and again you can see we're growing even faster in our storage, okay? So this is gonna be a challenge. You always are gonna find the case that you've got more and more things that people are keeping, especially when you drop the price. No incentive to throw anything away. Now so what do you do when you do have a system where the drive is full, the file system becomes full? Well one approach that you could take and one of the systems I actually used as an undergraduate it would make an announcement. The disk space is running low and it was the software engineering class. I'm there late at night working away on my project and all of a sudden boop, a pops this message saying we're running out of disk space, please delete some files. Now if you're working on your project what is the first thing that you're gonna do if you see a message like that? You're gonna save, right? And it's a race because you know that there's an ever decreasing amount of free space available. So you wanna make sure that you're saved complete successfully and not oops only saved half your project, the rest got lost. So as soon as that message popped up as Pavlovian you were like immediately save my file. So this doesn't work, right? Cause when you say that disk space is low people just use it all up and it all disappears. Now maybe the solution is you buy a larger drive or something like that at least for me that's never worked. Every laptop I've gotten I've doubled the size of the drive in it and yet I never have any space available. So I think for many people that's gonna be true and so we'll just assume drives are full. So what's the solution? So one solution is to lie. So the operating system is basically gonna say we're not gonna let the drive get full. We're gonna reserve a portion of the drive and that means that if we look at how many blocks are free in the bitmap we're never gonna allow allocations if it would take our account less than the reserve. Now how much is a good reserve? In practice it turns out like 10% is good. So UNIX typically reserves 10% of the disk. Now there's another benefit that you get. If your drive fills 100% the system administrator can't log in because the login process writes the syslog and other files. And if it can't write those it won't complete the login process. So this is yet another good reason to not allow your drive to really become full. So root processes are allowed to ignore this and still continue. Question? Yeah so the question is and this is a really good question why do we do this? So the reason why we do this is because this increases the probability that we'll find big holes. So by forcing you to have 10% of your drive be free we're more likely to find runs of zeros as opposed to individual blocks that are free on the disk. And if we can allocate within a run of zero we get better performance because accessing that file is now gonna be a seek free access as opposed to requiring some seeks. Yeah question in the back? That's correct. The 10% that's reserved is not in and of itself contiguous. It's not like we say okay we're not gonna allow you to allocate any data above logical block number five million. It's just saying that we're not gonna let you allocate if we don't have that 10% reserve in our count of blocks in the free list of blocks in the bitmap rather. So why not just tell the user that you're using a lot more than 90% and you're gonna start seeing fragmentation. How many users do you think would really pay attention to that message or even understand what that message means? They wouldn't in either case understand it or pay attention most likely. And so this is a way of making sure we get good performance. Again the same is true with your SSD. If you don't fill up your SSD greater probability you're gonna have free erased blocks but people fill up their drives. Store lots of music and stuff on it and so by enforcing it below the user's control we're trading off some cost making it more expensive for getting contiguous allocations and better performance. So it's a reasonable trade off to make. And most file systems make some trade off along this line. But it was controversial when it was first introduced because now you're saying I'm gonna pay 10% more for my already expensive disk drive for this sort of claim of performance but they were actually able to demonstrate you got much better performance in the long term by having 10% be reserved. Okay so other things we need to deal with. So we've talked a lot about seeks but rotational delay is equally as important because rotational delay is of comparable cost to seeks. So we'd like to avoid any kind of rotational delay. The problem is we can miss blocks because of the rotational delay. The issue is this, we do a read. So I read a block, now my application does some compute and then it says give me the next block or give me the next thousand bytes. And so the OS goes to read the next block. Well by the time it makes that request to read the next block the drive which is spinning continuously has passed that block. Now we have to wait for it to rotate all the way back around before we can read that. And so we can get into this kind of cadence where we read a block, wait a rotation, read a block, wait a rotation, read a block, wait a rotation. So very poor performance. So the first solution is something called interleaving or skip sector positioning. So the idea is that if these are our sectors on the disk, rather than storing our blocks one after the other, we're gonna alternate them. So first block zero will go here, then block one, then block two, then block three, then block four, five, six, seven. So when you get a new drive, you put it into your computer and you profile it to see if you did a little bit of compute how quickly could you read the next block from the disk? How far would it have rotated? And based on how far it rotated, you'd set this interleave factor. So I just did an interleave of one, but you could have an interleave of two or three or whatever was appropriate, depending on how fast the drive was and how fast your processor was. Okay? So yes, blocks are bigger than sectors. So the question is can we find continuous runs? So the file system is allocated in terms of blocks. The disk deals with sectors, which are smaller. And sometimes it's a one to one or it's a two sectors for every block. It depends on the file system and depends on the drive and the size of sectors. Now where sectors are much larger on drives, I think they like four kilobyte sectors or maybe even 16 kilobyte sectors, the block size is large and the block size is typically equal to the sector size. Okay, so this is one approach and the downside of this approach is every time I change my processor, so if I take my drive out of a computer and put it into a new computer because I wanna transfer over my data, I need to re-perform this calculation because if that new computer is faster, then maybe I can use a different interleave. And similarly, every time I buy a drive, the drives are probably getting faster and faster so I may have to use larger and larger interleave factors if I'm not also updating my processor at the same time. So another alternative is to just prefetch. So now when the application says give me bytes zero through 1024, I read the first block if my blocks are one k in size and then I automatically read the next block. So now when the user comes back and says, I want block 1025 to 2048, I already have that or bytes rather, 1025 to 2048, I already have those in memory. So that's prefetching. But the downside of prefetching is if we do it in the operating system, we're taking away from other requests that might wanna use the disk. So it's a trade-off. You always have this, anytime you prefetch, you may get better performance or you may get worse performance if you're too aggressive about the prefetching. Instead of prefetching the next block, you decided to prefetch the next 10 blocks. What if I don't use those? And that's wasted time on the drive that it could have been busy servicing other requests because it has a queue of requests. So the other alternative and what modern drives do is it's actually done by the drive itself. So it has a track buffer and when you go to read a particular sector or block of a track, it just reads the entire crack into memory. So just as it's spinning around, it just stores it into memory for you. So if you do that, then now you're gonna read it directly out of the RAM of the drive. So modern drives have RAM in addition to having a spinning drive. Was there a question? All right, so there's also, these file systems were designed at a time when drives were done. And all of the intelligence and the logic was controlled in the operating system. But modern drives actually do everything. They have track buffers. They actually reschedule the queue of events. They do elevator scheduling. If there are bad blocks, they automatically remap those bad blocks, detect them and remap them out of a pool of backup blocks. And so again, there is the risk that errors can be introduced because all of this is running in software. So same as with SSDs, there's that risk. Any questions? Okay, so some administrative notes. We had an exam. It was a bit longer than I had anticipated. It was only the second time in teaching this class that I've had an exam that was too long. But in the end, actually, most people did quite well. So we just released to Piazza, or posted on Piazza about the solutions. You can look at the solutions on the sections and exam, old exams page. And on PandaGrader, you can actually look at your graded exam. Our mean was 69 or actually 70. That's probably about four points lower than I would have liked, but it's still not bad. A median of 72, standard deviation of 14 and a quarter, which is slightly higher than is typical. Usually it's around 10 or 11. What does this mean? If you did more than two standard deviations below the mean, you should go and see your TA, or you should see me or Professor Canney because you're not doing very well in this class. And you're in serious danger of not passing the class. We have lots of resources to help you. We have four TAs with office hours. We have two professors, so we have four hours of faculty office hours. So there's no reason that you should do poorly in this class. So again, if you're down here, you definitely need to talk to one of the course staff as soon as possible so we can help you. Now, PandaGrader makes it really easy to ask to have your exam question regraded. Push that button. If you push that button, which you have to do by Friday at midnight, we're gonna regrade your entire exam. We were really lenient in interpreting correct answers, especially for problems like problem number six. If it looked like you had a correct solution, or also problem two, we gave you a lot of credit. And we do this because we really don't want to have to deal with 166 regrade requests of six problems. So if you ask us to regrade your exam, we're gonna regrade the entire exam and we're gonna grade to a very strict interpretation of the correct answer. My experiences, you gain a few points, you lose a few points in the net, you typically lose points because we were really lenient. You'll see when you go on PandaGrader, look through the entire exam, you'll see that we were very lenient. If we could find a valid solution in your answer, we tried to take it. That said, it was a lot of late hours for the TA's and staff, and so we may have made mistakes. If we really made a valid mistake, then please do bring it to our attention and we will fix it. There's an anonymous course survey on SurveyMonkey. Please fill that out. That's your way to provide feedback and yes, we know the exam was too long. But otherwise, don't worry, the midterm number two will not be as long. In class exams are typically five questions, not six. If you have constructive feedback, please provide it to us because we'll try and make changes based on your feedback. With that, are there any questions about the exam or anything else? Okay, so quiz. You haven't had enough tests in this class. All right, so we've got five quiz questions to think about. First one, with the file allocation table, pointers are maintained in the data blocks. Second question, UNIX file system is more efficient than file allocation table for random access. Third question, the skip sector positioning technique allows reading consecutive blocks on a track. Fourth question, maintaining the free blocks in a list is more efficient than using a bitmap. And our fifth question, in UNIX accessing random data in a large file is on average slower than in a small file. So think about these while we take our five minute break. Okay, so let's get started. First question, with the file allocation table, pointers are maintained in the data blocks. How many people think that is true? Okay, and how many people think that's false? That is in fact false. The file allocation table is where we maintain these pointers. Question number two, UNIX file system is more efficient than the file allocation table for random access. How many people think that's true? Okay, how many people think that's false? That is in fact true, right? Because the UNIX file system is going to try and make things be as contiguous as possible. Whereas with the file allocation table, things are scattered all over the disk. The skip sector positioning technique allows reading consecutive blocks on a track. Okay, how many people think that is true? And how many people think that is false? The answer is true, right? Because it's doing this interleaving, it allows you to have contiguous reads. Continuous reads, rather, without having to wait for things to rotate around. That's a good point. I mean consecutive from the point of the file system. I'll have to change this for next year. Okay, it does not allow you to read consecutive blocks, physical blocks on the disk, but consecutive logical blocks. Question number four, maintaining the free blocks in a list is more efficient than using a bitmap. How many people think that is true? And how many people think that's false? Okay, memory allocation. Much more efficient to use the bitmap rather than using a free list. Same exact problem with doing your memory allocation. And our last question. In UNIX, accessing random data in a large file is on average slower than in a small file. How many people think that is true? Okay, and how many people think that's false? Okay, that is indeed true, right? Because for a large file, we're going to have to read through this hierarchy in order to figure out where the actual data block is located. In a small file, that's all gonna be cached in memory because that's all gonna be in the file header. Yes. Ah, so the question is, is there ever a case where accessing data in a large file is faster than in a small file? Only if in the small file, the blocks were not contiguously allocated. If they were completely contiguously allocated in the large file, then yes. But if you took a file as a subset, then I don't think that would be the case. Okay, so yes, question. Ah, so the question is, is there ever a case where it's better to maintain a list instead of a bitmap? I don't think so. The bitmap, it's a very compact data structure and with a list, you'd have to store block IDs or something like that. And then finding contiguous blocks is very easy in a bitmap, whereas it would be much harder to do in a free list. So for memory management, memory management's all done in the operating system is done using bitmaps. Okay, so how do we actually access files? All the information we need to know about a file is going to be accessible through its iNode. This is its file header. iNodes are global resources. They are logically stored in a global array indexed by a number. Since it's an iNode, that's an iNumber. All this predates Apple. Now, once you've loaded in this iNode, you know how to find all of the blocks of the file because you know where all of the indirect blocks, w, singly, and so on are located and where all the data blocks are located. Now, that's inside the operating system, but remember, we have applications and users. So how does a user actually ask for a specific file? So one way would be just to specify the iNumber. So I want file, you know, open file 14553344 for me. Yeah, that's probably not gonna be a good operating system. An alternative is to give a name. And then we just have to map in the operating system, this name has to be turned into the iNumber. So this is indirection. Another approach is an icon, right? Just point and click. That's how Apple made its money initially, was it introduced a graphical user interface for the first Mac attachment. Everybody else was using these primitive Windows environments, not Windows, I'm sorry, DOS environments with command line and having to remember the names of files. Mac came along and I could just a three-year-old could sit there and click on the file that they wanted to open. So again, we still have an indirection issue. We have to map from an icon to a name, ultimately to an iNumber. So this is naming. And more formally, naming or name resolution is the process that the operating system uses to translate from some user-visible, intelligible, understandable name into some system resource. And we're gonna see this again and again and again. I don't remember my machine at MIT by 18.26.4.9. I remember it by rover.lcs.mit.edu. So we're gonna see this everywhere. Okay, now for files, we're translating from these strings or from these icons into iNumbers and ultimately iNodes. Now, we can extend our file system. There's no reason why our file system has to be on a single machine. It can be a distributed or even a global file system, in which case, now we're gonna convert from a string or an icon into a physical machine name and some iNumber on that machine. So indirection is very powerful, very, very powerful concept. Now, we take our names and we organize them into directories. So everybody knows all of this. Directories are just a relation that we use for naming. It's a table that maps file names to iNumbers. That's it. Just these tuples is all we really have in a directory. There's also some metadata and stuff like that. Now, how do we actually construct these directories? We just store them in files because that's reuse. Easy way to do it. They have to be very quickly searchable so we could either store it as a list or as a hash table, right? Most file systems, they use a list. Not very efficient if you have large directories. So you may notice if you put all your files or a large number, like a few thousand files into a single directory, that things are kind of sluggish and that's because all of these directory operations are just going through the list. And so that's a case where you may wanna create some hierarchy and create some subdirectories. We'll get to that in a moment. Typically, directories are cached in memory so that they can be searched much faster rather than constantly going back to the disk. But again, if you have a very large directory, you're gonna spend a lot of time going back to the disk. Now, how do we modify the directories? Well, originally, an application could just open the directory file and modify it. Why is this a really bad idea? A couple of reasons why. But why do you think this is a really bad idea? Yes. Ah, what if the person's program fails halfway through while it's writing? So what if it has a bug? You've now corrupted the directory and lost all your files that were in that directory or some number of them. So that's a very good reason. That's a very important reason why we might not want to have applications to it. What's another reason? Exactly, you could maliciously or accidentally corrupt the directory. More reasons. What are other reasons? Yeah, exactly. The user program has to understand exactly how the file system is organized and all the metadata bits and how iNumbers are used and everything. And if I want to change my file system from a fat file system to FFS, I can't without having to go back and change all of the programs that are dependent on running on the fat file system. And there's a lot of file systems out there. And so that would mean either I have to write a program that could target all those file systems or I use system calls. So I put all that complexity into the operating system. What's another reason? Yes, sure. We might not want to allow programs to be able to rename certain files. Also, we might not want to allow them to change the iNumber to point to a system file because that might let them read or write that file. So we want to control what they're able to do and enforce permissions. And the easiest way to do that, again, is put it into the operating system, make it a system call. So we get portability. We get correctness, assuming the file system is implemented correctly. We avoid malicious users or users who inadvertently corrupt the data structures. We can control through permissions what you can rename and what you can point at. So a lot of power that we get by putting it into the operating system. Now, when you create a file, the operating system allocates a new iNode for you and then puts that file in a particular directory. So a file always has to be created within a directory. Now, how are our directories organized? They are organized hierarchically. This seems obvious, but in the 70s, it wasn't. In the 70s, your mainframes with their direct, what is it, direct attached storage devices, their DASD drives, used volumes. What was a volume? A directory, and there was no nesting. So if you wanted to store your files in a different organization, you'd have to create more volumes and then put the files into those volumes. So the introduction of hierarchy makes it a lot easier. So our entries now in a directory can either be files or they can be directories themselves. And we now name our files by an ordered set. So slash our top level directory, the programs directory and that, the p directory and that, and our file list. So again, everybody should be very familiar with this because we use it every day. Now, structure of our directories. It's actually not a hierarchy. It can be cyclic. It's more of a graph that can be cyclic or acyclic. And we have entries that can either be soft or hard links. So a hard link means that in the directory entries, so the directory entry here for book, we store the same i number for this directory. That's a hard link. So it means both of those directories are pointing at the same directory file. We can do the same thing with files. We can make hard links to a file. So this unhex file here can be the same file. So this is a hard link here. So the same i number is stored here as is stored here. This means we have to do reference counting. Because if we delete the file here, we can't actually delete the file because there's another link to it. So we have to reference count as soon as we introduce hard links. That introduces complexity in our file system. So to avoid that, we have soft links. These are also called shortcuts or other sorts of things in many operating systems. And all it is is you create a special file that has a path like this in it. So the advantage of hard links is if I move this file around, I decide to move unhex into this directory, I can still get to it from this name. With a soft link, I can't. Because the soft link contained this hard path all the way down to the file. So if I move the file to a different directory, soft links break. So that's where Windows cannot find the shortcut pops up when you go to click on something because you moved it behind the scenes. But the advantage of soft links is you don't have to reference count and soft links can cross file systems. So I can have a soft link on my machine that points to one of the instructional machines over NFS, over the network file system. I can't do that with hard links. Hard links have to be within the same file system because I'm storing the same i-number and i-numbers are unique to a particular file system. Everybody understand that? Kind of very important difference. Most people end up using, in most operating systems, you're using soft links, these shortcuts. Really simplifies things, yeah. That's great. In both cases, we're not making a copy of the file. Hard link and a soft link, we're not making a copy. The difference is just how we reference the file. With a hard link, we're storing that i-number. So it's pointing to the same i-node. With a soft link, we're just storing a name in a file. Actually a full path in the file. So every time we go to a soft link, we have to resolve that full name into the actual i-node. Which makes it brittle. Okay, so the question is on Windows, does it create, when you create a shortcut, does it create a relative or absolute path? I'm pretty sure it creates, that's a good question. I think it creates an absolute path, but I'm not, because I think if you move a directory that contains shortcuts, they'll still work. Because they, oh no, actually that's true. Yeah, I think they'll break. I think if you move a directory that contains a shortcut, I think it may break those shortcuts. Or no, if the shortcuts are beneath that directory, because you've changed the name effectively. Hard links, which you can create in Windows, it's not very easy. Typically you have to download a third-party program. Those will work no matter how you move the file around. They call them junction points in Windows, in NTFS. Okay, so name resolution is how we take a logical name and we convert it into an actual physical resource, like the file or a directory. So we're just gonna traverse a succession of directories until we actually find our target file. If it's a global file system, this may mean actually traversing machines around the globe in order to find our file. All right, so if we wanna resolve slash my slash book slash count, so count is our file, how many disaxis does it take to do that resolution? A lot. So first, we read in the file header for root. That's located on a fixed spot on the disk. Because we have to have some way of bootstrapping the process. Then we read in the first data block for root. And that's gonna be this table of filename i number pairs. And then we look through that list until we find my. Once we find my, we now know it's i number, which means we can look up its i node and we can read in the file header. It's i node for my. Then we read in the first data block that's in that file header for my. And we search for book. Once we find the file header, the i number for book, we can then load in its file header for book. We're gonna then read in the first data block of book and search for count. And then once we have the i number for count, we can read in the i node for count. Now we are finally ready to actually read the blocks in count. Now if your system had to do this for every time you opened a file, it would be incredibly expensive and incredibly slow. So instead what we do is we actually have a, effectively a per address space pointer, which is your current working directory. And this current working directory, basically you've resolved, so if I'm in book, you've already loaded in the file header for book. And you've probably cached the first data block for book, which means successive operations on files and book are gonna be very fast. So if I do an LS of the directory, I'm just already, I have this data block in memory, so I'm just gonna be able to list things out of memory instead of actually having to go to the disk. So it's a little shortcut that saves a lot of overhead. Yes. Ah, so that's a good question. The question is if we have a program that's really data intensive, creating lots of files, should we create all of those files in the same directory? So there's an advantage. If we create them all in the same directory, then we have this advantage of them being stored in these data blocks for that directory, the directory file. But there's a counterpoint to that. What would be the counterpoint? The disadvantage. So if I have something, a program creates 10,000 files. So we have to rearrange this stuff. What do you mean by rearrange? So our current directory would remain the same. The data blocks would now get bigger. We'd have to have more and more data blocks to store all those entries of name to iNumber pairs. What else would happen? Yeah, so we're gonna end up having to read more and more blocks. Because remember, this is, you know, book, our directory book is stored in a regular file. So if that file gets big, we're gonna have to walk a lot through the disk resolving all those indirect blocks and doubly indirect blocks if we make our directory really, really big. There's another issue that's gonna happen here. It's on this slide. It's an assumption that we're making right here. We're making the assumption that directories are typically very small so we can search them linearly. So if we go and we put 10,000 files into the same directory, we'll get some advantages because the data structures for book will all be blocks that are relatively close to each other. But the downside is searching through that directory is going to take a long time. Doing an LS on that directory or even deciding to load a file is gonna require iterating through all of those entries until we get to the entry we're looking for. So there's trade-offs. This is why you'll find typically programs that create a lot, a lot of files will create, you know, an A directory and then put like 1,000 files in that directory and a B directory and put 1,000 in that and a C directory and put 1,000 and so on to try and not have directories with too many files. Okay, but it means resolution gets more expensive. The trade-offs. Okay, now, where do we store the inodes? Well, originally in early versions of Unix and in the file allocation table scheme it was stored in the outermost track. Why? Because. But that's not near the data blocks. So if we're gonna read a small file, we're gonna have to seek all the way out to the outermost tracks then seek to where the first block is located. Then seek back to read the fad again and seek back to read the file and back and forth and back and forth. And if we're listing the files in a directory, we're gonna be seeking all over the place. So that was not very efficient. The other issue is that it's fixed size. So when you format a drive, there is actually a parameter for most file systems as to the maximum number of files that you're gonna create and that specifies the number of inodes that are created. This in general is more than enough for most people but if you're storing a lot of little files, like you have a web server or something like that, then you may need to raise that number. If you have a very, very large number of small files. Now, where do we actually store them today? We store them near the data blocks. In fact, we try to store them in the same cylinder group as the rest of the data. So this makes things like an LS on a directory run very, very fast because to do an LS, we actually have to look at the inodes to figure out what was the last modification time so we can print that out when we do the directory. Advantages are that you can also, by putting a portion of this file header close to the disk, close to the data that you're storing, for that directory, it makes all the operations run much faster and these file headers are really small. So for small directories, we can fit all the file header, all the data and the actual directory itself on the same cylinder. Zero seeks, just rotational delays and switching heads. So maximizing our performance. Now also, even better is the fact that because the file headers are much smaller than a disk block typically, we can store many of them in a single disk block so we can just do one read and read a whole bunch of them so we can sort of amortize the cost of doing those reads across reading a lot of file headers. Another big benefit. Also reliability. If your drive has some corruption, if you stored everything in one location, that corruption has the risk of losing everything. Whereas here, if one portion of the drive is corrupted, I still can recover the data because the file headers and the directories in this portion of the disk wouldn't be damaged. All of this is part of the 4.2 BSD fast file system. Overall, it's a lot of optimizations to avoid seeks. Okay, last thing I wanna talk about is some in-memory data structures. So when I do an open system call, what happens? Well first, we resolve the file name by looking in the directory. So we have to go walk through the directory, load in the directory block, find the actual entry, right? And that gives us the inode, the inumber which then gives us the inode. And then we make some entries in some per process and system-wide tables for that file. So there's a system-wide open file table and then there's a per process open file table. And what you get back as a user is a file handle, which is basically a pointer to that open file table. So when you wanna do a read, you provide this index as file handle, which is a pointer into this per process file table, open file table, which contains an entry into the system-wide open file table, which tells us where to find the inode and all the data blocks. So this is also a shared resource and this is a limited resource. So you have a limit on the number of open files that an individual process can have and you have a limit on the total number of open files that you can have for a system. And usually these are settable. This is usually settable by the U limit and this is usually set when you build your kernel. Okay, so last quiz. A hard link is a pointer to, that should be another file. Second question, an I number is the ID of a block. Third question, typically directories are stored as files and our fourth question is storing file headers on the outermost cylinders minimizes the seek time. Let's see if everyone's awake. So first question, hard link is a pointer to another file. How many people think that is true? How many people think that is false? Divided, the answer is false. It's not a pointer, it is the actual I number so we share the same structure. A soft link can be thought of as a pointer. We have to do some resolution to find it. The I number is the ID of a block. How many people think that is true? Okay, and how many people think that is false? That is false. The I number points to the I node. It gives us the index into that table for where we can then, given the I node, find all of the blocks of the file. So it's a different in direction. Okay, question number three. Typically directories are stored as files. How many people think that is true? And how many people think that is false? So the answer is true. Reuse data structures. Why create a new data structure when you don't have to? Because you can, it's probably not a good reason. Okay, last, storing file headers on the outermost cylinders minimizes the seek times. How many people think that is true? Okay, and how many people think that's false? So that is false. If we put it on the outermost cylinder, we're gonna be constantly seeking. So we wanna put it as close to the data as possible. Okay, so summary. A file system takes and transforms our drive with its blocks into files and directories. We all use file systems every single day. We wanna optimize for the access and usage patterns. So small files and sequential access modes. But we wanna support other ones, like efficient random access and large files because those exist. Files and directories are defined by an inode. That's our file header. And we use schemes like multi-level index schemes in 4.2 BSD where we have an inode that contains information about the file and then direct block pointers and indirect, singlely and doubly indirect and triply indirect pointers. In 4.2 BSD, we have optimizations for sequential access. So for example, starting new files within an open range of free blocks so that they can grow and remain contiguous. And we also have rotational optimizations like skip sector positioning to match the performance of the disk with the performance of the processor. And then finally, naming is that active transforming from a user-visible name or icon into something like a system resource. And we're gonna see naming appear again and again and again in operating systems. Any questions? Okay, see you on Monday.