 Okay. Welcome back everybody to 162. Today, we're going to pick up where we were last week on Tuesday and talk about file systems. And just remember, if you have any questions, type them in the chat and we'll try to get it to everybody. So, just to remind you of where we were, we were talking about building file systems and one of the things that we talked about was the FAST file system, which originally started in 4.1 back before it was the FAST file system. They didn't actually call it the slow file system. But an iNode basically was defined in the following fashion where a 128-bit structure would point at data and that iNode structure would have things like all of the metadata for the modes and owner and timestamps and so on. And then a set of direct blocks, single indirect blocks, doubly indirect blocks, etc. And if you remember a direct block points directly at a data block, the single indirect points at a block which points at data, doubly indirect points at a block which points at a block which points at a data, etc. And this particular structure, as you might recall, was put together as a way of basically getting very efficient support for short files but also being able to support long files fairly well. And so the short files are ones that are all direct pointers and the big files are ones that probably use the double and maybe even triply indirect blocks. And these data blocks are numbered in some sense file order in the following way. You start at zero, block zero, work your way through all the direct blocks, and then you work your way through the singly, doubly, triply, etc. indirect blocks. And so, you know, 10 direct pointers. So to get to any of the first blocks there, first 10 blocks is a single hop. To access block 23, assuming that this iNode's already open, you just basically have to grab the indirect block and then block 23, which would be the 13th block down. And keep in mind that these indirect blocks here have a tendency to get pulled into the buffer cache. And as a result, after you pull one in, then it's pretty efficient. All right. How about block five, block 23, you have to go to the indirect blocks, block five, you could grab it directly. Block 340 turns out you have to go through the doubly indirect blocks to get there. And this is a computation we're assuming that would be very easy for you to do. So the pros and cons of this original structure, the pros where it's more or less pretty simple, and it nicely optimizes for short files and long files. And you get this easy expansion up to a point. And some of the cons, though, in the original BSD was you had lots of seeks because there was no particularly close placement of the data blocks. And so that led to the FAST file system, which was in 4.2 BSD, where there was a lot of attention to how to lay out the data blocks in a way that gave you great locality. And that's what we talked about last time. And just so you know, so BSD, obviously was Berkeley standard distribution. So that's yay Berkeley, right? The XT2 and 3 from Linux is essentially in the direct line of this file system data. And so we XT2 and 3 from Linux is pretty close. It has a couple more direct pointers. It goes all the way out to triply indirect blocks. And the blocks are 4k kind of by default. So, but it's very similar. So are there any questions on this before I move on? Everybody good on these ideas? All right. So moving on, we, oops, there's a question. How many indirect blocks? So the number of indirect blocks are, so the direct blocks are here. The indirect blocks, there's a single, singly indirect block attached directly to the I node. There's a doubly indirect block that has singly indirect blocks and so on. And so there's kind of one of each of these varieties. All right. And the pointers are, the number of pointers are basically, you know, in the original BSD, there were 10 direct blocks and Linux, there are 12 and so on. The number of pointers in an indirect block like this depends on the size of a block versus a pointer. And so if you have, say, a 4k block, indirect block, then certainly 32 bits is more than enough to point it into data. So you could get 1,024 data pointers within an indirect block. Yep. Only one of, there's only question continuing here. There's only one singly indirect, one doubly indirect, one triply indirect and so on. Okay. Now, the thing that we were actually talking about at the very end last time was we were talking about caching. And in caching, as I've said before, operating systems are all about the cache. And so in the middle of the operating system, we have something called the buffer cache. And the buffer cache is kind of a generic cache used for all sorts of stuff. File data is one of its primary uses, but it's also holding things like inodes and directory contents, name translations, et cetera are all stored in the buffer cache. And so the key idea here as with any cache is we're going to exploit locality in the disk data to get us better performance. And that locality, including things like name translation is often there because typically you open a path and then you operate on a bunch of files in a directory. And so the name translations are cash and the disk blocks have a lot of locality to them, especially if you're doing sequential accesses. So the buffer cache is this thing, as I mentioned, used to cache all these things. And it's typically a single entity that's managed in the middle of the cache for handling all sorts of stuff. The question that's asked here is sort of how frequently do we flush dirty cache blocks back to disk? So that varies a lot by operating system, as I'll mention, with the typical Unix variants in kind of default mode, 30 seconds is often the number that's used there. So we'll talk more about this as we go on. So I wanted to give you a little bit of a graphical view of this just so you know. So we have the process control blocks with their file descriptors and we have the disk. And somehow between the user reading and writing stuff and the disk, there ought to be something in the middle to try to make it a little faster and that's going to be the buffer cache. And so the buffer cache you can think of is something, it's a set of memory set aside to hold a bunch of blocks that go across different processes. So this is between all the different processes and users of the disk. And every block is going to be a 4k block, for instance, if our blocks on the disk are 4k, and it's going to have state like whether they're free or not. And what I'm going to do here is these blocks, yes, are handled in an LRU fashion, we'll mention in a moment in the buffer cache, but you can think of them logically as there are some data blocks, which we have our gray, there are some inodes, there are some directory data blocks and there's sort of the free bitmap itself as well. And these things are all mappings of what's in our cache. And of course they're cached versions of what are on disk. And so if a user goes to access some data or open a file or whatever and it's already in the cache, then we can get very fast access, otherwise we've got to start involving the disk somehow. And just to give you an example, so for instance, if you already have a file open and you've been accessing it, then the file descriptor part of your PCB is probably pointing at its inode and it's already in memory somewhere and you can just access quickly. On the other hand, if you're trying to do an open for a brand new file, then what? Well then if things are not necessarily in the buffer cache, you've got to start moving forward with a process. And so for instance, you might have to start reading the directory off of the disk, assuming that there's an inode for that directory that's already been loaded. And so one of the things you're going to do is you're going to allocate a new block and mark it in a transient kind of read state where this says that this block is kind of halfway between the disk and memory. And then you're going to start a read going. And that may take a while. If you remember, we said the one number that I keep telling you guys to remember other than Pi of course is about over a million instructions are typically lost on our way to reading. And so clearly this process may be trying to do an open, but we're going to put it to sleep while we're busy reading data for the directory off disk. Eventually that gets put in its place. And now this maybe gets marked as a directory block in the buffer cache. And now we can logically think of it as available, in which case we can now look up, say our file that we're trying to look up, we can find out what I number it is that we're looking for. And now we look in the blocks that we have that are cache brinodes and we find out, oh, that's not in our memory yet. We've got to go to disk. And again, we're going to mark a block in the buffer cache as busy transient, not freeable yet. We're going to go ahead and do the read. And then eventually that will come back and it'll be read into memory and we'll mark that as an inode. And at that point, you know, we can actually mark the file descriptor for our open inode and start dealing with our data. And so in that case, we might have to read blocks off of disk and the reading of blocks that our data blocks is going to be similar to what I just showed you. So we're not going to go through it. But these gray blocks can also be cached. Okay. And so for reads, we're going to pull things off of the disk into the buffer cache and then into the process. And if you remember one of the things that we talked about a couple of lectures ago, was we said, look, the user's view of a file is a stream of bytes. The system's view underneath the covers is a bunch of blocks. And so, especially if the user reads a few bytes, we're going to pull the block into the memory so that we can then give them those few bytes. But if they ask for the next few bytes, we pull them directly out of the cache. So this is the buffer cache is part of not only how do we reduce performance penalty by going to the disk all the time, but also how can we help match the user's view, which is bytes with the system's view, which is blocks. And we do that by putting the blocks aside so that we can handle multiple reads to them. Writes, of course, are tricky. Because if you write a few bytes, you're not going to write through the disk. Okay, you absolutely aren't because, again, that's a million instructions worth of time just to write a few bytes to disk. Not a good idea. So writes are, by and large, going to go into memory. And in fact, you're going to have to do read modify write because if you write a few bytes, and there's already a block out on disk, you got to read the block into memory first, then modify the bytes, and then mark it as dirty. Okay, for later flushing. And of course, in order to allow you to write a stream of bytes over a couple of successive system calls, you're not going to flush the disk right away. And we're going to start getting to that question, which was asked earlier, how many, you know, how long is dirty data in the cache? And that could easily be seconds, 30 seconds. Okay, which is going to start causing some issues for us as we start thinking about it a little bit. But so blocks being written to disk go back through transient states. If you remember, for instance, when we were talking about paging, and we're talking about replacement policy, one of the advantages of the final free list version where we couple the clock with a free list, which was kind of like a second chance list, was that we could pull blocks that were about to be freed and put them on the free list, and then start them being written back to data, if they're back to disk, if they're dirty, and then by the time the blocks got all the way to the head, then hopefully they've been written back and they're ready to go, and they're clean. And so those blocks go through a set of transient states, for rights, and for eviction. Okay. All right. So what about the buffer cache? Well, it's written entirely, it's implemented entirely in the software. This is unlike memory caches and the TLB, which is hardware supported. The blocks go through transitional states, as I mentioned, between free and in use. So being read from disk, being written to disk, etc. Other processes can read or write. So for instance, if process A writes some data and then process B reads from the same file, the buffer cache will actually catch the reads of the second process. And so even though the data hasn't made it to disk yet, the second process gets the correct data back. Okay. So the buffer cache is catching all reads and writes to make sure that we have a consistent view in a single node. This gets a lot trickier when we start talking about distributed file systems, which we're not quite there yet. Now, so blocks, as I mentioned in the buffer cache are used for many purposes. Inodes, data for directories and files, free map, and the operating system maintains pointers into them while they're in the buffer cache. And essentially what we mean by some of these transient states is that these buffers are essentially locked during different periods of time when they can't be replaced while they're being read in or written out, etc. Okay. What happens on termination of a process? Well, at that point, we've got to flush all the data out of the user's buffers and into the buffer cache at least, and then possibly out to disk if it's requested by the process. So you might ask a question about what happens when the buffer cache gets full. And really, that's pretty much all the time because you're going to keep as much data as you possibly can in the buffer cache. So after a little while after rebooting the system, pretty much the buffer cache is full. And so now what? Well, LRU is easy enough to do. So unlike when we were dealing with replacement in virtual memory where to compute LRU, we had to essentially rearrange our blocks on every access, which would be way too expensive. And so we went to a clock algorithm. Here, we have the potential to do LRU because really, we're only rearranging blocks when we are replacing whole buffers, either pulling them in off of disk or sending them out to disk. And so we can keep a linked list if we want. Okay. So LRU might actually be a good policy here. And it's a nice approximation, as we've said in the past, to min, which is kind of the oracle that finds the page that's going to be replaced longest in the future. But there are some disadvantages, which are, there are some types of accesses that you absolutely don't want cached. A great example I give here is you find, and this find is, this is a pretty cool syntax, if you guys haven't seen this before, where you go to a directory and you say find starting in that directory, you're going to go for every file in that directory and you're going to execute grep for foo. And this funny little double brace means insert the file name in there and backslash colon says that's the end of this command. And so what this find will do is this will actually execute for every directory and every sub directory and every file underneath the current directory, it will execute grep foo to find something. So you guys should keep that in mind. But if you do this at the top of a terabyte size file system, you're clearly going to blow out your cache by filling it up and replacing and replacing and replacing and never getting any benefit from caching. So there is a good example where LRU isn't going to help you any because you're basically kicking out things. Now, there's a good question that's on the chat here, which is sort of how do you prevent a process from accessing a block in the cache that's being evicted by another process? Well, the answer is you don't prevent it. You do what we said about second chance list. So until the page is absolutely out of the system, you have the potential for a process that accesses it to bring it back in and tell it to, you know, come back into the active set. Okay, so you don't actually have to lock the entire cache. What you do is you start the process of writing it out. And when you finally finish that up, you know, on the interrupt from the disk saying the file is done, then at that point you can, you're in the interrupt context or you're right off the interrupt context so nobody can be accessing it. So you can free it at that point. Or if it gets reused in the meantime, then you just bring it back in and the only thing that's happening is you're writing the dirty data out and when it comes back it's clean. Another question here in the case where all blocks are being used and the cache is full, let's see, in another process P1 requests some data that's on disk. Do we wait or sleep on P1 until the block in the cache is no longer used? So there's a lot of different questions here that one could have about things that are in a transient state. If there's data and it's valid in the cache, say it's dirty and it's being written out to disk, then you could certainly satisfy reads. On the other hand, if you're reading it in from disk and there's no data there, then if a second process goes and asks for it you're going to have to put it to sleep. So there's a lot of different scenarios you can come up with and asking whether a second process is allowed to access or not is really a question for thinking through the scenarios. And you just got to think through which ones are going to make sense from a consistency standpoint which aren't. So if there's another question about encountering cache incoherency if our system has multiple physical disks, no problem as long as it's one node because the buffer cache helps all. If you think about this buffer cache there's no reason there can't be many physical disks behind it. The buffer cache is the gatekeeper between all processes trying to read and write. So as long as you're on one node, no problem. The moment you go to multiple nodes potential problem and that's when things are going to get interesting when we start talking about across different physical nodes where the buffer caches are now unique, then we got a problem and got to start thinking carefully there. But as long as it's one node multiple disks don't matter because everything goes through a single buffer cache. Okay. So the other thing I wanted to mention is LRU may be semi-optimal if we have no other options or don't know anything. But if a process does know that it's about to scan through the whole file system and basically get no benefit from caching, there are some systems that allow you to kind of say for my upcoming requests just we're going to do a use once policy and the file systems allow to discard blocks as soon as they're used rather than trying to cache them. In that case you might have a very small portion of the buffer cache that's literally just for transient blocks going in and out of the system and you leave the rest of the bulk of the buffer cache for other processes that are likely to get some benefit from it. Unfortunately those interfaces are not always available but when they are they're very helpful when you know how you're using the file system. So now we might ask some obvious question like how much memory should the OS put for this? So DRAM is what we're using up and we know from our previous few lectures last several weeks that we have many uses for DRAM one of which is virtual memory. The other of which is the buffer cache now. So virtual memory is different from the buffer cache. Virtual memory is memory that's mapped to virtual addresses in processes and it represents memory. The buffer cache is memory that represents files. Those are slightly different things and so you could imagine that there's a trade-off. The more memory you have for the buffer cache maybe the better your file system behaves but you start getting a lot of page faults because your processes that need a lot of virtual memory start faulting a lot because there's not enough physical DRAM. On the other hand if you have too much virtual memory then your file system might behave well because you're not getting any caching. So it's a conundrum. What do you do? So what you do is in the old days back when I first started compiling versions of BSD like file systems you actually had a constant in the top header file and you had to say sort of what fraction went to the buffer cache and what fraction went to the file system and this was unfortunate at best because you never guessed quite right and so you ended up doing a 50-50 or something. Today you just dynamically adjust it and so most modern operating systems kind of look at the miss rates in the buffer cache and the miss rates namely page faults in the virtual memory side and it just dynamically decides how many pages go on either side and so that's much easier. Another question you might ask is do we only pull off the disk those blocks that are actually asked for and there's pretty good reason not to do just that and the reason is that we know the way POSIX file system interfaces are you don't really know up front how much data the user wants you know they may read a couple of bytes and close the file or they may be reading a few bytes at a time but then proceed to grab a terabyte of sequential data off the disk but doing it a few bytes at a time. Those two options a few bytes and close or a terabyte and close those are actually not distinguishable to the file system in the short span and so as a result it could just pull in one block at a time when it goes to read or instead it could decide that well sequential reading is likely and so what it's going to do is just pull some things ahead and so so the key idea here is basically exploiting the fact that most common file access sequential and by prefetching the subsequent disk blocks we basically optimize for the pretty common case where what's actually happening is the user is just reading through the file sequentially but doing so a few bytes at a time and so often you know you can now ask a question of how much to read ahead and typically I'll say that in a second so typically a few blocks of the next few blocks of the file are almost always the right thing to do and you can think of many types of file access patterns which are essentially sequential and so that works out really well and the other thing is not just data but also when we map an executable into memory and demand page that executable in which we talked about and map last time that also is likely to be sequential because if you start running in a particular place and you're running a function that's close to the end of a block you're going to want the next block and so read prefetching which reads a couple of blocks ahead is pretty good optimization and the other thing is if we have a bunch of different prefetching going on the nice thing to remember is the elevator algorithm either running in the operating system or as we mentioned more commonly these days running on the disk controller itself you can take all these different prefetches and write backs for that matter and interleave them in a way that optimizes for not seeking too much and so having a number of accesses that are queued is actually an advantage for that from that standpoint because it gives the scheduling of the disk blocks a chance to optimize for the access pattern that happens to be in the queue so that's pretty good okay any questions so far okay are we good so far alrighty now what about delayed writes and so delayed writes as I've are specifically a name that is given to these writes that are in the buffer that aren't immediately sent to disk and basically the right the buffer cache itself is a write back cache clearly right because you don't write through the buffer cache to the disk because you'd blow a million instructions worth of time the fact that it's a write back cache now gives us some positives and negatives so on the positive basically the write copies the data from user to the kernel buffer and allows the user to interface on a byte level rather than on a block level so that's pretty good and the other thing that we get is that if a user has written some data another app can read it immediately and that data doesn't have to have gone to disk and so we get a very fast turnaround on data written by one process and read from another without having to take that million instruction hit to go out to disk so the cache on a positive here is transparent user program so that's great now we're going to flush this to disk periodically because if we only if we leave our data in memory and not to disk indefinitely then we have a pretty serious vulnerability that the moment the thing crashes we lose our data and that would be very unfortunate so for instance in Linux kind of the default is every 30 seconds or so the other advantage of delayed writes is just what I told you when I was talking about pre-fetching which is the fact that there are delayed writes sitting in the buffer cache means that in principle we could choose how to send those writes to disk in a way that optimizes for seeking and so the fact that there are more blocks to choose from in the buffer cache is an advantage so delayed writes have some pretty interesting positives to them another one which you might think not have thought about is if we're if we're going to try to optimize for locality as in the fast file system types of allocations we talked about last time what we'd like to do is we'd like to have some insight into how big of a file we've got and so if a user opens a new file and starts writing a few blocks and we have to allocate to disk immediately then we might choose a section of the disk that isn't big enough and not have enough locality but because we're writing into the buffer cache we can actually have data in the buffer cache that's not yet mapped to official physical locations on the disk yet so the data is just sitting there in the buffer cache and as a result if we have a set of if we start writing and we write a bunch of blocks then by the time it's time to flush them to disk the allocation portion that's allocating sort of from buffer cache space to physical space can now say oh gee this is a bigger file let me find a longer run of free blocks to write to and so you can get better allocation as a result of delaying so that's another advantage here's one that you definitely probably didn't think about which is as a result of delayed writes some files that are created written and destroyed like temporary files may never even have to go to disk and so if you were to observe when you run a compile you'd see that there's a lot of intermediate files that are all generated and then quickly destroyed and as a result of delayed writes you can you can have a situation where those files which are temporarily made and deleted don't actually ever have to go to disk so that's a pretty serious advantage of delayed writes okay so what's the downside well there's clearly one right what if the system crashes before the buffer cache is flushed to disk you lose a bunch of data what if it was a directory files data well now you might lose a pointer to the inode and so not only do you lose data but maybe you lose the existence of the file okay so there are some pretty serious consequences here from a reliability standpoint and so this kind of leads us into a pretty clear need for some sort of recovery mechanism to deal with failure and we'll talk about some here but there's there are a number of options that we could think about here right we could say that if we're writing a file and the file crashes the files the system crashes before we've closed the file and written it out to disk maybe and then flushed it to disk maybe we consider that okay because the the process that was writing hasn't reflected back to the the user yet that that was even written and so maybe it's okay to lose the data so that might be a failure mode that's a little different from one that you might think about now there's a question here what happens if a system crashes in the middle of the right what about a failure in the middle of writing to disk that gets very interesting right so most file systems start with the premise that writes to disk are atomic somehow so you can't get a partial right to disk that's clearly not a good assumption unless you have some way to make sure there's just enough power to finish the right you're currently working on some systems give you that another thing that you can do is sort of a more reliable kind of thing to do is you can start by having what we'll call non volatile RAM we'll talk about that in a moment where all of the writes go into this type of RAM that's battery backed up right away and then once you've verified that it's made it to disk then you can delete it from the envy RAM and so that's systems that are pretty serious server level systems among other things start with envy RAM and then they go from there and they're going to have some form of journaling or logging which we're going to talk about as a topic today as well okay alright now so what are some of the things that we might care about okay so one of the ones that we've already talked about earlier in the term is availability okay by the way you can see that the ilities all ended so availability is the probability that a system can accept and process requests for sure and this is what websites and web services often quote they say you know I have three nines of availability 99.9% chance that when I go to access that website 99.9% of the time something useful will happen now this is not necessarily as useful as you think it is okay this does not state that 99.9% of the time something correct will happen all it says is there'll be something there to give you a response and so if you're out shopping for a service you should look carefully at what they're telling you you got because if they're telling you you're getting three nines of availability that may or may not be useful okay it may be useful that you get something for sure 99.9% of the time but it would be even better to know you get something correct 99.9% of the time and that's not availability okay durability is also a different thing durability is the ability of a system to recover data despite faults okay so this is something where you say it will have 99.9% durability says that 99.9% of the data that's written to it will not be lost okay now that typically when you're getting into durability you're much more interested in a higher number of nines or five nines or seven nines because data is king that's my statement that I always make and if you lose the data there's no going back because data once lost this is an entropy thing solid physics once you've lost your data you can't get it back and so I want to point out amusingly enough that durability is not availability it's very durable doesn't mean it's available and the pyramids I think are a great example of this so once upon a time not too long ago the pyramids had writing on them that nobody could read okay and what happened was eventually there was the Rosetta stone that was discovered that allowed people to decode the hieroglyphs that were written on the pyramids and as a result they were suddenly able to read the data now that data was extremely durable it lasted for 2000 plus years but it wasn't really available because nobody could read it until the Rosetta stone came along so those two things are different and finally there's reliability which is what you really want most of the time this is the ability of a system or component to perform its required function under the stated conditions for a specified period of time and it's much stronger typically than availability because it means the system is not just up but it's working correctly so working correctly might mean that when you do a right to an available system you might get back an okay but the data might be lost if you do an oh if you do a right to a durable system you know it might give it it might write it there but you're not going to be able to read it again for a very long time the reliable system will actually not only write the data but give you a proper response back so oftentimes we're going to be looking to try to deal with reliable systems not just available ones alright questions by the way I'm going to show you as we start getting into distributed file systems and later today in this lecture I'm going to show you an example of where you could have durability without availability what you do is you encode data and you spread it all over the place and you spread the little chunks and as long as you can recover enough of those chunks you can get the data back but then a bunch of the chunks of the internet go offline now the data is safe and you can eventually recover it but during the period of time where parts of the net are offline you can't provide abilities I don't know because I called it illities no good reason I guess okay how do you make the file system durable so disk blocks actually contain error correction codes on them to deal with small defects in the disk drive so what you don't realize perhaps is when you write your data to disk it is actually encoded in a way that takes slightly more bits on the physical disk than the number of bits you wrote and these are often encoded in excuse me something called a read Solomon code and the good thing about this is when you go to read the data back even though there are some errors on some of the bits there's enough redundancy in the read Solomon code that you can with very high probability read back the data from the disk and you know decode the code and get it back and so this is essential to modern disks because the bits are so dense they're so close together and even with the shingled type of recording I mentioned a couple of lectures ago they're even on top of each other so you definitely need error correction code to make modern disks work and you don't have to even worry about that it happens automatically the second thing we might do is make sure that writes survive in the short term so this is going to be very useful when we're writing to a file system that's on a server that say has delayed writes it would be great if the writes that aren't yet persisted on disk were stored somewhere where a power outage wouldn't clobber them and that's a good example of that is non-volatile RAM which is random access memory that actually has a little battery on it so that if power goes out the data is still there now I will relate the following story when I was first at Berkeley and years ago where n is larger than I'll mention we actually had a whole bunch of RAID disk drives set up RAID servers actually for our data and we had a transient power outage and we thought this isn't a problem because all our data is protected and in fact we lost a whole bunch of data because all of the batteries in the little NV RAMs hadn't been checked in a long time because the system essentially was never failed and we never had a power outage and so even though in principle this was going to back up our data for us or hold it in RAM until it could be written to disk it didn't actually happen and we lost a whole bunch of data and had to go to tape to pull it back in so moral of that story is if you have something like a battery that you're relying on you need to test it regularly the other possibility in here of course that's much more possible today is flash memory or SSDs as a short-term write before you go out to spinning storage and that is something that people do of course SSDs have a limited amount of writing that can be done on them so six and a half of one half a baker's dozen of the other the other thing we might want to do so we've got the short-term survival because of the NV RAM the long-term survival is another question which is really about replication so more than one copy of the data so if you have it on your local server and it's replicated and there's a fire in the machine room you may have just lost your data so the real thing you want is you want your copies to be automatically put somewhere in the cloud and spread multiple continents to make sure it survives and so there's a lot of interesting ways to make data durable over the long-term so you can put copies on one disk but if the disk fails you can put copies on multiple disks but if the server fails you can put copies on different servers but if the building is struck by lightning you can put copies on servers in different continents so if we're hit by a very large meteor then you don't care anyway so it's probably okay that the data is not durable actually maybe the aliens care so maybe you send a copy off to the moon or something as well so one thing that I'm sure you've run into is this notion of RAID we're done in arrays of inexpensive disks this is a Berkeley thing as well, yay Berkeley so Patterson and Katz came up with this acronym and Dave Patterson as I like to say is the famous generator of four-letter acronyms and RAID is one of the ones he's very famous for which is we're done in arrays of inexpensive disks and what they were interested in here originally was instead of really expensive huge disk drives that were really fast and but very expensive he wanted to put a bunch of cheap disks together and sort of get the same amount of storage and same speed but much cheaper and the problem was of course cheaper disks when you have a lot of them they fail and so initially putting a bunch of disks together was actually a way to get lower reliability so they started investigating what you could do and among other so what they did was they started using redundancy across multiple disks and this as a result of redundancy basically gave them a way to make sure that even if the disks failed they could have copies around and as a result they could make a RAID system faster and more reliable when it was made out of a bunch of disks then these really big expensive systems and RAID the basic idea which I'm going to show you here which I'm sure many of you have seen before is something that can be done either in software or hardware and for instance if you take 262 we often talk about various hardware RAIDs that were designed with a hardware controller but a software microcontroller on it that would switch between different RAIDs and so on by HP and so you can do this in software you can do it in hardware you can do some combination of both but what's the essential idea well the essential idea was laid out with five different RAID levels and RAID 1, RAID 2, RAID 3, RAID 4, RAID 5 they weren't particularly inspirational in the way they named these things and the different 1, 2, 3, 4, 5 really have nothing to do with each other but the terminology stuck and so RAID 2 and 3 are ones that you've never heard of I'm sure and probably won't 4 as well so really you hear about 1 and 5 and 0 and 6 so what's RAID 1 well RAID 1 is very simple this is 2 disks so for every place you would have put one disk you put 2 okay and it's called mirroring and every disk has fully duplicated shadow disk and what's great about this is it's extremely simple every time you just make the disks have identical file systems on them and every time you do a write you write to 2 disks and every time you do a read you can actually randomly pick a disk so as you think about this it's got 100% overhead in capacity so you're having you have 2 disks but you only get one disk worth of storage but you get twice the read rate because you can read from either disk okay and you get this durability so that if a disk fails you've got the other one that's a good copy and I will tell you today these days I never buy workstations or servers that don't have at least RAID 1 because it's very easy to do you go to a company like Dell and you configure your primary disk storage and you just say make it a RAID 1 and just duplicate every disk you buy and this turns out the expense is not high and it's absolutely worth it so the bandwidth here is sacrificed on write because you got to do 2 writes you got to write to either disk and there's a little bit of synchronization involved and so 2 disks writing to each partition and making sure both of them there's a little bit of additional penalty there trying to get them both written then there would be if you were just writing on 1 reads can be optimized to choose from either disk and so are faster and recovery is disk failure you just replace the disk and copy it to a new disk and in fact you can even have systems that have a third disk that's sitting there idle or off all the time and the moment a disk fails you just pull the other one on now what I would like to point out here for those of you that might be wondering a little bit about this is how do you know a disk failed does anybody have any idea I mean it seems like RAID fundamentally requires you to know that a disk has failed how would you know that good idea check sums everybody's thinking of check sum turns out that's the right idea but I already told you we've got error correction codes if you remember and so what's great about these read Solomon codes is they not only have enough redundancy to fix a small number of bits but if there are too many bits that are there that are bad they'll tell you that and so you can find out because you go to read an item off the disk and the ECC code says this is a fail okay another one that's pointed out here which is also good is not responding yes so both the codes report a read failure or the disk controller itself isn't responding either of those are reasons to assume that the disk is failed and what we do in that instance is we just sort of mentally put a whole X over the full disk and assume that it's completely broken and then copy from the good disk to a new disk and so this is what's called an eraser code because we've erased say this pink disk because we know it's bad and then we copy the green one to another one okay and so it's so raid is fundamentally an eraser coding style of of access and so the other thing that you could imagine here is if you lose a few blocks on the disk that's going to be indicated because again of the error correction code at that point you could copy from the green disk to a spare block you've allocated on the disk but the moment the blocks start failing it's usually a pretty good predictor that that disk is starting to go out and it's very soon that you better get a new disk in there okay raid five so the problem with raid one is it's 100% overhead okay but it's really fast because you can write as fast as one disk which is pretty good and you can read twice the rate with raid five or five plus what you do is you have a set of disks so here's five disks that are all going to be put together and what we do is we imagine a disk group which is take the same set of blocks across all five disks and four of them for instance are going to be considered data and the fifth one is going to be parity and what we're going to do is we're going to take that stripe unit and we're going to take all four of the disks X or them together and put the result on the parity disk and for a moment and for a reason I'll tell you in a moment we're going to rotate that parity across the different disks and so if you notice we sort of look at what we're going to do with which disk has the parity on it okay and so this parity computation for instance p0 equals d0 plus d1 plus d2 plus d3 has this nice property that it's I-dependent so you can do it over and over again and you always get the same result and it doesn't matter which of these five disks fails it could be p0 could be d2 could be d0 whatever disk fails I can get xor across them so if disk zero fails I just xor the remaining four disks and I'll get what disk zero is supposed to be that's this nice property of raid five okay so again supposing the disk three fails then disk two for instance here is recomputed as xorring d0, d1, d3, p0 together I could get back disk six or excuse me block d6 by xorring d4, d5, p1, d7, etc alright so this is a very simple code it's easy to implement in a disk controller and it's easy to buy a disk controller at any server company or workstation company you can name and they will have the ability to do raid five okay and this trick that I've shown you here notice that it doesn't have to be five disks it could be four it could be seven what you're doing is you're taking the group of disks and you're having one parity to it what happens if disk three fails and I put a new disk in and I'm doing this reconstruction process to rewrite all the data and another disk fails what happens so I'm in the middle of copying to disk three prime and disk four fails what happens here well you lose data exactly so you needed something more here okay yeah cry yep it's happened so that's an issue basically this scheme raid five only gives you one protection against one disk and after one disk has failed you need to get a new disk and you need to get a new disk and you need to get a new disk before the next disk fails or your toast now the other thing I'll point out so one thing I pointed out is you can these disks one two three four five could be spread all over the internet and give you this reliability which is kind of fun the other thing that I will point out is if you're doing D2 D6 P2 etc I can get these ones that I haven't reconstructed yet and hand them off to somebody who's trying to read them so I can actually have a raid system that's in the middle of being reconstructed that's still servicing reads and writes from another process but it gets a little tricky and especially if you're building a system with raid 5 you're probably need to stop and think because you're doing the wrong thing and the reason is that the disks are so big that this reconstruction process both disk three fails I put a new one in there that's another four terabyte disk the time it takes to reconstruct all the data all four terabytes of the data you're probably going to need to get all the data all four terabytes of the data easily opens up a hole whereby we could lose another disk and then we've lost all of our data and that would be bad okay so what do we do we need to allow more disks to fail and it's not going to be this way so if I add disk six into this parity group all that means is I still can only lose one of them but now I have sort of a higher as I add more disks the probability of failure goes up so what I really need is something else and the answer is in general some raids are a form of erasure code and what we're really saying here is that disks three was erased and I need to be able to replace an erasure and so in order to deal with this I need to have more erasures okay and so today there's a what you can look for is you should look for at least six okay which is raid six controllers are ones that allow you to have two disks where the parity and allow two disks to fail and so the good thing about that is when the first one fails assuming you have one ready to go the second one can go forward okay okay and unfortunately you need something more complicated than just XOR and there's an example I have up in the reading for instance the even odd code which you guys should check out if you're curious the other is something that I want to mention the Reed Solomon codes that we mentioned from the disk discussion earlier are useful in general and so the simple thing that I'll say about Reed Solomon codes is the following alright you can if you remember from when you first took algebra what you learned was if you have a polynomial like this P of X equals coefficient A plus AX plus A sub two X squared etc etc this polynomials values are fully defined by the coefficients okay which means that if you take I suppose this is M minus one here if I have M points and I get M points I say P X of zero and X of one and two three four five six I get M of those then that will be enough points to reconstruct the polynomial and so Reed Solomon codes can be thought of this way where if I have more points than just M suppose I have a lot of points like N or say N is four times M then as long as I can get any M of those points I can reconstruct the polynomial and if my data is these coefficients A zero A one A two A three means I can get my data back by fitting this polynomial I'm not going to go into this now those of you that have taken security have probably run into Galois fields which are basically finite versions like the real numbers that let you have these properties like polynomial but what they do is they give you this M of N property which is extremely cool and I wanted to show you this M of N property so what is this M of N property again it says that if I have N and I encode it with a polynomial so I have N where N is bigger than M as long as I get any M back I can reconstruct my data and so I can it's like a hologram I could have N points where N is four times M and as long as any N of them come back I'm good and so if you imagine a server situation in the internet as a whole you send out N copies and as long as you can get back from anywhere you get the data back and here's a fun graph here which says that I do the following suppose I put four copies of my data out and every six months I find the copies that are out there and I make sure that I keep replicating so I still have four copies alright that's kind of boring and simple if instead I do the following I encode with an erasure code and I send out 64 little pieces any 16 of which are enough to reconstruct the total overhead of this 16 of 64 code is the same as the four copies of a code but in the first case I might lose 3% of my blocks per year just due to failure under these circumstances and in the second case I'm going to lose 10-35 blocks of data in a year so erasure coding basically makes your data extraordinarily hard to damage which gets us back to extreme durability okay so if we wanted to really make sure our data is durable we could now use a file system that actually stores out in the cloud somewhere where we have fragments say 64 servers out there and as long as we can get back from some 16 of them we can reconstruct our data this is extremely durable it's not as available necessarily because it might mean that I have to reach out to 16 of these servers reconstruct using the code before I get my data but boy is it durable and this is like a digital pyramid from the standpoint of really hard to destroy your data and so it's a nice way to make sure that you basically never lose your data okay all right now what if a disk loses power or software crashes some operations in progress may complete some may be lost you may overwrite a block only partially so you may have a disk block on disk that's sort of half written and notice that raid even that extreme form of raid I just showed you with read Solomon codes doesn't really help a bad state of the file system all it says is that if I have some data I can make sure that those files are hard to destroy but if the bits are wrong because I have a transient state a partial state of the file system then all I've done is made sure that those bad that bad state of the file system is extremely durable and will last forever okay so I have to think a little bit more about the actual semantics of my file system in order to preserve my data so of course file systems as a whole want some form of durability just as a minimum and so the question is not just how do I make bits durable but how do I make the file system durable and that really means that previous data that stored can be retrieved maybe after some recovery step regardless of the failure that might be going on and now this starts to get a little interesting for us okay so the storage reliability problem is as follows we have a single logical file operation like a right but it might update a bunch of different physical disc blocks might update the I node it might update an indirect block like say you just extended your file past block 10 on the original BSD and so now all of a sudden you've got to allocate a new indirect block and then a new block and so on so there might be the I node gets updated the indirect block gets updated the data block gets updated the free bitmap gets updated so one operation updates a whole bunch of things and as a result you might end up with garbage in one of or more of these things and so that single update that you tried to do consisting of a bunch of different pieces never actually happened properly and once you get sector remapping like in flash or whatever a single update to a physical disc block might require multiple updates to sectors under some circumstances where you have to move and reallocate and so on so so at the physical level operations complete one at a time but we want and we want concurrent operations for performance but we also want this reliability so how do we guarantee consistency regardless of when the crash occurs and this is an interesting problem in itself for how to design file systems and I showed you how to make sure that data once you encoded it in a raid or sent it out or whatever and written all of these disks that's extremely durable but we have to get to there from the rights that the user does so what are some threats to reliability here so interrupted operation so a crash or power failure in the middle of a series of related updates could leave this data inconsistent state you know transferring funds from one bank account to another and it sort of half happens or whatever what if the transfer is interrupted after you withdraw and before you deposit okay so now the data the money is just gone right you could lose some stored data like the non-volatile storage media like a disk may just cause the previously stored data to fail now this is a good situation or be corrupted this is a good situation where raid and some of the techniques like Reed Solomon can start to help us okay so one approach is really carefully order things so we're going to sequence our operations in a specific order and carefully design our file system to allow sequences to be interrupted safely okay and if things are interrupted because the system crashes we're going to have some sort of post crash recovery when we reboot the system to read in the data structures to see if there were any operations in progress at the time and then clean up or finish as needed and this is actually the approach taken by a lot of things by the FAT file system the FF FAS file system etc and what there is for instance on FF FAS file system or Linux there's something called FSCK which basically goes in and tries to make the file system consistent after a potential failure alright and there's also a lot of application level recovery recovery schemes where if you have a file in a regular format like Word or Emacs or whatever with autosaves and a crash occurs the applications sometimes try to recover for you as well by looking at previous versions of the data so this is pretty ad hoc for the instance in the case of the FAS file system you create a file very carefully so normally you allocate a data block you write the data block you write an iNode you update the bitmap to say the blocks are now no longer free then you update the directory with the file name to iNode number so notice this careful order that I've come up with here if I fail anywhere along the road here I'm probably still mostly okay so I might allocate a data block if I fail there doesn't matter because I haven't updated the free map yet I might write well if I crash that might be okay I'll lose that data but nothing will be inconsistent I might allocate an iNode for the new file well if it crashes the file goes away I might write an iNode and if it crashes the file might go away I might update the bitmap now by the time I get to this point now I've got an update that's hopefully done atomically where there's a new iNode but if I crash it's not in a directory yet well then I want to have some consistency thing that goes and finds free iNodes that are dangling but have data in them to put in a directory and it may not know which directory was supposed to go in but it can put it at the top level etc so if you notice how do I recovery I scan the iNode table if there are any unlinked files I might delete them or put them in a lost and found directory I know trees scan directories for missing updates etc I can go through a fairly extensive process to try to recover from updates in one of these places okay as you can imagine there are a number of ways this can not work out the way we might like but it mostly works and so this is called the FSCK gets run at no time it will figure out whether mostly the file system is in a good state and it will put it back in a consistent state and hopefully recover most of your data and of course the time for recovery is proportional to disk size so if you got a terabyte drive and you had a major crash that FSCK might take a long time but boy if you have a 16 terabyte drive I told you about the flash drive last time or two times ago then it's going to take a really long time to recover as well so this is probably not great the other idea is I might do a copy on write file layout and by the way this idea here of working my way through is kind of like it's got a single commit point kind of at the end here and it's going to re-node into the directory and that's the point at which everything succeeds but this is not really as clean as a transaction so another option would be a copy on write file layout where we update the file system we write a new version of the file containing the update never update anything in place so if a write fails we never cause a problem and we might reuse existing unchanged data blocks now there was a question here do we ever roll back changes for recovery or do we just continue through those normal operations so once recovery has started it's usually very hard to roll back anything recovery does so it's making changes to your file system that are often you can't undo them and back before journaling came in which we'll talk about in the lecture or maybe next time it was possible to have an FSCK operation that started freeing up a bunch of your directories because it thought they were inconsistent and you could actually lose data and it was very hard to recover from that so what you do in a really bad scenario is you make sure to make a raw copy of all the blocks and then you run the recovery on it but I'm going to show you something that's better when we get to journaling here so the file system we never update in place with a copy on write file system and we reuse unchanged disc blocks to keep track of the state of the file system so we're going to reuse everything that's not changed as much as possible and only write new stuff to new blocks seems expensive so you can think of this as every change to a file makes a brand new version of the file in a way that every version of the file is kind of available simultaneously it seems expensive but when you're making the new version of the file the old one's always around so if anything bad happens you don't lose the old version of the file almost all disc rights can occur in parallel in a system like this because we're not destroying any data we're just adding data and this is the approach taken in a lot of network server appliances like NetApps write anywhere file layout from son oracle or open ZFS and here's an example for instance of a copy on write with smaller radix blocks and so here is a file system or a file for instance with an inode tree that kind of we're doing I'm doing binary here to make it simple but notice how this blue is all the existing data and this file has data up to the end of this blue and I'm about to write as an append to this file a similar file system of the type we've been talking about you just overwrite this block by adding new data at the end in a copy on write file system you allocate a brand new block copy the data to the new block add the new data and then build a tree representing the new file where every block that I haven't changed or every tree that I haven't changed is linked into the new file and only this thing at the very end here our block was pointed at the previous this lower indirect block was pointed at the previous block but I have a new one which takes the left branch because that's unchanged but it has the new block on the right and so now I have two versions of the file that are simultaneously there and the only thing that's different data is the intermediate nodes and the block I've just written and as a result I have a very nice atomic change where the old version gets replaced by the new version in some directory in a way that basically doesn't risk any data until we've actually done the swap and in principle in fact if we do the same thing with the directories we can keep around as many old versions as we would like and there's file systems that work this way so ZFS is a great example of that so it has variable size blocks 512 bytes up to 128 kilobytes it's a symmetric tree in the form that I just showed you so it knows if it's large or small when we make the copy store the version number with the pointers and so I can have multiple versions simultaneously in the file system I can easily undo buffers writes before creating a new version with them so that potentially I could do if I have a bunch of writes that I do in a row I could do these for the new writes at once and so that will save a little bit of the versioning and also give the operating system a little time to figure out kind of how big the new file is going to be after the writes are done okay and free space is a tree of extents in each block group so we can delay updates to the free space and do them all when the block groups arise at basic copy on write that I showed you but this is indeed a file system that does copy on write and has really strong reliability even under flaky scenarios where a lot of writes are happening and things crash which you might see in a distributed file system for instance and ZFS or open ZFS those are available for you guys even as Linux file systems you could use this if you want so what are some more general solutions other than copy on write are things like transactions so we could put together all of these different operations that I mentioned all the different I know and directory operations that have to happen together together in some sort of transaction and then do commit and either what's the transaction mean well either everything commits or nothing commits and so as a result the file systems never inconsistent and so what we want to do with transactions is to ensure that multiple related updates are performed atomically and for instance if a crash occurs we just revert by throwing out all the partially done transactions and we just make sure they don't have any impact on the state of the file system and as a result the file system itself is never inconsistent and furthermore this idea of throwing out all the partially done transactions is actually a pretty good semantic to give to a user because what they know is that you only tell them that a write is completed after it has trans actually committed and as a result they never are under the false illusion that something has been written properly the file system until it actually has been written and you can provide some redundancy for media failures by making sure for instance you don't commit a transaction until after you have written everything to all of the drives in a raid group for instance or replicated out to the file system so you can trade off performance for redundancy in a system like this as well and make a decision as to when you say a transaction is actually committed so what's a transaction so transactions is closely related to critical sections as we mentioned so atomic sections back when we were talking about everything remember either everything happens or nothing happens and there's only one thing in the transaction at a time and so they're extending the concept of atomic updates from memory to stable storage and we're going to basically atomically update a set of persistent data structures stored on disk in a way that keeps our file system consistent there's many ad hoc approaches so what I showed you earlier with the fast file system carefully ordering the sequence of updates is kind of a handcrafted ad hoc transaction where only the very last thing really commits the file to the directory but the key concept here of the transaction is as follows okay an atomic sequence of actions reads or writes on a storage system that takes you from one consistent state to another and so what we want is consistent state one has a set of data through a transaction takes us to consistent state two which maybe represents a new file is created or some data has been updated and there are no inconsistent states we only go from one to another and what we know from our discussion of file systems is that means whatever a transaction means it has a set of things that happen together as a unit to get us from that consistent state one to two and we need to make sure that when we go from one to two we always do all of those things or we just stay in consistent state one there's no half state okay so here's an example of a typical idea you begin the transaction you do a bunch of updates if any of them fail you just roll back and throw everything out and then you commit okay so here's your classic example where I want to transfer data from Alice's account to Bob's I'll do a begin transaction and I do a bunch of things together okay of transferring pulling data out of Alice's account and putting it into Bob's and then I commit and the idea behind the transactional system is that all of these things happen or none of them happen and so we don't lose any money in the process okay that's as we talked about this early in the term okay so things that we want to get out of transactions are atomicity consistency isolation and durability I'll talk potentially more about that I'm running low on time here but I wanted to sort of give you the idea of what our basic approach is going to be here we'll pick this up on Thursday but one simple action becomes atomic it becomes an atomic sequence of action so here we have a bunch of different things that we want to put together and we write them in a log and you know meanwhile other stuff is happening blue things and the green things and the whatever that is pews things and they're all happening in the log together but how do I make sure that either all of these things are nothing happen well I do it very simply I say start transaction at the beginning and the fact that we've written all these things in the log has no impact on the abstract state of the system until I do one last right of commit and the moment that I say commit then all of these things that are part of that transaction are abstractly and simultaneously committed to the system okay and what's great about this is it's a single right which we make sure happens atomically commits all of these things okay and if we don't get around to that single right then it's as if these things never happen and if we do that single right then it's as if they absolutely all happen and the real trick is going to be how do we apply this to file systems where what's going to be in the middle here is going to be you know allocate an inode make a directory modification do some writes and a single commit in our log will suddenly say oh that happened and the absence of a commit will be oh it didn't happen okay and so next time we're going to really kind of show you how to do this we're going to basically get better reliability through the use of a log okay so all changes are going to be treated as transactions transactions are going to be committed once they're written to the log and then we're going to have data forced out to the log for reliability it might be forced into nvram to make things fast and although the file system may not be updated immediately the data is preserved in the log so what do I mean by that I mean if we have a bunch of writes to different things like inodes and directory pieces and so on and then we say commit even though we haven't updated our file system the fact that these are in the log means that anybody trying to read the data after recovery takes into account all of these things as if they were applied to the file system and acts as if they were applied to the file system and the user doesn't have to know the difference okay and that's going to be how we make this atomic okay and so now there's the difference between log structured and journaled which we're going to come up with soon is that a log structured file system the data stays in the log whereas in a journaled file system the log is just used for recovery and it's a transient place for holding information until it's committed so just like I said here in a journaled system these might say write to this I know do that write this data and so on we say commit to the log structured file system the data would just stay there and we'd have to figure out how to go through the pointers in this actual log to get us our data whereas in a journaled file system we would transfer all of these updates to the real file system that might be laid out like a fast file system and then once we've done that we can throw out the data in the log so the journal in the journaled case is only used to help us actually crash between the time we hit commit and the time that we've transferred all of these things over to the file system so we'll get to that next time and many journaling file systems out there and you're probably using one today and it's basically updates to system metadata using transactions happen in all of them updates to non-directory files like user data can be done in place without the logs or can be done with full logging so I'll show you how that is and I'm going to name just a few NTFS Apple HFS plus Linux XFS JFS EXT3 EXT4 you name it they have journaling in there to give you better reliability okay so there was some question about is there hardware support for parallel writes in the context of a transaction the nice thing about transactions is you can write as long as your log data is still in order then you can write out to the file system in parallel without any problem alright so finishing up we'll pick this up next time but we talked about file systems to transform blocks into files and directories we're optimizing for size access and usage patterns and that may be how we build our inode structures member in NTFS we built a slightly different database instead of inodes we might be trying to maximize sequential access to allow for really fast access for things like videos and so on but still have efficient random access and then we have to figure out how to provide protection a file as we mentioned is defined by the header called the inode or things in the master system table like in NTFS naming is the process of walking through the user visible names which is scanning through the directories multi-level index schemes like the inodes for NTFS or the fast file system are basically part of the file system structure we talked about the layout driven by how free space is happening and how we want to make sure that we have performance we're going to talk next time about flash file systems and so we're going to pick up there we're going to pick up with logging and then we'll end with flash file systems we also talked about the buffer cache and what that's about okay so I'm going to let you all go and we're going to pick this up on Thursday I hope you guys all have a great end of your night and Wednesday and we'll see you on Thursday alright and in terms of questions about midterm 2 we'll be answering that on piazza alright we'll see you all later and I'm going to end the recording now, bye now