 All right, good afternoon. Hopefully, quiz three went okay. Looks so far that looking at it went okay-ish. I don't know when the TAs are meeting to market, but I will update you as soon as I know any more information. So today we're gonna be a bit more modern. We're gonna talk about SSDs, which people actually have in their machines, and RAID, which is a fun topic. So SSDs or solved state drives are more modern than hard disk drives. There's no mechanical parts. They're kind of like RAM, they use transistors and to store data rather than magnetic disks so you can't actually ruin it with a magnet. So pros, no moving parts or other physical limitations where the spinning actually matters or actually waiting for something to physically move. It's gonna be higher throughput. It's got good random access so everything doesn't have to be sequential. Like we saw with spinning disk, if we didn't access it sequentially, our performance went down by like two orders of magnitude. They're more energy efficient, again, because we're not spinning a hunk of metal. Better space density so we can fit more bits of information in a smaller space, but they're gonna be more expensive. They're gonna have lower endurance so solved state devices actually have a limited number of times that you can write to them and then after that they wear out and you can no longer write to them anymore. So there is a level that they will just kind of wear out and you won't be able to use them anymore. And as we'll see based off just how they're made and their architecture, they're also more complicated to write drivers for. So this is the architecture of an SSD and what it looks like for the purposes for us. We don't really have to care about the dye in the yellow and the plain in the brown. We only have to care about blocks and pages so blocks aren't exactly what we talked about before so a block in this for a solved state device or drive is a group of pages. So all the pages luckily are usually four kilobytes which actually matches up to our virtual memory pages which is nice. The not nice thing is that they are organized in blocks and there are special rules for reading and writing them. So SSDs use like NAND flash. They're gonna be much faster in hard drives. Page size will typically be four kilobytes like I said and then it's gonna be about an order of magnitude quicker than a hard drive so reading a page would be like 10 microseconds. Writing a page would be 100 microseconds and then curiously, erasing a block is one millisecond and you will actually note here that it is impossible to erase a page so I can only erase blocks which is why actually writing drivers for this are kind of complicated and your kernel actually has to handle these specially in order for it to work correctly. Yep, so in this, so in the next slide we'll see that when you write you're only allowed to write once after it's erased and then after that it's locked until you erase it again so you still have to erase it. So yeah, this is where that comes in so if you actually program and write drivers for these things then you have to use pages and blocks so the rules are you can only read complete pages so they have to be written to you can't just read and initialize stuff and you have to only write to freshly erased pages so you can't rewrite over something without doing an erase in between and erasing is only done at the block level and a block will typically have like 128 or 256 pages and yeah the entire thing needs to be erased before you can do even one write operation so that is why writing would be slow so if you want to modify one page that has already been written to well you're going to have to the naive thing to do would be to copy all the other like 127 pages on that block to a new freshly erased block and then write that modified page to that block and then update all the information so it points to the new block instead of the old block so you're going to have to do lots of writes if you just want to modify a single page so the OS can help with this a little bit or the kernel so SSDs will need to garbage collect yeah sorry well that's what you have to do but if you want to just modify a single page you don't want to just write out 126 of them or 127 of them just to write out one so we'll see we can do something slightly better yeah the naive thing to do if you want to write something new is you have a freshly erased block you transfer all of it to a new block and then erase the old block and then you have to update everything so we'll see that so the SSD will need to have to garbage collect some blocks so like move any pages that are still so in that worst case I was talking about where we have to move 127 blocks or pages sorry that would of course be the worst case so the kernel one smarter thing to do instead of moving all 127 it could just move the blocks that it knows are actually still in use so that would still be some overhead but it's at least more intelligent and we're not just writing you know n minus one n minus one so the disk controller also needs some help to indicate what blocks are alive so even if you have you know you use the block you've written to it the SSD is going to assume that that block is in use even if the kernel no longer uses any of those pages anymore so the kernel needs some way to talk to a disk and tell it that hey that block has all the pages I'm no longer using so you can go ahead you can erase it at your will so I can write to it again later so that's what that trim command is so if you've ever seen that it's mostly it was an issue when SSDs were new and not every kernel actually took SSDs into account but now pretty much everything is going to have it so the trim command is basically the kernel informing the SSD that I'm not actually using any pages on this block you're free to garbage collect it and erase it and then the SSD hardware whenever it would be idle can just go ahead and erase it in the background whenever it's not doing anything else terribly important so that's the long and short of SSDs we won't go into too much detail about them just really know that you have to deal with they have the special rules where you have to erase a block at a time blocks contain multiple pages and you can only write to a freshly erased page and that's about it and then yeah so so far we've been talking about single devices so I guess mockingly this is just sometimes called a single large expensive disk or S LED and you can tell by the kind of wording of it that it might not be such a great idea so it's just one big large disk so the naive thing to do again is if I run out of space for all my files well I just go buy another hard drive and transfer everything to the new hard drive and just start using that but the drawback of that one it's going to be slightly expensive as you try and get more density and also it is a single point of failure so if you lose data if that disk dies then all your data is gone so what people typically use is something called an array or a redundant array of independent disks and it's basically just data is distributed across multiple disks and you can when you do that you can give a whole bunch of different trade offs whether you want to trade off performance or redundancy in case one of the drive fails and how many failures you actually want to endure so one of the first RAID configurations you might see is something called RAID 0 has anyone ever used RAID before? probably not, one, okay so this is a term if you're like a data hoarder or if you really want good performance you would use something like this so RAID 0 is also called a stripe volume and all that means is all the data is just put together in kind of larger chunks and then those chunks are distributed over the disk so in this example say file A was made up of eight chunks well the odd number chunks would go on the first disk which would be disk 0 and the even number chunks would go on the second disk which would be disk 1 so now whatever I need to read a file well I can read half from disk 0 and then half from disk 1 and it's going to be twice as fast and then similarly if I need to modify it assuming I'm modifying something that's big enough well instead of writing all that information to one disk I write half the information to one disk and then the other half of the information to the other disk so your reads improve by a factor of how many disks you have and so do your writes so you can do the same idea with more and more drives so you could do this with three drives, four drives, five drives but the big drawback is that if any of those drives die then you lose data because you only have one copy of everything and no way to get it back so instead of being a single point of failure for one disk now you have multiple points of failure which is kind of bad if you actually care about the data so it's really only for performance only all the data is just striped or basically just split into chunks and put on multiple disks so you're going to have faster parallel access you're going to get roughly end times of speed up for both reading and writing but any disk that fails is going to result in data loss you've way more points of failure so one you've end points of failure for how many disks you have and only takes one so the kind of opposite approach to that where you essentially make an exact copy of everything is called RAID 1 and that makes each disk a exact mirror of each other so if that file now is just four chunks well if all four chunks are on disk 0 then all four chunks are also on disk 1 so now if any of those disk drives die then I still have a backup and you can do this for multiple drives so I could have four drives that are all the same or five drives so this is where the textbook terminology also gets weird so we'll discuss at the end something called RAID 10 or RAID 1 plus 0 the textbook when you expand RAID 1 it doesn't just assume every disk is an exact mirror of each other it goes into this weird alternate thing that we'll discuss later so just watch yourself if you're reading the textbook but so RAID 1 simple but pretty wasteful so every disk again is an exact copy of every other disk and this gives us way more reliability so as long as one disk remains we have no data loss because they're all the same so if I have four disks here that are all the same I could lose up to three of them because they're all the same so I can tolerate everything except one disk dying so it's going to have some good reliability that's a lot of points of failure that need to occur in order for you to have data loss and it's going to have good read performance because it doesn't just have to read from one disk it can split up the reads across multiple disks and in this example say there were four well one disk could you know send a1 one disk could send a2 one disk could send a3 one disk could send a4 so you could split up the reads across the disk but you couldn't split up the writes so it'll have good read performance but the write performance will be the same because you have to write the data to all the disks and it might actually be a bit slower if you kind of run out bandwidth so yeah yeah because if I'm writing say I have like four disks in this case if I modify something in a1 I have to write it to every single disk in this case yeah so you're yeah so you're assuming the bottleneck is actually writing to the device so if I write to four devices at the same time it's the same as writing to one so that's why I said assume that you know we have enough bandwidth yeah yeah well you'd issue all the same commands to all the disks so ideally all the disks would pretty much be the same like manufacturer and everything so they'd all behave the same so you just send the same commands to all of them no so these would be independent disks so so like disk 0 and disk 1 would be two separate hard drives they're separate yeah yeah so there's multiple platters so that's just the only way to access data on a platter so one head per platter but you can only no no yeah in this case these are separate so in this disk 0 and disk 1 they could be a hard drive they could be an SSD they could be whatever yeah you can buy a bunch of SSDs and configure them like this if you want you can configure a bunch of hard drives like this if you want typically if you do this you really only care about as much data as possible so like if you care about your day you go to the store you buy a bunch of spinning disks and you set them up like this so if you care about if what SSDs will fail too right so if you care about your data you might want that or you might want the performance if you want SSDs so if I you know I could do RAID 0 and get twice the performance as my single SSD I just buy another one yeah yeah so the idea yeah the idea of RAID is like in its name the redundant array of independent disks so everything's independent and then you configure them however you want so we've seen two configurations so there's RAID 0 which is like go as fast as possible I don't care if one fails I'll take some data loss and then one is like as safe as possible everything's a copy so you can think of this as like it's one way to think of this is it's like automatically creating a backup for you so anything that you you modify your lab it's saved on two devices so now if one dies and the other one you know still has your data you don't lose your lab or anything like that okay so then the idea here comes for RAID 4 which I'm just here for illustration it's not actually used but the idea is key here yeah question is where two and three go so two and three were large mistakes they didn't they didn't work that well so we'll skip over two and three we'll be skipping over four because four basically doesn't exist either because it's a bad idea but we'll see this is how things work people get an idea same thing redundant array of independent disks so the numbers just corresponds to how they're configured so arrayed four so does kind of what RAID 0 was doing where it stripes the data across disks but instead of here there's one disk dedicated to parity and what that will do is basically store some information such that if one of the disks dies then it can actually reconstruct what that bit was so easiest way to argue about it is with bits so we can see something like this so say we had I don't know that's like three bits of information and then we had four drives so in this case we actually I'll call it three so this three is going to be our parity drive and it's going to essentially be an XOR of everything so let's say across these disks we're storing one zero one we're storing zero one zero then we're storing one one one so the idea here is we have sorry we're getting there so this is our data across our four disks and we're going to use yeah data across our three disks and we're going to use the last disk to store some information such that we could recreate the information if a disk dies so disks three in this case we'll store parity information will be an XOR of all the bits four disks here I'm just saying there's three bits across all the machines so I'm just illustrating just because I don't want to write out like a full byte so in this case so on this so for bit three so on this zero one and two well the XOR between three things is basically the shortcut way to think about is it's going to be zero if the sum of all them are odd or it's going to be one if the sum of all them is sorry it's going to be zero if the sum of all them is even and then one if the sum of all them is odd so in this case if we add one plus zero plus one well that's even so the XOR of all that is going to be zero so we would store that information on the third disk and then for the next bit we have zero one and zero so that all together is odd so that would be a one indicating it's odd and then for the last one we have one plus one plus one so that's three so that's odd so if we XORed all of them that would be one so yeah so this one would yeah so so this is what would have sorry it was your question so the hardware will say it fails or like the kernel will try and write to it and it won't actually be written to or it won't respond anymore or something yeah well if the parity just disagrees then something there's probably been a failure well then that's okay yeah so that's a good question if the parity disk fails well the parity disk fails you throw in another disk and recalculate the parodies for everything you have to trust the hardware that it actually says it if it works it works so these devices will say whether or not they work yeah yeah so that's a good question so if I say everything was working fine if I go ahead and modify you know information on disk one then I'm going to have to modify the update the parity bit on disk three and then similarly if I change anything on disk zero I also have to update the parity information on disk three and if I update anything on two I also have to update disk three so yeah so there's going to be some concurrency issues here right yeah yeah that's what that's what we're doing yeah that's what we're going to do okay so that's what we're going to do so this is the reconstruction so let's assume disk zero dies so disk zero is now dead so just remember it was one zero one but not using that information we can get that back so disk one zero is now dead so now it's dead all the bits are gone so ideally what you do here is you buy a new disk and then stick it in and then you can recreate the the failed disk zero and copy it to a new disk so we can use this parity information to keep track in order to recalculate it so we have an unknown here so it's like unknown plus zero plus one is reading the parity bit is even so I need either a one or a zero added to one to make it even so I need a one right so the bit three is definitely a one in order to jive with the parity bit there and then for bit two well I have a one plus zero and then plus an unknown and that has to be odd so the only number it could be is a zero knowing that parity bit and then similarly for the last one I have an unknown plus one plus one so like unknown plus two and then that has to be odd so if two plus some unknown is odd well that unknown has to be one if it can either be a one or a zero so just knowing what was on disk three here that parity information knowing it's an XOR we're able to recreate our one we're able to recreate the data on disk zero yeah I'm just showing you by one bit but if you do this times eight it's a byte add five columns yeah so yeah yeah well so if you're dealing with the hardware so this is like at the bit level but you'd be loading it in into like a page or something like that and then writing it back out in page so you just issue commands to the disk and let the disk figure it out so in this too so you're you can miss mix and match disk so you typically it'll be as slow as the slowest disk so if they're all from the same manufacturer they'll probably all behave the same so you won't have a bottleneck but if you gave it like you know three SSDs and one hard drive well then it's going to be as slow as the hard drive essentially all right so that is the idea behind the parity calculation but as we kind of discussed this is not great because if you modify anything on any of the disk even if I was to only modify you know a1 on disk 0 b2 on disk 1 and c3 on disk 2 well one block on each of those disks would correspond to I have to update the parity information on disk 3 so it's going to do three times the number of writes as the other one so it's going to be very disproportionate and you're going to get bottlenecked by that parity disk but that's why it's not used and I'm only using it to illustrate but with parity so you can use essentially all the space minus the space from one of the disks and you need at least three drives to do this type of you need at least three drives to do RAID 4 so we can get n minus 1 times performance because all the rest of the data is striped and that's essentially just just arguing that we ignore the parity disk and then it's essentially like RAID 0 but then the con is that that's not actually going to be true because your write performance is really going to suffer because any update has to write to the parity disk but the other pro is we can tolerate a single disk failure with good performance and not wasting that much space so but if two disks fail then we're screwed so RAID 5 is the actual is how RAID 4 is actually used so RAID 5 is exactly like RAID 4 the only difference is there's not a parity disk it distributes all the parity information across all the disks so now before disk 3 had the parity calculation for A, B, C, and D well now disk 3 only has a parity calculation for A 2 is the calculation for B 1 has the calculation for C and 0 has the calculation for D so you're hoping that since you distributed across that you'll kind of even out across all the disks that have to update the parity information so this is the first RAID 5 or RAID 5 is like the first one that's actually used aside from 0 and 1 so 4 is not actually used because it's going to be really uneven on that parity disk so RAID 5 is basically just an improved RAID 4 all the same pros as RAID 4 except it doesn't have that con where we have probably suffering white performance because we're bottlenecked on a drive so 6 is like RAID 5 except there is now two parity bits so instead of before where our parity bit was called P before now we have another parity bit called Q and then like with RAID 5 to RAID 4 we're not going to have two disks that just calculate parity bits for P and Q we're going to distribute them across all the disks so your first parity calculation for P would be a simple XOR if you're interested in the calculation for Q or the second one that is where your linear algebra courses come into play and I won't go over the details it like somewhat involves a Galois field so if you don't know what that is you can learn later but you have to do some more complicated math if you want to get a second parity bit there so from a usage point of view all you really care about is do the extra parity I essentially lose another disk of space and again I needed these four disks to do a RAID 6 and also the right performance is going to be slightly less than RAID 5 because I have another parity calculation and also one thing that's listed on here that is a big drawback too is you might actually care about reconstruction time of how long it actually takes to you know make the new disk a copy of what it is replacing so like when I was doing grad school I had a RAID 5 array I was doing my work one of my drives died and I was like yeah that's okay all my data is fine and I can tolerate one drive failure but sure enough you buy hard drives together they tend to fail together so after about a month another one failed and then I lost data so you want to replace them when they start failing they're expensive they're still expensive yeah the school is to buy them yeah drive space isn't free yeah what's SSD because I don't even think let's see quick google search yeah they don't even make 12 terabyte SSDs you can buy a 12 I just bought a 12 terabyte hard drive can't buy a 12 terabyte SSD well you can except they're going to be like several thousand dollars so no I had no in grad school it was less than that is more I cared about my stuff not getting deleted so I had a RAID and then it died so replace your stuff because they tend to fail together so another sorry yeah well so it was RAID 5 so it was like this so one of the disks died which is fine you can still use it because you have you can calculate all the information on the fly so you can still access all your files and everything so I was like yeah this is fine I can still work because because it takes like days to rebuild it as well to recalculate the parity for like I forget how big they were like they were probably like four terabytes so it might take like four days to rebuild it yeah on the new disk so here if you have four disks and one dies you have three disks so you replace the old one you put in the new one and then you have four again that has to reconstruct all the information that was on the one that died to the new one well two died I didn't get to the reconstructing step yeah one died and then I'm like yeah this is fine and two died yeah so that's what it's good to know this in actuality two died but one was like just dying because as soon as two died I ran to the store and bought a new one and I it was just starting to die so I actually got most of the information off it so it like gimped along yeah yeah this is like assuming disk drive they can fail in fairly spectacular ways so like the spinning magnetic disk one spectacular way they can fail is called a head crash where that magnetic head hits the platter and makes essentially makes it explode and you're not going to retrieve any data off that because it is now in a billion pieces yeah yeah it becomes a slow rate zero so so if one of these disk dies say I don't have a three anymore right in a normal a zero if I just had you know if I just had normal parity or normal rate zero I would have like a one a two and a three and then I could read them all but if I have this and say this two dies well if I want to get a three back I have to kind of calculate it back so it's going to yeah so it's going to be a calculation step so it's going to actually perform a little bit slower if you have one dead the pen yeah but remember like the parity is going to be distributed across and most of them will probably have the information but you're still going to have some parity so it'll be slower if one's dead ideally you don't do what I do and you go replace it so in this case if just zero died yeah yeah so you would lose all of it but you can recreate the parity bit because all the other information's still there okay so another thing you could do too is something called so there's another raid one that is a combination of two that's in the textbook so you can also have something called a raid 10 or sometimes it's called raid one plus zero and the idea behind that is that it stripes but it has two copies of everything so say we have disc zero disc one disc two so I'm going to use six discs to illustrate this this three disc actually I'll put it to the side so the idea behind this is for half the disc you do what raid zero does and you just completely stripe the information so for example I would have like a zero a one a two and then I could have you know the next bits of that file a three a four and a five so within those three discs they're all striped so this is essentially exactly like raid zero and then the one plus zero is I make a copy of all that with the other disc so this way in this dimension it's kind of like a raid one so disc three is an exact copy of disc one or sorry disc three is an exact copy of disc zero disc one is an exact copy of disc four and then disc five and the exact copy of disc two so that way now I get pretty good performance and also if one of the disc dry if one of the disc dies so say disc zero dies this disc is dead well the cost of recreating it is going to be much faster because there's no calculations involved I just take out disc zero I put in a new disc and it is an exact copy of disc three and only disc three so you just have to copy all the data from one disc to another disc there's no calculations across disc involve or anything like that so instead of my scenario where it would have taken days to reconstruct it only takes you know hours to just cut straight copy a drive especially if they're a spinning disc you just copy it all sequentially and it will be quite fast so now the question for you too is in this configuration whoops I hit that in this configuration how many drive failures can I withstand before I actually start losing information and losing information in this case means that I have one of these blocks where I don't have any copies anymore or if I parody I don't have any means to reconstruct it anymore so this has no parody so you can essentially argue about that I don't have any copies of a block left so we got two two minimal yeah yeah so ever no so so my worst case is if I lose two discs I'm I might be screwed by if I'm lucky I could lose up to three in this scenario so why is that so now let's assume disc three dies so disc three is dead so if disc three is dead well essentially I'm my danger zone if you want to think of it that way is I only can't lose disc zero because now on this zero that's my only copy of a zero and a three but I could lose any other disc because there's two copies of everything else so if I lose this disc I'm screwed I lost you know a third of my data and it's no longer there but I could lose any other disc so I could lose disc one so now we're in a bit more of a danger zone so I lost disc one so now I'm essentially if another disc fails I'm essentially at assuming everything's like perfectly you know randomly distributed or uniformly distributed I'm at a 50-50 shot now so if I assume disc zero four or two or five could die next well if disc zero dies well I'm screwed because now I don't have a zero or a three if disc four dies I don't have a one or a four anymore but either disc two or disc five could die so if one of those dies then that's fine and I still have all the data but now I don't have any backup copies of anything else so because there's mirrors of everything so if I lose both mirrors I'm done but I can you know if I have at least if I have one failure in every stripe that's fine I can go with that so this is going to extend up to like if I have eight discs I can tolerate my minimum two failures because you're still no matter how many you have if both pairs die then you're screwed but I have more pairs so I could if I'm really lucky tolerate four failures if I have eight discs for example yeah so if I have four if I have four or sorry if I have eight discs it is yeah it would essentially be like two more here yeah so because they're mirrors so the app if you got really lucky the most amount of failures you could tolerate is the number of discs divide by two but your minimum is always two because you could knock out the other mirror so it depends kind of how lucky you think you are but the nice thing about this is to reconstruct it again it's really really quick because I don't have to do any computation or anything like that but of course if I have one of the big drawbacks though is essentially I'm wasting half the space just for the mirror so the more discs I have with raid five and raid six the more space I get back like closer to the theoretical max if they were all in raid zero but in this your capacity is always going to be half yeah yeah so the question is could I draw do really reliable ones for half the mirror and then uh unreliable ones for the other half that are cheaper I imagine you could there's more exotic file systems if you really care that can do a bit better than do the parody a lot faster and typically you're at that point you're wasting your money on really reliable drives you just you'd be better served just to buy more yeah so so the question is how long does it take a disc to fail um so I mean it's just there's a distribution of failures within like off the top of my head I mean it really depends on what drive you buy and who makes it but like the average failure rate for two years ballpark probably like five percent something like that so if you are google and you have you know thousands and thousands and thousands of drives you're going to be somewhere in that distribution and a lot like like five percent are going to die within two years yeah so I've had drives of yeah I've been lucky some drives have lasted you know eight years they still work like 12 years who knows the drives that died on me they were like it was like warranty time plus two months so I think it was like they died after like a year in two months so I got unlucky that was a notoriously bad batch and now I just don't buy drives from that manufacturer anymore sorry let me turn this off now let's see all right so more options for persistence so we saw SSDs and RAID SSDs more like RAM except you know accesses and page and the only differences in blocks and because of this block thing where you need to you know erase it and you can only write to a freshly erased block and you can only erase groups of pages the kernel and the SSD kind of have to work together using something like trim and we saw RAID to tolerate failures and improve performance using multiple disks so that's it for today and just remember I'm pulling for you we're all in this together