 All right, good morning. So hopefully quiz three went all right. And today we are taking our journey more on persistence and talking about something more modern. So we'll be talking about SSDs and RAID. So Salsate devices, most people probably have not even seen a hard drive before. They're the more modern way to do storage, but they have their own little quirks that make it difficult to program for and the transition was kind of odd. So they use transistors like RAM to store data rather than magnetic disks. So they don't have a problem where you can take a magnet and actually corrupt them. So they have no moving physical parts or physical limitations. They have higher throughput, they have good random access unlike what we saw before with hard disk drives that really, really want sequential access. And if you have random access, the performance gets like orders of magnitude worse. They're more energy efficient, better space density, but the cons is they're gonna be more expensive. They've lower endurance. So with SSDs and other flash devices, there's actually a limited number of times you can write to them before they wear out and then they can no longer hold a charge anymore and you can't use them. So if you want to optimize for SSDs, one thing you need to do is minimize the number of writes you can if you want because that is kind of the limiting factor with how long it actually lasts. And it's going to be more complicated to write drivers for and that comes as a result of just how it's constructed and we'll see what that looks like. So if you rip open an SSD and look at the kind of the package, this is what it looks like. So we can ignore like the dye in the yellow and the plain in the brown. And for programming-wise, we only have to care about blocks and pages. Thankfully the pages in the green, they're typically four kilobytes. So they're the same page size as we already use in with virtual memory. And then the blocks, all you need to know about the blocks is that they contain multiple pages and we'll see why we actually have to care about a block containing multiple pages in the next few slides. But just to give you an estimation of like typical speeds, them being much, much faster. So pages again, typically four kilobytes. Reading a page for hard drives, things were on the order of milliseconds. So for SSDs, they're on the order of microseconds. So reading a page is like 10 microseconds. Writing a page is like 100 microseconds. And then erasing a block, which is why we actually care about the layout of it, is one millisecond. And why do we care about erasing a block? Well, that's the only way to erase data on an SSD. You can't erase individual pages. You have to erase blocks at a time, which is the complexity with writing drivers for this. So you have to understand that and do some things so you know that, hey, this whole block is fine, or if you have to modify data, if you have to modify one page, well, you're gonna have to move it all to a new block and just write the other pages there that aren't modified. And then you can delete the whole block and then you have to kind of update all the references. So you can only erase in blocks. So, and the other rules too are you can only read complete pages and write to freshly erased pages. So in order to write to a page, it would have have to been erased. So you can't overwrite a page or do anything without doing an erase in between, which if you're writing a driver for this or you're trying to use it is kind of annoying. So a racing is only done at a block level and a block will typically have like 128 or 256 pages. And again, remember that the entire block needs to be erased before you can write and then you can only write once to that page. So don't ask me exactly why that is, that is just how the hardware is designed. And then writing also in general is slow because in the general sense, while you might have to create a new block or freshly erase a block or possibly just copy everything else that you want from an old block over to a new block. And that's going to be much, much, much slower than with just a normal hard drive where you can just kind of read and write at will and you don't have to care about the actual structure of the disk or where the sectors actually are or anything like that. So SSDs need the kernel or the operating systems help to speed up access. So SSDs actually have some type of garbage collection on them. So they want to move any pages that are still alive to a new block and there may be some overhead. So it kind of defragments itself a little bit behind the scene. So they'll try and move blocks to freshly allocated blocks and try and kind of remove all the fragmentations and try and fit everything on a block. Instead of being spread out over multiple blocks. So if you are the kernel and you have access to disk controller you might not even know what blocks are still alive. The SSD might think the disk is full when a file could be deleted and that means not erased. So to delete a file you can just remove all the references to it. You don't actually have to erase this or erase the block. You can just kind of let it sit there and then let the hardware go ahead and figure out. Hopefully I stopped using all the pages on a block and then eventually that block could be erased. So there's a lot of bookkeeping here associated with it especially when the hardware the hardware might only care about blocks and not about the pages used on the block. So this is kind of like a weird type of fragmentation where you might have some pages available on a block but the disk might not know that they're available or not and therefore can't really do anything about it. So like with typical deletion you just delete a reference to it but the device might not know, hey, does this page on a block is it actually a represent a deleted file or can it still access it? It would have no way of knowing. So there's something called a trim command that informs the SSD if a block is unused because otherwise it will not be able to figure it out just based on the pages. Yeah, so the question is if I download a big 10 gigabyte game and I delete it, does it actually get erased on your SSD? And the answer to that is probably no because it would be somewhat slow. So it would mark all those pages as being deleted and then if it has this like trim command thing it could tell the SSD that hey, these blocks are no longer used and then it's free to erase the blocks in the background but it typically does that on its own time and not like when it's just idle not doing anything else because otherwise that information's there and this is why you can actually recover some data or if you delete files, someone that is really, really good at data recovery can still recover it because it just kind of deletes reference to it and doesn't delete the data. So this is kind of how data recovery works as well. So even though the operating system's not pointing at stuff, if you know the format of stuff you can point at things and kind of probe around and see all your files that were deleted. So all the stuff that you think you might have deleted might not actually be deleted. And like an explicit version of this is like on Windows your recycling bin or Mac in your trash that basically keeps one reference alive to that file so it doesn't actually get deleted on disk and then when you clear your trash bin it'll just remove that one reference to the file and then it may actually be deleted on hardware or it may not be but that's also why recovering a file in trash is really, really fast because it just recreates the reference that points to that in whatever the original location is you deleted it. And we'll see more of that when we get to file systems. But yeah, so Trim you might see this it's more of a big deal when SSDs were new but that's just the way of the operating system informing the SSD that hey this block's unused you can actually erase it. Yeah, so so far we've been talking about single devices so now we can get into more exotic things so sometimes just a single device is kind of jokingly called a single large expensive disk so if you need more storage one thing you can do is just buy a single bigger hard drive and fit everything on that but people it's kind of a mocking term because it doesn't sound good it sounds like it's kind of mocking you because it's just one single point of failure so one thing you will typically see especially that you can also do yourself if you really really care about your data is something called a raid or a redundant array of independent disks which just means it's just a bunch of disks that are linked together in some way to either improve its reliability or improve its performance over a single disk. So instead of just having a single disk with all your data your data can now be distributed across multiple disks and there's different ways to accomplish this so you can use redundancy to prevent data loss so your data is in more than one place or you could use redundancy to increase throughput so if you have two copies of a file and they're both on a single disk well I can go ahead and read half of it from one disk and half of it from the other disk and I can kind of double my performance or at least for reading so the first thing you might encounter has anyone ever heard of raid before or has, okay so some people have so when you actually start caring about your data and like doing stuff you might actually use one of these things so they're actually good to know what they are so raid zero is also called a stripe volume and this is purely for performance so data stripes will typically be in like size of 128 or 256 kilobytes and distributed over the disk so those are like the size of chunks that are per disk so if you had a big file that was in eight of those chunks then what raid zero would do is essentially put all the even number chunks on one disk and all the odd number chunks on the other disk so this way your read and write performance is both doubled so if I'm writing out all eight chunks I can send four of them to one drive four of them to the other drive so I'd have double my read performance or write performance if I want to read that A file well to read them I read four chunks for one drive and then four chunks for another drive so I would double my read performance as well and this can scale up to as many disks as you would like so you could do this with three disks, four disks, five disks, however many you have so this is going to be like faster parallel access both read and writes are essentially going to be n times faster where n is the number of disks but now instead of before when we had a single large disk we had a single point of failure well now we have multiple points of failure so if we had our eight chunks of our file here well if we lost one of those drives doesn't matter which one we'd lose half our file and it'd no longer be accessible so if it was a video it would just be corrupted anything would just be corrupted just missing half its data which would get worse if that thing was distributed over four disks so any disk that goes down that file is now corrupt and you can't read it because now you're missing data all right any questions about read zero it's typically only used for performance you're essentially trading off performance four more points of failure so read one if you get into data hoarding or you actually care about your data it's kind of the cheapest way to go and most simple so it kind of you can think of it it's creating a backup for you automatically so it will just take two disks and make them an exact copy of each other so if that a file is now four chunks well the exact same four chunks would be on the first drive as the second drive so now we'd get double our read performance in this case but for writing it's just going to be as slow as a single disk because we have to write that data to both disks but this will make it anything any file you change any file you create anytime you save well the data is going to be on both disks so now if one disk dies well you don't lose anything because the other disk is an exact copy of it and it's maintained like that so you don't have to worry about that as much but of course if both disks die then you lose your data so it's pretty simple but could be wasteful when we compare other techniques so in this every disk has a mirrored copy of all the data and again like read zero this can scale up so I could have read one for multiple disks so I could have it for you could have four disks like that that they're all copies of each other your read performance will be four times as good but your write performance is going to be the same as a single disk but the only benefit is now to lose your data you actually can lose up to three of your four disks as long as one survives you're good you still have your data left so it's got good reliability again like if one disk remains you still have your data and good read performance but there's a really high cost for redundancy so everything is an exact copy so we can actually do better than this which we'll see next and again the con is the write performance because we have to write all the data to all the disks so the first one that introduces some type of parity and what parity is is just a little bit of information that can be used to reconstruct other bits of information so the data stripes would be distributed over disk or over the disk and there'd be one disk that's dedicated for parity and the simplest parity scheme is just the parity bit is an XOR of all the other bits on the other drives so that if one dies it can be reconstructed and what that would look like is essentially well for every byte of information or every bit of information so you'd have all your bits here and then we could have you know disk one, two, three and then our parity disk well if these are all of our bits that are distributed so say we have one, one, zero, zero is one, zero, one, one, one so the parity is just going to be a calculation for across all of our disks so the parity for the bit seven on all of our disks is essentially the XOR of everything which essentially just means hey the XOR you can think of it, it means it's a one if the result of adding them all together is even and it's a zero if adding all of them together is odd or sorry it's a one if all of them together are odd and a zero if all of them together are even so here our parity would be oops, zero for the next case there is an odd number of bits so our parity would be one and for the last one there's also an odd number of bits so our parity would be one and the idea behind that is if we lose one of these disks so let's lose the first row and just remember it was one, zero, one well if we just lose that and we have no idea what it is well we can use that parity bit to reconstruct what that value was so on disk two it's a one and disk three it's a zero and then if I compute the parity across all the disks including all disk one it's supposed to be even so if I know adding all of them together is even and the last two disks remaining are odd that means it has to be a one that I lost so this is what you would do to recover your data so if I lost this one well I look at my parity bit and what bits are remaining and then I reconstruct this one yeah, yeah so in this case if we lose two disks it's all over I have, I essentially have I just have two unknowns and I can't do anything about that so have we taken differential equations and didn't like the sets two unknowns and all that linear whatever all that stuff so all that stuff is from linear equations which we'll see to well we won't go into the details but if you want to know the details of how to reconstruct it if you're missing two disks well you'd need linear algebra but in this case I could reconstruct so the first one would be one and then I know for the other one for bit six while disk one is a one disk three is a zero so those together are odd and then the parity says the result of all of them is odd so the one I'm missing just has to be a zero because otherwise I would change the result and same for the last one so my two disks together are even and I know with the parity it's odd so that means my disk that I'm missing is a one so that is at the basics how the parity calculation works for one level of parity so of course if two disks fail it's all over but we can actually reconstruct the information if a single disk fails so also the problem with this one too is there's going to be a very uneven where because no matter what whatever modification you make even it's only to a single drive while every single modification is going to have to go through disk three that's holding all the parity information so it's going to be very very unevenly distributed where disk three is going to get a very disproportionate amount of the load because it has to store all the parity across all the disks so even if you change this one disk you have to reconstruct the parity so if I change like A in disk zero, B in disk one, C in disk three, two while each of those is just writing a single chunk to the disk while the parity disk would have to write all three chunks that is a parity of all three of those so it's going to have way more writes happening to it than the other disks but that's just to illustrate what a parity drive could do and what parity information does so with parity you essentially lose a single drive to just parity calculations and you need at least three drives for this to work but for the pro well you get n minus one times performance so you basically just remove the parity disk because it's a bunch of overhead so you can read across n minus one disk and write to the other disk but you'd have a little bit more writing because every write has to go through the parity as well but you can also replace a single filled disk and rebuild and you also get most of the space of all those disks combined while having some benefit for performance and being able to tolerate at least one rate failure so write performance can suffer a bit too because every write no matter what has to go to that parity disk so raid five is essentially raid four is never ever used because of that issue and raid five is the same idea where there's a single parity disk except the parity is distributed across all of the disks so in this example you know disk zero has a parity calculation for D disk one has it for C disk two has it for B and disk three has it for A so it's more evenly distributed across all the disks but otherwise performs exactly the same as raid four yeah so the question is is there raid two and three because we skipped the numbers so there is raid two and three but they're really really weird so we won't talk about them so typically we're just talking about the so the ones that are used are zero one five then we'll see six and then a combination of one and zero but yeah raid two and three were essentially mistakes so this is what raid five uses and so it has all the same pros as raid four in terms of space all of our calculations are the same the only difference is the write performance is improved and we because we no longer have the bottleneck of the single drive everything is more evenly spread out between them but with this two you know you can only tolerate one drive failure as I remember in grad school I had a raid five and one the disk failed and I was like oh yeah no problem I can rebuild it I can save information but I didn't act that fast typically this fail in pairs together or generally at the same time so I lost another disk while it was going and I lost a bunch of well I lost some stuff yeah so if you buy them at the same time typically if all the wear is about the same they'll kind of wear out about the same so if you buy them at the exact same time which typically if you really care you shouldn't do you should buy them from different batches or something because you don't want them to both die at the same time if you're using this yeah so if you do this you should space out like what batch they come from for performance reasons they should be like the same performance and stuff because you're gonna be ball necked by the slowest one anyways so they should all perform about the same but if you really care you want to buy them from different batches probably so go to another store well hopefully they came from different batches usually they say what batch they came from on them but yeah typically if one dies the other one's going to die close together so don't do what I did if a drive dies go out and replace it because otherwise you're gonna have a bad time but if you wanted to kind of live on the edge like I did and have a bit more redundancy well that's what RAID 6 is so RAID 6 just adds another parody another parody bit per stripe so instead of just being one parody so the original parody would be P which is like the XOR there would be another parody which would be something like Q and to understand how you calculate the Q parody bit so such that it's different and not using just a plain XOR that's where the scope of this course will end and you can look it on your own if you're really interested in that but that will also tie into kind of your linear equations courses and all of that but for all we care about RAID 6 can just recover from two simultaneous drive failures but now do the extra parody we have less available space so essentially we're going to have two discs used for parody so the amount of space we have is essentially two discs smaller than if all of them were just like in RAID 0 and for this you require at least four drives and the right performance is also going to be a little bit slower than RAID 5 because there's going to be some extra parody calculation and it's going to be a bit more complicated than just doing straight XOR so there's also an option here to do something that's called a RAID 10 but a RAID 10 is just a RAID 1 plus a RAID 0 so a RAID 10 essentially is going to have different stripes across all the discs and then you would have a mirrored copy of everything so let's draw that out what that looks like so whoops so for a RAID 10 which is sometimes called a RAID 1 plus 0 well you need at least four discs so disc 0, disc 1 so you would have your data striped across two discs so you'd have like a 0, a 1, a 2 so like with our RAID 0 before where we just have all the data just distributed across the disc we would have all the data distributed across the disc on one side and then the other, whoops and then we would use the other disc as a RAID 1 which is a complete mirror of each other so there's kind of two dimensions to this so across the disc there are exact copies of each other but between all the discs this way they'd be RAID 0 where all the data is distributed across them so typically you use something like this so there's going to be exact copies so you're going to lose half the data due to redundancy because we have a RAID 1 in the mix but we're not going to have all the drawbacks of it because we do use striping between the discs as well so in this case the number of drive failures you can tolerate is kind of dependent so there's like a minimum and a maximum so in this scenario, well, let's do another few more discs so say I just added, here, let me rename them so in this case that would be our RAID 0 across those three discs and then we'd use another three discs to have an exact mirror of that RAID 0 so disc 3 would store a 0 and a 3 exactly like disc 0 so disc 3 would be a mirror of disc 3 or disc 0 disc 4 would be a mirror of disc 1 so it would contain the a1 chunk and a4 chunk then disc 5 would contain the a2 and a5 chunk because it would be a mirror of disc 5 so now how many discs could we tolerate failures for? what was our minimum and maximum number of failures we could actually recover from? yeah, so minimum is 2 and why is that? well, I could get really unlucky and so each of these discs on this side have one mirror so say this disc fails well, I can get really unlucky and if its mirror also fails, like this one then a0 is gone, a3 is gone because its mirror is now gone so if two things fail, well then if the wrong two things fail then we are screwed but in this case, in another way well, that's our unlucky case our lucky case is, well, say this one fails so now we have two disc failures and we're still okay because we have at least one copy of every single chunk so we just lost our mirrored copy of a1 but that's okay because we still have an a1 and then now a third disc failure could happen and we could be okay as long as we're lucky but if you assume things happen uniformly, randomly we're not playing very good odds here but we essentially have a 50-50 chance of completely failing or not so if either disc two or disc five fails like this then you're still fine because you have a redundant copy of a, we have at least one a2 left and one a5 but we were 50-50 there because if this three dies then we no longer have any a0s or a3s and then if disc one died instead well, we just lost that data as well so this has like a minimum and a maximum so you can, depending on your luck your minimum is always going to be two because I could lose my mirror but the more drives you add like this the more probability you have that you might actually be okay to lose two because it might happen with another pair so you just have other pairs if you lose both in a pair, you lose data but you can lose one in multiple pairs and still be okay and another reason why some people prefer this setup to is if you have to, if one disc drive dies and you have to recreate it well, if it is, you're recreating it from parity information well, you're going to have to do some calculations so you're going to have to do a lot of XOR operations and it's going to take some time so typically rebuilding a disc if it's several terabytes will take on the order of days so with this scheme there's no recalculation or anything if another drive dies to reconstruct the new one you just straight up copy the one that survives so there's no calculation, there's no nothing you just copy it as fast as it can go and you can reconstruct it in a matter of hours instead of days which is a lot more palatable if you're running some very important thing or especially if it takes days and you bought them together and they fail together well, so in the process of reconstructing the data you lost, if you lose another drive you might be screwed so sometimes you want to do this as quick as possible so any questions about RAID stuff? Okay, cool so we saw some more topics we saw SSDs and RAIDs, SSDs are more like RAM except that they're accessed in pages and they're organized in blocks and they have some weird rules associated with them where you can only write to a freshly erased page and you can only write exactly once and you can only erase data in blocks at a time so because of this SSDs need to work with the kernel to get the best performance where you need something like trim that actually, which basically just informs the drive that hey, this whole block isn't actually used even though it looks to you like it might be used and we saw RAID or the redundant array of independent disks to tolerate failures and improve performance using multiple disks there's a lot of ways to arrange that and a lot more exotic kind of file systems and solutions to that but the major ones are RAID 0, 1, 5, 6 and then RAID 10 which as we saw is just a combination of RAID 1 and RAID 0 so with that, just remember, pulling for you, we're all in this together