 I don't like to come to class in the rain. OK, today we're going to talk about RAID. Many people are familiar with RAID in some way. I mean, people have set up a RAID array or, well, I mean, we're before you read stuff for this class. I mean, people have a RAID array as part of one of their machines, like a non-RAID 0 array. OK, cool. Yeah, so this is cool technology. It's kind of a neat way to finish off our discussion of storage. So that's what we're going to do today. And then next week we're going to talk about virtualization. As announced today, there will be no class on Friday, given the number of people here today. That feels like a good move. So we have some extra time to finish up assignment 3.2. And then next week we will do a series of lectures on virtualization, which will be fun. All right, I think I'm preaching to the choir at this point. I suspect that if you're here, you feel reasonably confident about your 3.2, but it is due on Friday. So we'll have, as announced, extra office hours on Friday to help people with last-minute bugs and problems. And then no rest for the weary. There is two weeks left after that for assignment 3.3. You probably have done this math already, but assignment 3.2 and 3.3 together represent half of your assignment grade. So that is intentional. I mean there are sort of big parts of the course. And the final exam is also another half of the exam grade. So there are still, if you were a little nervous about your midterm grade, there are still like 50% of the points for the class that are still outstanding, that are still out there, upper graphs. So before we start talking about RAID, I just want to spend about 10 minutes introducing you guys to how to read a research paper. Because as part of the virtualization week, we'll look at the Zen paper, which is another just classic paper. I mean these are papers that spawned entire industries. So RAID, Zen, and then we'll probably add something fun in at the end in the last week. I don't know what it will be. So maybe I should start with why. So we talked a little bit about this at the end of the last class. There are still systems that the best way to learn about them is to read a paper about them. I mean you can read things on Wikipedia, and Wikipedia is a good way to get a sense of what something is. But Wikipedia is an encyclopedia, and the writing in Wikipedia tends to be not just communal, but there's this particular tone that an encyclopedia goes for, authoritative and encyclopedic. Whereas research papers tend to be persuasive, or research paper will not only try to tell you about something, but also convince you that that design choice or that way of building a system or that system itself is a good idea. And so it's just a different tone than reading a Wikipedia page. Again, don't avoid the Wikipedia page. Just certainly use Google and stuff like that to find out about technologies. A lot of the stuff we'll talk about is really well documented online. But reading research papers gives you a different sense. And it exposes you to people who are making the case for something to a skeptical audience. So you can read research papers. They're written in English. The good ones tend to be pretty well written. Now you don't read, well I don't know, does anyone read anymore? Like do you? People still read? OK, good. Just making sure. I'm sure someday I'm going to ask that question. People will be like, it's all on YouTube. Just watch the YouTube video. So research papers are just another thing you can read. The difference is you don't necessarily want to read everything about them. You don't want to read every paper. So there are approaches to choosing what papers to read and choosing how to approach a research paper. You don't want to read it like a novel. You don't want to read it end to end. And just like when you choose literature, you probably don't want to read bad stuff. So the first thing to do is pick good papers to read. If you've heard of a technology, so how many people have heard of Zen before? Maybe before this. So if you've heard of a technology that's been commercialized and there's a paper about the technology from 20 years ago, that's probably a good paper to read. Because that paper, again, spawned an entire industry. Those are cool papers to read. But there are, in the field of science, there are venues that are very selective and prestigious. And then there are venues that are less so. And so if you pick the top conferences and venues, you will find the top people in the field who are doing some of the most interesting work. Now, again, you don't want to take a research paper and just start at the beginning and read all the way through. That's a good way to waste some time, if that's what you want to do. What you want to do is jump around a little bit. And depending on what you're trying to do, pick up the important bits. So if you go to work at a company and someone hands you a paper and says, you need to implement this for me, that's a different style of reading. And if you're just trying to get the sense of what is the big design goal or the big idea of a particular system? And how can I take that from that system and apply it somewhere else? In a lot of cases, who cares about the details about how stuff is actually built? What we care about are the ideas. What's new about this system? What sort of new design principle are they introducing? Or what new take on an old problem are they trying to establish? This is where papers get published, we'll talk about this, and then the kinds of papers. And these are all things that help you choose papers to read and then figure out how to approach them. So where papers get published? So there are, there's something called workshops that go on, workshop papers tend to be short. So for example, if you decide you like operating systems, there's a workshop that's held every couple of years that's called Hot Topics on Operating Systems or Hot OS. You can go to, you can, if you Google Hot OS, well, Google will try to auto correct it to photos, right? So click, don't click on that, right? If you Google photos, you can imagine what you find. But if you Google Hot OS, you will find websites and you will find links to papers. And these are typically fun because they're five to six pages, so they're short compared with conference papers. And workshop papers are written with a different style and tone. Frequently intended to be, these are hot topics. So they're intended to be provocative, they're intended to be new. You can read it and get kind of a new, interesting take on an old problem or something that they might be. And again, frequently they're designed to be provocative. So for example, about 10 years ago, Margot Seltzer and one of her students wrote a paper for Hot OS called Hierarchical File Systems Are Dead. This is a classic Hot OS title, by the way. Something, something is dead. We shouldn't be doing this anymore. And if you read this paper, they make the argument that these Hierarchical File Systems with Hierarchical Namespaces that we've been talking about are no longer relevant. Not something that we need to keep designing today because people don't use file systems that way. So that's an example of a workshop paper. Conference papers are longer and more complete. So where a workshop paper can be a little fuzzy around the edges and provocative, conference papers are supposed to be presenting a complete work and really sort of button up and tie everything down. So when you read a workshop paper, you might get to the end and think I'm not sure this works. If you get to the end of a conference paper and think I'm not sure this works, that's not a good paper. That their job is to convince you. Here's an example of something called the Scalable Commutivity Rule. This is actually a neat paper because it talks about how to design operating system interfaces. You don't necessarily get that from the title. But one of the findings in this paper is that if you make some small changes to the POSIX operating system interface, you can actually get much better performance because there are hidden dependencies between different system calls and if you remove those, you can, multiple system calls from the same process or multiple processes can work together in parallel more effectively. So this is pretty cool. And then journal papers are these long archival things that I don't know, I wouldn't bother. If you find a paper that's in a journal that's really long, usually what a journal paper is is that the authors have decided to go back and put all the stuff into the conference paper that they cut out to save space and get it down to the conference page limit. That's usually not the stuff you want to read. It's like the dregs of the paper, the boring parts and sort of like the details that didn't make the cut. So don't read that. All right, so when you start reading the paper, we'll talk about the original read paper today despite the fact that I know I didn't assign that. Then we'll have a few little quotes. But most of what we're doing today comes stuff that you can pick up from the Wikipedia page. So this is an incomplete taxonomy, but there are a variety of different types of research papers and a variety of types of new systems, I would argue. So big idea papers present some new novel insight about how to build systems. Something new has entered the world. People have made a new observation and they think that this is useful for other people to know about. Problem papers try to convince you that a particular problem is really important. There's something that doesn't work about current systems and the community should work on that problem and fix it. And usually a problem paper is written because people don't feel like this particular problem is getting enough attention or there's some cool solution that they want people to consider. Data papers, so data papers are written when people collect a big data set and then use that data set to try to provide some interesting observations about the system that they built. So for example, Google from time to time will collect logs from some of their systems and do some analysis on them to try to understand how those systems work and then share those results with the community. New technology papers, so as technology changes there are these new things that we're trying to figure out how to fit into operate systems and to computer system design in general. So when new technology emerges people frequently will publish a variety of different papers saying, hey, here's a cool way to use this new thing. So does anyone know what an FPGA is? People use those? Yeah, so programmable hardware. This is something that the systems community has been thinking increasingly about how to use for the past couple decades, GPUs. Anyone know what a GPU is? Yeah, so again, same thing. It's still repapers where people are saying, oh, I can do this cool new thing with GPUs. So new piece of hardware comes along, gets integrated into a variety of different systems and then we spend some time figuring out how do we take advantage of that. Wrong way paper. So these are kind of fun. These are of the are dead variety. This is this argument that something that the community has been doing or a way that we've been solving a problem is broken and we shouldn't do things that way anymore. And over the next two, two and a half, I guess it's three weeks? Is it three weeks left this semester? Two weeks left in the semester. We'll read a couple of papers drawn from these different categories. We don't have enough time to read one of each. Okay, so parts of a research paper. So now you've got this thing in front of you or on your display and again, do not read it from front to back. Nobody just don't do it. No one does this. It's a bad idea. The interesting parts of the paper are frequently probably the most useful parts to a broader audience or to somebody who's just sort of curious about the system is the beginning. The beginning is where the authors are convincing you that the paper is good and interesting. So again, I mean, a lot of people when they are reading research papers, we'll start, they'll read the first page or two and then they'll be like, I don't know. I don't want to keep reading. This seems boring. So the first couple of pages are when authors have to make the case that this is an interesting problem, an interesting system, something that's cool about this. And so this is frequently the part of the paper that has the best writing, the best arguments, the most persuasion is part of it. Then you get to the middle, which is this kind of variety of different parts of the paper that are usually designed to talk about what actually was done. Not why we did it or what's problem we solved, but what do we actually do? And depending on the kind of paper, you can have some mixture of these different sections. So if there's an actual system, they might tell you a little bit about how it was built, the implementation part. If there's an actual system, they might tell you about how this system was designed. Here were the design goals for this system. If it's a data paper, there's a section about results where they say, okay, we took this data set and here's the interesting things that we found out about. And then the evaluation. So the evaluation is really important for systems papers because usually what I'm doing is I'm trying to convince you to do things in a new way and I need to show you that that way is better. And so I evaluate the system and how the evaluation is done is really important to people who work in the area and evaluate these things because a good evaluation will convince me that what you did is correct. A bad evaluation really says nothing about the paper and there's a variety of different ways that you can produce bad evaluations. Okay, that was the 10 minute version of how to read a research paper. Any questions about this before we talk about, talk about RAID. Man, you guys are bored today. I can tell. Okay, it'll get better, I promise. This rate's pretty cool. Okay, RAID. So what kind of paper is this? This is like, when was this paper written, 82, I think, 83? Long time ago. So RAID. If you have to pick what kind of paper, is this a data analysis paper? Is this a paper that claims that we're doing things in the wrong way? What kind of paper is this? Yeah, sheesh. Okay, kind of a new technology. What's the new technology? Okay, yeah, so there's a way to build something new based on the things that we already have. I consider this to be a big idea paper. What is the big idea here? I mean, one way you can observe, or one way you can identify a big idea paper is when it spawns an entire industry. Like when you guys know this acronym, then clearly something good happened after the authors wrote this paper. David Patterson is one of the authors of this paper, also is responsible for risk, right? So it's pretty good. If you come up with two acronyms in your life that everybody knows, like you can die happy, right? Those will be on this tombstone. So what's the big idea here? And again, the fascinating thing for me about this paper is not the disc parts. The disc parts are gory and kind of sad and a function of the time period, right? They just happen to be trying to build something out of discs, but there's a much, much bigger, more interesting idea here that's embedded in this paper that has spread all over the place into all kinds of different computer systems. Yeah. Okay, so redundancy is part of it, yeah. Okay, so I'm trying to use multiple things together. Yeah. Okay, we're getting, I mean replication is a feature that's required to get redundant in these types of systems. But like, again, I know you guys didn't read the actual paper, but in the actual paper, there's this long description at the beginning of the types of disc drives that were available to them at the time. So they kind of had these two choices. They had something called, in the paper they referred to as the sled. Single large expensive disc. So at the time, if you wanted to drop a lot of coin, you could get a really, really nice hard disk that had really good performance features. Okay, rate is not about the sled. What is rate about? Yeah. Yeah, so several cheap things properly working together. This is the critical part. This is where we get to build some cool stuff. It is possible to build software that unites several cheap things into one really functional thing. One thing that is way more powerful and better than anything I could spend money on. Because I can actually build cheap disks to build an array that outperforms even the most expensive disk drive that I could buy. Where else do you see this idea? So clearly disks, which, you know, again, are kind of like this old technology that no one cares about anymore, right? Where do you see this applied in places where you might care about? Yeah, so like Google, okay? Google is an epitome of this. So there was a period of time where now Google's data centers are all these like custom built, you know, one URAC servers. But there was a period of time when Google was starting out where they didn't buy servers. Does anyone know what they bought? They built data centers. And they were, you know, they had these similar architecture they do today for making use of lots and lots of machines. But what were the machines that they bought? Does anyone know? Yeah. Yeah, they bought like cheap Dell PCs. Because they did some analysis and they realized that that was the best price point for them, so they would buy like thousands of Dell PCs. And they would put them in a data center and they had software, which they're still using today, variants of it. That would cause them to work together and essentially act as one incredibly powerful supercomputer. Where else? Yeah. Yeah, yeah. Like lots of crappy developers, right? I'm serious. I mean, like lots of crappy developers can together if they work carefully together. Again, this is about building systems. You don't just get this by combining cheap crap together, right? I mean, the McDonald's hamburger is still in a McDonald's hamburger, right? Like just because you took a couple of cheap things and put them together into an attractive package doesn't mean that magically becomes something else. There has to be like a system for transforming it. So working together with appropriate systems in place and appropriate tools, lots of developers can certainly build software that nobody could build alone. I mean, Linux is an example of this. No one person could have built Linux in the time period that it was developed. It took a lot of people. And none of those people understand the entire system. Maybe. I shouldn't say it. I'm gonna get hate mail from the Linux people. I understand the entire system. Of course. Yeah, exactly. Every bite of it. Yeah. Yeah. Okay. That's fair. And there it wasn't even that, the barrier wasn't cost necessarily, although it could be, right? It's heat and other things. Yeah, but absolutely. So multi-core. Maybe I take eight cores that are cheaper to make and simpler to make and I combine them together and I get better features out of that. But it's even on the slide actually. So clearly I agree with you. Crowd sourcing. Crowd sourcing is even more interesting because I have to take potentially multiple inputs from actors that I don't fully trust and try to, you know, how do I find out truth about the world? I have a lot of people that are confused about stuff, but if I combine their observations together carefully, I might actually be able to pin something down. Okay. So, and I already sort of gave this away, right? So the problem that the raid paper identifies, the raid paper has this great introduction. David Patterson was famous for, and still is famous for writing these fantastic introductions for papers, where he talks about something called the emerging IO crisis, right? And we kind of talked about this a little bit. You know, CPUs are getting faster, memory is getting faster, partly for the same reason because transistors are getting smaller. I'm getting, I can divide bits into smaller and smaller pieces. Discs don't follow that technology curve because disks have to move stuff around. And so at the time, disks were not keeping up and they never kept up. They've never, they never have kept up with the speed increases in other parts of the system. So, while we can imagine improvements in software file systems via buffering for new to own demands, we need innovation to avoid an IO crisis. This is like, you know, if we don't solve this problem, the world is going to end, the world of computers will end because of this IO crisis. So, so, you know, raid is this nice approach, but the problems created by raid, and this is the same thing for Google and these other solutions, are that sheep stuff breaks. Well, really anything breaks. I mean, that was one of their observations is even the sled, even a little expensive disk break, but when I combine a lot of things together, the probability that stuff breaks goes up. In fact, since we've been talking, I'm sure people, there's probably a website you can go to, right? How many computers have failed in Google's data centers today, right? And there's probably just a number that goes up. And I suspect since I started talking, there's been a bunch, right? Because they've got like, I looked at something like 10 million computers around the world, something like that in their data centers, stuff breaks, right? Those computers have parts in them, some of them have spinning disks in them, those just have died since we started talking. And the more stuff you have, the faster things break. And so, on some level, maintaining these big cloud data centers is dealing with the fact that the hardware, as the data centers get bigger, the hardware is always failing. Something is always breaking, going down, dying forever and needing to be replaced. And building software that allows us to effectively combine cheap things into one cool performance thing requires accounting for this. So I need a plan. Any questions at this point? I'm gonna go through the RAID levels. One of the interesting things about the Randy Katz paper that you guys read was he points out, and we'll come back to this at the end, that one of the big contributions that he felt like they made was unifying some of the work that other people were doing. So he said, you know, when we started writing the RAID paper, we discovered, and after the paper was published and people started talking to him about it, that a lot of people had done this stuff already. And so part of what they brought into this was terminology and a framework for thinking about it. And if you go back and you read that original RAID paper, it proceeds with this lovely sort of flow to it. It's like RAID one, RAID two. At every level, there's an improvement, there's some nice analysis about what happens. And there's a principled way of moving forward. So RAID one, okay. So all the RAID levels are gonna require at least two disks. Some of them require more than two disks. But for RAID one, I have two disks, and what do I do? RAID one is really easy. Right, every piece of content from RAID from each disk is mirrored on the other disk. So for every block on disk zero, there's an identical block on disk one. So what is the capacity of this array? Let's say each one of these drives is a terabyte. How much space is the array? How much space can I, how much can it store? One terabyte. Let's say I add a third drive. How much space can it store? One terabyte, right? Which is the reason, you never RAID one. Well, I guess I shouldn't say never because I'm sure some weirdo did this, but it doesn't make really much sense to do RAID one with more than two drives. Just two drives, right? Because then it's like, oh, I can get 10 drives and put them in RAID one, and I have one terabyte. So the interesting thing about RAID one, of course, is that how many drives do I have to touch to read a block? One, so I do see performance improvements for reads. Because, so actually, let me back up for a second. So how RAID works is that there are drives that act like the drives that we've been talking about. And then somewhere in the system, it can be in software, you can do software RAID using Linux and other operating systems, or sometimes on a special dedicated piece of hardware called a RAID controller, or somewhere in the array, there's software that is controlling the two disks and making decisions about where to write things, where to read things, and making sure that the replication goes on, correct? So that software is sometimes referred to as the RAID controller. It can be anywhere, right? Like I said, you can go out, you can spend money, you can spend a thousand bucks, you can get a fancy RAID controller that you can plug a bunch of disks into, and it just makes them look like one RAID array to the computer that's mounting it, or you can fool around with RAID by having RAID run in software. There's certain things that RAID controllers have to do that can be done a little bit more efficiently in hardware. So sometimes that's a popular option. All right, so RAID one, the RAID controller is making sure that writes go to both disks, but reads can come from either. So now again, think about how spinning disks work. So when I wanna do a write, I actually have to wait for both heads to get to the right spot on the disk and write the data before I can move on. So writes actually slow down a bit on a RAID one array. Reads on the other hand, I can split the reads between the two disks. And so read bandwidth improves because I don't need to touch both disks for a read. Okay, so now let's go to RAID two. A lot of these RAID levels have been sort of lost to the sands of time. In practice, people do RAID one, people do RAID five. I think there might be some RAID three arrays around there. Yeah, six and 10 are we're not gonna talk about. But of the original five RAID levels, it's like one and five, essentially. The one's in between it. And you'll be able to see why. Okay, so in RAID two, we introduce a new concept which is this idea of error correcting codes. So these are not necessarily that complicated. The idea behind an error correcting code is if I take a byte of data or some amount of data, I can add additional information to that that allows me to recover from failures in the disk drive. So for example, if disk two in a RAID array fails and the failure model here is important because RAID two assumes the disk can fail by returning bogus information. So for example, if disk two fails and starts returning just garbage data, I can both detect the disk two has failed and recover from that failure. So there's enough extra information in the system now to both identify the disk that has failed and recover the data correctly. Okay, RAID two, the original proposal uses something called Hamming codes to do this. And at this point, both reads and writes are gonna require all disks. Why? Why do I need to read from all of the disks? Yeah, I need to actually check and make sure that the data I got is right. Remember, I don't trust any of these disks. These disks can give me garbage data, they can lie. And so in order to detect whether or not the block that I've read from disk three is correct, I actually need all the other blocks. I need all the other data associated with that block. So I probably need to touch a bunch of other disks. Same thing with writes. Whenever I update data in RAID two array, I need to update the Hamming codes and that requires touching a bunch of other disks. The nice thing about this is that the Hamming code does not require as much data as the original chunk of data does. I said that in a really bad way. Let me try again. So if I have a certain amount of data, the extra information that the Hamming code adds is not, I'm saying it another bad way, anyway. I have some capacity improvement. I don't have to write data twice. The Hamming code doesn't actually require that me duplicate the data. And so I start to get a capacity improvement. So as I add disks to the system, I'm not stuck at whatever the capacity of the disk is. I'm getting more space as I add new disks. Now, okay, so now in RAID three, this is one of the more important observations that the authors make, in my opinion, because it sort of leads, it basically just, all the other RAID levels fall out from here. So in RAID three, what I do is I say, look, I can detect when disks fail. I'm not concerned about being able to detect that a disk is inaccurate. So RAID two, remember, I have to assume that the disk can give me garbage data. In RAID three, what I say is I say, look, the system knows when a disk fails. So some part of the system communicates with the disk and the disk knows that it's not working anymore and the disk says, I'm broken. I don't trust me anymore. So now I don't have to be able to identify the disk that failed inside the RAID controller because some other part of the system knows that. All I have to be able to do is correct any failures that result from losing the data on that disk. So any one of these disks I can kill off and I will know which one it is and from the remaining disks I need to be able to correctly assemble the information. So how much extra information is required to do that? So good question. I have n bits of information and at some point one of those bits can fail and I know which bit it is. So I will tell you which bit failed. How many bits do you have to add in order to correct the failure? Double, okay, I have n more bits. That's too many, yeah. Log n, I like this. We're just guessing big O notation things, right? Log n, okay. Anyone want to keep trying? Anyone want to go smaller? I mean, zero is probably wrong, right? One. How does it work? I've got n bits and bit three fails. So you added a bit at the end. What are you storing that bit? No, remember you don't know which bit is going to fail beforehand. So I gave you n bits and all I'm going to say is that after you add information, I'm going to pull one of those bits out and you can't use it anymore. What do you store in the end post one bit? Yeah, or you just, or I think the way Ray does it, you just add up all of the bits since that modulo two, right? So I take all the ones and zeros and I just add them up and when I'm done, I take the result modulo two and I have either a zero or a one. So this is a check bit. So when I lose a bit, I take the remaining bits and I can use the check bit to determine whether the bit that I lost was a one or a zero and so I can recover the extra bit. Does this make sense? It's a really basic sort of error. It's as simple as possible error code, error correcting code. Again, it requires that I know which bit failed and in a lot of applications, if you think about like wireless communication, for example, I don't get to do this because I don't know which bit failed. The wireless driver says, here's a bunch of bits that I computed based on measuring a waveform of electromagnetic radiation that's traveling through the air. I don't know anything about them other than that's what I got. So in that scenario, when I can't identify the bit that's wrong, I need to add more information but if I know the bit that's wrong, I don't. So one extra bit. The RAID3 solution uses byte level striping. So for each byte, for each eight bits, there's an extra bit of information stored somewhere on another disk. Now, the problem with this is that to read a block of information, I still have to touch every disk because the block is broken into multiple bytes and those bytes are themselves striped over the disks. At this point, the capacity is improved even further from RAID2 because RAID2 required, the Hamming codes I was using in RAID2 required a lot more information. Now I just have one extra bit. And the nice thing is, as I add disks to the system, I still only need one extra bit. So if I have 192 disks, like the array that they built as part of this project, only I can essentially have a 191 bit string and I use the 192nd disk as a check disk and I can repair errors to any one of the 192 disks. So that's nice, yeah. Yep, if there's more than one, I'm in trouble. Yeah, and so this is a great point. So what happens when the disk, what happens in RAID when a disk fails? What's the assumption here? Has anyone ever had this happen? Yeah, exactly. So what RAID arrays will do, now so first of all, you can configure RAID arrays to be able to handle more than one failure, right? So I can build a RAID array that can handle N failures, essentially. Handling failures increases the amount of extra information I have to store. And so the capacity goes down as I wanna handle more failures. That makes sense. There's an inherent trade-off here between the capacity of the combined drive and the amount of failures I can handle. But the model for RAID is when a disk fails, somebody gets an email and that person goes to the array, pulls out the bad disk and sticks in a new disk. And then that new disk is reloaded with information from the other disks. During that process, the array is sometimes referred to as degraded because it cannot handle additional failures. If an additional disk fails during that time, data will be lost. Some like, we actually have a RAID array at home at the machine at my house. And on Mac, once the RAID array is degraded, you can't write to that mount point anymore. So it essentially stops you from making changes. Usually that causes something to fail, which means that you notice that the RAID array is degraded. Rebuilding the, now rebuilding the disk can take a long time. In this case, I think our RAID at home is, it's a RAID one, so it's mirrored. But think about it, rebuilding the array essentially means rewriting a lot of data onto that new disk. And so I think when we did this a few weeks ago, it took about 12, 16 hours. And I've seen this take multiple days on larger RAID arrays. Again, during that time, the array is vulnerable to losing another drive. But when we come, we can come back and we'll talk a little bit about the failure models, right, for RAID, right? Because what I'm assuming is that single things fail, but as long as I catch the failure and replace them, the likelihood of two things failing in an overlapping fashion is small. Okay, so RAID four, what we do is, RAID four makes this observation that, hey, if I store blocks instead of bytes, then I don't have to touch all the disks anymore during a read. I can just read a block from one disk. RAID three reads require touching a lot of disks because I'm doing, I'm striping things on a byte level. You guys understand what that means? I don't think I explained that. So byte level striping means, here's byte one. So let's say I have like a 256 bytes. Byte zero, one, two, three, well, here's the, I think this is the check byte. Three, four, five, six, seven, eight, nine, 10, 11. The bytes are distributed between the disks and that's why reading blocks requires touching all the disks. In RAID four, what I do is that now A1 is an entire 256 or 512 byte block. And so to read things, blocks that a file system normally read, it only requires touching one disk. Now, in RAID four, what this is showing is that all the parity information for that block is stored on this dedicated disk three. So these are data blocks, A1, A2, A3, and this is the parity block for some number of data blocks. So when I write, so if I need to read block A2, how many disks do I need to touch? If I need to read block A2, one, right? Remember, the parity bit is only used during repair. It is not something I need to normally use when I read because if a disk fails, I know immediately that it failed. I'm not gonna read from a disk that failed, right? So to read block A2 requires only touching disk one. What if I want to modify A2? How many disks do I have to touch? Which ones? So if I want to modify block A2, which disks do I need to modify to touch? Yeah. Right, so disk one and disk three. Let's say I want to modify block D3. Which disks do I need to use? It's two and three. What about if I want to modify C1? Which disks do I need to use? Zero and three. So what's the problem here? Disk three gets a lot of traffic. Every right has to touch disk three, whether it's a right to any other array in the drive. And if the drive gets bigger, there's more right activity going on to the other drives. All of the parity activity goes on on a single disk. So what's the logical solution to this? Distribute the, oops, sorry. Distribute the parity. And this is RAID five. This is sort of probably the most useful contribution from that original RAID paper. So here, and again it's a little hard to see in the Wikipedia diagram, but this is the parity information for these three blocks. This is the parity information for these three blocks. This is the parity information for these three blocks. And so along with the data blocks, I'm actually also striping the parity information, yeah. Absolutely, yeah. That's something that the controller has to understand. So the controller has to keep maps of where stuff is, right? Yeah, and those maps are probably stored on disk, actually, to make sure that they're portable. So yeah, so now reads require one disk, writes can require two disks, but those disks are distributed evenly throughout the array. And I have a better distribution of writes between disks for that reason. I don't have this bottleneck disk for the parity. So now I've been compared to read for, I've improved performance for writes because I've eliminated that bottleneck. All right, so any questions at this point? Yeah. Is this one? Yeah, great question. What happens if I lose parity information? I can regenerate, yeah. So again, go back to my example. I gave you n bits, you added bit n plus one. Even if I kill off bit n plus one, you still have the n bits that I gave you, right? So it turns out I can always rebuild the parity information using the data. So I can use the parity information, rebuild the data, I can use the data to rebuild the parity information. No, yeah, okay, so it's really important. Remember, RAID 345 all assume that the disk controller knows when disks fail, right? I don't have to detect that the disk fails by looking at the data. That's a much harder problem. So RAID 2 when I use these longer codes, I have to both detect that a disk has failed and detect which disk has failed. So with the parity bit, if I didn't know which disk has failed, I could detect that the parity bit was wrong, but I have no idea which bit is wrong. So again, if I give you n bits and you compute a parity bit as n plus one, and I say I'm gonna flip one of those bits, but I'm not gonna tell you which one, you can't tell, right? You can tell that it's wrong because you can compute the parity bit and the parity bit will be wrong, right? But unless you have the original data, which you don't, right? You can't detect which bit is wrong, right? And that's the real core, again, that's a real core assumption here, which is that the disk has to know when it's failing or some part of the system has to know when it's failing and that's out of band information for the RAID control. That's a good question. Yeah. I mean, you can apply this approach to memory. Yeah, why not? You can apply this to anything that stores data. You know, memory chips, yeah, it's interesting. I mean, memory is normally used for more transient information, but I was telling someone after class that the author of the LFS paper is actually now working on entire data center designs that only use memory. No stable storage at all, right? So does entire cloud services that are only built on memory. Why not? Memory's cheaper. Well, it's not cheaper. It's cheap-ish, it's fast, it's really fast, right? I mean, at some point we won't have disks anymore, right? That moment will come. I don't know, that'll be sort of exciting. Well, we'll always have disks with us. We still have tape drives, right? But we won't use them in the same way. All right, just to complete this picture, I wanna make sure that you guys understand a non-RAID solution. So the RAID authors are probably angry that this is referred to as RAID zero, right? So in RAID zero, I have a RAID zero array under my desk, and I'm totally okay with that, but I just wanna make sure you understand what RAID zero is and is not, right? Because the RAID and RAID zero is very misleading. So in RAID zero, I essentially build, I build what looks like one disk out of two halves, two single disks. So the file system sees one big array, but half the array comes from one disk and half the array comes from the other. Now the performance is great because I can divide traffic between the two disks. And so performance is really great. How much safety does the system have? It's worse than one disk, right? Because if either disk fails, the entire file system is corrupt. So again, as long as you understand that this is how this works, like all my stuff's backed up in other ways, well, I think, maybe I'll find out someday. Anyway, so the point is that if you understand how this works, that's fine, but RAID zero, you have to take the RAID and multiply by zero, right? That's how much RAID there is in RAID zero, none. So it is not a RAID solution. So RAID arrays can be constructed, again, to tolerate one or more failures. Once I get to, if the RAID array can tolerate end failures, once I get to end failures, an additional failure will cause data loss. And at that point, someone has to come in and fix and address the problem by replacing disks and rebuilding the array. And again, rebuilding the array can take a while. Okay, I don't know if you guys like this paper. I thought it was pretty cool and it really hadn't had the backstory about RAID in the past. Obviously, Randy Katz is proud of RAID as he is completely entitled to be. One of the things I thought was neat about this that I didn't know was simply the fact that, as he put it, there were a lot of existing efforts in this area. This wasn't like a bolt of lightning from the sky. No one had been thinking about this. A lot of people were working on this problem. And one of the enduring contributions of RAID was simply creating a common vocabulary so that people could talk about what they were doing. Even though we don't use RAID 2 and RAID 4 as much anymore, those RAID levels at the time described approaches that people were trying to take to building, you know, to building, perform it safe disk arrays out of cheaper parts. And so this allowed people to sort of come together and work on these things. And that was pretty cool. So any final questions about RAID? Yeah, yeah, so good question. Can I have parallel writes in a RAID 5 array? Yeah, why not? I mean, it really depends on what disks are involved, right? Yeah? No, no, no. No, remember, each block generates parity information that's on some other disk. So to modify C1 and C2, I have to modify C1 and C2, the blocks, and then their parity information somewhere else. So on a large enough RAID array, I mean, if you read the Randy Gaspaper, they build like 192 disk RAID array. So on that RAID array, I can do a lot of parallel writes, right? Because essentially I'm picking two disks at random, right? So I pick two disks at random, that's my first write. Then I pick two other disks at random. What's the likelihood that I have a collision with the first two not very likely? On 192 RAID, there'd be two more, two more. At some point, you know, for example, if I wanted to modify C1 and B2, right? Oh, no, I see what you mean. Okay, yeah. No, no, so this stores separate parity information for C1, C2, and C3. The parity information is computed separately. I don't have to use all of them to compute the parity information. So C1 generates a few bytes of parity information. C2 generates a few bytes of parity information, but all that parity information is stored together here. So if I wanted to modify C1 and C2 in parallel, I'd be able to modify the data blocks in parallel, but when it came to modifying the parity information, I'd have two writes hitting disk one, right? And so those would have to serialize in some way. Yeah, yeah, cool. All right, so next week, Friday, no lecture. Finish assignment 3.2. Next week, we'll talk about virtualization. Have a good weekend.