 All right, let's get started. This is 6.824, Distributed Systems. So I'd like to start with just a brief explanation of what I think a distributed system is. The core of it is a set of cooperating computers that are communicating with each other over networked to get some coherent tasks done. And so the kinds of examples that we'll be focusing on in this class are things like storage for big websites or big data computations, such as MapReduce, and also some more exotic things like peer-to-peer file sharing. So those are all just examples of the kinds of case studies we'll look at. And the reason why all this is important is that a lot of critical infrastructure out there is built out of distributed systems. Infrastructure that requires more than one computer to get its job done, or it sort of inherently needs to be spread out physically. So the reasons why people build this stuff, first of all, before I even talk about distributed systems, I just want to remind you that if you're designing a system or you're designing, you need to solve some problem. If you can possibly solve it on a single computer without building a distributed system, you should do it that way. And there's many, many jobs you can get done on a single computer, and it's always easier. So distributed systems, you should try everything else before you try building distributed systems, because they're not simpler. So the reason why people are driven to use lots of cooperating computers are they need to get high performance. And the way to think about that is they want to achieve some sort of parallelism. Lots of CPUs, lots of memories, lots of disc arms moving in parallel. Another reason why people build this stuff is to be able to tolerate faults. Have two computers do the exact same thing. If one of them fails, you can cut over to the other one. Another is that some problems are just naturally spread out in space. You want to do interbank transfers of money or something. Well, Bank A has this computer in New York City and Bank B has this computer in London. You just have to have some way for them to talk to each other and cooperate in order to carry that out. So there's some natural sort of physical reasons, systems that are inherently physically distributed. And the final reason that people build this stuff is in order to achieve some sort of security goal. So often by if there's some code you don't trust or you need to interact with somebody but they may not be, they may be malicious or maybe their code has bugs in it. So you don't want to have to trust it. You may want to split up the computation. So your stuff runs over there and that computer and my stuff runs here on this computer and they only talk to each other through some sort of narrowly defined network protocol. So we may be worried about security and that's achieved by splitting things up into multiple computers so that they can be isolated. Most of this course is going to be about performance and fault tolerance, although the other two often work themselves in by way of the sort of constraints on the case studies that we're going to look at. All the distributed systems of these problems are, because they have many parts and the parts execute concurrently because they're multiple computers, you get all the problems that come up with concurrent programming, all the sort of complex interactions and weird timing dependent stuff. And that's part of what makes distributed systems hard. Another thing that makes distributed systems hard is that because again you have multiple pieces plus a network, you can have very unexpected failure patterns. That is, if you have a single computer, it's usually the case that either a computer works or maybe it crashes or suffers a power failure or something. But it pretty much either works or doesn't work. Distributed systems made up of lots of computers, you can have partial failures. That is, some pieces stop working, other pieces continue working. Or maybe the computers are working, but some part of the network is broken or unreliable. So partial failures is another reason why distributed systems are hard. This is sort of basic challenges. And a final reason why it's hard is that the original reason to build a distributed system is often to get higher performance, to get 1,000 computers worth of performance or 1,000 disc arms worth of performance. But it's actually very tricky to obtain that 1,000x speed up with 1,000 computers. Often a lot of roadblocks thrown in your way. So it often takes a bit of careful design to make the system actually give you the performance you feel you deserve. So solving these problems, of course, can be all about addressing these issues. The reason to take the course is because often the problems and the solutions are quite just technically interesting. They're hard problems. For some of these problems, there's pretty good solutions known for other problems. They're not such great solutions known. Distributed systems are used by a lot of real world systems out there. Big websites often involve vast numbers of computers that are put together as distributed systems. When I first started teaching this course, distributed systems were something of an academic curiosity. People thought, oh, at a small scale, they were used sometimes, and people felt that, oh, someday they might be important. But now, particularly driven by the rise of giant websites that have vast amounts of data and entire warehouses full of computers, distributed systems in the last 20 years have gotten to be a very seriously important part of computing infrastructure. This means that there's been a lot of attention paid to them, a lot of problems have been solved. But there's still quite a few unsolved problems. So if you're a graduate student or you're interested in research, there's a lot of problems yet to be solved in distributed systems that you could look into as research. And finally, if you like building stuff, this is a good class because it has a lab sequence in which you'll construct some fairly realistic distributed systems focused on performance and fault tolerance. So you've got a lot of practice just building distributed systems and making them work. Let me talk about course structure a bit before I get started on real technical content. You should be able to find the course website using Google. And on the course website is the lab assignments and course schedule, and also linked to a piazza page where you can post questions, get answers. The course staff, I'm Robert Morris. I'll be giving the lectures. I also have four TAs. You guys want to stand up and show your faces. The TAs are experts that, in particular, at solving the labs, they'll be holding office hours. So if you have questions about the labs, you should go to office hours. Or you could post questions to piazza. The course has a couple of important components. One is this lectures. There's a paper for almost every lecture. There's two exams. There's the labs, programming labs. And there's an optional final project that you can do instead of one of the labs. The lectures will be about big ideas in distributed systems. There'll also be a couple of lectures that are more about lab programming stuff. A lot of the lectures will be taken up by case studies. A lot of the way that I try to bring out the content of distributed systems is by looking at papers, some academic, some written by people in industry, describing real solutions to real problems. These lectures actually would be videotaped. And I'm hoping to post them online so that you're not here or you want to review the lectures. You'll be able to look at the videotaped lectures. The papers, again, there's one to read per week. Most of them are research papers. Some of them are classic papers, like today's paper, which I hope some of you have read on MapReduce. It's an old paper, but it was the beginning of, it spurred an enormous amount of interesting work, both academic and in the real world. So some are classic and some are more recent papers, sort of talking about more up-to-date research, what people are currently worried about. And from the papers, I'll be hoping to tease out what the basic problems are, what ideas people have had that might or might not be useful in solving distributed system problems. We'll be looking at sometimes implementation details in some of these papers, because a lot of this has to do with actual construction of software-based systems. And we're also gonna spend a certain amount of time looking at evaluations, people evaluating how fault tolerant their systems by measuring them or people measuring how much performance or whether they got performance improvement at all. So I'm hoping that you'll all read the papers before coming to class. The lectures are maybe not gonna make as much sense if you haven't already read the lecture, because there's not enough time to both explain all the content of the paper and have a sort of interesting reflection on what the paper means on one class. So you really gotta read the papers before coming to class. And hopefully one of the things you'll learn in this class is how to read a paper rapidly and efficiently and skip over the parts that maybe aren't that important and sort of focus on teasing out the important ideas. On the website there's for every link to by the schedule, there's a question that you should submit an answer for for every paper, I think the answers are due at midnight. And we also ask that you submit a question you have about the paper through the website in order both to give me something to think about as I'm preparing the lecture and if I have time I'll try to answer at least a few of the questions by email. And the question and the answer for each paper will be due midnight, the night before. There's two exams, there's a midterm exam in class I think on the last class meeting before spring break. And there's a final exam during final exam week at the end of the semester. The exams are gonna focus mostly on papers and the labs and probably the best way to prepare for them as well as attending lecture and reading the papers. A good way to prepare for the exams is to look at old exams, we have links to 20 years of old exams and solutions. And since you look at those and sort of get a feel for what kind of questions that I like to ask. And indeed, because we read many of the same papers inevitably I ask questions each year that can't help but resemble questions asked in previous years. The labs, there's four programming labs. The first one of them is due Friday next week. Their lab one is a simple map-reduced lab to implement your own version of the paper that we read today in which I'll be discussing in a few minutes. Lab two involves using a technique called raft in order to sort of allow, in theory, allow any system to be made fault tolerant by replicating it and having this raft technique manage the replication and manage sort of automatic cutover if there's a failed, if one of the replicated servers fails. So this is raft for fault tolerance. In lab three, you'll use your raft implementation in order to build a fault tolerant key value server created in fault tolerant. And in lab four, you'll take your replicated key value server and clone it into a number of independent groups and you'll split the data in your key value storage system across all of these individual replicated groups to get parallel speed up by running multiple replicated groups in parallel. And you'll also be responsible for moving the various chunks of data between different servers as they come and go without dropping any balls. So this is what's often called a sharded key value service. And sharding refers to splitting up the data, partitioning the data among multiple servers in order to get parallel speed up. If you want, instead of doing lab four, you can do a project of your own choice. And the idea here is if you have some idea for a distributed system in the style of some of the distributed systems we talk about in the class, if you have your own idea that you wanna pursue and you'd like to build something and measure whether it worked in order to explore your idea, you can do a project. And so for a project, you'll pick some teammates because we require that projects are done in teams of two or three people. So like some teammates and send your project idea to us and we'll think about it and say yes or no and maybe give you some advice. And then if you go ahead and do, if we say yes and you wanna do a project, we'll do that instead of lab four and it's due at the end of the semester and you should do some design work and build a real system. And then in the last day of class, you'll demonstrate your system as well as handing in a short sort of written report to us about what you built. And I posted on the website some ideas which might or might not be useful for you to sort of spur thoughts about what projects you might build. But really, the best projects are one where sort of you have a good idea for the project. And the idea is if you wanna do a project, you should choose an idea that's sort of in the same vein as the systems that we're talking about in this class. Okay, back to labs. The lab grades that we give you, you hand in your lab code and we run some tests against it and your grade will be based on how many tests you pass. We give you all the tests that we use so there's no hidden tests. So if you influence the lab and it reliably passes all the tests then chances are good, unless there's something funny going on which there sometimes is, chances are good that if your code passes all the tests when you run it or pass all the tests when you run it and you'll get a full score. So hopefully there'll be no mystery about what score you're likely to get on the labs. Let me warn you that debugging these labs can be time consuming because they're distributed systems and there's a lot of concurrency and communication, sort of strange, difficult to debug errors can crop up. So you really ought to start the labs early. Don't leave them, you'll have a lot of trouble if you leave the labs for the last moment. You gotta start early. If you have problems, please come to the TA's office hours and please feel free to ask questions about the labs on Piazza. And indeed, I hope if you know the answer that you'll answer people's questions on Piazza as well. All right. Any questions about the mechanics of the course? Yes. So the question is how do these things factor in the grade? I forget, but it's all on the website under something. I think that it's the labs or the single most important component. Okay. All right. So this is a course about infrastructure for applications. And so all through this course, there's gonna be a sort of split in the way I talk about things between applications which are sort of other people, the customer, somebody else writes, but the applications are gonna use the infrastructure that we're thinking about in this course. And so the kinds of infrastructure that tend to come up a lot, storage, communication, and computation. And we'll talk about systems that provide all three of these kinds of infrastructure. The storage, it turns out that storage is gonna be the one we focus most on because it's a very well-defined and useful abstraction and usually fairly straightforward abstraction. So people know a lot about how to use and build storage systems and how to build sort of replicated, fault tolerant, high performance distributed implementations of storage. We'll also talk about some of our computation systems. Like MapReduce for today is a computation system. And we will talk about communication some, but mostly from the point is a tool that we need to use to build distributed systems. Like computers have to talk to each other over a network. Maybe you need reliability or something. And so we'll talk a bit about, but we're actually mostly consumers of communication. If you wanna learn about communication systems as sort of how they work, that's more the topic of 6829. So for storage and computation, a lot of our goal is to be able to discover abstractions, ways of simplifying the interface to storage and distributed storage and computation infrastructure so that it's easy to build applications on top of it. And what that really means is that we'd like to be able to build abstractions that hide the distributed nature of these systems. So the dream, which is rarely fully achieved, but the dream would be to be able to build an interface that looks to an application as if it's a non-distributed storage system, just like a file system or something that everybody already knows how to program and has a pretty simple model semantics. We'd love to be able to build interfaces that look and act just like non-distributed storage and computation systems, but are actually vast, extremely high performance, fault tolerant distributed systems underneath. So we'd love to have abstractions. And as you'll see as the course goes on, it's sort of only part of the way they are. It's rare that you find an abstraction for a distributed version of storage or computation that has simple behavior, behaves just like the non-distributed version of storage that everybody understands. But people are getting better at this and we're gonna try to study the ways of what people have learned about building such abstractions. Okay, so what kind of topics show up as we're considering these abstractions? The first one, the first topic, general topic that we'll see a lot in a lot of the systems we look at, have to do with implementation. So for example, the kind of tools that you see a lot for ways people learn how to build these systems are things like remote procedure call, whose goal is to mask the fact that we're communicating over an unreliable network. Another kind of implementation topic that we'll see a lot is threads, which are a programming technique that allows us to harness, well, it allows us to harness multi-core computers, but maybe more important for this class, threads are a way of structuring concurrent operations in a way that's hopefully simplifies the programmer view of those concurrent operations. And because we're gonna use threads a lot, it turns out we're gonna need to also, just as from the implementation level, spend a certain amount of time thinking about concurrency control, things like locks. The main place that these implementation ideas will come up in the class, that we touched on in many of the papers, but you're gonna come face to face with all of this in a big way in the labs. You need to build distributed, do the programming for distributed systems, and these are like a lot of the sort of important tools beyond the sort of ordinary programming. These are some of the critical tools that you'll need to use to build distributed systems. Another big topic that comes up in all the papers we're gonna talk about is performance. Usually the high level goal of building a distributed system is to get what people call scalable speedup. So we're gonna look in for scalability. What I mean by scalability or scalable speedup is that if I have some problem that I'm solving with one computer and I buy a second computer to help me execute my problem, if I can now solve the problem in half the time or maybe solve twice as many problem instances per minute on two computers as I had on one, then that's an example of scalability. So sort of two times the computers or resources gets me two times the performance or throughput. And this is a huge hammer. If you can build a system that actually has this behavior, namely that if you increase the number of computers you throw at the problem by some factor, you get that factor more throughput, more performance out of the system. That's a huge win because you can buy computers with just money, right? Whereas if in order to get the alternative to this is that in order to get more performance you have to pay programmers to restructure your software to get better performance, to make it more efficient or to apply some sort of specialized techniques, better algorithms or something. If you have to pay programmers to fix your code to be faster, that's an expensive way to go. We'd love to be able to just buy 1,000 computers instead of 10 computers and get 100 times more throughput. That's fantastic. And so this sort of scalability idea is a huge idea in the backs of people's heads when they're like building things like big websites that run on a building full of computers. If the building full of computers is there to get a sort of corresponding amount of performance, but you have to be careful about the design in order to actually get that performance. So often the way this looks when we're looking at diagrams or writing diagrams in this course is that I'm not supposing we're building a website. Ordinarily you might have a website that has a HTTP server. Let's say it has some users running web browsers and they talk to a web server running Python or PHP or whatever, sort of web server. And the web server talks to some kind of database. You know, when you have one or two users, you can just have one computer running both and maybe a computer for the web server and a computer for the database, but maybe all of a sudden you get really popular and you 100 million people sign up for your service. How do you fix your seat? I certainly can't support millions of people on a single computer except by extremely careful labor intensive optimization, which you don't have time for. So typically the way you're gonna speed things up, the first thing you do is buy more web servers and just split the user so that half you users or some fraction of the user go to web server one and the other half you send them to web server two. And because maybe you're building, I don't know what Reddit or something, where all the users need to see the same data ultimately, you have all the web servers talk to the back end and maybe you can keep on adding web servers for a long time here. And so this is a way of getting parallel speed up on the web server code. You know, if you're running PHP or Python, maybe it's not too efficient. As long as each individual web server doesn't put too much load on the database, you can add a lot of web servers before you run into problems. But this kind of scalability is rarely infinite unfortunately. Certainly not without serious thought. And so what tends to happen with these systems is that at some point after you have 10 or 20 or 100 web servers all talking to the same database, now all of a sudden the database starts to be a bottleneck and adding more web servers no longer helps. So it's rare that you get full scalability through sort of infinite numbers of, adding infinite numbers of computers. At some point you run out of gas because the place at which you are adding more computers is no longer the bottleneck. By having lots and lots of web servers we basically move the bottleneck. The thing is limiting performance from the web servers to the database. And at this point, actually, you almost certainly have to do a bit of design work because it's rare that you can take, that there's any straightforward way to take a single database and sort of refactor things with it. Or you can take data sorted in a single database and refactor it so it's split over multiple databases. But it's often a fair amount of work. And because it's awkward but people, many people actually need to do this. We're gonna see a lot of examples in this course in which the distributed system people are talking about is a storage system because the authors were running something like a big website that ran out of gas on a single database or storage servers. Anyway, so the scalability story is we love to build systems that scale this way but it's hard to make it or takes work off and design work to push this idea infinitely far. Okay, so another big topic that comes up a lot is fault tolerance. If you're building a system with a single computer in it, well, a single computer often can stay up for years. Like I have servers in my office that have been up for years without crashing. You know, the computer's pretty reliable, the operating system's pretty reliable, apparently the power in my building is pretty reliable. So it's not uncommon to have single computers. Let's just stay up for amazing amounts of time. However, if you're building systems out of thousands of computers then even if each computer can be expected to stay up for a year, with a thousand computers that means you're gonna have like about three computer failures per day in your set of a thousand computers. So solving big problems with big distributed systems turns sort of very rare fault tolerance, very real failure, very rare failure problems into failure problems that happen just all the time. In a system with a thousand computers there's almost certainly always something broken. There's always some computer that's either crashed or mysteriously running incorrectly or slowly or doing the wrong thing or maybe there's some piece of the network. Like with a thousand computers we got a lot of network cables and a lot of network switches. And so there's always some network cable that somebody stepped on and is unreliability or network cable that fell out or some network switch whose fan is broken and the switch overheated and failed. There's always some little problem somewhere in your building sized distributed system. So big scale turns problems from very rare events you really don't have to worry about that much into just constant problems. That means the failure has to be really or the response, the masking of failures, the ability to proceed without failures just has to be built into the design because there's always failures. And as part of building convenient abstractions for application programmers we really need to be able to build infrastructure that as much as possible hides the failures from application programmers or masks them or something so that every application programmer doesn't have to have a complete complicated story for all the different kinds of failures that can occur. There's a bunch of different notions that you can have about what it means to be fault tolerant about little more about exactly what we mean by that. And we'll see a lot of different flavors but among the more common ideas you see one is availability. So some systems are designed so that under some certain kinds of failures not all failures but certain kinds of failures the system will keep operating despite the failure while providing undamaged service, the same kind of service it would have provided even if there hadn't been no failure. So some systems are available in that sense that up and up a, so if you build a replicated service that maybe has two copies if one of the replica service fails maybe the other server can continue operating it both fail of course, you can't promise availability in that case. So available systems usually say well under certain set of failures we're gonna continue providing service we're gonna be available. More failures than that occur they won't be available anymore. Another kind of fault tolerance you might have or in addition to availability or by itself is recoverability. And what this means is that if something goes wrong maybe the service will stop working that is it'll simply stop responding to requests and it'll wait for someone to come along and repair whatever went wrong but after the repair occurs the system will be able to continue as if nothing bad had gone wrong. So this is sort of a weaker requirement than availability because here we're not gonna do anything while the fail, until the fail component has been repaired but the fact that we can get up, get going again without any loss of correctness is still a significant requirement. It means recoverable systems typically need to do things like save their latest data on disk or something where they can get it back after the power comes back up. And even among available systems in order for a system to be useful in real life usually what the way available systems respect is that they're available until some number of failures have happened. If too many failures have happened an available system will stop working or will stop responding at all but when not things have been repaired it'll continue operating. So a good available system will sort of be recoverable as well in a sense that if too many failures occur it'll stop answering but then we'll continue correctly after that. So this is what we'd love to obtain. The biggest hammer, we'll see a number of approaches to solving these problems. There's really sort of two things that are the most important tools we have in this department. One is non-volatile storage so that something crashes, power fails or whatever. Maybe there's a building-wide power failure. We can use non-volatile storage like hard drives or flash or solid state drives or something to sort of store a checkpoint or a log of the state of the system and then when the power comes back up or somebody repairs our power supply or something we'll be able to read our latest state off the hard drive and continue from there. So one tool is sort of non-volatile storage and the management of non-volatile storage doesn't think it comes up a lot because non-volatile storage tends to be expensive to update and so a huge amount of the sort of nitty gritty of building sort of high performance fault tolerant systems is in clever ways to avoid having to write the non-volatile storage too much and the old days and even today what writing non-volatile storage meant was moving a disk arm and waiting for a disk platter to rotate both of which are agonizingly slow on the scale of three gigahertz microprocessors. With things like flash, life's quite bit better but still requires a lot of thought to get good performance out of and the other big tool we have for fault tolerance is replication and the management of replicated copies is sort of tricky, that sort of key problem lurking in any replicated system where we have two servers each with a supposedly identical copy of the system state. The key problem that comes up is always that the two replicas will accidentally drift out of sync and will stop being replicas. And this is just with the back of every design that we're gonna see for using replication to get fault tolerance and Lab 2, and Lab 3 are all about management of replicated copies for fault tolerance. As you'll see, it's pretty complex. A final topic, the cross-cutting topic is consistency. So as an example of what I mean by consistency, supposing we're building a distributed storage system and it's a key value service so it just supports two operations. Maybe there's a put operation and you give it a key and a value and the storage system sort of stashes away the value under as the value for this key so it maintains it's just a big table of keys and values and then there's a get operation. The client sends it a key and the storage service is supposed to respond with the value of the value that's stored for that key. And this is kind of the, when I can't think of anything else as an example of a distributed system, I'll whip out key value services and they're very useful, right? They're just sort of the kind of fundamental, simple version of a storage system. So of course, if you're an application programmer, it's helpful if these two operations kind of have meanings attached to them that you can go look in the manual and the manual says what it means, but you'll get back if you call get, right? And sort of what it means for you to call put, right? So it's great if there's some sort of spec for what they meant, otherwise, like who knows? How can you possibly write an application without a description of what put and get are supposed to do? And this is the topic of consistency and the reason why it's interesting in distributed systems is that both for performance and for fault tolerant reasons, fault tolerance reasons, we often have more than one copy of the data floating around. So in a non-distributed system where you just have a single server with a single table, there's often, although not always, but there's often like relatively no ambiguity about what put and get could possibly mean, right? Intuitively, what put means is update the table and what get means is just get me the version that's stored in the table, which, but in a distributed system where there's more than one copy of the data due to replication or caching or who knows what, there may be lots of different versions of this key value pair floating around. Like if one of the replicas, if so is it some client issues of put and there's two copies of the server. So they both have a key value table, right? And maybe key one has value 20 on both of them. And then some client issues of put. As we have a client over here and it's gonna send a put, it wants to update the value of one to be 21, right? Maybe it's counting stuff in this key value server. So it sends a put with key one on value 21 and sends it to the first server and it's about to send the same put, you know, it wants to update both copies, right? Keeps them in sync. It's about to send this put, but just before it sends the put to the second server, it crashes, right? Power failure or bug in operating system or something. So now the state we're left in, sadly, is that we sent this put and so we've updated one of the two replicas to have value 21, but the other one's still with 20. Now if somebody comes along and reads with a get, they wanna read the value associated with key one, they might get 21 or they might get 20 depending on who they talk to. And even if the rule, as you always talk to the top server first, if you're building a fault-tolerant system, the actual rule has to be, oh, you talk to the top server first, unless it's failed, in which case you talk to the bottom server. So either way, someday you risk exposing this stale copy of the data to some future get. And it could be that many gets get the updated 21 and then like next week, all of a sudden some get yields, you know, a week old copy of the data. So that's not very consistent, right? So in order, but, you know, it's the kind of thing that could happen, right? We're not careful. So, you know, we need to have, we need to actually write down what the rules are gonna be about puts and get given this danger of due to replication. And it turns out there's many different definitions you can have of consistency. You know, many of them are relatively straightforward. Many of them sound like, well, a get yields the, you know, value put by the most recently completed put, right? So that's usually called strong consistency. It turns out also it's very useful to build systems that have much weaker consistency that for example, do not guarantee anything like a get sees the value written by the most recent put. And the reason, so there's, there's strongly consistent systems that usually have some version of get seeing most recent puts. Although you have to, there's a lot of details to work out. There's also weakly consistent, many sort of flavors of weakly consistent systems that do not make any such guarantee that may guarantee, well, you're, you know, if someone does a put then you may not see the put. You may see old values that weren't updated by the put for an unbounded amount of time, maybe. And the reason for people being very interested in weak consistency schemes is that strong consistency that is having reads actually see, be guaranteed to see the most recent write, that's a very expensive spec to implement. Because what it means is almost certainly that you have to, somebody has to do a lot of communication in order to actually implement some notion of strong consistency. If you have multiple copies, it means that either the writer or the reader or maybe both has to consult every copy. Like in this case, where, you know, maybe a client crash left one updated but not the other. If we wanted to implement strong consistency in the, maybe a simple way in this system we'd have readers read both of the copies or if there's more than one copy all the copies and use the most recently written value that they find. But that's expensive. That's a lot of chit chat to read one value. So in order to avoid communication as much as possible, particularly replicas are far away. People build weak systems that might actually allow the stale read of an old value in this case. Although there's often more semantics attached to that to try to make these weak schemes more useful. And where this communication problem, you know, strong consistency requiring expensive communication. Where this really runs you into trouble is that if we're using replication for fault tolerance then we really want the replicas to have independent failure probability, to have uncorrelated failure. So for example, putting both of the replicas of our data in the same rack, in the same machine room is probably a really bad idea because if someone trips over the power cable to that rack, both of our copies of our data are gonna die because they're both attached to the same power cable in the same rack. So in the search for making replicas as independent in failure as possible in order to get decent fault tolerance, people would love to put different replicas as far apart as possible, like in different cities or maybe on opposite sides of the continent. So an earthquake that destroys one data center will be extremely unlikely to also destroy the other data center of that as the other copy. So we'd love to be able to do that. If you do that, then the other copy is thousands of miles away, and the rate at which light travels means that it may take on the order of milliseconds or tens of milliseconds to communicate to a data center across the continent in order to update the other copy of the data. And so that makes the communication required for strong consistency, for good consistency, potentially extremely expensive. Like every time you wanna do one of these put operations or maybe a get depending on how you implement it, you might have to sit there waiting for like 10 or 20 or 30 milliseconds in order to talk to both copies of the data to ensure that they're both updated or both checked to find the latest copy. And that tremendous expense, right? This is 10 or 20 or 30 milliseconds on machines that after all will execute like a billion instructions per second. So we're wasting a lot of potential instructions while we wait. People often build much weaker systems. You're allowed to only update the nearest copy or only consult the nearest copy. And I mean, there's a huge sort of amount of academic and real world research on how to structure weak consistency guarantees so they're actually useful to applications and how to take advantage of them in order to actually get high performance. All right, so that's a lightning preview of the technical ideas in the course. Any questions about this before I start talking about MapReduce? All right, I wanna switch to MapReduce. That's a sort of detailed case study that's actually gonna illustrate most of the ideas that we've been talking about here. MapReduce is a system that was originally designed and built and used by Google. I think the paper dates back to 2004. The problem they were faced with was that they were running huge computations on terabytes and terabytes of data like creating an index of all of the content of the web or analyzing the link structure of the entire web in order to identify the most important pages or the most authoritative pages. And so the whole web was even in those days tens of terabytes of data. Building index of the web is basically equivalent to a sort, running sort of the entire data sort. It's like reasonably expensive and to run a sort on the entire content of the web on a single computer, I don't know how long it would have taken, but it's weeks or months or years or something. So Google at the time was desperate to be able to run giant computations on giant data on thousands of computers in order that the computations could finish rapidly. It was worth it to them to buy lots of computers so that their engineers wouldn't have to spend a lot of time reading the newspaper or something, waiting for their big compute jobs to finish. And so for a while, they had their clever engineer sort of hand write. If he needed to write a web indexer or some sort of web link analysis tool, Google bought the computers and they said, here engineers, do run whatever software you like on these computers. They would laboriously write the sort of one off, manually bitten software to take whatever problem they were working on and sort of somehow farm it out to a lot of computers and organize that computation and get the data back. If you only hire engineers who are skilled, distributed systems experts, maybe that's okay, although even then, it's probably very wasteful of engineering effort. But they wanted to hire people who were skilled at something else and not necessarily engineers who wanted to spend all their time writing distributed system software. So they really needed some kind of framework that would make it easy to just have their engineers write the kind of guts of whatever analysis they wanted to do, like sort algorithm or a web indexer or link analyzer or whatever, just write the guts of that application and not be able to run it on a thousands of computers without worrying about the details of how to spread the work over the thousands of computers, how to organize whatever data movement was required, how to cope with the inevitable failures. So they were looking for a framework that would make it easy for non-specialists to be able to write and run giant distributed computations. And so that's what MapReduce is all about. And the idea is that the programmer just write the application designer, consumer of this distributed computation, just be able to write a simple map function and a simple reduce function that don't know anything about distribution and the MapReduce framework would take care of everything else. So an abstract view of how what MapReduce is up to is it starts by assuming that there's some input and the input is split up into a whole bunch of different files or chunks in some way. So we're imagining that we have input file one, input file two, et cetera. These inputs are maybe web pages crawled from the web or more likely sort of big files that contain many web, each of which contains many web files crawled from the web. All right, and the way MapReduce starts is that you define a map function and the MapReduce framework is gonna run your map function on each of the input files. And of course you can see here there's some obvious parallelism available. It can run the maps in parallel. So each of these map functions only looks as this input and produces output. The output that a map function is required to produce is a list, it takes a file as input and the file is some fraction of the input data and it produces a list of key value pairs as output, the map function. And so for example, let's suppose we're writing the simplest possible MapReduce example, a word count MapReduce job whose goal is to count the number of occurrences of each word. So your map function might emit key value pairs where the key is the word and the value is just one. So for every word it sees, so this map function will split the input up into words for every word it sees, it emits that word as the key and one as the value. And then later on it will count up all those ones in order to get the final output. So maybe input one has the word A in it and the word B in it and so the output map is gonna produces key A value one, key B value one. Maybe the second map notification sees a file that has a B in it and nothing else. So it's gonna implement output B one. Maybe this third input has an A in it and a C in it. All right, so we run all these maps on all the input files and we get this intermediate, what the paper calls intermediate output, which is for every map, a set of key value pairs is output. Then the second stage of the computation is to run the reduces. And the idea is that the MapReduce framework collects together all instances from all maps of each keyword. So the MapReduce framework is gonna collect together all of the A's from every map, every key value pair whose key was A. It's gonna collect them all and hand them to one call of the programmer defined reduced function. Then it's gonna take all the B's and collect them together. Of course, requires real collection because they were different instances of key B were produced by different invocations of map on different computers. So we're now talking about data movement. So we're gonna collect all the B keys and hand them to a different call to reduce that has all of the B keys as its arguments and same with C, all the C. So this is gonna be the MapReduce framework will arrange for one call to reduce for every key that occurred in any of the map output. And for our sort of silly word count example all these reduces have to do or any one of them has to do is just count the number of items passed to it. Doesn't even have to look at the items because it knows that each of them is the word that's responsible for plus one is the value. You don't have to look at those ones we just count them. So this reduce is gonna produce A and then the count of its inputs. This reduce is gonna produce the key associated with it and then count of its values which is also two. So this is what a typical MapReduce job looks like at a high level. Just for completeness the, well a little bit of terminology the whole computation is called the job. Any one invocation of map or reduce is called a task. So we have the entire job and it's made up of a bunch of map tasks and then a bunch of reduced tasks. So an example for this word count of what the map and reduce functions would look like the map function takes a key and a value as arguments and now we're talking about functions like written in an ordinary programming language like C++ or Java or who knows what. So this is just code people, ordinary people can write. What a map function for word count would do is split. The key is the file name which typically is ignored. We don't really care what the file name was and the V is the content of this maps input file. So V is, just contains all this text. We're gonna split V into words and for each word we're just gonna emit and emit takes two arguments. Emits, call only map can make emit is provided by the map reduce framework. We get to reduce, we hand emit a key which is the word and a value which is the string one. So that's it for the map function. And a word count map function and map reduce literally could be this simple. So there's sort of promise to make the, and this map function doesn't know anything about distribution or multiple computers or the fact we may need to move data across the network or who knows what. This is extremely straightforward. And the reduce function for word count, the reduce is called with, remember each reduce is called with sort of all the instances of a given key. The map reduce framework calls reduce with the key that it's responsible for and a vector of all the values that the maps produced associated with that key. The key is the word, the values are all ones, we don't really care about them, we only care about how many there were. And so reduce has its own emit function that just takes a value to be emitted as the final output as the value for this key. So we're gonna emit the length of this array. So this is also about as simple as reduce functions are in map reduce, namely extremely simple and requiring no knowledge about fault tolerance or anything else. All right, any questions about the basic framework? Yes. You mean can you feed the output of the reduces sort of? Oh yes. Oh yes, in real life, all right. In real life, it was routine among map reduce users to define a map reduce job that took some inputs and produce some outputs and then have a second map reduce job, if you're doing some very complicated multi-stage analysis or iterative algorithm, like PageRank for example, which is the algorithm Google uses to sort of estimate how important or influential different web pages are. That's an iterative algorithm that sort of gradually converges on an answer. And if you implement a map reduce, which I think they originally did, you have to run the map reduce job multiple times and the output of each one is sort of a list of web pages with an updated sort of value or weight or importance for each web page. So it was routine to take this output and then use it as the input to another map reduce job. Oh yeah. Yeah, you need to sort of set things up. You need to write the reduce function sort of in the knowledge that, oh, I need to produce data that's in the format or has the information required for the next map reduce job. This actually brings up a little bit of a shortcoming in the map reduce framework, which is it's great if the algorithm you need to run is easily expressible as a map followed by this sort of shuffling of the data by key followed by a reduce and that's it. Map reduce is fantastic for algorithms that can be cast in that form. And where furthermore, each of the maps has to be completely independent. The maps are required to be functional, pure functional functions that just look at their arguments and nothing else. And so, you know, that's like, it's a restriction. And it turns out that many people want to run much longer pipelines that involve lots and lots of different kinds of processing. And with map reduce, to sort of cobble that together from multiple map reduce, distinct map reduce jobs, and more advanced systems, which we'll talk about later in the course, are much better at allowing you to specify the complete pipeline of computations and they'll do optimization. You know, the framework realizes all the stuff you have to do and organize much more complicated, efficiently optimize much more complicated computations. No question. From the programmer's point of view, it's just about map and reduce. From our point of view, it's gonna be about the worker processes and the worker servers that are part of the map reduce framework that among many other things call the map and reduce functions. So, yeah, from our point of view, we care a lot about how this is organized by the surrounding framework. This is sort of the programmer's view with all the distributed stuff stripped out. Yes. Sorry, I gotta say that again. Sorry, do you emit locally or...? Oh, you mean where does the emit data go? Okay, so there's two questions. One is when you call emit, what happens to the data? And the other is where the functions run? So the actual answer is that, first, where the stuff run? There's a number of, say, 1,000 servers. Actually, the right thing to look at here is figure one in the paper. Sitting underneath this in the real world, there's some big collection of servers. And we'll call them maybe worker servers, or workers. And there's also a single master server that's organizing the whole computation. And what's going on here is the master server knows that there's some number of input files, 5,000 input files, and it farms out invocations of map to the different workers. So it'll send a message to worker7 saying, please run this map function on such and such an input file. And then the worker function, which is part of MapReduce and knows all about MapReduce, will then read the file, read the input, whichever input file, and call this map function with the file name, value as its arguments. Then that worker process is what implements emit, and every time the map calls emit, the worker process will write this data to files on the local disk. So what happens to map emits is they produce files on the map worker's local disk that are accumulating all the keys and values produced by the map to run on that worker. So at the end of the map phase, what we're left with is all those worker machines, each of which has the output of some of whatever maps were run on that worker machine. Then the MapReduce workers arrange to move the data to where it's going to be needed for the reducers. And since, you know, in a typical big computation, this reduce invocation is going to need all map output that mentioned the key A, but it's going to turn out, you know, this is a sort of simple example, but probably in general every single map indication will produce lots of keys, including some instances of key A. So typically in order before we can even run this reduce function, the MapReduce framework, that is the MapReduce worker running one of our thousand servers, is going to have to go talk to every single other of the thousand servers and say, look, you know, I'm going to run the reduce for key A. Please look at the intermediate map output stored in your disk and fish out all of the instances of key A and send them over the network to me. So the reduce worker is going to do that. It's going to fetch from every worker all of the instances of the key that it's responsible for that the master has told it to be responsible for. And once it's collected all of that data, then it can call reduce. And the reduce function itself calls reduce emit, which is different from the map emit, and what reduces emit does is writes the output to a file in a cluster file service that Google uses. So here's something I haven't mentioned. I haven't mentioned where the input lives and where the output lives. They're both files. Because any piece of input, we want the flexibility to be able to read any piece of input on any worker server, that means we need some kind of network file system to store the input data. And so indeed, the paper talks about this thing called GFS for Google file system. And GFS is a cluster file system, and GFS actually runs on exactly the same set of workers that worker servers that run MapReduce. And the input GFS just automatically, when it's a file system you can read and write files, it just automatically splits up any big file you store on it across lots of servers in 64 megabyte chunks. So if you write, you know, if you have 10 terabytes of crawled web page contents and you just write them to GFS, even as a single big file, GFS will automatically split that vast amount of data up into 64 kilobyte chunks distributed evenly over all of the GFS servers, which is to say all the servers that Google has available. And that's fantastic. That's just what we need. If we then want to run a MapReduce job that takes the entire crawled web as input, the data is already stored in a way that's split up evenly across all the servers. And so that means that the MapWorkers, you know, we're going to launch, you know, if we have a thousand servers, we're going to launch a thousand MapWorkers, each reading one-one-thousands of the input data, and they're going to be able to read the data in parallel from a thousand GFS file servers, thus getting, you know, tremendous total read throughput, you know, the read throughput of a thousand servers. Yeah. So you're thinking maybe that Google has one set of physical machines that run GFS and a separate set of physical machines that run MapReduce jobs. I guess I was just saying. Right. So the question is, what does this arrow here actually involve? And the answer to that actually sort of changed over the years as Google's involved this system, but, you know, what this, in the most general case, if we have big files stored in some big network files system, like, you know, it's like GFS is a bit like AFS you might have used on Athena, where you go talk to some collection, your data is split over big collection of servers, you have to go talk to those servers over the network to retrieve your data. In that case, what this arrow might represent is the MapReduce worker process has to go off and talk across the network to the correct GFS server or maybe servers that store its part of the input and fetch it over the network to the MapReduce worker machine in order to pass the map. And that's certainly the most general case. And that was eventually how MapReduce actually worked. In the world of this paper, though, and if you did that, that's a lot of network communication. You're talking about 10 terabytes of data, and we have to move 10 terabytes across the data center network, which, you know, data center networks run a gigabits per second, but it's still a lot of time to move 10s of terabytes of data. In order to try to, and indeed in the world of this paper in 2004, the most constraining bottleneck in their MapReduce system was network throughput, because they were running on a network, if you read as far as the evaluation section, their network, their network was, they had thousands of machines, whatever, and they would collect machines, they would plug machines into, you know, each rack of machines into, you know, an ethernet switch for that rack or something. But then, you know, they all need to talk to each other, but there was a root ethernet switch that all of the rack ethernet switches talked to. And this, and, you know, so if you just pick some MapReduce worker and some GFS server, you know, chances are, at least half the time, the communication between them has to pass through this one root switch. Their root switch had only some amount of total throughput, which I forget, you know, some number of gigabits per second. Anyway, I forget the number. Well, but when I did the division that is divided up the total throughput available in the root switch by the roughly 2,000 servers that they used in the Papers experiments, what I got was that each machine's share of the root switch or of the total network capacity was only 50 megabits per second in their setup. So 50 megabits per second per machine. And that might seem like a lot of 50 megabits, gosh, millions and millions, but it's actually quite small compared to how fast a disk's run or CPU's run. And so this, with their network, this 50 megabits per second was like a tremendous limit. And so they really stood on their heads in the design described in the paper to avoid using the network. And they played a bunch of tricks to avoid sending stuff over the network when they possibly could avoid it. One of them was they ran the GFS servers and the map-produced workers on the same set of machines. So they had 1,000 machines, they'd run GFS, they'd implement their GFS service on that 1,000 machines and run map-produced on the same 1,000 machines. And then when the master was splitting up the map work and sort of farming it out to different workers, it would cleverly, when it was about to run the map that was going to read from input file one, it would figure out from GFS which server actually holds input file one on its local disk and it would send the map for that input file to the map-produced software on the same machine so that by default this arrow was actually local read from the local disk and did not involve the network. And depending on failures or load or whatever, it couldn't always do that. But almost all of the maps would be run on the very same machine that stored the data, thus saving them a vast amount of time that they would otherwise have to wait to move the input data across the network. The next trick they played is that a map, as I mentioned before, still has this output on the local disk of the machine that you run the map on. So again, storing the output of the map does not require network communication, at least not immediately, because the output is stored on the disk. However, we know for sure that one way or another, in order to group together all of, you know, by the way the map produces defined, in order to group together all of the values associated with the given key and pass them to a single invocation of reduce on some machine, this is going to require network communication. We're going to, you know, we want to need to fetch all the As and give them a single machine that has to be moved across the network. And so this shuffle, this movement of the keys from this kind of, you know, originally stored by row and on the same machine that ran the map, we need them essentially to be stored by column on the machine that's going to be responsible for reduce. This transformation of row storage, essentially column storage is called, the paper calls a shuffle. And really that required moving every piece of data across the network from the map that produced it to the reduce that would need it. And now it's like the expensive part of the map reduce. Yeah. You're right. You can imagine a different definition in which you have a more kind of streaming reduce. I don't know, I haven't thought this through. I don't know why, whether that would be feasible or not. Certainly as far as programmer interface, like if the goal, their number one goal really was to be able to make it easy to program by people who just had no idea what was going on in the system. So it may be that this spec, this is really the way reduce functions look in C++ or something. Like a streaming version of this is now starting to look, I don't know how it would look. Probably not this simple. But maybe it could be done that way. And indeed, many modern systems, people have gotten a lot more sophisticated with modern things that are the successors of the map reduce, and they do indeed involve processing streams of data off rather than this very batch approach. It is a batch approach in the sense that we wait until we get all the data and then we process it. So first of all, you then have to have a notion of finite inputs. Modern systems often do indeed use streams and are able to take advantage of some efficiencies due to that. So this is the point at which the shuffle is where all the network traffic happens. This can actually be a vast amount of data. So if you think about sort, if you're sorting, the output of the sort has the same size as the input to the sort. So that means that if your input is 10 terabytes of data and you're running a sort, you're moving 10 terabytes of data across the network at this point. Your output will also be 10 terabytes, and so this is quite a lot of data, and indeed it is for many map reduce jobs. Although not all, there's some that significantly reduce the amount of data at these stages. Somebody mentioned, oh, what if you want to feed the output of reduce into another map reduce job? And indeed, that was often what people wanted to do. And in any case, the output of the reduce might be enormous, like for sort or web indexing. The output of the reduce is on 10 terabytes of input. The output of the reduce is again going to be 10 terabytes. So the output of the reduce was also stored on GFS. And the system would, you know, reduce would just produce these key value pairs, but the map reduce framework would gather them up and write them into giant files on GFS. And so there was another round of network communication required to get the output of each reduce to the GFS server that needed to store that reduce. And because you might think that they could have played the same trick with the output of storing the output on the GFS server that happened to run the map reduce worker that ran the reduce. And maybe they did do that. But because GFS as well as splitting data for performance also keeps two or three copies for fault tolerance, that means no matter what you need to write one copy of the data across a network to a different server. So there's a lot of network communication here and a bunch here also. And it was this network communication that really limited the throughput of map reduce in 2004. In 2020, because this network arrangement was such a limiting factor for so many things people wanted to do in data centers, modern data center networks are a lot faster at the root than this was. And so, you know, one typical data center network that you might see today actually has many roots. Instead of a single root switch that everything has to go through, you might have many root switches and each rack switch has a connection to each of these sort of replicated root switches and the traffic is split up among the root switches. So modern data center networks have far more network throughput. And because of that, actually modern, I think Google sort of stopped using map reduce a few years ago, but before they stopped using it, the modern map reduce actually no longer tried to run the maps on the same machine as the data was stored on. They were happy to load the data from anywhere because they just assumed the network was extremely fast. Okay, we're out of time for map reduce. We have a lab due at the end of next week in which you'll write your own somewhat simplified map reduce. So have fun with that. And see you on Thursday.