 I'd like to get started. Today, we're going to talk about GFS, the Google File System paper we read for today. And this will be the first of a number of different case studies we'll talk about in this course about how to build big storage systems. So the larger topic is big storage. And the reason is that storage has turned out to be a key abstraction. If you didn't know already, you might imagine that there could be all kinds of different important abstractions you might want to use for distributed systems. But it's turned out that a simple storage interface is just incredibly useful and extremely general. And so a lot of the thought that's gone into building distributed systems has either gone into designing storage systems or designing other systems that assume underneath some reasonably well-behaved, big distributed storage system. So we're going to care a lot about how to design a good interface to a big storage system and how to design the innards of the storage system so it has good behavior. Of course, that's why we're reading this paper, just to get a start on that. This paper also touches on a lot of themes that will come up a lot in A2-4, parallel performance, fault tolerance, replication, and consistency. And this paper is, as such things go, reasonably straightforward and easy to understand. It's also a good systems paper. It talks about issues all the way from the hardware to the software that ultimately uses the system. And it's a successful real-world design. So it's an academic paper published in an academic conference, but it describes something that really was successful and used for a long time in the real world. So we know that we're talking about something that is a good, useful design. OK, so before I'm going to talk about GFS, I want to talk about the space of distributed storage systems a little bit to set the scene. So first, why is it hard? There's actually a lot to get right, but for A2-4, there's a particular narrative that's going to come up quite a lot for many systems. Often the starting point for people designing these big distributed systems or big storage systems is they want to get huge aggregate performance, be able to harness the resources of hundreds of machines in order to get a huge amount of work done. So the starting point is often performance. And if you start there, a natural next thought is, well, we're going to split our data over a huge number of servers in order to be able to read many servers in parallel. So we're going to get, and that's often called sharding. If you shard over many servers, hundreds or thousands of servers, you're just going to see constant faults. If you have thousands of servers, there's just always going to be one down. So the faults are just every day, every hour occurrences, and we need automatic. We can't have humans involved in fixing these faults. We need automatic fault-tolerant systems. So that leads to fault-tolerance. Among the most powerful ways to get fault-tolerance is with replication. Just keep two or three or whatever copies of data. One of them fails. You can use another one. So we want to have tolerance. That leads us to replication. If you have replication, two copies of the data, then for sure, if you're not careful, they're going to get out of sync. And so what you thought was two replicas of the data where you could use either one interchangeably to tolerate faults. If you're not careful, what you end up with is two almost identical replicas of the data. That's not exactly replicas at all. And what you get back depends on which one you talk to. So that's starting to maybe look a little bit tricky for applications to use. So if we have replication, we risk weird inconsistencies. Of course, clever design. You can get rid of inconsistency and make the data look very well behaved. But if you do that, it almost always requires extra work and extra sort of chit chat between all the different servers and clients in the network that reduces performance. So if you want consistency, you pay for it with low performance, which is, of course, not what we were originally hoping for. And of course, this is an absolute. You can build very high performance systems. But nevertheless, there's this sort of inevitable way that the design of these systems play out. And it results in a tension between the original goals of performance and the sort of realization that if you want good consistency, you're going to pay for it. And if you don't want to pay for it, then you have to suffer with sort of anomalous behavior sometimes. And I'm putting this up because we're going to see this loop many times for many of the systems we look at. People are rarely willing to or happy about paying the full cost of very good consistency. OK, so we've brought a consistency. I'll talk more later in the course about more exactly what I mean by good consistency. But you can think of strong consistency or good consistency as being, oh, we want to build a system whose behavior to applications or clients looks just like you'd expect from talking to a single server. We're going to build systems out of hundreds of machines, but a kind of ideal strong consistency model would be, ah, there was just one server with one copy of the data doing one thing at a time. So this is kind of a strong consistency, kind of intuitive way to think about strong consistency. So you might think you have one server, we'll assume it's a single threaded server and that it processes requests from clients one at a time. And that's important because there may be lots of clients sending concurrent requests in. So the server sees concurrent requests, it picks one or the other to go first and executes that request to completion, then executes the next. So for storage servers or the server's got a disk on it and what it means to process a request is either fit the right request, which might be writing an item or maybe incrementing an item. If it's a mutation, then we're going to go and we have some table of data and maybe index by keys and values and we're going to update this table. And if the request comes in and to read, we're just going to pull the right data out of the table. One of the rules here that sort of makes this well behaved is that each is that the server really does execute in our simplified model, executes the request one at a time, and that requests see data that reflects all the previous operations in order. So if a sequence of writes come in and the server processes them in some order, then when you read, you see the sort of value you would expect if those writes had occurred one at a time. The behavior of this is still not completely straightforward. There's some things that you have to spend at least a second thinking about. So for example, if we have a bunch of clients and client one issues a right of value x and wants it to set it to one, and at the same time, client two issues a right of the same value, but wants to set it to a different, of the same key, but wants to set it to a different value. Something happens. Let's say client three reads and gets some result, or client three after these writes complete reads, gets some result, client four reads x, and also gets a result. So what results should the two clients see? Well, that's a good question. So what I'm assuming here is that client one and client two launch these requests at the same time. So if we were monitoring the network, we'd see two requests heading to the server at the same time, and then some time later the server would respond to them. So there's actually not enough here to be able to say whether the client would process the first request first. There's not enough here to tell which order the server processes them in. And of course, if it processes this request first, then that means it processes the right with value two second. And that means the subsequent reads have to see two. Whereas if the server happened to process this request first and this one second, that means the result and value better be one, and these two requests would see one. So I'm just putting this up to illustrate that even in a simple system, there's ambiguity you can't necessarily tell from a trace of what went into the server, what should come out. All you can tell is that some set of results is consistent or not consistent with a possible execution. So certainly there's some completely wrong results we could go by if client three sees a two, then client four had better see a two also. Because our model is well after the second right. Client three sees a two. That means this right must have been second. And it still has to have been the second right when client four goes to read the data. So hopefully all of this is just completely straightforward and just as expected because it's supposed to be the intuitive model of strong consistency. So the problem with this, of course, is that a single server has poor fault tolerance. If it crashes or it's disk dies or something, we're left with nothing. And so in the real world of distributed systems, we actually build replicated systems. And that's where all the problems start leaking in is when we have a second copy of data. So here is what must be close to the worst replication design. And I'm doing this to warn you of the problems that we will then be looking for in GFS. So here's a bad replication design. We're going to have two servers now, each with a complete copy of the data. And so on disks, they're both going to have this table of keys and values. The intuition, of course, is that we want to keep these tables, we hope to keep these tables identical so that if one server fails, we can read or write from the other server. And so that means that somehow every right must be processed by both servers. And reads have to be able to be processed by a single server, otherwise it's not fault tolerant. If reads have to consult both, then we can't survive the loss of one of the servers. OK, so the problem is going to come up. Well, let's suppose we have client one and client two. And they both want to do these right. Say one of them is going to write one, and the other is going to write two. So client one is going to launch its write x1 to both because we want to update both of them. And client two is going to launch its write x2 to both of them. So what's going to go wrong here? Yeah. Yeah, we haven't done anything here to ensure that the two servers process the two requests in the same order. So that's why it's a bad design. So if server one processes client one's request first, it'll start with a value of one, and then it'll see client two's request and overwrite that with two. If server two just happens to receive the packets over the network in a different order, it's going to execute client two's request and set the value to two, and then it'll see client one's request and set the value to one. And now what a later reading client sees, if client three happens to reach from this server, and client four happens to reach from the other server, then we get into this terrible situation where they're going to read different values, even though our intuitive model of a correct service says both subsequent reads have to yield the same value. And this is going to rise in other ways. Suppose we try to fix this by making the clients always read from server one if it's up and otherwise server two. If we do that, then if this situation happened, then for a while, yeah, everybody reads might see client, might see value two, but if server one suddenly fails, then even though there was no write, suddenly the value for x will switch from two to one, because if server one dies, all the clients will switch to server two. And there'll be just this mysterious change in the data that doesn't correspond to any write, which is also totally not something that could have happened in the simple server model. So of course this can be fixed. The fix requires more communication, usually between the servers or somewhere, more complexity. And because of the cost of, inevitable cost of the complexity to get strong consistency, there's a whole range of different solutions to get better consistency and a whole range of what people feel is an acceptable level of consistency and an acceptable set of anomalous behaviors that might be revealed. All right, any questions about this disastrous model here? OK, let's switch to talking about GFS. A lot of what doing GFS is doing is fixing this, to have better but not perfect behavior. OK, so where GFS came from in 2003, quite a while ago, actually at that time, the web was certainly starting to be a very big deal, and people were building big websites. In addition, there had been decades of research into distributed systems, and people knew at least at the academic level how to build all kinds of highly parallel, fault tolerant, whatever systems. But there had been very little use of the academic ideas in industry. But starting at around the time this paper was published, big websites like Google started to actually build serious distributed systems. And it was very exciting for people like me who were on the academic side of this to see real uses of these ideas. Where Google was coming from was they had some vast, vast data sets, far larger than could be stored in a single disk, like an entire crawl copy of the web. Or a little bit after this paper, they had giant YouTube videos. They had things like the intermediate files for building a search index. They also apparently kept enormous log files from all their web servers so they could later analyze them. So they had some big, big data sets. They needed both to store them and many, many disks to store them. And they needed to be able to process them quickly with things like MapReduce. So they needed high speed parallel access to these vast amounts of data. OK, so what they were looking for, one goal was just that the thing be big and fast. They also wanted a file system that was sort of global in the sense that many different applications could get at it. One way to build a big storage system is to you have some particular application of mine and you build storage sort of dedicated and tailored to that application. And if somebody else in the next office needs big storage, well, they can build their own thing. But if you have a universal or kind of global reusable storage system, then that means that if I store a huge amount of data saying I'm crawling the web and you want to look at my crawled web pages, because we're all playing in the same sandbox, we're all using the same storage system, you can just read my files, maybe access controls permitting. So the idea was to build a sort of file system where anybody inside Google could name and read any of the files to allow sharing. In order to get bigness and fastness, they need to split the data through every file. It would be automatically split by GFS over many servers. So that writes and reads would just automatically be fast. As long as you were reading a file from lots of clients, you get high aggregate throughput. And also be able to, for a single file, be able to have single files that were bigger than any single disk. Because we're building something out of hundreds of servers we want automatic failure recovery. We don't want to build a system where every time one of our hundreds of servers have failed, some human being has to go to the machine room and do something with the server to get it up and running or transfer its data or something. Well, this isn't just fix itself. There were some sort of non-goals, like one is that GFS was designed to run in a single data center. So we're not talking about placing replicas all over the world. A single GFS installation just lived in one data center, one big machine room. So getting this style system to work with the replicas are far distant from each other is a valuable goal, but difficult. So single data centers. This is not a service to customers. GFS was for internal use by applications written by Google engineers. So they weren't directly selling this. They might be selling services they used GFS internally, but they weren't selling it directly. So it's just for internal use. And it was tailored in a number of ways for big sequential file reads and writes. There's a whole other domain of storage systems that are optimized for small pieces of data. Like a bank that's holding bank balances probably wants a database that can read and write and update 100-byte records that hold people's bank balances. But GFS is not that system. So it's really for big, where big is terabytes or gigabytes. So big sequential, not random access. It also has a certain batch flavor. There's not a huge amount of effort to make access be very low latency. The focus is really on throughput of big multi-megabyte operations. This paper was published at SOSP in 2003, the Top Systems Academic Conference. Usually the standard for such papers, such conferences, they have a lot of very novel research. This paper was not necessarily in that class. The specific ideas in this paper, none of them are particularly new at the time. Things like distribution and sharding and fault tolerance were well understood how to deliver those. But this paper described a system that was really operating in use at a far, far larger scale, hundreds of thousands of machines, much bigger than any academics ever built. The fact that it was used in industry and reflected real world experience of what actually did and didn't work for deployed systems that had to work and had to be cost effective. Also extremely valuable. The paper proposed a fairly heretical view that it was OK for the storage system to have pretty weak consistency. I think the academic mindset at that time was the storage system really should have good behavior. Like what's the point of building systems that return the wrong data, like my terrible replication system? Like why do that? Why not build systems that return the right data, correct data, instead of incorrect data? But this paper actually does not guarantee return correct data. The hope is that they take advantage of that in order to get better performance. A final thing that was interesting about this paper is its use of a single master. In a sort of academic paper, you'd probably have some fault tolerant replicated, automatic, failure recovering master. Perhaps many masters with the work split open. But this paper said, look, they could get away with a single master and it worked fine. Well, cynically, who's going to notice on the web that some vote count or something is wrong? Or if you do a search on a search engine, are you going to know that one of 20,000 items is missing from the search results or that they're in the wrong order? Probably not. So there was just much more tolerance in these kind of systems than there would in a bank for incorrect data. That doesn't mean that all data and websites could be wrong. Like if you're charging people for ad impressions, you better get the numbers right. But this is not really about that. In addition, some of the ways in which GFS could serve up odd data could be compensated for in the applications. Like where the paper says applications should accompany their data with check sums and clearly mark record boundaries. That's so that the applications can recover from GFS serving them maybe not quite the right data. So the general structure, and this is just figure one in the paper. So we have a bunch of clients, hundreds of clients. We have one master, although there might be replicas of the master. The master keeps the mapping from file names to where to find the data, basically. Although there's really two tables. And then there's a bunch of chunk servers, maybe hundreds of chunk servers. Each with perhaps one or two disks. The separation here is the master is all about naming and knowing where the chunks are. And the chunk servers store the actual data. So it's like a nice aspect of the design that these two concerns are almost completely separated from each other. And can be designed just separately with separate properties. The master knows about all the files. For every file, the master keeps track of a list of chunks, chunk identifiers that contain the successive pieces of that file. Each chunk is 64 megabytes. So if I have a gigabyte file, the master's going to know that maybe the first chunk is stored here, and the second chunk is stored here, and the third chunk is stored here. And if I want to read whatever part of the file, I need to ask the master, oh, which server holds that chunk, and then I go talk to that server and read the chunk, roughly speaking. All right, so more precisely, we need to, turns out if we're going to talk about the consistency of the system and how it deals with faults, we need to know what the master is actually storing in a little bit more detail. So the master data, it's got two main tables that we care about. It's got one table that maps file name to an array of chunk IDs, or chunk handles. This just tells you where to find the data, or what the identifiers or the chunks are. So it's not much yet you can do with a chunk identifier, but the master also happens to have a second table that maps chunk handles, each chunk handle, to a bunch of data about that chunk. So one is the list of chunk servers that hold replicas of that data. So each chunk is stored on more than one chunk server. So it's a list of chunk servers. Every chunk has a current version number. So this master has a version number for each chunk. All writes for a chunk have to be sequenced to the chunk's primary. It's one of the replicas. So the master remembers which chunk server is the primary, and there's also that primary is only allowed to be primary for a certain least time. So the master remembers the expiration time of the lease. This stuff, so far, it's all in RAM in the master. So it would just be gone if the master crashed. So in order that you'd be able to reboot the master and not forget everything about the file system, the master actually stores all of this data on disk, as well as in memory. So reads just come from memory. But writes, at least the parts of this data that had to be reflected on disk, writes have to go to the disk. And the way it actually managed that is that the master has a log on disk. And every time it changes the data, it appends an entry to the log on disk on a checkpoint. So some of this stuff actually needs to be on disk, and some doesn't. It turns out, I'm guessing a little bit here, but certainly the array of chunk handles has to be on disk. And so I'm going to write nv here for non-volatile, meaning it's got to be reflected on disk. The list of chunk servers, it turns out, doesn't, because the master, if it reboots, talks to all the chunk servers and asks them what chunks they have. So this is, I imagine, not written to disk. The version number, any guesses, written to disk, not written to disk, requires knowing how the system works. I'm going to vote written to disk, non-volatile. But we can argue about that later when we talk about how the system works. Identity of the primary, it turns out, almost certainly not written to disk, so volatile. And the reason is that the master reboots and forgets therefore, since it's volatile, forgets who the primary is for a chunk, it can simply wait for the 60-second lease expiration time. And then it knows that absolutely no primary will be functioning for this chunk, and then it can designate a different primary safely. And similarly, the lease expiration stuff is volatile. So that means that whenever a file is extended with a new chunk, grows to the next 64 megabyte boundary, or the version number changes because the new primary is designated, that means that the master has to first append a little record to its log, basically saying, oh, I just added such and such a chunk to this file, or I just changed the version number. So every time it changes one of those, it needs to write its disk. So this is, the paper doesn't really talk about this too much, but this limits the rate at which the master can change things, because you can only write your disk however many times per second. And the reason for using a log rather than a database, some sort of B-tree or hash table on disk, is that you can append to a log very efficiently, because you can take a bunch of recent log records that need to be added and write them on a single write after a single rotation to whatever the point in the disk is that contains the end of the log file. Whereas if it were a sort of B-tree reflecting the real structure of this data, then you would have to seek to a random place in the disk and do a little write. So the log makes it a little bit faster to write the, to reflect operations onto the disk. However, if the master crashes and has to reconstruct its state, you wouldn't want to have to reread its log file back starting from the beginning of time, from when the server was first installed a few years ago. So in addition, the master sometimes checkpoints its complete state to disk, which takes some amount of time, seconds, maybe a minute or something. And then when it restarts, what it does is goes back to the most recent checkpoint and plays just a portion of a log that's sort of starting at the point in time when that checkpoint was created. Any questions about the master data? So with that in mind, I want to lay out the steps in a read and the steps in the write. Where all this is heading is that I then want to discuss, for each failure I can think of, why does the system, or does the system act correctly after that failure? But in order to do that, we need to understand the data and operations and the data. OK, so if there's a read, the first step is that the client, what a read means is that the application has a file name in mind and an offset in the file that it wants to read some data from. So it sends the file name and the offset to the master. And the master looks up the file name and its file table, and then each chunk is 64 megabytes. So we can use the offset divided by 64 megabytes to find which chunk. Then it looks up that chunk in its chunk table, finds the list of chunk servers that have replicas of that data and returns that list to the client. So the first step is the file name and offset to the master and the master sends the chunk handle, let's say h, and the list of servers. So now we have some choice. We can ask any one of these servers. Pick one, and the paper says that clients try to guess which server is closest to them in the network, maybe in the same rack, and send a read request to that replica. The client actually caches this result so that it reads that chunk again. And indeed, the client might read a given chunk in one megabyte pieces or 64 kilobyte pieces or something. So the client may end up reading the same different points, successive regions of a chunk many times. And so it caches which server to talk to for a given chunk so it doesn't have to keep beating on the master, asking the master for the same information over and over. Now the client talks to one of the chunk servers, tells of the chunk handle enough set, and the chunk servers store these chunks, each chunk in a separate Linux file on their hard drive in a sort of ordinary Linux file system. And presumably the chunk files are just named by the handle. So all the chunk server has to do is go find the file with the right name, I'll give it the entire chunk, and then just read the desired range of bytes out of that file. And return the data. Any questions about how reads operate? Can I repeat number one? The step one is the application wants to read a particular file at a particular offset within the file, a particular range of bytes in the files from the 1,000 to 2,000, and so it just sends the name of the file and the beginning of the byte range to the master. And then the master looks up the file name in its file table to find the chunk that contains that byte range for that file. Is that good? So I don't know the exact details. My impression is that if the application wants to read more than 64 megabytes, or even just two bytes, but spanning a chunk boundary, that the library, so the application is linked with a library that sends RPCs to the various servers. And that library would notice that the read spanned a chunk boundary and break it into two separate reads. And maybe talk to the master. I mean, it may be that you could talk to the master once and get two results or something. But logically, it used two requests to the master and then requests the two different chunk servers. Yes? Well, at least initially the client doesn't know for a given file what chunks it needs, the 17th chunk. But then it needs to know what chunk server holds the 17th chunk of that file. And for that, it needs to talk to the master. OK, so I'm not going to make a strong claim about which of them decides that it was the 17th chunk in the file. But it's the master that finds the identifier, the handle of the 17th chunk in the file, looks that up in its table and figures out which chunk servers hold that chunk. Yes? How does the client know? Or you mean if the client asks for a range of bytes that spans a chunk boundary. So the client will ask that, well, the client's linked with this library, this sort of GFS library that knows how to take read requests apart and put them back together. And so that library would talk to the master and the master would tell it, well, chunk seven is on this server and chunk eight is on that server. And the library would just be able to say, oh, I need the last couple bytes of chunk seven and the first couple bytes of chunk eight. And then would fetch those, put them together in a buffer, and return them to the calling application. Well, the master tells it about chunks. And the library kind of figures out where it should look in a given chunk to find the data the application wanted. The application only thinks in terms of file names and sort of just offsets in the entire file. In the library and the master conspire to turn that into chunks. Yeah, let me get closer here. You see again? So the question is, does it matter which chunk server you reach from? So yes and no. Notionally, they're all supposed to be replicas. In fact, as you may have noticed, or as we'll talk about, they're not necessarily identical. And applications are supposed to be able to tolerate this, but the fact is that you may get slightly different data depending on which replica you read. Yeah, so the paper says that clients try to read from the chunk server that's in the same rack or on the same switch or something. All right, so that's reads. Rights are more complex and interesting. The application interface for rights is pretty similar. There's just some call, some library you make to the GFS client library saying, look, here's a file name and a range of bytes I'd like to write and a buffer of data that I'd like you to write to that range. Actually, let me backpedal. I only want to talk about record append. And so I'm going to phrase the client interface as the client makes a library call that says, here's a file name and I'd like to append this buffer of bytes to the file. So this is the record append that the paper talks about. So again, the client asks the master, look, I want to append, sends a master requesting, look, I would like to append to this named file. Please tell me where to look for the last chunk in the file because the client may not know how long the file is. If lots of clients are appending to the same file because we have some big file that's logging stuff from a lot of different clients, maybe no client will necessarily know how long the file is and therefore which offset or which chunk it should be appending to. So you can ask the master, please tell me about the servers that hold the very last current chunk in this file. So unfortunately now, the writing, if you're reading, you can read from any up-to-date replica. For writing though, there needs to be a primary. So at this point, the file may or may not have a primary already designated by the master. So we need to consider the case of if there's no primary already and all the master knows, well, there's no primary. So one case is no primary. In that case, the master needs to find out the set of chunk servers that have the most up-to-date copy of the chunk because if you've been running the system for a long time due to failures or whatever, there may be chunk servers out there that have old copies of the chunk from yesterday or last week that haven't been kept up-to-date because maybe that server was dead for a couple of days and wasn't receiving updates. So you need to be able to tell the difference between up-to-date copies of the chunk and non-up-to-date. So the first step is to find up-to-date, this is all happening in the master because the client has asked the master, told the master, look, I want to append to this file, please tell me what chunk servers to talk to. It's all part of the master trying to figure out what chunk servers the client should talk to. So if we find up-to-date replicas, and what up-to-date means is a replica whose version of the chunk is equal to the version number that the master knows is the most up-to-date version number. So the master that hands out these version numbers, the master remembers that, oh, for this particular chunk, the chunk server is only up-to-date if it has version number 17. And this is why it has to be non-volatile. It's stored on disk. Because if it was lost in a crash and there were chunk servers holding stale copies of chunks, the master wouldn't be able to distinguish between chunk servers holding stale copies of a chunk from last week and a chunk server that holds the copy of the chunk that was up-to-date as of the crash. That's why the master remembers the version number on disk. If you knew you were talking to all the chunk servers. So the observation is the master has to talk to the chunk servers anyway if it reboots in order to find which chunk server holds which chunk. Because the master doesn't remember that. So you might think that you could just take the maximum, you could just talk to the chunk servers, find out what chunks and versions they hold and take the maximum for a given chunk over all the responding chunk servers. And that would work if all the chunk servers holding a chunk responded. But the risk is that at the time the master reboots, maybe some of the chunk servers are offline or disconnected or whatever, themselves rebooting and don't respond. And so all the master gets back is responses from chunk servers that have last week's copies of the block and the chunk servers that have the current copy haven't finished rebooting or offline or something. So OK, oh yes, if the servers holding the most recent copy are permanently dead, if you've lost all copies, all of the most recent version of a chunk, then yes, no. OK, so the question is, the master knows that for this chunk is looking for version 17, supposing it finds no chunk server, and it talks to the chunk servers periodically to ask them what chunks do you have, what versions do you have. Supposing it finds no server with version 17 for this chunk, then the master will either not respond yet and wait, or it will tell the client, look, I can't answer that, try again later. And this would come up. Like there was a power failure in the building and all the servers crashed and were slowly rebooting, the master might come up first, and some fraction of the chunk servers might be up, and other ones would reboot five minutes from now, so we have to be prepared to wait, and it will wait forever because you don't want to use a stale version of a chunk. OK, so the master needs to assemble the list of chunk servers that have the most recent version. The master knows the most recent version stored on disk. Each chunk server, along with each chunk, as you pointed out, also remembers the version number of the chunk that it stores, so that when chunk servers report into the master saying, look, I have this chunk, the master can ignore the ones whose version does not match the version the master knows is the most recent. OK, so remember we were the client wants to append, the master doesn't have a primary, figures out, maybe you have to wait for the set of chunk servers that have the most recent version of that chunk. It picks a primary, so going to pick one of them to be the primary and the others to be secondary servers, among the replicas that have the most recent version. The master then increments the version number and writes that to disk so it doesn't forget it. It crashes, and then it sends the primary and the secondaries, each of them a message saying, look, for this chunk, here's the primary, here's the secondaries, the recipient may be one of them, and here's the new version number. So then it tells primaries and the secondaries this information plus the version number. The primaries and secondaries all write the version number to disk so they don't forget. Because if there's a power failure or whatever, they have to report into the master with the actual version number they hold. Yes? What happens if you boot up the master doesn't find anyone with a different version number? That's a great question. So I don't know. There's hints in the paper that I'm slightly wrong about this. So the paper says, I think your question is, explain something to me about the paper. The paper says if the master reboots and talks to chunk servers, and one of the chunk servers reports a version number that's higher than the version number the master remembers, the master assumes that there was a failure while it was assigning a new primary and adopts the higher version number that it heard from a chunk server. So it must be the case that in order to handle a master crash at this point, that the master writes its own version number to disk after telling the primaries. There's a bit of a problem here, though. Because if the, what's that? Is there an act? All right. So maybe the master tells the primaries and back ups that their primaries and secondaries that their primary secondary tells them the new version number, waits for the act, and then writes to disk. There's something unsatisfying about this. I don't believe that works because of the possibility that the chunk servers with the most recent version numbers being offline at the time the master reboots. We wouldn't want the master. The master doesn't know the current version number. It'll just accept whatever highest version number it hears, which could be an old version number. All right. So this is an area of my ignorance. I don't really understand whether the master updates its own version number on disk first and then tells the primary and secondary or the other way around. And I'm not sure it works either way. OK, but in any case, one way or another, the master updates its version number, tells the primary and secondary, look, your primaries and secondaries, here's a new version number. And so now we have a primary which is able to accept writes. That's what the primary's job is to take writes from clients and organize applying those writes to the various chunk servers. And the reason for the version number stuff is so that the master will recognize the rich servers have this new master hands out the ability to be primary to some chunk server. We want to be able to recognize if the master crashes that it was that was the primary. That only that primary and its secondaries, which were actually processed, which were in charge of updating that chunk, that only those primaries and secondaries are allowed to be chunk servers in the future. And the way the master does this is with this version number logic. OK, so the master tells the primaries and secondaries that they're it. They're allowed to modify this block. It also gives the primary a lease, which basically tells the primary, look, you're allowed to be primary for the next 60 seconds. After 60 seconds, you have to stop. And this is part of the machinery for making sure that we don't end up with two primaries, which we'll talk about a bit later. OK, so now we have a primary. Now the master tells the client who the primary and the secondaries are. And at this point, we're executing in Figure 2 in the paper. The client now knows who the primary and secondaries are. In some order or another, and the paper explains a sort of clever way to manage this, in some order or another, the client sends a copy of the data it wants to be appended to the primary and all the secondaries. And the primary and the secondaries write that data to a temporary location. It's not appended to the file yet. After they've all said, yes, we have the data, the client sends a message to the primary saying, look, you and all the secondaries have the data. I'd like to append it to this file. The primary maybe is receiving these requests from lots of different clients. And currently it picks some order, executes the client request one at a time. And for each client append request, the primary looks at the offset that's the end of the file, the current end of the current chunk, makes sure there's enough remaining space in the chunk, and then writes the client's record to the end of the current chunk and tells all the secondaries to also write the client's data to the same offset in their chunks. So the primary picks an offset, all the replicas, including the primary, are told to write the new appended record at that offset. The secondaries, they may do it, they may not do it. Maybe they ran out of space, maybe they crashed, maybe the network message was lost from the primary. So if a secondary actually wrote the data to its disk at that offset, it will reply yes to the primary. If the primary collects a yes answer from all of the secondaries, so if all of them managed to actually write and reply to the primary saying yes, I did it, then the primary is going to reply success to the client. If the primary doesn't get an answer from one of the secondaries or the secondary replies sorry, something bad happened, I ran out of disk space, my disk died, I don't know what, then the primary replies no to the client. And the paper says, oh, if the client gets an error like that back from the primary, the client is supposed to reissue the entire append sequence, starting again talking to the master to find out the most chunk at the end of the file. On a no, the client's supposed to reissue the whole record append operation. Ah, you would think, but they don't. So the question is, geez, the primary tells all the replicas to do the append, yeah, maybe some of them do, some of them don't. If some of them don't, then we apply an error to the client, so the client thinks the append didn't happen. But those other replicas where the append succeeded, they did append. So now we have replicas that don't have the same data, one of them, the one that returned an error didn't do the append and the ones that returned yes did do the append. So that is just the way GFS works. Yeah, so if a reader then reads this file, depending on what replica they read, they may either see the appended record, or they may not, if the record append failed. But if the record append succeeded, if the client got a success message back, then that means all of the replicas appended that record at the same offset. If the client gets a no back, then zero or more of the replicas may have appended the record of that offset and the other one's not. So if the client got a no, then that means that maybe some replicas have the record and some don't. Which are replicas you read from, you may or may not see the record. All the replicas are the same, all the secondary say the same version number. So the version number only changes when the master assigns a new primary, which would ordinarily happen and probably only happen if the primary failed. So what we're talking about is replicas that have the fresh version number all right, and you can't tell from looking at them that they're missing, that the replicas are different, but maybe they're different. And the justification for this is that yeah, maybe the replicas don't all have the appended record, but that's the case in which the primary answered no to the client. So the client knows that the right failed. And the reasoning behind this is that then the client library will reissue the append. So the appended record will show up, eventually the append will succeed, you would think, because the client will keep reissuing it until it succeeds. And then when it succeeds, that means there's going to be some offset farther on in the file where that record actually occurs in all the replicas, as well as offsets preceding that where it only occurs in a few of the replicas. Yes. So this is a great question. The exact path that the right data takes might be quite important with respect to the underlying network. And the paper somewhere says, even though when the paper first talks about it, it claims that the client sends the data to each replica. In fact, later on it changes the tune and says the client sends it to only the closest of the replicas. And then that replica forwards the data to another replica along a chain until all the replicas have the data. And that path of that chain is taken to minimize crossing a bottleneck inter-switch links in a data center. Yes, no, no. So the version number only gets incremented if the master thinks there's no primary. So in the ordinary sequence, there already be a primary for that chunk. The master sort of will remember, oh gosh, there was already a primary and secondary for that chunk. And it won't go through this master selection. It won't increment the version number. It'll just tell the client, look, here's the primary, with no version number change. My understanding is that if, so this is a, I think you're asking an interesting question. So in this scenario in which the primary says it's answered failure to the client, you might think something must be wrong with something and that it should be fixed before you proceed. In fact, as far as I can tell the paper, there's no immediate anything. The client retries the append. Because maybe the problem was a network message got lost. So there's nothing to repair, right? The network message got lost. We should retransmit it. And this is sort of a complicated way of retransmitting the network message. Maybe that's the most common kind of failure. In that case, just, we don't change anything. It's still the same primary, same secondaries. The client retries, maybe this time it'll work. Because the network doesn't discard a message. It's an interesting question, though, that if what went wrong here is that one of, that there was a serious error or fault in one of the secondaries, what we would like is for the master to reconfigure that set of replicas to drop that secondary that's not working. And it would then, because it's choosing a new primary and executing this code path, the master would then increment the version. And then we have a new primary and new working secondaries with a new version and this not so great secondary with an old version and a stale copy of the data. But because that is an old version, the master will never mistake it for being fresh. But there's no evidence in the paper that that happens immediately. As far as what's said in the paper, the client just retries and hopes it works again later. Eventually the master, if the secondary is dead, eventually the master does ping all the trunk servers, will realize that and will probably then change the set of primaries and secondaries and increment the version, but only later. The least is to the answer to the question, what if the master thinks the primary is dead? Because it can't reach it, right? That's something we're in a situation where at some point the master said, you're the primary. And the master's like pinging all the servers periodically to see if they're alive, because if they're dead it wants to pick a new primary. The master sends some pings to you, you're the primary, and you don't respond. So you would think that at that point, when you're not responding to my pings, you might think the master at that point would designate a new primary. It turns out that by itself is a mistake. And the reason why it's a mistake to use that simple design is that I may be pinging you, and the reason why I'm not getting responses is because there's something wrong with the network between me and you. So there's a possibility that you're alive. You're the primary, you're alive. I'm pinging you, the network is dropping my packets. But you can talk to other clients, and you're serving requests from other clients. And if I, the master, sort of designated a new primary for that chunk, now we'd have two primaries, processing writes, but the different copies of the data. And so now we have totally diverging copies of the data. And that's called that error, having two primaries or whatever processing requests without knowing each other. It's called split brain, and I'm writing this on the board because it's an important idea, and it'll come up again. And it's caused, or it's usually said to be caused by network partition. That is some network error in which the master can't talk to the primary, but the primary can talk to clients, sort of partial network failure. And these are some of the hardest problems to deal with in building these kind of storage systems. OK, so that's the problem. We want to rule out the possibility of mistakenly designating two primaries for the same chunk. The way the master achieves that is that when it designates a primary, it says it gives the primary a lease, which is basically the right to be primary until a certain time. The master knows it remembers, knows how long the lease lasts, and the primary knows how long its lease lasts. If the lease expires, the primary knows that it expires and will simply stop executing client requests. It'll ignore or reject client requests after the lease expired. And therefore, if the master can't talk to the primary and the master would like to designate a new primary, the master must wait for the lease to expire for the previous primary. So that means the master's going to sit on its hands for one lease period, 60 seconds. After that, it's guaranteed the old primary will stop operating as primary. And now the master can safely designate a new primary without producing this terrible split-brain situation. Oh, so the question is, why is designing a new primary bad since the clients always ask the master first? And so the master changes its mind, then subsequent clients will direct clients to the new primary. Well, one reason is that the client's cash, for efficiency, the client's cash, the identity of the primary, at least for short periods of time. Even if they didn't, though, the bad sequence is that I'm the master. You ask me who the primary is. I send you a message saying the primary is server one. And that message is in-flight in the network. And then on the master, I think somebody's failed, whatever. I think that primary's failed. I designate a new primary. And I send the primary message saying you're the primary. And I start answering other clients who ask who the primary is, saying that that over there is the primary. Well, the message to you is still in-flight. You receive the message saying the old primary is the primary. You think, gosh, I just got this from the master. I'm going to go talk to that primary. And without some much more clever scheme, there's no way you can realize that even though you just got this information from the master, it's already out of date. And if that primary serves your modification requests, now we have to and respond success to you, then we have two conflicting replicas. Is that again? You have a new file and no replicas. OK, so if you have a new file and no replicas or even existing file and no replicas, you'll take the path I drew on the blackboard. The master will receive a request from a client saying, oh, I'd like to append to this file. And I guess the master will first see there's no chunks associated with that file. And it will just make up a new chunk identifier, perhaps by calling the random number generator. And then it'll look in its chunk information table and see, gosh, I don't have any information about that chunk. And it'll make up a new record saying, it must be special case code where it says, well, I don't know any version number. This chunk doesn't exist. I'm just going to make up a new version number one, pick a random primary and set of secondaries, and tell them, look, you are responsible for this new empty chunk. Please get to work. The paper says three replicas per chunk by default. So typically a primary and two backups. OK. So maybe the most important thing here is just to repeat the discussion we had a few minutes ago. The intentional construction of GFS with these record appends is that if we have three replicas, maybe a client sends in a record append for record A, and all three replicas, or the primary and both of the secondaries, successfully append the data to the chunk. So maybe the first record in the chunk might be A in that case. And they all agree because they all did it. Suppose another client comes and says, look, I want to append record B. But the message is lost to one of the replicas, the network, whatever, throws away the message by mistake. But the other two replicas get the message. One is a primary and one of the secondaries. They both append to the file. So now what we have is two of the replicas at B, and the other one doesn't have anything. And then maybe a third client wants to append C, and maybe remember that if this is the primary, the primary picks the offset. And so the primary is going to tell the secondaries, look, write record C at this point in the chunk. They all write C here. Now the client for B, the rule for a client for B, for the client that gets error back from its request is that it will resend the request. So now the client that asked to append record B will ask again to append record B. And this time maybe there's no network losses, and all three replicas append record B. And they're all live. They're all have the most fresh version number. And now if a client reads what they see depends on which replica they look at. It's going to see in total all three of the records. But it'll see in different orders, depending on which replica it reads. It'll see A, B, C, and then a repeat of B. So if it reads this replica, it'll see B and then C. If it reads this replica, it'll see A and then a blank space in the file, padding, and then C and then B. So if you read here, you see C and then B. If you read here, you see B and then C. So different readers will see different results. And maybe the worst situation is that some client gets an error back from the primary because one of the secondaries failed to do the append, and then the client dies before resending the request. So then you might get a situation where you have record D showing up in some of the replicas and completely not showing up anywhere in the other replicas. So under this scheme, we have good properties for appends that the primary sent back a successful answer for, and sort of not so great properties for appends where the primary sent back a failure. And the records, the replicas just absolutely be different, all different sets of replicas. My reading of the paper is that the client starts at the very beginning of the process and asks the master, again, what's the last chunk in this file? Because it might have changed if other people are pending of the file. So I can't read the designer's mind. So the observation is the system could have been designed to keep the replicas in precise sync. It's absolutely true, and you will do it in labs two and three. So you guys are going to design a system that does replication but actually keeps the replicas in sync. And you'll learn there's some various techniques, various things you have to do in order to do that. And one of them is that there just has to be this rule. If you want the replicas to stay in sync, there has to be this rule that you can't have these partial operations that are applied to only some and not others. And that means that there has to be some mechanism to where the system, even if the client dies, where the system says, wait a minute, there was this operation. I haven't finished it yet. So you'll build systems in which the primary actually makes sure the backups get every message. If the first write and B failed, you think the C should go where the B is. Well, it doesn't. You may think it should. But the way the system actually operates is that the primary will add C to the end of the chunk. And it's the after B. Yeah, one reason for this is that at the time the write for C comes in, the primary may not actually know what the fate of B was. Because we may have multiple clients submitting a pens concurrently. And for high performance, you want the primary to start the append for B first. And then as soon as it can, at the next offset, tell everybody to do C so that all this stuff happens in parallel. By slowing it down, the primary could sort of decide to be it totally failed. And then send another round of messages saying, please undo the write of B. And that would be more complex and slower. Again, the justification for this is that the design is pretty simple. It reveals some odd things to applications. And the hope was that applications could be relatively easily written to tolerate records being in different orders, or who knows what. Or if they couldn't, that applications could either make their own arrangements for picking an order themselves and writing sequence numbers in the files or something. Or if the application really was very sensitive to order, you could just not have concurrent appends from different clients to the same file. You could just, you know, files where order is very important, like say it's a movie file. You don't want to scramble the bytes in a movie file. You just write the movie to the file by one client in sequential order and not with concurrent record appends. All right. Somebody asked, basically, what would it take to turn this design into one which actually provided strong consistency, consistency closer to our sort of single server model where there's no surprises? I don't actually know because that requires an entire new complex design. It's not clear how to mutate GFS to be that design. But I can list for you some things that you would want to think about if you wanted to sort of upgrade GFS to a system that did have strong consistency. One is that you probably need the primary to detect duplicate requests so that when this second B comes in, the primary is aware that, oh, actually, you know, we already saw that request earlier and did it or didn't do it. And to try to make sure that B doesn't show up twice in the file. So one is you're going to need duplicate detection. Another issue is you probably, if a secondary is detecting a secondary, you really need to design a system so that if the primary tells a secondary to do something, the secondary actually does it and doesn't just return error. For a strictly consistent system, having the secondaries be able to just sort of blow off primary requests with really no compensation is not OK. So either the secondaries have to accept requests and execute them. Or if a secondary has some sort of permanent damage, like its disc got unplugged by mistake, you need to have a mechanism to take the secondary out of the system so that the primary can proceed with the remaining secondaries. But GFS kind of does neither, at least not right away. And so that also means that when the primary asks secondaries to append something, the secondaries have to be careful not to expose that data to readers until the primary is sure that all the secondaries really will be able to execute the append. So you might need sort of multiple phases in the rights, the first phase in which the primary asks the secondaries, look, I'd really like you to do this operation, can you do it, but don't actually do it yet. And if all the secondaries answer with a promise to be able to do the operation, only then the primary says, all right, everybody, go ahead and do that operation you promised. And that's the way a lot of real world systems, strong and consistent systems work. And that trick, it's called Two-Face Commit, another issue is that if the primary crashes, there will have been some last set of operations that the primary had launched, started to the secondaries, but the primary crashed before it was sure whether all the secondaries got their copy of the operation or not. So if the primary crashes, a new primary, one of the secondaries is going to take over as primary. But at that point, the new primary and the remaining secondaries may differ in the last few operations, because maybe some of them didn't get the message before the primary crashed. And so the new primary has to start by explicitly resynchronizing with the secondaries to make sure that the sort of the tail of their operation histories are the same. Finally, to deal with this problem of, oh, there may be times when the secondaries differ or the client may have a slightly stale indication from the master of which secondary to talk to, the system either needs to send all client reads through the primary, because only the primary is likely to know which operations have really happened. Or we need a least system for the secondaries just like we have for the primary, so that it's well understood that when secondary can and can't legally respond to a client. And so these are the things I'm aware of that would have to be fixed in the system, sort of added complexity and chip chat to make it have strong consistency. And you'll actually, the way I got that list, was by thinking about the labs. So you're going to end up doing all the things I just talked about as part of labs two and three to build a strictly consistent system. OK, so let me spend one minute on the, there's actually I have a link in the notes to a sort of retrospective interview about how well GFS played out over the first five or ten years of its life at Google. So the high level summary is that the most, is that it was tremendously successful and many Google applications used it and a number of Google infrastructure was built as a like big file, for example, big table, I mean, was built as a layer on top of GFS and that produced also. Why did the user think Google? Maybe the most serious limitation is that there was a single master and the master had to have a table entry for every file and every chunk. And that meant as the GFS use grew and there got more and more files, the master just ran out of memory, ran out of RAM to store the files. And you can put more RAM on, but there's limits to how much RAM a single machine can have. And so that was the most immediate problem people ran into. In addition, the load on a single master from thousands of clients started to be too much. A master can only CPU, can only process however many hundreds of requests per second, especially if this has to write things to disk. And pretty soon there got to be too many clients. Another problem was that some applications found it hard to deal with this kind of sort of odd semantics. And a final problem is that the master was not an automatic story for master failover in the GFS paper as we read it, like required human intervention to deal with a master that had sort of permanently crashed and need to be replaced. And that could take tens of minutes or more. That was just too long for failure recovery for some applications. OK, excellent. I'll see you on Thursday. And we'll hear more about all these themes over the semester.