 Last time I started talking about linearizability, and I want to finish up this time. The reason why we're talking about it again is that it's our kind of standard definition for what strong consistency means in storage style systems. So for example, lab 3 needs to obey, your lab 3 needs to be linearizable. And sometimes this will come up because we're talking about a strongly consistent system and we're wondering whether a particular behavior is acceptable. And other times linearizability will come up because we'll be talking about a system that isn't linearizable and we'll be wondering, you know, in what ways might it fall short or deviate from linearizability. So one thing you need to be able to do is look at a particular sequence of operations, a particular execution of some system that executes reads and writes, like your lab 3, and be able to answer the question, oh, was that sequence of operations I just saw linearizable or not? So we're going to continue practicing that a little bit now, plus I'll try to actually establish some interesting facts that will be helpful for us about what it means, about the consequences for the systems we build and look at of linearizability. It's defined on particular operation history. So always the thing we're talking about is, oh, we observed, you know, a sequence of requests by clients and then they got some responses at different times and they asked for different, you know, to read different data and got various answers back, you know, is that history that we saw linearizable? OK, so here's an example of a history that might or might not be linearizable. So let's suppose at some point in time some client, which the time is going to move to the right, this vertical bar marks the time at which a client sent a request. I'm going to use this notation to mean that the request is a right and asks to set variable or key or whatever x to value 0. So we got a sort of a key and a value. This would correspond to a put of key x and value 0 in lab 3. And then this is sort of, we're watching what the client sent. The client sent this request to our service and at some point the service responded and said, yes, your right is completed. So we're assuming the service is of a nature that actually tells you when the right completes, otherwise the definition isn't very useful. OK, so we have this request by somebody to write. And then I'm imagining in this example, there's another request that, because I'm putting this mark here, this means the second request started after the first request finished, and the reason why that's important is because of this rule that linearizable history must match real time. And what that really means is that requests that are known in real time to have started after some other request finished, the second request has to occur after the first request in whatever order we work out that's the proof that the history was linearizable. OK, so in this example, I'm imagining there's another request that asks to write x to have value 1. And then a concurrent request, maybe started a little bit later, asks to set x to 2. I said, now we have two, maybe two different clients issued requests at about the same time to set x to two different values. So of course we're wondering which one is going to be the real value. And then we also have some reads. If all you have is writes, it's hard to say too much about linearizable linearizability because you don't know, you don't have any proof that the system actually did anything or revealed any values. So we really need reads. So let's imagine we have some read, and let's say we see in the history that a client sent a read at this time, and the second time it got an answer for it read key x and got value two. So presumably, we actually saw this value. And then there was another request by maybe the same client or a different client, but known to have started in time after this request finished and this read of x got value one. So the question in front of us is, is this history linearizable? And there's sort of two strategies we can take. We can either cook up a sequence, because if we can come up with a total order of these five operations that are raised real time, and in which each read sees the value written by the most recently preceding write in the order. If we can come up with that order, then that's a proof the history is linearizable. Another strategy is to observe that these rules, each one may imply certain, this comes before that, edges in a graph. And if we can find a cycle in the, this operation must come before that operation. We can find a cycle in that graph and that's proof that the history isn't linearizable. And for small histories, we may actually be able to enumerate every single order and use that to show the history isn't linearizable. Anyway, any thoughts about whether this might be or might not be linearizable? Yes, okay. So the observation is that it's a little bit troubling that we saw a read with value two and then a read with value one. And maybe that contradicts, there were two writes, one with value one and one value two, so certainly if we had a read with value three, that would obviously be something I got terribly wrong. But we got, there were a write of one and two and a read of one and two. So the question is whether this order of reads could possibly be reconciled with the way these two writes show up in the history. Maybe we shouldn't be able to read their values until the right is complete. So I'm a bit confused why we're able to see RX2 and RX1. Okay, so what I'm, the game we're playing is that we have a, maybe two clients or three clients and they're talking to some service. Maybe a raft, lab three or something. What we are seeing is requests and responses, right? So what this means is that we saw requests from a client to write X, to put requests for X and one, and we saw the response here. So what we know is that somewhere during this interval of time, presumably the service actually internally changed the value of X to one. And what this means is that somewhere in this interval of time, the service presumably changed its internal idea of the value of X to two. Somewhere in this time, but it's just somewhere in this time. It doesn't mean it happened here or here. Does that answer your question? Yes, yes, okay. So the observation is that is linearizable and it's been accompanied by an actual proof of linearizability. Namely, a demonstration of the order that shows that it is linearizable. And the order is, yes, it's linearizable. And the order is first, right of X with value zero. And the server got both of these rights at roughly the same time. It's allowed to choose the order itself, right? So let's just say it could have executed the right of X to value two first. And then the read of X then executed the read of X, which would, the first read of X, which at that point would yield two. And then we're gonna say the next operation it executed was the right of X to one. And then the last operation in the history is the read of X to one. And so this is proof that the history is linearizable, because here's an order. It's a total order of the operations, and this is the order. It matches real time, so what that means is, well, I'll just go through it, the right of X to zero comes first. And that's totally intuitive, since it actually finished before any other operations started. The right of X to one comes, sorry, the right of X to two comes second. So we're gonna say maybe that I'm gonna mark here the sort of real time at which we imagine these operations happen to demonstrate that the order here does match real time. So we'll say, I'll just write a big X here to mark the time when we imagine this operation happened, right? So that's the second operation. Then we're imagining that the next operation is the read of X of two. There's no real time problem, because the read of X of two actually was issued concurrently with the right of X of two. It's not like the read of X of two finished and only then did the right of X with two start. They really are concurrent. We'll just imagine that the sort of point in time at which this operation happened is right there. So this is the, we don't care when this one happened. Let's just say there's the first operation, second, third. Now we have a right of X of one. Let's just say it happens here in real time. It just has to happen after the operations that occur before it in the order. So we'll say there's the fourth operation. And now we have the read of X of one that can pretty much happen at any time. So let's say it happens here, okay? So this is the demo, so we have the order. This is the demonstration that the order is consistent with real time. That is, we can pick a time for each of the operations that's within its start and end time that would cause this total order to match our real time order. And so the final question is, did each read see the value written by the most closely preceding right of the same variable? So there's two reads, this read is preceded by a right with the correct values, that's good. And this read is preceded by a right, most closely preceded by a right of the same value also. Okay, so this is a demonstration that this history was linearizable. And it depends on what you thought when you first saw the history. But it's not always immediately clear that a setup this complicated is. It's easy to be tricked when looking at these histories because you think, the right of x of one started first, so we just sort of assume that the first value written must be one, but that's actually not required here. Any questions about this? If the, you mean if these two were moved like this? Okay, so if the right with value two was only issued by the client after the read of x with value two returned, that wouldn't be linearizable because in whatever order, any order we come up with has to obey the real time ordering. So any order we come up with would have had to have the read of x with two precede the right of x with two. And since there's no other right of x of two in sight here, that means that a read at this point could only see zero or one because those are the only other two rights that could possibly come before this read. So moving, you know, shifting these, that much makes the, would make the example not linearizable, yes. I'm saying that the first vertical line is the moment the client sends the request and the second vertical line is the moment the client receives the request, yes, yeah, yeah. So this is a very client centric kind of definition that says, you know, client should see the following behavior. And what happens after you have sent a request in, maybe there's a lot of replicas, maybe a complicated network, who knows what. It's almost none of our business where only the definition is only about what clients see. There's some gray areas which we'll all come to in a moment, like if the client should need to retransmit a request, then we also have to, you know, that's something we have to think about other questions. Okay, so this one is linearizable. Here's another example. I'm actually going to start out with it being almost identical. I'm going to start out with it being identical for the first example. So again, we have a right of x with zero. We have these two concurrent rights and we have the same two reads. This was so far identical to the previous example. So therefore we know this must be, this alone must be linearizable. But I'm going to add, let's imagine that client one issued these two requests. The definition doesn't really care about clients, but for insanity we'll assume client one read x and saw two, and then later read x and saw one. But that's okay so far. Let's say there's another client and the other client does a read of x and it sees a one, and then the other client does a second read of x and it sees two. So the question is, this is linearizable. We either have to come up with an order, or this comes before that graph that has a cycle in it. So the thing this is getting at, the puzzle is, if one client saw, there's only two rights here. So they, in any order, one of the rights comes first, or the other right comes first. And intuitively, client one observed that the right with value two came first. And then, the right of value one, right? These two reads mean that has to be the case that in any legal order, the right of two has to come before the right of one. In order for the client one to have seen this, and it's the same order we saw over here. But symmetrically, client one's experience clearly shows the opposite, right? Sorry, ah. Client two, client two's experience shows the opposite. Client two saw the right of one first, and then the right with value two. And one of the rules here is that there's just one total order of operations. We're not allowed to have different clients see different histories or different progressions, evolutions of the values that are stored in the system. There can only be one total of order that all clients have to experience operations that are consistent with the one order. And if one, this one client clearly implies that the order is right to and then right one. And so we should not be able to have any other client who observes proof that the order was anything else, which is what we have here. And so that's a bit of an intuitive explanation for what's going wrong here. And by the way, the reason why this could come up in the systems that we build and look at is that we're building replicated systems. Either, you know, raft replicas or maybe systems with cashing in them. But we're building systems out of many copies of the data. And so there may be many servers with copies of X in them, possibly with different values at different times, right? If they haven't gotten the commit yet or something, some replicas may have one value, some may have the other. But in spite of that, if our system is linearizable or strongly consistent, it must behave as if there was only one copy of the data and one linear sequence of operations applied to the data. And that's why this is an interesting example. Because this could come up in a sort of buggy system that had two copies of the data and one copy executed these rights in one order. And the other replica executed the rights in the other order. Then we could see this. And linearizability says, no, we can't see that. We're not allowed to see that in the correct system. So the cycle in the graph, this comes before that graph. That would be a sort of slightly more proofy proof that this is not linearizable, is that the right of two has to come before client one's read of two. So there's one arrow like this. So this right has to come before that read. Client one's read has to come before the right of X with value one. Otherwise, this read wouldn't be able to see one. You can imagine this right might happen very early in the order. But in that case, this read of X wouldn't see one, it would see two. Since we know this guy saw two. So the read of X with two must come before the right of X with one. The right of X with one must come before any read of X with value one. Because including client two's read of X with value one. But in order to get value one here and for this read to see two, the right of X with value two must come between in the order between these two operations. So we know that the read of X one must come before the rate of X two. And that's a cycle, right? So there's no linear order that can obey all of these time and value rules. And there isn't because there's a cycle in the graph. That's a good question. This definition is a definition about histories, not about necessarily systems. So what it's not saying is that a system design is linearizable if something about the design, it's really only history by history. So if we don't get to know how the system operates internally, and the only thing we know is we get to watch it while it executes, then before we've seen anything, we just don't know. Right, we assume it's linearizable. And then we see more and more sequences of operations that say, gosh, they're all consistent with linearizability. They all follow these rules. So we believe it's probably the system linearizable. And if we ever see one that isn't, then we realize it's not linearizable. So this is, yeah, it's not a definition on the system design. It's a definition on what we observe the system to do. So in that sense, it's maybe a little bit unsatisfying if you're trying to design something, right, there's not a recipe for how you design, except in the trivial sense that if you had a single server in a very simple system, one server, one copy of the data, not threaded or multi-core or anything, it's a little bit hard to build a system that violates this in a very simple setup, but super easy to violate it in any kind of distributed system. Okay, so the lesson from this is that there's only, can only be one order in which the system is observed to execute the rights. All clients have to see values consistent with the system executing the rights in the same order. Here's a very simple history, another example. Supposing we write x with value one, and then definitely subsequently in time, maybe with another client, another client launches a write of x with value two, and sees the response back from the server saying, yes, I did the right. And then a third client does a read of x and gets value one. So this is a very easy example. It's clearly not linearizable because the time rule means that the only possible order is the write of x with one, the write of x with two, the read of x with one. So that has to be the order. And that order clearly violates, since it's only one order, that order clearly violates the second rule about values. That is the most value written by the most recent right in the one order that's possible is not one, it's two. So this is clearly not linearizable. And the reason I'm bringing it up is because this is the argument that a linearizable system, a strongly consistent system, cannot serve up stale data, right? And the reason why this might come up is, again, maybe you have lots of replicas, each maybe hasn't seen all the rights or all the committed rights or something. So maybe there's some, maybe all the replicas have seen this right, but only some replicas have seen this right. And so if you ask a replica that's lagging behind a little bit, it's still gonna have value one for x. But nevertheless, clients should never be able to see this old value in a linearizable system, or that no stale data allowed, no stale reads. Like two operations will match real time, or like you can't switch their order around if one ends, one begins after the other ends. Yes. But if there's some overlapping of the intervals, that's when we have freedom to switch the order. Yeah, if there's overlap in the interval, then the system could legally execute either of them in a real time in the interval. So that's the sense in which the system could execute them in either order. Now, if it weren't for these two reads, the system would have total freedom to execute the rights in either order. But because we saw the two reads, we know that the only legal order is two and then one. Yeah, so if the two reads were overlapping on any order, then the reads could have seen either. In fact, until we saw the two and the one resolved from the reads, these two reads could have, the system, until it committed to the values for the read, it still had freedom to return them at either order. I'm using them as synonyms. Yeah, for most people, although possibly not today's paper, linearizability is well-defined, and people's definitions rarely deviate very much from this. Strong consistency, though, is less, I think there's less sort of consensus about exactly what the definition might be if you meant strong consistency. It's often meant, it's usually meant in ways that are quite close to this. Like, for example, that the system behaves the same way that a system with only one copy of the data would behave, which is quite close to what we're getting at with this definition. But yeah, for, it's reasonable to assume that strong consistency is the same as serializable. Okay, so this is not linearizable and the lesson is reads are not allowed to return stale data. Only fresh data, read can only return the results of the most recently completed write. Okay, and I have a final little example. So we have two clients, one of them submits a write to x with value three, and then a write to x with value four. And we have another client, and at this point in time, the client issues a read of x, but this is a question you asked. The client doesn't get a response, right? You know, who knows, like in the sort of actual implementation, maybe the leader crashed at some point. Maybe this client who sent in the read request, so the leader maybe didn't get it because the request was dropped, or maybe the leader got the request and executed it, but the response, the network dropped the response, or maybe the leader got it and started to process it, but then crashed before finished processing it, or maybe it did process it and crashed before sending the response. Who knows? From the client's point of view, I send a request and never got a response. So in the interior machinery of the client, for most of the systems we're talking about, the client is going to resend the request, maybe to a different leader, maybe the same leader, who knows what. So they send the first request here, and maybe it sends the second request at this point in time. It times out, you know, no response, sends the second request at this point, and then finally gets a response. It turns out that, and then you're going to implement this in live three, that a reasonable way of servers dealing with repeated requests is for the servers to keep tables sort of indexed by some kind of unique request number or something from the clients. In which the servers remember, I already saw that request and executed it, and this was the response that I sent back. Because you don't want to execute a request twice, for example, if it's a right request, you don't want to execute a request twice. So the servers have to be able to filter out duplicate requests, and they have to be able to return the reply, to repeat the reply that they originally sent to that request, which perhaps was dropped by the network. So the servers remember the original reply and repeat it. In response to the resend. And if you do that, which you will in lab three, then if, you know, since the server, the leader could have seen value three when it executed the original read request from client two, it could return value three to the repeated request that was sent at this time and completed at this time. And so we have to make a call on whether that is legal, right? You could argue that, gosh, the client re-sent the request here. This was after the rate of X to four completed. So geez, we really should return four at this point instead of three. And this is like a little bit, a question of, it's like a little bit up to the designer. But if what you view is going on is that the retransmissions are a low level concern that's part of the RPC machinery or hidden in some library or something, and that from the client applications point of view, all that happened was that it sent a request at this time and got a response at this time. And that's all that happened from the client's point of view. Then a value of three is totally legal here. Because this request took a long time, it was completely concurrent with the right, not ordered in real time with the right, and therefore either the three or the four is valid as if the read request really executed here in real time or here in real time. So the larger lesson is if you have client retransmissions, from the application point of view, if you're defining linearized ability from the applications point of view, even with retransmissions, the real time extent of requests like this is from the very first transmission of the request to the final time at which the application actually got the response. Maybe after many re-sends, yes, you might rather you got fresh data than stale data, if I'm, supposing the request is what time it is, what time is it? It's a time server, I send a request saying, what time is it? And it sends me a response, yeah, if I send a request now and I don't get the response until two minutes from now due to some network issue, it may be that the application would prefer to see, when it gets a response it would prefer to see a time that was close to the time at which it actually got the response rather than a time deep in the past when it originally sent the request. Now, the fact is that if you're using a system like this, you have to write applications that are tolerant of these rules. If you're using a linearizable system, like these are the rules. And so you must write, correct applications must be tolerant of, if they send a request and they get a response a while later, they just, you can't, you are not allowed to write the application as if, gosh, if I get a response, that means that the value, at the time I got the response was equal to three. That is not okay for applications to think. Now what that, how that plays out for a given application depends on what the application is doing. The reason I bring this up is because it's a common question in 6824. You guys will implement the machinery by which servers detect duplicates and resend the previous answer that the server originally sent. And the question will come up, is it okay if you originally saw the request here to return at this point in time the response that you would have sent back here if the network hadn't dropped it? And it's handy to have a kind of way of reasoning. I mean, one reason to have definitions like linearizability is to be able to reason about questions like that, right? And using the scheme, we can say, well, it actually is okay by those rules. All right, that's all I wanted to say about linearizability of any lingering questions. Well, maybe I'm taking liberties here, but what's going on is that in real time, we have a read of two and a read of one. And the read of one really came after in real time, the read of two. And so must come, must be in this order in the final order. That means there must have been a right of two somewhere in here. Sorry, a right with value one somewhere in here. That is after the read of two, in the final order, right? After the read of two and before the read of one. In that order, there must be a right with value one. There's only one right with value one available. If there were more than one, we maybe could play games. But there's only one available, so this right must slip in here in the final order, therefore I felt able to draw this arrow. And these arrows just capture the sort of one by one, the implication of the rules on what the order must look like. Rx, yeah, I mean any Rx, so which, sorry, which, which, his own Rx1. He sees it before his own Rx1. Okay, so there be a, yep, well, we're not, we're not really able to say which of these two reads came first. So we can't quite draw this out. If we mean this arrow to constrain the ultimate order, we're not, you know, these two reads could come in either order. So we're not allowed to say this one came before that one. It could be there's a simpler cycle, actually, than I've drawn. So it may, it may be, because certainly the, the damage is in these four items. I agree with that, these two, these four items kind of are the main evidence that something is wrong. Now whether a cycle, I'm not sure whether there's a cycle that just involves that. It could be. Okay, this is worth thinking about because, you know, if I can't think of anything better, I'll certainly ask you a question about linearizable histories on the midterm. Okay, so today's paper. Today's paper, ZooKeeper. And part of the reason we're reading the ZooKeeper paper is that it's a successful real world system. It's an open source service that actually a lot of people run, has been incorporated into a lot of real world software. So there's a certain kind of reality and success to it. But that makes it attractive from the point of view of kind of supporting the idea that the ZooKeeper's design might actually be a reasonable design. But the reason we're interested in it, I'm interested in it, is for two somewhat more precise technical points. So why are we looking at this paper? So one of them is that in contrast to raft, like the raft you've written and raft is defined as really a library. You can use a raft library as part of some larger replicated system. But raft isn't like a standalone service or something that you can talk to. You really have to design your application to interact with the raft library explicitly. So you might wonder, it's an interesting question, whether some useful standalone general purpose system could be defined that would be helpful for people building separate distributed systems? Like is there some service that can bite off a significant portion of why it's painful to build distributed systems and sort of package it up in a standalone service that anybody can use? So this is really the question of what would an API look like for a general purpose? I'll call it, I'm not sure what the right name for things like ZooKeeper is, but you've all got a general purpose coordination service. And the other question, the other interesting aspect of ZooKeeper is that when we build replicated systems, and ZooKeeper is a replicated system, because among other things it's like a fault tolerant general purpose coordination service, and it gets fault tolerance like most systems by replication that is there's a bunch of maybe three or five or seven or five who knows what ZooKeeper servers, it takes money to buy those servers, right? Like a seven server ZooKeeper set up is seven times expensive as a sort of simple single server. So it's very tempting to ask, if you buy seven servers to run your replicated service, can you get seven times the performance out of your seven servers? And how could we possibly do that? So the question is, if we have n times of many servers, can that yield us n times the performance? So I'm going to talk about the second question first. So from the point of view of this discussion about performance, I'm just going to view ZooKeeper as just some service, we don't really care what the service is, but replicated with a raft like replication system. ZooKeeper actually runs on top of this thing called ZAB, which for our purposes, we'll just treat as being almost identical to raft. And I'm just worried about the performance of the replication, I'm not really worried about what ZooKeeper specifically is up to. So the general picture is that we have a bunch of clients, maybe hundreds, maybe hundreds of clients, and we have just as in the labs, we have a leader. The leader has a ZooKeeper layer that clients talk to, and then under the ZooKeeper layer is this ZAB system that manages replication. And just like rafts, what ZAB, a lot of what ZAB is doing is maintaining a log that contains a sequence of operations that clients have sent in. So really very similar to raft. We may have a bunch of these, and each of them has a log that it's depending on your request. That's a familiar setup. So the client sends in a request, and the ZAB layer sends a copy of that request to each of the replicas, and the replicas append this to their in-memory log, probably persist it onto a disk so they can get it back if they crash and restart. So the question is, as we add more servers, you know, we could have four servers or five or seven or whatever, does the system get faster as we add more CPUs, more horsepower to it? Do you think your labs will get faster as you have more replicas? So I mean, each replicas is its own computer, right, so that you really do get more CPU cycles as you add more replicas. Yeah, there's nothing about this that makes it faster as you add more servers. It's absolutely true. As we add more servers, the leader is almost certainly the bottleneck, because the leader has to process every request, and it sends a copy of every request to every other server. As you add more servers, it just adds more work to this bottleneck node. You're not getting any performance benefit out of the added servers, because they're not really doing anything. They're just all happily doing whatever the leader tells them to do. They're not subtracting from the leader's work, and every single operation goes through the leader. So for here, the performance is inversely proportional to the number of servers that you add. You add more servers, this almost certainly gets slower, because the leader just has more work. So in this system, we have the problem that more servers makes the system slower. That's too bad. These servers cost a couple thousand bucks each. You would hope that you could use them to get better performance. Okay, so the question is, what if the requests, maybe from different clients or successive requests from the same client or something, what if the request applied to totally different parts of the state? So in a key value store, maybe one of them is a put on X and the other is a put on Y. Nothing to do with each other. Can we take advantage of that? The answer to that is absolutely. Now, not in this framework though, or it's the sense of which we can take advantage of it is very limited in this framework. It could be, well, at a high level, the request all still goes through the leader. And the leader still has to send it out to all the replicas, and the more replicas there are, the more messages the leader has to send. So at a high level, it's not likely to, the sort of commutative request is not likely to help this situation. It's a fantastic thought to keep in mind though, because it will absolutely come up in other systems. And people, we will be able to take advantage of it in other systems. Okay, so, this is a little bit disappointing. Dex with server hardware wasn't helping performance. So a very sort of obvious, maybe the simplest way that you might be able to harness these other servers is build a system in which, yeah, right requests all have to go through the leader. But in the real world, a huge number of workloads are read heavy. That is, there's way more reads. Like when you look at web pages, you know, it's all about reading data to produce the web page, and generally there are very relatively few writes. And that's true of a lot of systems. So maybe we'll send writes to the leader, but send reads just to one of the replicas. Right, just pick one of the replicas. And if you have a read-only request, like a get in lab three, just send it to one of the replicas and knock to the leader. Now, if we do that, we haven't helped writes much, although we've gotten a lot of read workload off the leader, so maybe that helps. But we absolutely have made tremendous progress with reads. Because now the more servers we add, the more clients we can support. Right, because we're just splitting the client read workload across the different replicas. So the question is, if we have clients send directly to the replicas, are we going to be happy? Yeah, so up-to-date is the right word. In a raft-like system, which ZooKeeper is, if a client sends a request to a random replica, you know, sure, the replica has a copy of the log, and it, you know, it's been executing along with the leader, and you know, for lab three, it's got this key value table, and you know, you do a get for key X, and it's going to have some value for key X in its table, and it can reply to you. So sort of functionally, the replicas got all the pieces it needs to respond to client, to read requests from clients. The difficulty is that there's no reason to believe that any one replica other than a leader is up-to-date. Because, well, there's a bunch of reasons why replicas may not be up-to-date. One of them is that they may not be in the majority that the leader was waiting for. If you think about what raft is doing, the leader is only obliged to wait for responses to its append entries from a majority of the followers. And then it can commit the operation and go on to the next operation. So if this replica wasn't in the majority, it may never have seen or write. Maybe the network dropped it and never got it. And so, yeah, you know, the leader and you know, a majority of the servers have seen the first three requests, but, you know, this server only saw the first two. It's missing B, so a read of B, a read of, you know, what should be there will just be totally get a stale value from this one. Even if this replica actually saw this new log entry, it might be missing the commit command. You know, this zookeeper's ab is much the same as raft. It first sends out a log entry, and then when the leader gets a majority of positive replies, the leader sends out a notification saying, yeah, I'm going to be committing that log entry. I may not have gotten the commit. And the sort of worst case version of this, although it's equivalent to what I already said, is that for all this client for all client who knows, this replica may be partitioned from the leader. It may just absolutely not be in contact with the leader at all. And, you know, the follower doesn't really have a way of knowing that actually it's just been cut off a moment ago from the leader and just not getting anything. So without some further cleverness if we want to build a linearizable system, we can't play this game of sending the attractive as it is for performance. We can't play this game of replicas sending read requests to the replicas. And you shouldn't do it for Lab 3 either because Lab 3 is also supposed to be linearizable. So any questions about why linearizability forbids us from having replicas serve clients? Okay, you know, the proof is the maybe I lost it now, but the proof was that simple read and write one, write two, read one example I put on the board earlier. You're not allowed to serve stale data in a linearizable system. Okay. So how does ZooKeeper deal with this? ZooKeeper actually does. You can tell from table two. You look in table two. ZooKeeper's read performance goes up dramatically as you add more servers. So clearly ZooKeeper's playing some game here which allows, must be allowing it to return read only, to serve read only requests from the additional servers the replicas. So how does ZooKeeper make this safe? That's right. I mean in fact it's almost not allowed to say it does need the latest. The way ZooKeeper skins this cat is that it's not linearizable. Right? They just like defined away this problem and say well we're not going to be, we're not going to provide linearizable reads and so therefore we don't, are not obliged, you know, ZooKeeper's not obliged to provide fresh data to reads. It's allowed by its rules of consistency which are not linearizable to produce stale data for reads. So it's sort of solved this technical problem with a kind of definitional wave of the wand by saying well we never owed you that linearizable in the first place so it's not a bug if we don't provide it. And that's actually a pretty classic way to approach this. To approach the sort of tension between performance and strict and strong consistency is to just not provide strong consistency. Nevertheless we have to keep in the back of our minds the question of if the system doesn't provide linearizability is it still going to be useful? Right? You do a read and you just don't get the current answer, current correct answer, the most latest data like why do we believe that that's going to produce a useful system? And so let me talk about that. First of all any questions about the basic problem? ZooKeeper really does allow client to send read-only requests to any replica. And the replica responds out of its current state and that replica may be lagging, its log may not have a very latest log entries and so it may return stale data even though there's a more recent committed value. So what are we left with? ZooKeeper does actually have some that does have a set of consistency guarantees. So to help people write ZooKeeper based applications reason about what their applications, what's actually going to happen when they run them. And these guarantees have to do with ordering as indeed linearizability does. So ZooKeeper does have two main guarantees that they state and this is section 2.3. One of them is it says that writes are linearizable. Now their notion of linearizable isn't quite the same in mind maybe because they're talking about writes, no reads. What they really mean here is that the system behaves as if even though clients might submit writes concurrently nevertheless the system behaves as if it executes the writes one at a time in some order and indeed obeys the real time ordering of writes. So if one write is seen to have completed before another write is issued then ZooKeeper will indeed act as if it executed the second write after the first write. So it's writes but not reads are linearizable. And ZooKeeper isn't a strict read write system. There are actually writes that imply reads also. And for those sort of mixed writes any operation that modifies the state is linearizable with respect to all other operations that modify the state. The other guarantee it gives is that any given client its operations execute in the order specified by the client. They call that FIFO client order. And what this means is that if a particular client issues a write and then a read and then a read and a write or whatever that first of all the writes from that sequence fit in in the client specified order in the overall order of all client's writes. So if the client says do this write then that write, then the third write in the final order of writes we'll see the client's writes occur in the order of the client specified. So for writes this is our client specified order. This is particularly you know this is an issue with this system because clients are allowed to launch asynchronous write requests. That is a client can fire off a whole sequence of writes to the leader to the zookeeper leader without waiting for any of them to complete. And in order presumed the paper doesn't exactly say this but presumably in order for the leader to actually be able to execute the client's writes in the client specified order we're imagining, I'm imagining that the client actually stamps its write requests with numbers and say you know I'll do this one first, this one second, this one third and the zookeeper leader obeys that right. So this is particularly interesting due to these asynchronous write requests. And for reads this is a little more complicated. The reads as I said before don't go through, the writes all go through the leader. The reads just go to some replica and so all they see is the stuff that happens to have made it to that replica's log. The way we're supposed to think about FIFO client order for reads is that if the client issues a sequence of reads again in some order, the client reads one thing and then another thing and then a third thing that relative to the log on the replica it's talking to, does client's reads each have to occur at some particular point in the log or they need to sort of observe the state as it, as the state existed at a particular point in the log and furthermore that the successive reads have to observe points that don't go backwards. That is if a client issues one read and then another read and the first read executes at this point in the log. The second read is allowed to execute at the same or later points in the log but not allowed to see a previous state by issue one read and then another read. The second read has to see a state that's at least as up to date as the first state. And so that's a significant fact in that we're going to harness when we're reasoning about how to write correct suekeeper applications. And where this is especially exciting is that if the client is talking to one replica for a while and it issues some reads, suppose it's issued a read here and then a read there, if this replica fails and the client needs to start sending its read to another replica that guarantee, this FIFO client order guarantee still holds if the client switches to a new replica. And so that means that if before a crash the client did a read that saw state as of this point in the log, that means when the client switches to the new replica, if it issues another read its previous read executed here, if the client issues another read that read has to execute at this point or later even though it switched replicas. And the way this works is that each of these log entries is tagged or the leader tags it with ZXID, which is basically just a entry number. Whenever a replica responds to a client read request, it executed the request at a particular point and the replica responds with the ZXID of the immediately proceeding log entry back to the client. The client remembers this was the ZXID of the most recent data. This is the highest ZXID I've ever seen. And when the client sends a request to the same or a different replica, it accompanies the request with that highest ZXID it's ever seen. And that tells this other replica oh, aha, you know, I need to respond to that request with data that's at least relative to this point in the log. And that's interesting if this replica is not up, if this second replica is even less up to date, it was then received any of these but it receives a request from a client and the client says, oh gosh, the last ZXID I did executed this spot in the log and some other replica, this replica needs to wait until it's gotten the entire log up to this point before it's allowed to respond to the client. And I'm not sure exactly how that works but either the replica just delays responding to the read or maybe it rejects the read and says, look, I just don't have the information, talk to somebody else or talk to me later. Of course eventually the you know, this replica will catch up if it's connected to the leader and then somebody else will be able to respond. Okay, so reads are ordered. They only go forward in time or only go forward in a sort of log order. And a further thing which I believe is true about reads and writes is that reads and writes, the FIFO client order applies to all of a client, all of a single client's request. So if I do a write, if I'm a client and I send a write to the leader it takes time before that write is sent out, committed, whatever. So I may send a write to the leader. The leader hasn't processed it or committed it yet. And then I send a read to a replica. The read may have to stall in order to guarantee FIFO client order. The read may have to stall until this client has actually seen and executed the previous, the client's previous write operation. So that's a consequence of this FIFO client order is that you know, reads and writes are in the same order. The most obvious way to see this is if a client writes a particular piece of data, you know, sends a write to the leader and then immediately does a read of the same piece of data and sends that read to a replica. Boy, it better see its own written value, right? If I write something to have value 17 and then I do a read and it doesn't have value 17 then that's just bizarre and it's evidence that gosh, the system was not executing my request in order because then it would have executed the write and then before the read. So there must be some funny business with the replica stalling. The client must, when it sends a read and say, look, you know, I, the last write request I sent to the leader was ZXID something and this replica has to wait till it sees that in the leader. Yes? Oh, absolutely. So I think what you're observing is that a read from a replica may not see the latest data. So the leader may have sent out C to a majority of replicas and committed it and the majority may have executed it but if our replica that we're talking wasn't in that majority, maybe this replica doesn't have the latest data and that just is the way ZooKeeper works. And so it does not guarantee that read see the latest data. So if there is a guarantee about read write ordering but it's only per client. So if I send a write in and then I read that data, the system guarantees that my read observes my write. If you send a write in and then I read the data that you wrote, the system does not guarantee that I see your write. And that's like the foundation of how they get speed up for reads proportional to the number of replicas. I would say the system isn't linearizable. But it's not that it has no properties. The rights are certainly linearized. All rights from all clients form some one at a time sequence. So that's a sense in which the rights, all rights are linearizable and each individual client's operations maybe this means linearizable also. It may, you know, this probably means that each individual client's operations are linearizable. Though I'm not quite sure. You know, I'm actually not sure how it works. But that's a reasonable supposition. Then when I send in an asynchronous write the system doesn't execute it yet but it does reply to me saying, yeah, you know, I got your write and here's the ZXID that it will have if it's committed. Just like start return. So that's a reasonable theory. I don't actually know how it does it. And then the client if it does a read needs to tell the replica look, you know, last write I did was... It does return to the client. Does it give it the length of the log? Does it try to give it the oldest one it can? Does the latest one or oldest one are just... I mean, if I do a read... It's of the data, it's of the operation. Okay, so if you send a read to a replica the replica is going to return you, you know, it really gets a read from this table is what you're notionally what the client thinks it's doing. So the client says, oh, I'm going to read this row from this table. The server of this replica sends back its current value for that table plus the ZXID of the last operation that updated that table. Yeah, so actually I'm not prepared to... So the two things that would make sense and I think either of them would be okay is the server could track for every table row the ZXID of the last write operation that touched it or it could just to all read requests return the ZXID of the last committed operation in its log regardless of whether that was the last operation to touch that row because all we need to do is make sure that client requests move forward in the order. So we just need something to return, something that's greater than or equal to the write that last touched the data that the client read. Alright, so these are the set of guarantees. So we still left with the question of whether it's possible to do reasonable programming with this set of guarantees. And the answer is well, at a high level this is not quite as good as linearizable. It's a little bit harder to reason about and there's sort of more gotchas like reads can return stale data, which can't happen in a linearizable system. But it's nevertheless good enough to make it pretty straightforward to reason about it. A lot of things you might want to do with ZooKeeper. So I'm going to try to construct an argument maybe by example of why this is not such a bad programming model. One reason, by the way, is that there's an out. There's this operation called sync which is essentially a write operation. And if a client supposedly I know that you recently wrote something, you being a different client, and I want to read what you wrote. So I actually want fresh data. I can send in one of these sync operations which is effectively, well, the sync operation makes its way through the system as if it were a write. And finally showing up in the logs of the replicas are really at least the replica that I'm talking to. And then I can come back and do a read. And I can tell the replica basically don't serve this read until you've seen my last sync. And that actually falls out naturally from FIFO client order. If we count the sync as a write then FIFO client orders as reads are required to see state that's at least as up to date as the last write from that client. And so if I send in a sync and then I do a read the system is obliged to give me data that's at least as up to date as where my sync fell in the log order. Anyway if I need to read up to date data send in a sync then do a read. And the read is guaranteed to see data as of the time the sync was entered into the log. So reasonably fresh. So that's one out. But it's an expensive one because you're now converted a cheap read into the sync operation which burned up time on the leader. So it's a no-no if you don't have to. But here's a couple of examples of scenarios that the paper talks about that the reasoning about them is simplified or reasonably simple given the rules that are here. So first I want to talk about the trick in section 2.3 with the ready file where we assume there's some master and the master's maintaining a configuration in ZooKeeper which is a bunch of files in ZooKeeper that describe you know something about our distributed system like the IP addresses of all the workers or who the master is or something. So we have a master who's updating this configuration and maybe a bunch of readers that need to read the current configuration and need to see it every time it changes. And so the question is you know can we construct something that even though updating the configure even though the configuration is split across many files in ZooKeeper we can have the effect of an atomic update so that workers don't see the configuration don't see a sort of partially updated configuration but only a completely updated configuration. So that's a classic kind of thing that this configuration management that ZooKeeper people use ZooKeeper for. So you know looking at the sort of copying what section 2.3 how it describes this we'll say the master is doing a bunch of writes to update the configuration and here's the order that the master for our distributed system does the writes. First we're assuming there's a ready file a file named ready and if the ready file exists then the configuration is we're allowed to read the configuration if the ready file is missing that means the configuration is being updated and we shouldn't look at it. So if the master is going to update the configuration file the very first thing is delete the ready file then it writes the various files that hold the data for the configuration might be a lot of files who knows and then when it's completely updated all the files that make up the configuration then it creates again this ready file. So far the semantics are actually extremely straight forward. This is just writes, there's only writes here, no reads writes are guaranteed to execute in linear order and I guess now we have to appeal the FIFO client order if the master sort of tags these as oh you know I want my writes to occur in this order then the leader is obliged to enter them into the replicated log in that order and so the replicas will all dutifully execute these one at a time, they'll all delete the ready file then apply this right and that right and then create the ready file again so these are writes, the order is straight forward. For the reads though it's maybe a little bit maybe a little more thinking is required, supposing we have some worker that needs to read the current configuration we're going to assume that this worker first checks to see whether the ready file exists. If it doesn't exist it's going to sleep and try again so let's assume it does exist, let's assume we assume that the worker checks to see if the ready file exists after it's recreated and so you know what this means now these are all write requests sent to the leader this is a read request that's just sent to whatever replica the client's talking to and then if it exists it's going to read F1 and read F2. The interesting thing that FIFO client order guarantees here is that if this return true if the replica the client was talking to said yes that file exists then what that means is that at least with this setup is that as that replica that replica had actually seen the recreate of the ready file in order for this to exist to see that the ready file exists and because successive read operations are required to march along only forwards in the log and never backwards that means that if the replica the client was talking to if its log actually contained and it had executed this creative the ready file that means that subsequent client reads must move only forward in the sequence of writes that the leader put into the log so if we saw this ready that means that the read occurs that the replica reads down here somewhere after the write the created the ready and that means that the reads are guaranteed to observe the effects of these writes so we do actually get some benefit here some reasoning benefit from the fact that even though it's not fully linearizable the writes are linearizable and the reads have to read sort of monotonically move forward in time to the log. Yes, so that's a great question so your question is well you know all this client knows you know if this is the real scenario that the create is entered in the log and then the read arrives at the replica after that replica executed this create ready then everything straight forward but there's other possibilities for how this stuff was interleaved so let's look at a much more troubling scenario so the scenario you brought up which I happen to be prepared to talk about is that you know the master at some point executed a delete of ready or you know way back in time some previous master this master created the ready file after it finished updating the state so the ready file existed for a while then some new master this master needs to change the configuration to release the ready file you know it doesn't write what's really troubling is that the client that needs to read this configuration might have called exists to see whether the ready file exists at this time right and you know at this point in time yeah sure the ready file exists then time passes and the client issues the reads for the maybe the client reads the first file that makes up the configuration but maybe it you know then it reads the second file maybe this file this read comes totally after the master has been changing the configuration so now this reader has read this damaged mix of f1 from the old configuration and f2 from the new configuration there's no reason to believe that that's going to contain anything other than broken information so this first scenario was great this scenario is a disaster and so now we're starting to get into question of like serious challenges which a really designed API for coordination between machines in a distributed system might actually help us solve right because like for lab 3 you know you're going to build a put get system and a simple lab 3 style put get system you know it would run into this problem too and just does not have any tools to deal with it but the zookeeper API actually is more clever than this and it can cope with it so what actually happens the way you would actually use zookeeper is that when the client sent in this exist request to ask does this file exist it would say not only does this file exist but it would say you know tell me if it exists and set a watch on that file which means if the file's ever deleted or if it doesn't exist if it's ever created but in this case if it is ever deleted please send me a notification and furthermore the notifications that zookeeper sends the reader here is only talking to some replicas this is all the replicas doing these things for it the replica guarantees to send a notification for some change to this ready file at the correct point relative to the responses to the client's reads and so what that means the implication of that is that in this scenario in which these these rights sort of fit in here in real time the guarantee is that if you ask for a watch on something and then you issue some reads if that replica you're talking to executes something that should trigger the watch in during your sequence of reads then the replica guarantees to deliver the notification about the watch before it responds to any read that came that you know saw the log after the point of the operation that triggered the watch notification executed and so this is the log on the replica and so you know if the file client ordering rules say each client request must fit somewhere into the log apparently these fit in here in the log what we're worried about is that this read occurs here in the log but we set up this watch and the guarantee is that we'll receive the notice if somebody deletes this file and we get notified then that notification will appear at the client before a read that yields anything subsequently in the log we'll get the notification before we get the results of any read that saw something in log after the operation that produced the notification so what this means is that the delete ready is going to since we have a watch on the ready file the delete ready is going to generate a notification and that notification is guaranteed to be delivered before the read result of F2 if F2 is going to see this second right and that means that before the reading client has finished the sequence in which it looks at the configuration it's guaranteed to see the watch notification before it sees the results of any right that happened after this delete that triggered the notification who generates the watch as well the replica let's say the client is talking to this replica and it sends in the exist request the exist request the read only request it sends with this replica the replica is maintaining on the side a table of watches saying oh you know such and such a client asks for a watch on this file and furthermore the watch was established at a particular ZXID that is it did a read that client did a read the replica executed the read at this point in the log and returned results relative to this point in the log and it remembers oh that watch is relative to that point in the log and then if a delete comes in for every operation that the replica looks in a subtle table and says aha you know the watch on that file maybe it's indexed by hash or file name or something okay so the question is oh yeah this this replica has to have a watch table you know if the replica crashes and the client has to switch to a different replica you know what about the watch table it's already established this watch and the answer to that is that no you switch if your replica crashes the new replica you switch to won't have the watch table and but the client gets a notification at the appropriate point in the stream of responses it gets back saying oops your replica you were talking to crashed and so the client then knows it has to completely reset up everything and so tucked away in the examples are missing event handlers that say oh gosh you know we need to go back and reestablish everything if we get a notification that our replica crashed alright I'll continue this