 All right, let's get started. Today, and indeed, today and tomorrow, I'm going to talk about raft, both because I hope it will be helpful for you in implementing the labs, and also because it's just a case study in the details of how to get state machine replication correct. So by way of introduction to the problem, you may have noticed a pattern in fault-tolerant systems that we've looked at so far. One is that MapReduce replicates computation, but the replication is controlled. The whole computation is controlled by a single master. Another example I'd like to draw your attention to is that GFS replicates data, as this primary backup scheme for replicating the actual contents of files. But it relies on a single master to choose who the primary is for every piece of data. Another example, VMware FT replicates computational right on a primary virtual machine and a backup virtual machine. But in order to figure out what to do next, if one of them seems to fail, it relies on a single test and set server to help it choose, to help it ensure that exactly one of the primary or the backup takes over if there's some kind of failure. So in all three of these cases, sure, there was a replication system, but sort of tucked away in a corner in the replication system, there was some scheme where a single entity was required to make a critical decision about who the primary was in the cases we care about. So a very nice thing about having a single entity decide who's going to be the primary is that it can't disagree with itself. There's only one of it, make some decision. That's the decision it made. But the bad thing about having a single entity decide who the primary is is that it itself has a single point of failure. And so you can view these systems that we've looked at as sort of pushing the real heart of the fault tolerance machinery into a little corner, that is, the single entity that decides who's going to be the primary if there's a failure. Now, this whole thing is about how to avoid split brain. The reason why we have to have, have to be extremely careful about making the decision about who should be the primary if there's a failure is that otherwise we risk split brain. And just to make this point super clear, I'm going to just remind you what the problem is and why it's a serious problem. So supposing, for example, we want to build ourselves a replicated test and set server. That is, we're worried about the fact that VMware FT relies on this test and set server to choose who the primary is. So let's build a replicated test and set server. Now, what I'm going to do this is going to be broken. It's just an illustration for why it's difficult to get the split brain problem correctly. So we're going to imagine we have a network and maybe two servers, which are supposed to be replicas of our test and set service connected, and maybe two clients that need to know, oh, who's the primary right now? Actually, maybe these clients in this case are the primary in the backup in VMware FT. So if it was a test and set service, then both these databases, both these servers start out with their state. That is the state of this test and set back being zero. And the one operation their clients can send is the test and set operation, which is supposed to set the flag of the replicated service to one. So it should set both copies and then return the old value. So it essentially acts as a kind of simplified lock server. OK, so the problem situation, what we worry about split brain arises when a client can talk to one of the servers but can't talk to the other. So we're imagining either that when clients send a request, they send it to both. I'm just going to assume that now. It almost doesn't matter. So let's assume that the protocol is that the client's supposed to send ordinarily any request to both servers. And somehow, we need to think through what the client should do if one of the servers doesn't respond or what the system should do if one of the servers seems to be unresponsive. So let's imagine now the client, one, can contact server one but not server two. How should the system react? One possibility is that we think, well, gosh, we certainly don't want to just talk to client to server one because that would leave the second replica inconsistent if we set this value to one but didn't also set this value to one. So maybe the rule should be that the client is always required to talk to both replicas, to both servers, for any operation and shouldn't be allowed to just talk to one of them. So why is that the wrong answer? So the rule is, oh, in our replicated system, the client's always required to talk to both replicas in order to make progress. That's not about tolerant at all. In fact, it's worse. It's worse than talking to a single server because now the system has a problem and if either of these servers is crashed or you can't talk to it, at least with a non-replicated servers, you're only depending on one server. But here, both servers have to be alive. If we require the client to talk to both servers, then both servers have to be alive. So we can't possibly require the client to actually wait for both servers to respond. If we don't have fault tolerance, we need it to be able to proceed. So another obvious answer is that if the client can't talk to both, well, it just talks to the one it can talk to and figures the other one's dead. So why is that also not the right answer? The troubling scenario is if the other server is actually alive. So suppose the actual problem we're encountering is not that the server crashed, which would be good for us, but the much worse issue that something went wrong with the network cable and that this client can talk to. Client one can talk to server one but not server two and there's maybe some other client out there that can talk to server two but not server one. So if we make the rule that if a client can't talk to both servers, that it's OK in order to be fault tolerant, that it just talked to one, then what's just inevitably going to happen is that this cable is going to break, thus cutting the network in half. Client one is going to send a test and set request to server one. Server one will set its state to one and return the previous value of zero to client one. And so that means client one will think it has the lock. And if it's a VMware FT server, we'll think it can be takeover as primary. But this replica still has zero in it. So now if client two also sends a test and set request to, it tries to send them to both, sees that server one appears to be down, follows the rule that says, well, you just send to the one server you can talk to, then it will also think that it acquired. Client two also think that it acquired the lock. And so now, if we were imagining this test and set server was going to be used with VMware FT, now both of these VMware machines, I think they could be primary by themselves without consulting the other server. So that's a complete failure. So with this setup and two servers, it seemed like we had this, we just had to choose either you wait for both and you're not fault tolerant, or you wait for just one and you're not correct. And the not correct version is often called split brain. Everybody see this? Well, so this was basically where things stood until the late 80s. And when people, but people did want to build replicated systems, like the computers that control telephone switches, or the computers that ran banks. There was a place where we only spent a huge amount of money in order to have reliable service. And so they would replicate, they would build replicated systems. And the way they would deal, the way that they would have replication, but try to rule out split brain is a couple of techniques. One is they would build a network that could not fail. And so usually what that means, and in fact, you guys use networks that essentially cannot fail all the time. The wires inside your laptop connecting the CPU to the DRAM are effectively a network that cannot fail between your CPU and DRAM. So with reasonable assumptions and lots of money, and sort of carefully controlled physical situation, like you don't want to have a cable sneaking across the floor that somebody can step on. It's got to be carefully, physically designed setup. With a network that cannot fail, you can rule out split brain. It's a bit of an assumption, but with enough money, people get quite close to this. Because if the network cannot fail, that basically means that the client can't talk to a server to, that means server to must be down. Because it can't have been the network malfunctioning. So that was one way that people built replication systems that didn't suffer from split brain. Another possibility would be to have some human being sort out the problem. That is, don't automatically do anything instead have the clients, by default, clients always have to wait for both replicas to respond or something, never allowed to proceed with just one of them. But you can call somebody's beeper to go off, some human being goes to the machine room and sort of looks at the two replicas and either turns one off to make sure it's definitely dead or verifies that one of them has indeed crashed and that the other is alive. And so you're essentially using the human as the tiebreaker and the human is a, if they were a computer, it would be a single point of failure themselves. So for a long time, people use one or the other of these schemes in order to build replicated systems and it's not, they can be made to work. The humans don't respond very quickly and the network that cannot fail is expensive but it's not not doable. But it turned out that you can actually build automated failover systems that can work correctly in the face of flaky networks, of networks that could fail and that can partition. So this is split in the network in half where the two sides operate but can't talk to each other. That's usually called a partition. And the big insight that people came up with in order to build automated replication systems that don't suffer from split brain is the idea of a majority vote. So this is a concept that shows up in like every other sentence practically in the raft paper. It's a sort of fundamental way of proceeding. The first step is to have an odd number of servers instead of an even number of servers. Like one flaw here is that it's a little bit too symmetric, right? The two sides of the split here just look the same. So they run the same software, they're gonna do the same thing and that's not good. But if you have an odd number of servers, then it's not symmetric anymore, right? At least a single network split will be presumably two servers on one side and one server on the other side and they won't be symmetric at all. And that's part of what majority vote, majority voting schemes are appealing to. So basic idea is you have an odd number of servers in order to make progress of any kind. So in raft I'll elect a leader or cause a log entry to be committed. In order to make any progress at each step, you have to assemble a majority of the servers, more than half, more than half of all the servers in order to sort of approve that step, like vote for a leader, accept a new log entry and commit it. So the most straightforward way is to have two or three servers required to do anything. One reason this works, of course, is that if there's a partition, there can't be more than one partition with a majority of the servers in it. That's one way to look at this. A partition can have one server in it, which is not a majority, or maybe it can have two, but if one partition has two, then the other partition has to have only one server in it and therefore we'll never be able to assemble a majority and won't be able to make progress. And just to be totally clear, when we're talking about a majority, it's always a majority out of all of the servers, not just the live servers. This is a point that confused me for a long time, but if you have a system with three servers and maybe some of them have failed or something, if you need to assemble a majority, it's always two out of three, even if you know that one has failed. The majority is always out of the total number of servers. There's a more general formulation of this. Because a majority voting system in which two out of three are required to make progress, it can survive the failure of one server, right? Any two servers are enough to make progress. If you need to be able to, if you're, you know, you worried about how reliable your servers are, or God, then you can build systems that have more servers. And so the more general formulation is if you have two F plus one servers, then you can withstand F failures. So if it's three, that means F is one, in a system with three servers, you can tolerate F servers, F one failure, and still keep going. Often these are called quorum systems, because the two out of three is sometimes out of quorum. Okay, so one property I've already mentioned about these majority voting systems is that at most one partition can have a majority, and therefore if the network's partitioned, we can't have both halves of the network making progress. Another more subtle thing that's going on here is that if you always need a majority of the servers to proceed, and you go through a sort of succession of operations in which for each operation, somebody assembled a majority, like, you know, votes for leaders, or let's say votes for leaders are wrapped, then at every step, the majority you assemble for that step must contain at least one server that was in the previous majority. That is, any two majorities overlap in at least one server, and it's really that property, more than anything else, that Raft is relying on to avoid split brain. So the fact that, for example, when you have a successful leader election and a leader assembles votes for a majority, its majority is guaranteed to overlap with the previous leader's majority, and so for example, the new leader is guaranteed to know about the term number used by the previous leader, because its majority overlaps with the previous leader's majority, and everybody in the previous leader's majority knew about the previous leader's term number. Similarly anything the previous leader could have committed must be present in the majority of the servers in Raft, and therefore any new leader's majority must overlap at at least one server with every committed entry from the previous leader. This is a big part of why it is that Raft is correct. Any questions about the general concept of majority voting systems? Yeah, is it possible to add servers? Is it possible, you know, section, something maybe six in the paper explains how to add it or change the set of servers, and it's possible, you need to do it in a long running system. If you're running a system for five, 10 years, you're gonna need to replace the servers after a while. One of them fails permanently or you upgrade or you move machine rooms to a different machine room, you really do need to be able to support changing sets of servers. So that's a, it certainly doesn't happen every day, but it's a critical part of the sort of long term maintainability of these systems. And you know, the Raft authors sort of pat themselves on the back that they have a scheme that deals with this, which as well they might, because it's complex. All right, so using this idea, in about 1990 or so, there were two systems proposed at about the same time that realized that you could use this majority voting system to kind of get around the apparent impossibility of avoiding split brain by using, basically by using three servers instead of two and taking majority votes. And in one of these very early systems was called PAXOS. The Raft paper talks about this a lot. And another of these very early systems was called View Stamp Replication, which I'll abbreviate as VSR for View Stamp Replication. And even though PAXOS is by far the more widely known system in this department, Raft is actually closer in design to View Stamp Replication, which was invented by people at MIT. And so there's sort of a long, many decade history of these systems and they only really came to the forefront and started being used a lot in deployed big distributed systems about 15 years ago, a good 15 years after they were originally invented. Okay, so let me talk about Raft now. Raft takes the form of a library intended to be included in some service application. So if you have a replicated service, that each of the replicas in the service is gonna be some application code, which receives RPCs or something, plus a Raft library and the Raft libraries cooperate with each other to maintain replication. So a sort of software overview of a single Raft replica is that at the top we can think of the replicas having the application code. So it might be for lab three a key value server. So maybe we have some key value server. And it has state, the application has state that Raft is helping it manage, replicated state. And for a key value server, it's gonna be a table of keys and values. The next layer down is a Raft layer. So the key value server is gonna make function calls into Raft, and they're gonna chitchat back and forth a little bit. And Raft keeps a little bit of state. You can see it in figure two. And for our purposes, really the most critical piece of state is that Raft has a log of operations. System with three replicas, we're actually gonna have three servers that have exactly the same identical structure and hopefully the very same data sitting in, sitting at both layers. Outside of this, there's gonna be clients. And the game is that, so we have client one and client two, whole bunch of clients. The clients don't really know. The clients are just external code that needs to be able to use the service. And the hope is the clients won't really need to be aware that they're talking to a replicated service. To the clients, it'll look almost like it's just one server and they talk to the one server. And so the clients actually send client requests to the application layer of the current leader, the replica that's the current leader in Raft. And so these are gonna be application level requests for a database for a key value server. These might be put and get requests to, put takes a key and a value and updates the table and get, ask the service to get the current key, current value corresponding to some key. So this like has nothing much to do with Raft. It's just sort of client server interaction for whatever service we're building. But once one of these commands gets sent from the request, gets sent from the client to the server, what actually happens is on a non-replicated server, the application code would like execute this request and say update the table in response to a put, but not in a Raft-replicated service. Instead, if assuming the client sends a request to the leader, what really happens is the application layer simply sends the request, the client's request, down into the Raft layer to say, look, here's a request, please get it committed into the replicated log and tell me when you're done. And so at this point, the Rafts chit chat with each other until all the replicas or a majority of the replicas get this new operation into their logs so that it is replicated. And then when a leader knows that all of the replicas have a copy of this, only then does Raft send a notification back up to the key value they are saying, aha, that operation you sent me a minute ago, it's been now committed into all the replicas and so it's safely replicated. And at this point, it's okay to execute that operation. The client sends a request to the key value layer, key value layer does not execute it yet, so we're not sure because it hasn't been replicated only when it's in all and the logs of all the replicas then Raft notifies the leader and now the leader actually executes the operation which corresponds to for a put updating the key value table for a get reading the correct value out of the table and then finally sends the reply back to the client. So that's an ordinary operation of the. It's committed if it's in a majority. And again, the reason why it can't be all is that if we want to build a fault tolerance system, it has to be able to make progress even if some of the servers have failed. So yeah, so it's committed when it's in a majority. It's committed when it's in a majority. The request of all these key value copies except that it replicates the log of the error. But do the replicas that receive the mix are all from the leader top operations? Yeah, I love that, I'm sorry. And so in addition, when the operations finally committed, each of the replicas sends the operation up. Each of the Raft library layer sends the operation up to the local application layer and the local application layer applies that operation to its state. Its state, and so hopefully all the replicas seem the same stream of operations. They show up in these up calls in the same order. They get applied to the state in the same order and assuming the operations are deterministic, which they better be, the state of the replicated state will evolve identically on all the replicas. So typically this table is what the paper is talking about when it talks about state. A different way of viewing this interaction and one that'll sort of notation that'll come up a lot in this course is that sort of time diagram, I'll draw you a time diagram of how the messages work. So let's imagine we have a client and server one is the leader and we also have server two and server three. And time flows downward on this diagram. We imagine the client sending the original request to server one. After that server one's Raft layer sends an append entries RPC to each of the two replicas. This is just an ordinary, let's say a put request. This is append entries requests. The server's now waiting for replies on the servers from the other replicas as soon as replies from a majority arrive back, including the leader itself. So in a system with only three replicas the leader only has to wait for one other replica to respond positively to an append entries as soon as it assembles positive responses from a majority. The leader executes the command, figures out what the answer is like for a get and sends the reply back to the client. And meanwhile, of course, if S2 is actually a Rai, a Lai, it'll send back its response too, but we're not waiting for it, although it's useful to know and figure two. All right, everybody see this? This is the sort of ordinary operation of the system, no failures here. Okay, gosh, yeah, I left out important steps. So at this point, the leader knows, oh, I got a majority of, put it in a log, I can go ahead and execute it and reply yes to the client because it's committed. But server two doesn't know anything yet, it just knows, well, I got this request from a leader, but I don't know if it's committed yet. Depends on, for example, whether my reply got back to the leader, for all server two knows its reply was dropped by the network, maybe the leader never heard the reply and never decided to commit this request. So there's actually another stage. Once the server realizes that a request is committed, it then needs to tell the other replicas that fact. And so there's an extra message here. Exactly what that message is, depends a little bit on what else is going on. It's, at least in raft, there's not an explicit commit message, instead the information is piggybacked inside the next append entries that the leader sends out, the next append entries RPC it sends out, for whatever reason. Like there's a commit, leader commit or something field in that RPC, and the next time the leader needs to have to send a heartbeat or needs to send out a new client request cause some different client request or something, it'll send out the new higher leader commit value and at that point, the replicas will execute the operation and apply it to their state. Yes. So this is a protocol that has quite a bit of chit chat in it. And it's not super fast. Indeed, the client sends a new request, request has to get to the server, the server talks to at least another, it sends out multiple messages, has to wait for the responses, sends something back. So there's a bunch of message round trip times kind of embedded here. Yeah, so if, so this is up to you as the implementer actually, exactly when the leader sends out the updated commit index. If client requests come back only very occasionally, then the leader may want to send out a heartbeat or send out a special append entries message. If client requests come quite frequently, then it doesn't matter because if they come, there's thousand arrived per second and geez, it'll be another one along very soon. And so you can piggyback. So without generating an extra message, which is somewhat expensive, you can get the information out on the next message you are gonna send anyway. In fact, I don't think the time at which the replicas execute the request is critical because nobody's waiting for it. At least if there's no failures, if there's no failures, replicas executing the request isn't really on the critical path. Like the client isn't waiting for them, the client's only waiting for the leader to execute. So it may not be that important. It may not affect client perceived latency, sort of exactly how this gets staged. One question you should ask is why does the system, why is the system so focused on logs? What are the logs doing? And it's sort of worth trying to come up with an explicit answers to that. One answer to why the system is totally focused on logs is that the log is the kind of mechanism by which the leader orders operations. It's vital for these replicated state machines that all the replicas apply not just the same client operations to their start, but the same operations in the same order. But they all have to apply these operations coming from the clients in the same order. And the log, among many other things, is part of the machinery by which the leader assigns an order to the incoming client operations. Like if 10 clients send operations to the leader at the same time, the leader has to pick an order and make sure everybody, all the replicas obey that order. And the fact that the log has numbered slots is part of how the leader expresses the order it's chosen. Another use of the log is that between this point and this point, server three has received an operation that is not yet sure is committed and it cannot execute it yet. It has to put this operation aside somewhere until the increment to the leader commit value comes in. And so another thing that the log is doing is that on the followers, the log is the place where the follower sort of sets aside operations that are still tentative that have arrived but are not yet known to be committed. And they may have to be thrown away, as we'll see. So that's another use. On the sort of dual of that use on the leader side is that the leader needs to remember operations and it's log because it may need to retransmit them to followers. If some followers offline, maybe it's something briefly happened to its network connection or something, this is some messages. The leader needs to be able to resend log messages that any follower has missed. And so the leader needs a place where it can set aside copies of messages of client requests, even ones that it's already executed in order to be able to resend them to the client. I mean resend them to replicas that missed that operation. And a final reason for all of them to keep the log is that at least in the world of figure two, if a server crashes and restarts and wants to rejoin, well you really want a server that crashes to in fact restart and rejoin the wrapped cluster, otherwise you're now operating with only two out of three servers and you can't survive any more failures. We need to reincorporate failed and rebooted servers and the log is sort of where, what a server, a rebooted server uses, the log persists to its disk. Because one of the rules is that each wrapped server needs to write its log to its disk where it will still be after it crashes and restarts. That log is what the server uses. It replays the operations in that log from the beginning to sort of create its state as of when it crashed. And then it carries on from there. So the log is also used as part of the persistence plan as a sequence of commands to rebuild the state. Yes. So I'm wondering what scenarios would arrive way slower than the rate of which the leader is at? Well, ultimately, okay, so the question is, suppose the leader is capable of executing a thousand client commands a second and the followers are only capable of executing a hundred client commands per second. That's sort of sustained rate at full speed. So one thing to note is that the, the replicas, the followers acknowledge commands before they execute them. So the rate at which they acknowledge and accumulate stuff in their logs is not limited. So maybe they can acknowledge it a thousand requests per second. If they do that forever, then they will build up unbounded sized logs because their execution rate will fall an unbounded amount behind the rate at which the leader has given the messages sort of under the rules of our game. And so what that means is they will eventually run out of memory at some point. So after they have a billion, after they fall a billion log entries behind, they'll just like, they'll call the memory allocator for space for a new log entry and it will fail. So yeah, and RAF doesn't, RAF doesn't have the flow controls that's required to cope with this. So I think in a real system, you would actually need probably piggybacked and doesn't need to be real time, but you probably need some kind of additional communication here that says, oh, here's how far I've gotten in execution so that the leader can say, well, too many thousands of requests ahead of the point at which the followers have executed. Yeah, so I think there's probably in a production system that you're trying to push to the absolute max, you might well need an extra message to throttle the leader if it got too far ahead. So the question is, if one of these servers crashes, it has this log that it persisted to disk because that's one of the rules of figure two. So the server will be able to read this log back from disk, but of course, that server doesn't know how far it got in executing the log. And also it doesn't know, at least when it first reboots by the rule of figure two, it doesn't even know how much of the log is committed. So the first answer to your question is that immediately after a restart, after a server crashes and restarts and reads this log, it is not allowed to do anything with the log because it does not know how far the system has committed in this log. Maybe it's log has 1,000 uncommitted entries and zero committed entries for all it knows. So if the leader dies, of course, that doesn't help either, but let's suppose they've all crashed. This is getting a bit ahead of me, but we'll suppose they've all crashed. And so all they have is the state that was marked as non-volatile in figure two, which includes the log and maybe the latest term. And so they don't know, so if there's a crash, they all crash and they all restart, none of them knows initially how far they had executed before the crash. So what happens is they do leader election, one of them gets picked as a leader. And that leader, if you sort of track through what figure two says about how a pendant is supposed to work, the leader will actually figure out as a byproduct of sending out a pendant, sending out the first heartbeat, really. It'll figure out what the latest point is, basically, that all of the, that a majority of the replicas agree on their laws, because that's the commit point. Another way of looking at it is that once you choose a leader, through the append entries mechanism, the leader forces all of the other replicas to have identical logs to the leader. And at that point, plus a little bit of extra that paper explains, at that point, since the leader knows that it's forced all the replicas to have logs that are identical to it, then it knows that all the replicas must also have a, there must be a majority of replicas that all those log entries in the logs that are now identical must also be committed because they're held on a majority of replicas. And at that point, the leader, the append entries code described in figure two for the leader will increment the leader's commit point and everybody can now execute the entire log from the beginning and recreate their state from scratch, possibly extremely laboriously. So that's what figure two says. It's obviously this re-executing from scratch is not very attractive, but it's what the basic protocol does. And we'll see tomorrow that the sort of version of this is more efficient uses checkpoints and we'll talk about it tomorrow. Okay, so this was a sequence in sort of ordinary non-failure operation. Another thing I wanna briefly mention is what this interface looks like. You probably all seen a little bit of it due to working on the labs, but roughly speaking, if you have, let's say the key value layer with its state and the raft layer underneath it. There's, on each replica, there's really two main pieces of the interface between them. There's this method by which the key value layer can relay if a client sends in a request, the key value layer has to give it a raft and say please, you know, fit this request into the log somewhere. And that's the start function that you'll see in raft.go. And really it just takes one argument, the client command, with a key value layer saying please, I got this command, stick it into the log and tell me when it's committed. And the other piece of the interface is that by and by the raft layer will notify the key value layer that aha, that operation that you sent to me in a start command a while ago, which may well not be the most recent start, right? You know, 100 client commands could come in and cause calls to start before any of them are committed. So by and by this upward communication is takes the form of a message on a go channel that the raft library sends on and key value layer is supposed to read from. So there's this apply, it's called the apply channel and on it you send a apply message. This start, and of course you need the key value layer and you need to be able to match up messages that receive from the apply channel with calls to start that it made. And so the start command actually returns enough information for that match up to happen. It returns the index, the start functions basically returns the index in the log where if this command is committed, which it might not be, it'll be committed at this index. And I think it also returns the current term and some other stuff we don't care about very much. And then this apply message is gonna contain the index, both the command and all the replicas will get these apply messages so they'll all know, oh, I should apply this command, figure out what this command means and apply it to my local state. And they also get the index, the index is really only useful on the leader so it can figure out what client request we're talking about. The wrap leader is supposed to be if we return in any case, but the point at which the client by making a get request to that same leader will be able to see that any value is when the wrap, for instance, okay is the executioner. Let's see how we start. Okay, so I think I answer a slightly different question. Let's suppose the client sends any request in. Let's say it's a put or a get, could be put or a get, doesn't really matter, let's say it's a get. The point at which the, as a client sends in a get and waits for a response, the point at which the leader will send a response at all is after the leader knows that command is committed. So this is gonna be a sort of get reply. So the client doesn't see anything back. I mean, and so that means in terms of the actual software stack, that means that the key value, the RPC will arrive, the key value layer will call the start function. The start function will return to the key value layer, but the key value layer will not yet reply to the client because it does not know if it's, actually it hasn't executed the client's request now. It doesn't even know if it ever will because it's not sure if the request is gonna be committed. And the situation in which it may not be committed is if the key value layer gets a request, calls start and immediately after start returns, it crashes, right? It certainly hasn't sent out its apply, what append messages or whatever, nothing's been committed yet. So the game is start returns, time passes, the relevant, the apply message corresponding to that client request appears to the key value server on the apply channel and only then, and that causes the key value server to execute the request and send a reply. And that's like all this is very important when it doesn't really matter everything goes well, but if there's a failure we're now at the point where we start worrying about failures and we're extremely interested in if there was a failure, what did the client see? All right, and so one thing that has come up is that all of you should be familiar with is that at least initially one interesting thing about the logs is that they may not be identical. There are a whole bunch of situations in which at least for brief periods of time the ends of the different replica's logs may diverge. Like for example, if a leader starts to send out a round of append messages, but crashes before it's able to send all of them out. That'll mean that some of the replicas that got the append message will append, that new log entry and the ones that didn't get that append messages RPC won't have appended them. So it's easy to see that the logs are gonna diverge sometimes. The good news is that the way Raft works actually ends up forcing the logs to be identical after a while. There may be transient differences but in the long run all the logs will sort of be modified by the leader until the leader ensures they're all identical and only then are they executed. Okay, so I think the next, there's really two big topics to talk about here for Raft, one is how leader election works which is in a lab two and the other is how the leader deals with the different replica's logs particularly after failure. So first I wanna talk about leader election. Question to ask is how come the system even has a leader? Why do we need a leader? The part of the answer is you do not need a leader to build a system like this. It is possible to build an agreement system by which a cluster of servers agrees the sequence of entries in a log without having any kind of designated leader and indeed the original Paxo system which the paper refers to, original Paxos did not have a leader. So it's possible. The reason why Raft has a leader is basically that there's probably a lot of reasons but one of the foremost reasons is that you can build a more efficient, in the common case in which the servers don't fail it's possible to build a more efficient system if you have a leader because with a designated leader or everybody knows who the leader is you can basically get agreement on a request with one round of messages per request whereas leader of the systems have more of the flavor of well you need a first round to kind of agree on a temporary leader and then a second round to actually send out the request. So it's probably the case that use of a leader speeds up the system by a factor of two and also makes it sort of easier to think about what's going on. Raft goes through a sequence of leaders and it uses these term numbers in order to sort of disambiguate which leader we're talking about. Turns out the followers don't really need to know the identity of the leader they really just need to know what the current term number is. Each term has at most one leader and it's a critical property. For every term there might be no leader during that term or there might be one leader but there cannot be two leaders during the same term. Every term has at most one leader. How do the leaders get created in the first place? Every raft server keeps this election timer which is basically just a time that it has recorded that says well if that time occurs I'm gonna do something. And the something that it does is that if an entire leader election period expires without the server having heard any message from the current leader then the server sort of assumes probably that the current leader is dead and starts an election. So we have this election timer and if it expires start an election. And what it means to start an election is basically that you increment the term to candidate the server that's decided it's gonna be a candidate and sort of force a new election. First increments is term because it wants there to be a new leader namely itself and a leader, a term can't have more than one leader so we gotta start a new term in order to have a new leader. And then it sends out these request votes, RPCs are gonna send out a full round of request votes. And you may only have to send out n minus one request votes because one of the rules is that a new candidate always votes for itself in the election. So one thing to note about this is that it's not quite the case that if the leader didn't fail, we won't have an election. But if the leader does fail, then we will have election and election assuming any other server is up because someday the other servers, election timers will go off. But if the leader didn't fail, we might still unfortunately get an election. So if the network is slow or drops a few heartbeats or something, we may end up having election timers go off. And even though there was a perfectly good leader, we may nevertheless have a new election. So we have to sort of keep that in mind when we're thinking about correctness. And what that in turn means is that if there's a new election, it could easily be the case that the old leader is still hanging around and still thinks it's the leader. Like if there's a network partition, for example, and the old leader is still alive and well in a minority partition, the majority partition may run an election and indeed a successful election and choose a new leader, all totally unknown to the previous leader. So we also have to worry about, oh, what's that previous leader gonna do since it does not know there was a new election? Yes. Okay, so the question is, are there, can there be pathological cases in which, for example, one way network communication can prevent the system from making progress? I believe the answer is yes, certainly. So for example, if the current leader, if its network somehow half fails in a way the current leader can send out heartbeats but can't receive any client requests, then the heartbeats that it sends out, which are delivered, because its outgoing network connection works, its outgoing heartbeats will suppress any other server from starting an election. But the fact that it's incoming network wire apparently is broken will prevent it from hearing and executing any client commands. So it's absolutely the case that RAP does not proof against all, sort of all crazy network problems that can come up. I believe, the ones I've thought about, I believe are fixable in the sense that we could solve this one by having us sort of requiring a two-way heartbeat in which if the leader sends out heartbeats but in which followers are required to reply in some way to heartbeats, like if they are already required to reply, but if the leader stops seeing replies to its heartbeats then after some amount of time in which a season of replies, the leader decides to step down. I feel like that specific issue can be fixed and many others can too, but you're absolutely right that very strange things can happen to networks including some that the protocol is not prepared for. Okay, so we got these leader elections. We need to ensure that there is at most, at most one leader per term. How does RAF do that? Well, RAF requires in order to be elected for a term, RAF requires a candidate to get yes votes from a majority of the servers. The servers and each server will only cast one yes vote per term. So in any given term, it basically means that in any given term each server votes only once for only one candidate, you can't have two candidates both get a majority of votes because everybody votes only once. So the majority rule causes there to be at most one winning candidate. And so then we get at most one candidate elected per term. And in addition, critically, the majority rule means that you can get elected even if some servers have crashed, right? If a minority of servers are crashed or unavailable and network problems, we can still elect a leader. If more than half have crashed or not available or in another partition or something, then actually the system will just sit there trying again and again to elect a leader and never elect one if it cannot in fact, if they're not a majority of live servers. If an election succeeds, everybody would be great if everybody learned about it. So we need to ask ourselves how do all the parties learn what happened? The server that wins an election, assuming it doesn't crash, the server that wins an election will actually see a majority or positive votes for its request vote from a majority of the other servers. So the candidate that's running the election that wins it, the candidate that wins the election will actually know directly, ah, I got a majority of votes. But nobody else directly knows who the winner was or whether anybody won. So the way that the candidate informs other servers is that heartbeat. The rules in figure two say, oh, if you win an election, you're immediately required to send out an append entries to all the other servers. Now the append entries, that heartbeat append entries doesn't explicitly say, I won the election, you know, I'm a leader for term 23. It's a little more subtle than that. The way the information is communicated is that no one is allowed to send out an append entries unless they're a leader for that term. So the fact that I'm a server and I saw, oh, there's an election for term 19, and then by and by I send an append entries whose term is 19, that tells me that somebody, I don't know who, but somebody won the election. So that's how the other servers knows. They were receiving append entries for that term. And that append entries also has the effect of resetting everybody's election timer. So as long as the leader is up and it sends out heartbeat messages or append entries, at least, you know, at the rate it's supposed to, every time a server receives an append entries, it'll reset its election timer and sort of suppress anybody from being a new candidate. So as long as everything's functioning, the repeated heartbeats will prevent any further elections. Of course, if the network fails or packets have dropped, there may nevertheless be an election. But if all goes well, we're sort of unlikely to get an election. This scheme could fail in the sense, but it can't fail in the sense of electing two leaders for a term, but it can fail in the sense of electing zero leaders for a term. The sort of boring way it may fail is that if too many servers are dead or unavailable or have bad network connections. So if you can't assemble a majority, you can't be elected, nothing happens. The more interesting way in which an election can fail is if everybody's up, you know, there's no failures, no packets are dropped, but two leaders become candidate close together enough in time that they split the vote between them, or say three leaders. So supposing we have three leaders. Supposing we have a three replica system. All their election timers go off at the same time. Every server votes for itself. And then when each of them receives a request vote from another server, well, it's already cast its vote for itself, and so it says no. So that means that all three of the servers needs to get one vote each, nobody gets a majority, and nobody's elected. So then their election timers will go off again because the election timers only reset if it gets an append entries, but there's no leaders, so no append entries. They'll all have their election timers go off again, and if we're unlucky, they'll all go off at the same time, they'll all vote for themselves, nobody will get a majority. So clearly, I'm sure you're all aware at this point, there's more to this story. And the way raft makes the possibility of split votes unlikely, but not impossible, is by randomizing these election timers. So the way to think of it, and the randomization, the way to think of it is that, supposing we have some timeline I'm gonna draw events on, there's some point at which everybody received the last append entries, right, and then maybe the server died, let's just assume the server sent out a last heartbeat and then died. Well, all of the followers have reset their election timers when they received at the same time, because they probably all received the append entries at the same time, they all reset their election timers for some point in the future, to go off at some point in the future, but they chose different random times in the future at which then we're gonna go off. So let's suppose the dead leader is server one, so now server two and server three at this point set their election timers for a random point in the future, let's say server two set their election timer to go off here and server three set its election timer to go off there. And the crucial point about this picture is that, assuming they picked different random numbers, one of them is first and the other one is second, right? That's what's going on here. And the one that's first, assuming this gap is big enough, the one that's first is election timer go off first before the other one's election timer. And if we're, as long as we're not unlucky, it'll have time to send out a full round of vote requests and get answers from everybody where everybody's alive before the second election timer goes off from any other server. So, does everybody see how the randomization desynchronizes these candidates? Unfortunately, there's a bit of art in setting the constants for these election timers and there's some sort of competing requirements you might wanna fulfill. So one obvious requirement is that the election timer has to be at least as long as the expected interval between heartbeats. This is pretty obvious that the leader sends out heartbeats every 100 milliseconds. You better make sure, there's no point in having the election timer, anybody's election timer ever go off for 100 milliseconds because then it will go off before we could even have expected a new append entries. So the lower limit is, certainly the lower limit is one heartbeat interval. In fact, because the network may drop packets, you probably wanna have the minimum election timer value be a couple of times the heartbeat interval. So for 100 millisecond heartbeats, you probably wanna have the very shortest possible election timer be, say, 300 milliseconds, three times the heartbeat interval. So that's the sort of minimum is the, see if the heartbeats are this frequent, you want the minimum to be a couple of times that or here. So what about the maximum? You're gonna presumably randomize uniformly over some range of times. Where should we set the kind of maximum time that we're randomizing over? And there's a couple of considerations here. In a real system, this maximum time affects how quickly the system can recover from failure because remember, from the time at which the server fails until the first election timer goes off, the whole system is frozen. There's no leader, the client's requests are being thrown away because there's no leader and we're not assigning a new leader, even though presumably these other servers are up. So the bigger we choose this maximum, the longer delay we're imposing on clients before recovery occurs. Whether that's important, depends on sort of how high performance we need this to be and how often we think there will be failures. Failures happen once a year, then who cares? We're expecting failures frequently, we may care very much. How long it takes to recover? Okay, so that's one consideration. The other consideration is that this gap, that is the expected gap in time between the first timer going off and the second timer going off. This gap really in order to be useful has to be longer than the time it takes for the candidate to assemble votes from everybody. That is longer than the expected round trip time, the amount of time it takes to send an RPC and get the response. And so maybe it takes 10 milliseconds to send an RPC and get a response. Get a response from all the other servers. And if that's the case, we need to make maximum at least long enough that there's pretty likely to be 10 milliseconds difference between the smallest random number and the next smallest random number. And for you, the test code will get upset if you don't recover from a leader failure in a couple seconds. And so just pragmatically, you need to tune this maximum down so that it's highly likely that you'll be able to complete a leader election within a few seconds. But that's not a very tight constraint. Any questions about the election timeouts? One tiny point is that you wanna choose new random timeouts every time a node sets its election timer. That is, don't choose a random number when server is first created and then reuse that same number over and over again because you make an unlucky choice. That is, you choose one server happens by ill chance to choose the same random number as another server. That means that you're gonna have split in votes over and over again forever. So that's why you wanna almost certainly choose a different, a new, a fresh random number for the election timeout value every time you reset the timer. All right, so a final issue about leader election. Suppose we are in this situation where the old leader's partitioned. The network cable is broken and the old leader is sort of out there with a couple clients and a minority of servers and there's a majority in the other half of the network and the majority of the new half of the network elects a new leader. What about the old leader? Why won't the old leader cause incorrect execution? Yeah, I mean, that's two potential problems. One is, or one sort of non-problem is that if there's a leader off in another partition and it doesn't have a majority, then the next time a client sends it a request that leader that, you know, an abartuation with a minority, yeah, it'll send out append entries, but because it's in the minority partition, it won't be able to get responses back from a majority of the servers, including itself. And so it will never commit the operation. It'll never execute it. It'll never respond to the client saying that it executed it either. And so that means that, yeah, an old server off in a different partition, people may, clients may send a request, but they'll never get responses. So no client will be fooled into thinking that that old server executed anything for it. The other sort of more tricky issue, which I'll talk about in a few minutes, is the possibility that before a server fails, it sends out append entries to a subset of the servers and then crashes before making a commit decision. And that's a very interesting question, which we'll probably spend a good 45 minutes talking about. And so actually before I turn to that topic in general, any more questions about leader election? Okay, okay, so how about the contents of the logs and how in particular, how a newly elected leader possibly picking up the pieces after an awkward crash of the previous leader? How does a newly elected leader sort out the possibly divergent logs on the different replicas in order to restore a sort of consistent state in the system? All right, so the first question is, what can things, this whole topic is really only interesting after a server crashes, right? If the server stays up, then relatively few things can go wrong. If we have a server that's up and has a majority, during the period of time when it's up and has a majority, it just tells the followers what the log should look like and the followers are not allowed to disagree. They're required to accept, they just do by the rules of figure two if they've been more or less keeping up. They just take whatever the leader sends them and append entries and append them to the log and obey commit messages and execute. There's hardly anything to go wrong. The things that go wrong and wrapped go wrong when the old leader crashes sort of midway through sending out messages or a new leader crashes sort of just after it's been elected but before it's done anything very useful. So one thing we're very interested in is what can the logs look like after some sequence of crashes? Okay, so here's an example. So then we have three servers and the way I'm gonna draw out these diagrams because we're gonna be looking a lot at a lot of sort of situations where the logs look like this and we're gonna be wondering is that possible and what happens if they do look like that? So my notation's gonna be I'm gonna write out log entries for each of the servers sort of aligned to indicate the slots, corresponding slots in the log and the values I'm gonna write here are the term numbers rather than the client operations. I'm gonna, this is slot one, this is slot two. Everybody saw a command from term three in slot one and server two and server three saw a command also from term three and the second slot, the server one has nothing there at all. And so question for this, like the very first question is can this arise? Could this set up a rise and how? Yes, yeah, so maybe server three was the leader for just repeating what you said, maybe server three is the leader for term three. It got a command that sent out to everybody, everybody received it, it depended it to the log and then it got a server three, got a second request from a client and maybe it sent it to all three servers but the message got lost on the way to server one or maybe server one was down at the time or something and so only server two, well the leader always appends new commands to its log before it sends out append entries and maybe the append entry RPC only got to server two. So this situation, it's like the simplest situation in which actually the logs are now different and we know how it could possibly arise. And so if server three, which is a leader should crash now, the next server they're gonna need to make sure server one, well first of all, if server three crashes or we get an election and some of the leader is chosen, two things have to happen. The new leader has gotta recognize that this command could have committed. It's not allowed to throw it away and it needs to make sure server one fills in this blank here with indeed this very same command that everybody else had in that slot. All right, so after a crash somebody, server three, suppose another way this can come up is server three might have sent out the append entries to server two but then crash before sending the append entries to server three. So if we're electing a new leader, it could be because we got a crash before the message was sent. Here's another scenario to think about. Three servers again. Now I'm gonna number the slots in the log so we can refer to them. We've got slot 10, 11, 12, 13. Again, same setup except now we have in slot 12, we have server two as a command from term four and server three as a command from term five. So before we analyze these to figure out what would happen and what would a server do if it saw this we need to ask, could this even occur? Because sometimes the answer to the question, oh geez, what would happen if this configuration arose? Sometimes the answer is it cannot arise so we do not have to worry about it. So the question is, could this arise and how? Maybe a code. Thank you. All right, so. And any. I'm guessing the box announced later and it gets to the request and it's back to town board and then it gets to the request of the client puts it and then it's logged with the command on the report and then it sends to the dependent case with the log in case so that everyone recognizes that it's a log in case it's a client so it doesn't get to the request but it's probably also out there. Now it's there but it's back to that and then it's back to the request and how far does it have to the request? So let me just repeat that in brief. We know this configuration can arise and so the way we can then get a four and a five here is let's suppose in the next leader election server two is elected leader. Now for term four, selected leader, it gets a request from a client. It appends it to its own log and crashes. So now we have this, right? We need a new election because the leader just crashed. Now in this election and so now we have to ask whether who could be elected or we have to keep them back of our heads. Oh gosh, who could be elected? So we're gonna claim server three could be elected. The reason why it could be elected is because it only needs request vote responses from majority. That majority is server one and server three. There's no problem, no conflict between these two logs. So server three can be elected four term five, get a request from a client, append it to its own log and crash and that's how you get this configuration. So you need to be able to work through these things in order to get to the stage of saying yes, this could happen and therefore raft must do something sensible as opposed to it cannot happen, because some things can't happen. All right, so what can happen now? We know this can occur so hopefully we can convince ourselves that raft actually does something sensible. Now as for the range of things, before we talk about what raft would actually do, we need to have some sense of what would be an acceptable outcome, right? And just eyeballing this, we know that the command and slot 10 since it's known by all the replicas, it could have been committed so we cannot throw it away. Similarly, the command and slot 11 since it's in a majority of the replicas, it could for all we know have been committed so we can't throw it away. The command and slot 12 however, neither of them could possibly have been committed. So we're entitled, we don't know, haven't looked at what raft will actually do but raft is entitled to drop both of these even though it is not entitled to drop in either of the commands in the 10 or 11. It's entitled to drop, it's not required to drop either one of them but we know it certainly must drop one, at least one, because we have to have identical log contents in the end. What do you mean? This could have been committed. We can't tell by looking at the logs exactly how far the leader got before crashing. So one possibility is that for this command or even this command, one possibility is that the leader send out the append messages with a new command and then immediately crashed. So it never got any response back because it crashed so the old leader did not know if it was committed. And if it didn't get a response back, that means it didn't execute it and it didn't send out, but it didn't send out the incremented commit index. And so maybe the replicas didn't execute it either. So it's actually possible that this wasn't committed. So even though raft doesn't know, it could be legal for raft if raft knew more than it does know. It might be legal to drop this log entry because it might not have been committed, but because on the evidence, there's no way to disprove it was committed based on this evidence. It could have been committed and raft can't prove it wasn't so it must treat it as committed. Because the leader might have received it, might have crashed just after receiving the append entry replies and replying to the client. So just looking at this, we can't rule out the possibility that the either possibility that the leader responded to the client, in which case we cannot throw away this entry because the client knows about it or the possibility the leader never did. And yeah, we could, we have to assume that it was committed. Question. Can some entities be considered committed to get an effort from the client? Yes. No, there's no. I mean, in this situation, maybe the server crashed before getting the responses. All right, let's continue this on Thursday.