 I was imagining three servers with logs that looked like this, where the numbers I'm writing are the term numbers of the command that's in that log entry. So we don't really care what the actual commands are. And I get a number of the log slots. And so let's imagine that presumably the next term is term six, although you can't actually tell that from looking at the evidence on the board, but it must be at least six or greater. Let's imagine that server S3 is chosen as the leader for term six. And at some point, S3, the new leader, is going to want to send out a new log entry. So let's suppose it wants to send out its first log entry for term six. So we're thinking about the append entries, RPCs, that the leader is going to send out to carry the first log entry for term six, where they should be under slot 13. The rules in figure two say that an append entry's RPC actually has two, as well as the command that the client sends into the leader that we want to replicate on the logs of all the followers. There's this append entries RPC also contains this previous log index field and the previous log term field. And when we're sending out an append entries where this is the first entry, the leader is supposed to put information about the previous slot, the slot before the new information is sending out. So in this case, the log index of the previous entry is 12. And the term of the command in the leader's log for the previous entry is 5. So it sends out this information to the followers. And the followers, before they accept append entries, are supposed to check. They know they've received an append entries that for some log entries that start here. And the first thing they do is check that their previous, the receiving followers check that their previous log entry matches the previous information that the leader sent out. So for server 2, of course, it doesn't match. Server 2 has an entry here, all right, but it's an entry from term 4 and not from term 5. And so the server 2 is going to reject this append entries and send a false reply back to the leader. And server 1 doesn't even have anything here. So server 1 is going to also reject the append entries in the leader. And so far, so good. The terrible thing that has been averted at this point is the bad thing we absolutely don't want to see is that server 2 actually stuck the new log entry in here, which would break the sort of inductive proofs, essentially, that the figure 2 scheme relies on, and hide the fact that server 2 actually had a different log. So instead of accepting log entry, server 2 rejects this RPC. The leader sees these two rejections, and the leader is maintaining this next index field, one for each follower. So it has a next index for server 2, and the leader has a next index for server 1. Presumably, I should have said this before, if the server is sending out information about slot 13 here, that must mean that the server's next index for both of these other servers has started out as 13. And that would be the case if the server, if this leader, had just restarted, because the figure 2 rules say that next index starts out at the end of the new leader's log. And so in response to errors, the leader is supposed to decrement its next index field. So it does that for both, got errors from both, decrements to 12 and resends. And this time, the server is going to send out append entries with previous log index equals 11, and previous log term equals 3. And this new append entries has a different previous log index, but it's the content in the log entries that the server is going to send out. This time, include all the entries after the new previous log index is sending out. So server 2 now, the previous log index 11, it looks there and it sees a high. The term is 3, same as what the leader is sending me. So server 2 is actually going to accept this append entries. And figure 2 rules say, oh, if you accept a append entry, she's supposed to delete everything in your log after where the append entry starts and replace it with whatever's in the append entries. So server 2 is going to do that. Now it's log to 5, 6. Server 1 still has a problem, because it has nothing at slot 11. It would return another error. The server will now back up. It's server 1 next index to 11. It'll send out its log starting here with the previous index and term referring now to this slot. And this one's actually acceptable to server 1. It'll accept the new log entries and send a positive response back to the server. And now they're all, and presumably the server also, when it sees that the follower has accepted an append entries that had a certain number of log entries, it actually increments this next index to be 14 for both. All right, so in effect of all this backing up is that the server has used the backup mechanism to detect the point at which the follower's logs started to be equal to the servers and then sent each of the followers starting from that point, the complete remainder of the server's log after that last point at which they were equal. Any questions? Just to repeat a discussion we've had before, and we'll probably have again, you notice that we erased some log entries here, which are now so erased that I forget what they were. Four and five, so there were some. Well, actually, it was mostly, remember, we erased this log entry here. This used to say four on server 2. The question is, why was it OK for the system to forget about this client command? This thing we erased corresponds to some client command, which are now throwing away. I talked about this yesterday. What's the rationale here? Yeah, so it's not a majority of the servers. And therefore, whatever previous leader it was who sent this out couldn't have gotten acknowledgments from a majority of servers. Therefore, that previous leader couldn't have decided it was committed, couldn't have executed it, and applied it to the application state, could never have sent a positive reply back to the client. So because this isn't on a majority of the servers, we know that the client who sent it in has no reason to believe it was executed, couldn't have gotten a reply. Because one of the rules is a server only the leader only sends a reply to a client after it commits and executes. So the client had no reason to believe it. He was even received by any server. And the rules of figure two basically say the client gets no response after a while, it's supposed to resend the request. So we know whatever request this was that we threw away. We've never executed, never included in any state already, and the client's going to resend it by and by. Yes? Well, it's always the leading suffix of the follower's log. I mean, in the end, the sort of backup answer to this is that the leader has a complete log. So all else fails, it can just send its complete log to the follower. And indeed, if you've just started up the system and something very strange happened, even at the very beginning, then you may end up actually, maybe in some of the tests for lab two, you may end up backing up to the very first entry and then having the leader essentially send the whole log. But because the leader has this whole log, we know it's got all the information that's required to fill everybody's logs up if it needs to. Okay. All right, so in this example, which I guess is now erased, we elected S3 as the leader. And so the question is, could we, who are we allowed to elect as leader? If you read the paper, the answer is not just anyone. It turns out it matters a lot for the correctness, the correctness of the system, that we don't allow just anyone to be the leader. Like for example, the first node whose timer goes off may in fact not be an acceptable leader. And so it turns out Raft has some rules that applies about, oh yes, you can be leader or you can't be leader. And to see why this is true, let's sort of set up a straw man proposal that maybe Raft should accept, should use the server with the longest log as the leader. Some alternate universe, that could be true. And it is actually true in systems with different designs, just not in Raft. So the question we're investigating is, why not use the server with the longest log as leader? And this would involve changing the voting rules in Raft to have the voters only vote for nodes that have longer logs. All right, so the example, that's gonna be convenient for showing why this is a bad idea. So let's imagine we have three servers again. And now the log setups are, server one has entries for terms five, six, and seven. Server two for five and eight. And server three also for five and eight. So the first question of course, to avoid spending our time scratching our heads about utter nonsense is to make sure that, convince ourselves that this configuration could actually arise, because if it couldn't possibly arise, then maybe a waste of time to figure out what would happen if it did arise. Anybody wanna propose a sequence of events whereby this set of logs could have arisen? How about an argument that it couldn't have arisen? Oh yeah, okay, so we'll maybe we'll back up some time. All right, so server one is wins the election at this point, and it's in term six. Sends out, yeah, it receives a client request, sends out the first append entries, and then, no, that's fine actually, everything's fine so far, nothing's wrong. All right, well a good bet for all these things is then it crashes, right? Or it receives the client request in term six, it appends the client request to its own log, which it does first, and it's about to send out a append entries, but it crashes. Okay, so it didn't send out any append entries. And then, we need then it crashes and restarts very quickly, there's a new election, and gosh, server one is elected again as the new leader, it receives in term seven, it receives a client request, it appends it to its log, ah, and then it crashes, right? And then after it crashes, we have a new election, maybe server two gets elected this time, or maybe server one is down now, so off the table. If server two is elected at this point, suppose server one is still dead, what term is server two gonna use? Yeah, eight's the right answer, so why eight and not? Remember, this is now gone, why eight and not six? That's absolutely right, so not written on the board, but in order for server one to have been elected here, it must have votes from a majority of nodes, which include at least one of server two and server three. If you look at the vote request code in figure two, if you vote for somebody, you're supposed to record the term in persistent storage. And that means that either server two or server three both knew about term six, and in fact, term seven. And therefore, when server one dies and they can elect a new leader, at least one of them knows that the current term was eight. If that one, and only that one actually, if there's only one of them, only that one could win an election, because it has the higher term number. If they both know about term eight, sorry, if they both know about term seven, then they'll both, then either one of them will try to be leader and term eight. So the fact that the next term must be term eight is insured by the property that the majorities must overlap, and the fact that current term is updated by vote request and is persistent and guaranteed not to be lost, even if there were some crashes here. So the next term's gonna be eight, server two or server three will win the leadership election, and let's just imagine that whichever one it is sends out append entries for a new client request, the other one gets it, and so now we have this configuration. So that was a bit of a detour. We're back to our original question of, in this configuration, suppose server one revives, we have an election. Would it be okay to use server one? Would it be okay to have the rule be the longest log wins, the longest log gets to be the leader? Yeah, obviously not, right? Because server one was a leader. It's gonna force its log onto the two followers by the append entries machinery that we just talked about a few minutes ago. So if we allow server one to be the leader, it's gonna send out append entries, whatever back up, overwrite these eights, tell the followers to erase their log entries for term eight to overwrite them with this six and seven log entries and then to proceed now with logs identical to server ones. So of course, why are we upset about this? Yeah, exactly, eight was already committed, right? It's on a majority of servers, it's already committed, probably executed, quite possibly a reply sent to a client, so we're not entitled to delete it. And therefore server one cannot be allowed to become leader and force its log onto servers two and three. Everybody see why that's a bad idea for RAFT? And because of that, this can't possibly have been a rule for elections. Of course, shortest log done worked too well either. And so in fact, if you read forward to section something, 5.4.1, RAFT actually has a slightly more sophisticated election restriction that the request vote handling, RPC handling code is supposed to check before it says yes, before it votes yes for a different peer. And the rule is we only vote yes for some candidate who send us a request votes only if candidate has a higher term in the last log entry or same last term, same term in the last log entry and a log length that's greater than or equal to the server that received the vote request. And so if we apply this here, sort of server two gets a vote request from server one. Their last log entry terms are seven. Server one's gonna send out a request votes with a last entry term, whatever, seven, server twos is eight, so this isn't true. Server three didn't get a request from somebody with a higher term in the last entry. And the last entry terms aren't the same either, so the second clause doesn't apply either. So neither server two nor server three is gonna vote for server one. And so even if it sends out this vote request first, because it has a shorter election time out, nobody's gonna vote for it except itself. So I don't think it's one vote, it's not a majority. If either server two or server three becomes a candidate, then either of them will accept the other because they have the same last term number and their logs are each greater than or equal to in length than the others. So either of them will vote for the other one, will server one vote for either of them? Yes, because either server two or server three has a higher term number in the last entry. So what this is doing is making sure that you can only become a candidate if it prefers candidates that knew about hire that had log entries from higher terms. That is it prefers candidates that are more likely to have been receiving log entries from the previous leader. And this second part says, well, if we were all listening to the previous leader, then we're gonna prefer the server that has saw more requests from the very last leader. Any questions about the election restriction? Okay. Final thing about sending out log entries is that this rollback scheme, at least as I described it, as is described in figure two, rolls back one log entry at a time. And probably a lot of time that's okay, but there are situations maybe in the real world and definitely in the lab tests where backing up one entry at a time is gonna take a long, long time. And so the real world situation where that might be true is if a follower has been down for a long time and missed a lot of append entries and the leader restarts, and if you follow the pseudocode in figure two, if a leader restarts it's supposed to set its next index to the end of the leader's log. So if the follower has been down and missed the last thousand log entries and the leader reboots, the leader is gonna have to walk back all, one at a time, one RPC at a time, all thousand of those log entries that the follower missed. And there's no particular reason why this would never happen in real life. It could easily happen. A somewhat more contrived situation that the tester definitely explores is if a follower is, if we say we have five servers and there's a leader, but the leader got trapped with one follower in a network partition, but the leader doesn't know it's not leader anymore and it's still sending out append entries to its one follower, none of which are committed, while in the other majority partition, the system is continuing as usual. The ex-leader and follower in that minority partition could end up putting in their logs, sort of unlimited numbers of log entries for a stale term that will never be committed and need to be deleted and overwritten eventually when they rejoin the main group. That's maybe a little less likely in the real world, but you'll see it happen in the test setup. So in order to be able to back up faster, the paper has a somewhat vague description of a faster scheme towards the end of section 5.3. It's a little bit hard to interpret, so I'm gonna try to explain what their idea is about how to back up faster a little bit better. And the general idea is to be able to have the follower send enough information to the leader that the leader can jump back an entire terms worth of entries that have to be deleted per append entry. So a leader may only have to send one in append entries per term in which the leader and follower disagree instead of one per entry. And so there's three cases I think are important. The fact is that you can probably think of many different log backup acceleration strategies. Here's one. So I'm gonna divide the kinds of situations you might see into three cases. So this is fast backup. Okay, so one. I'm just gonna talk about one follower and the leader and not worry about the other nodes. Let's say we have server one, which is the follower, and server two, which is the new leader. So this is one case. And here we need to back up over a term where that term's entirely missing from the leader. Another case. So in this case, we need to back up over some entries, but there are entries for a term that the leader actually knows about. So apparently this follower saw a couple of entries, a couple of, the very last few append entries sent out by a leader that was about to crash. But the new leader didn't see them. We still need to back up over them. And a third case is where the follower's entirely missing the following the leader agree, but the follower's is missing the end of the leader's log. And I believe you can take care of all three of these with three pieces of extra information in the reply that a follower sends back to the leader. In the case, in the append entries, so we're talking about the append entries reply. If the follower rejects the append entries because the logs don't agree, there's three pieces of information that'll be useful in taking care of these three cases. I'll call them X term, which is the term of the conflicting entry. I remember the leader sent this previous log term, and if the follower rejects it because it has something up with the term's wrong, so it'll put the follower's term for the conflicting entry here. Or negative one or something if it actually doesn't have anything in the log there. It'll also send back the index of the conflicting, whoops, the index of the first entry with that term. And finally, if there wasn't any log entry there at all, the follower will send back the length of its log, length of the follower's log. So for case one, the way this helps, if the leader sees that the leader doesn't even have an entry of term X term at all in its log, so that's this case where the leader didn't have term five, and if the leader can simply back up to the beginning of the follower's run of entries with X term, that is the leader can set its next index to this X index thing, which is the first entry of the follower's run of items from term five, all right? So if the leader doesn't have X term at all, it should back up to X, back the follower up to X index. The second case you can detect, the leader can detect if X term is valid and the leader actually has log entries of term X term, that's the case here where the disagreement is here, but the leader actually has some entries of that term, and in that case, the leader should back up to the last entry it has that has the follower's term for the conflicting term in it, that is the last entry the leader has for term four in this case, and if neither of these two cases hold, that is the, well, actually if the follower indicates that I'm maybe setting X term to minus one, that it actually didn't have anything whatsoever at the conflicting log index because its log is too short, then the leader should back up its next index to the last entry that the follower had at all and start sending from there. And I'm telling you this because it'll be useful for doing the lab, and if you miss some of my description, it's in the electronics. Then any questions about this backing up business? Just do you still wanna find the first entry where the term and the index are the same and the later follower go from there? I think that's true, yeah. Go from there? Yeah. Okay. Yeah, maybe binary search. I'm not ruling out other solutions. I mean that, after reading the paper's non-description of how to do it, I like cook this up, and there's probably other ways to do this, probably better ways, faster ways of doing it. Like I'm sure that if you're willing to send back more information or have a more sophisticated strategy like binary search, you could do a better job. But you almost certainly need to do something. Experience suggests that in order to pass the test, you don't need to do something to... And when are you gonna want to visit this? Well, probably not. I mean, although that's not quite true, like one of the solutions I've written over the years actually does the stupid thing and still passes the test. But because the test, one of the unfortunate but inevitable things about the tests we give you is that they have a bit of a real-time requirement. That is, the tests are not willing to wait forever for your solution to produce an answer. So it is possible to have a solution that's technically correct, but takes so long that the tester gives up. And unfortunately, the tester will fail you if your solution doesn't finish the test and whatever the time limit is. And therefore, you do actually have to pay some attention to performance in order to... Your solution has to be both correct and have enough performance to finish before the tester gets bored and times out on you with just like 10 minutes or I don't know what it is. And unfortunately, it's relatively, like this stuff's complex enough that it's not that hard to write a correct solution that's not fast enough. Yes. So the way you can tell, the leader can tell the difference is that the follower is supposed to send back the term number it sees in the conflicting entry. You, we have case one if the leader does not have that term in its log. So here, the follower will set X term to five because this is gonna be the conflicting entry. The follower says it's X term to five. The leader observes, oh, I do not have term five in my log and therefore, this case one, you know, it should back up to the beginning. Like it doesn't, the follower has, the leader has none of those entry term five entries. So it should just get rid of all of them in the follower by backing up to the beginning, which is X index. Do you have a question? Yeah, yeah, because the leader's gonna back up its next index to here and then send in append entries that starts here and the rules of figure two, say, the follower just has to replace its log. So it is gonna get rid of the fives. All right, the next thing I wanna talk about is persistence. You'll notice in figure two that the state in the upper left-hand corners sort of divided into summer marked persistent and summer marked volatile. And what's going on here is that the, the distinction between persistence and volatile only matters if a server reboots, crashes, and restarts. Right, because the persistent, what the persistent means is that if you change one of those items, it's marked persistent, you're supposed to, the server's supposed to write it to disk or to some other non-volatile storage, like SSD or battery backed something or whatever, that will ensure that if the server restarts that it will be able to find that information and sort of reload it into memory. And that's to allow servers to be able to pick up where they left off if they crash and restart. Now, you might think that it would be sufficient and simpler to say, well, if a server crashes, then we just throw it away and we need to be able to throw it away and replace it with a brand new empty server and bring it up to speed. Right, and of course, you do actually, it is vital to be able to do that, right, because if some server suffers a failure, some catastrophic failure like it's, you know, disk melts or something, you absolutely need to be able to replace it and you cannot count on getting anything useful off its disk if something bad happened to its disk. So we absolutely need to be able to replace, to completely replace servers that have no state whatsoever. You might think that's sufficient to handle any difficulties, but it's actually not. It turns out that another common failure mode is power failure of the entire cluster where they all stop executing at the same time, right? And in that case, we can't handle that failure by simply throwing away the servers and replacing them with new hardware that we buy from Dell. We actually have to be able to get off the ground. We need to be able to get a copy of the state back in order to keep executing if we want our service to be felt tolerant. And therefore, in order, at least in order to handle the situation of simultaneous power failure, we have to have a way for the servers to sort of save their state somewhere where it will be available when the power returns. And that's one way of viewing what's going on with persistence. It's that that's the state that's required to get a server going again after either a single power failure or power failure of the entire cluster. All right, so figure two lists three items. Only three items are persistent. So there's a log, and it's like all the log entries, current term and voted for. And by the way, one of us server reboots that actually has to make an explicit check to make sure that these data are valid on its disk before it rejoins the RAF cluster. It has to have some way of saying, oh yeah, I actually do have some safe persistent state as opposed to a bunch of zeros that are not valid. All right, so the reason why log has to be persisted is that at least according to figure two, this is the only record of the application state that is figure two doesn't really have a notion. Figure two does not say that we have to persist the application state. So if we're running a database or a test and set service like for VMware FT, the actual database or the actual value of the test and set flag isn't persisted according to figure two. Only the log is. And so when a server restarts, the only information available to reconstruct the application state is the sequence of commands in the log. And so that has to be persisted. And so what about current term? Why does current term have to be persisted? Yeah, so they're both about ensuring that there's only one that each term has at most one leader. So yeah, so voted for the specific potential damaging case is that if a server receives a vote request and votes for server one and then it crashes, and if it didn't persist this, the identity of who it voted for, then it might crash restart, get another vote request for the same term from server two and say, gosh, I haven't voted for anybody because my voted for is blank. Now I'm gonna vote for server two and now our servers voted for server one and for server two in the same term. And that might allow two servers since both server one and server two voted for themselves. They both may think they have a majority out of three and they're both gonna become leader. Now we have two simultaneous servers for the same term. So that's why voted for has to be persistent. Current term's gonna be a little more subtle, but we actually talked before about how, again, we don't wanna have more than one server for a term and if we don't know what term number it is, then we can't necessarily, then it may be hard to ensure that there's only one server for a term. And I think maybe in this example, yeah, if server one was down and server two and server three were gonna try to elect a new server, they need evidence that the correct term number's eight and not six, right? Because if they forgot about current term and it was just server two and server three voting for each other and they only have their log to look at, they might think the next term should be term six. If they did that, they'd start producing stuff for term six, but now there's gonna be a lot of confusion because we have two different term sixes. And so that's the reason my current term has to be persistent to preserve evidence about term numbers that have already been used. These have to be persisted pretty much every time you change them, right? So certainly the safe thing to do is every time you add an entry to the log or change current term or set voted for, you probably need to persist that. And in a real RAF server, that would mean writing it to the disk. So you'd have some set of files that recorded this stuff. You can probably be a little bit, you maybe can cut some corners if you observe that you don't need to persist these things until you communicate with the outside world. So there may be some opportunity for a little bit of batching by saying, well, we don't have to persist anything until we're about to reply to an RPC or about to send out an RPC. And that may allow you to avoid a few persistings. The reason that's important is that writing stuff to disk can be very expensive. If it's a mechanical hard drive that we're talking about, then writing anything, if the way we're persisting is writing files on the disk, writing anything on the disk costs you about 10 milliseconds because you either have to wait for the disk to spin for the point you want to write to spin under the head, which the disk only rotates about once every 10 milliseconds. Or worse yet, you may actually have to seek to move the arm to the right track. So these persistings can be terribly, terribly expensive. And if for any kind of straightforward design, they're likely to be the limiting factor in performance because they mean that doing anything, anything whatsoever on these RAF servers takes 10 milliseconds a pop. And 10 milliseconds is far longer than it takes to say send an RPC or almost anything else you might do. So yeah, 10 milliseconds each means you can just never, if you persist data to a mechanical drive, you just can never build a RAF service to conserve more than 100 requests per second because that's what you get at 10 milliseconds per operation. And this is really all about cost of synchronous disk updates. And it comes up in many systems, like file systems. The file systems that are running on your laptops are the designers spend a huge amount of time sort of trying to navigate around the performance problems of synchronous disk rights because in order for stuff to get safe on your disk in order to update the file system on your laptop's disk safely, turns out the file system has to be careful about how it writes and needs to sometimes wait for the disk to finish writing. So this is like a cross-cutting issue in all kinds of systems, it certainly comes up in RAF. If you wanted to build a system that could serve more than 100 requests per second, then there's a bunch of options. One is you can use a solid state drive or some kind of flash or something. Solid state drives can do a write to the flash memory in maybe a tenth of a millisecond. So that's a factor of 100 for you. Or if you're even more sophisticated, maybe you can build yourself battery backed DRAM and do the persistence into battery backed DRAM. And then if the server reboots, hope that the reboot was took shorter than the amount of time the battery lasts and that the stuff you persisted is still in the RAM. And the reason, I mean, if you have money and sophistication, the reason to favor that is you can write DRAM millions of times per second. And so it's probably not gonna be a performance bottleneck. Anyway, so this problem is why the sort of marking a persistent versus volatile in figure two is like has a lot of significance for performance as well as crash recovery and correctness. Any questions about persisting? Yeah, yes? So if you do something, send out an RPC that immediately crash, then maybe the... All right, so your question is basically you're writing code, say go code for your raft implementation or you're trying to write real raft implementation. And you actually wanna make sure that when you persist your update to the log or the current term or whatever that it in fact will be there after a crash and reboot. Like what's the recipe for what you have to do to make sure it's there? And your observation that if you call on a Unix or Linux or whatever Mac, if you call write, the write system calls how you write to a disk file, if you simply call write, as you pointed out, it is not the case that after the write returns, the data is safe on disk and will survive a reboot. It almost certainly isn't, almost certainly not on disk. So the particular piece of magic you need to do is on Unix at any rate, you do need to call write so you can write the file you've opened that's gonna contain the stuff that you wanna write. And then you gotta call this fsync call, which on most systems, the guarantee is that fsync doesn't return until all the data you previously written to this file is safely on the surface, on the media, in a place where it will still be there if there's a crash and reboot. So this thing is, and then this call is an expensive call and that's why it's a separate. That's why write doesn't write the disk and only fsync does is because it's so expensive you would never wanna do it unless you really wanted to persist some data. Okay, so you can use more expensive disk hardware. The other trick people play a lot is to try to batch. That is, if you can, if client requests are, if you have a lot of client requests coming in, maybe you should accept a lot of them and not reply to any of them for a little bit, wait till a lot of them accumulate and then persist 100 log entries at a time from your 100 clients and only then send out the append entries. Because you do actually have to persist this stuff to disk. If you receive a client request, you have to persist the new entry to disk before you send the append entries RPCs to the followers because you're not allowed, if the leader, the leader is essentially promising to commit that request and can't forget about it. And indeed the followers have to persist the new log entry to their disk before they reply to the append entries because they reply to the append entries is also a promise to preserve and eventually commit that log entry so they can't be allowed to forget about it if they crash. Other questions about persistence? All right, well a final little detail about persistence is that some of the stuff in Figure 2 is not persistent and so it's worth scratching your head a little bit about why commit index, last applied, next index, a match index, why it's fair game for them to be simply thrown away if the server crashes and restarts, like why wasn't commit index or last applied, like geez, last applied is the record of how much we've executed. If we throw that away, aren't we gonna execute log entries twice and is that correct about that? Why is it safe to throw away last applied? Yes, we're all about simplicity and safety here with raft. So that's exactly correct. The reason why all that other stuff can be non-volatile as you mentioned, I mean sorry, volatile. The reason why those other fields can be volatile and thrown away is that we can, the leader can reconstruct sort of what's been committed by inspecting its own log and by the results of append entries that it sends out to the followers. I mean initially the leader, if everybody restarts because they experience a power failure, initially the leader does not know what's committed, what's executed, but when it sends out log in append entries it'll sort of gather back information essentially from the followers about what's in, how much of their logs match the leaders and therefore how much must have been committed before the crash. Another thing in the Figure 2 world, which is not the real world, another thing about Figure 2 is that Figure 2 assumes that the application state is destroyed and thrown away if there's a crash and a restart. So the Figure 2 world assumes that while log is persistent that the application state is absolutely not persistent, required not to be persistent in Figure 2 because in Figure 2 the log is preserved, persisted from the very beginning of the system. And so what's gonna happen if you sort of play out what the various rules in Figure 2 after a leader restarts is that the leader will eventually re-execute every single log entry that is handed to the application starting with log entry one after a reboot. The raft is gonna hand the application every log entry starting from one. And so that will, after a restart the application will completely reconstruct its state from scratch by a replay from the beginning of the time of the entire log after each restart. And again that's like a sort of straightforward, elegant plan, but obviously potentially very slow. Which brings us to the next topic, which is log compaction and snapshots. And this has a lot to do with Lab 3B, actually you'll see log compaction and snapshots in Lab 3B. And so the problem that log compaction and snapshotting is solving a raft is that indeed for a long running system that's been going for weeks or months or years, if we just follow the Figure 2 rules the log just keeps on growing. It may end up millions and millions of entries long and so it requires a lot of memory to store. If you store it on disk, like if you have to persist it every time you persist the log, it's using up a huge amount of space on disk. And if a server ever restarts, it has to reconstruct its state by replaying these millions and millions of log entries from the very beginning, which could take like hours for a server to run through its entire log and re-execute it if it crashes and restarts. All of which is like a similar kind of wasted because before it crashed it already had application state. And so in order to cope with this, raft has this idea of snapshots and the sort of idea behind snapshots is to be able to save or ask the application to save a copy of its state as of a particular log entry. So we've been mostly kind of ignoring the application but the fact is that if we have a, I suppose we were building a key value store at a raft, the log is gonna contain a bunch of put and get sort of read and write requests. So maybe a log contains a put that some client wants to set x to one and then another one where it says x to two and then y equals seven or whatever. And if there's no crashes as the raft is executing along there's gonna be this, at the layer above raft there's gonna be this application and the application, if it's a key value store database, it's gonna be maintaining this table and as raft hands it one command after or next, the application's gonna update its table. So after the first command, it's gonna set x to one in its table, after the second command, it's gonna update its table. One interesting fact is that for most applications, the application state is likely to be much smaller than the corresponding log. At some level we know that the log in the state or the log in the state as of some point in the log are kind of interchangeable. They both sort of implied the same thing about the state of the application but the log may contain a lot of multiple assignments to x that use up a lot of space in the log but are all sort of effectively compacted down to a single entry in the table and that's pretty typical of these replicated applications. But the point is that instead of storing the log which may go to be huge we have the option of storing instead the table which might be a lot smaller. And that's what the snapshots are doing. So when raft feels that its log has gotten to be too large, you know, more than a megabyte or 10 megabytes or whatever some arbitrary limit, raft will ask the application to make a snapshot of the application state as of a certain point in the log. So if raft asked the application for a snapshot, raft would pick a point in the log that the snapshot referred to and require the application to produce a snapshot as of that point. And this is extremely critical because what we're about to do is throw away everything before that point. So if there's not a well-defined point that corresponds to a snapshot then we can't safely throw away the log before that point. So that means that raft is gonna ask for a snapshot and the snapshot's basically just the table, right? It's just a database server. And we also need to annotate the snapshot with the entry number that it corresponds to. So basically, you know, if the entries are one, two, three, this snapshot corresponds to just after log index three. With the snapshot in hand, if we persisted to disk, raft persisted to disk, raft never again will need this part of the log and it can simply throw it away as long as it persists a snapshot as of a certain log index plus the log after that index. As long as that's persisted to disk, we're never gonna need the log before that. And so this is what raft does. The raft asks the application for snapshot, gets the snapshot, saves it to disk with the log after that and just throws away this log here, right? And so it really operates so the sort of persistent story is all about pairs of a snapshot and the log after that, after the point in the log associated with snapshot. Everyone see this? Yes? No, it's still, it's, you know, there's these sort of phantom entries one, two, three and this, you know, suffix of the log is indeed viewed as still the, it's, maybe the right way to think of it is still there's just one log except these entries are sort of phantom entries that we can view as being kind of there and principal, but since we're, we never need to look at them because we have the snapshot. The fact that they just happen not to be stored anywhere is neither here nor there. But it's, but yeah, you should think of it as being still the same log just not just threw away the early entries. That's a maybe a little bit too glib of an answer because the fact is that figure two talks about the log in ways that makes it, that if you just follow figure two you sometimes still need these earlier entries. And so you'll have to reinterpret figure two a little bit in light of the fact that sometimes it says blah, blah, blah, log entry where the log entry doesn't exist. Okay, and so what happens on a restart, the restart story is a little more complicated than it used to be with just a log. What happens on a restart is that there needs to be away for Raft to give the latest, for Raft to find the latest snapshot log pair on its disk and hand the snapshot to the application because we no longer able to replay all log entries so there must be some other way to initialize the application and basically not only does the application have to be able to produce a snapshot of application state but it has to be able to absorb a previously made snapshot and sort of reconstruct its table and memory from a snapshot. And so even though Raft is kind of managing this whole snapshotting stuff, the snapshot contents are really the property of the application and Raft doesn't really understand what's in here only the application does because it's all full of application-specific information. So after a restart, the application has to be able to absorb the latest snapshot that Raft found. So if we're just this simple, it would be simple. Unfortunately, this snapshotting and in particular the idea that the leader might throw away part of its log introduces a major piece of complexity and that is that if there's some follower out there whose log ends before the point at which the leader's log starts, then unless we invent something new namely install snapshot, unless we invent something new that follower can never get up to date because if the followers, if there's some follower whose log only is the first two log entries we no longer have the log entry three that's required to send it to that follower in an append entries RPC to allow its log to catch up to the leaders. Now, we could avoid this problem by having the leader never drop part of its log if there's any follower out there that hasn't caught up to the point at which the leader's thinking about doing a snapshot because the leader knows through next index, well actually the leader doesn't really know but the leader could know in principle how far each follower had gotten and the leader could say well I'm just never gonna drop the part of my log before the end of the follower with the shortest log and that would be okay. It might actually just be a good idea period. The reason why that's maybe not such a great idea is that of course if a follower shut down for a week it's not gonna be acknowledging log entries and that means that the leader can't reduce its memory use by snapshotting. So the way the RAF design is chosen to go is that the leader is allowed to throw away parts of its logs that would be needed by some follower and so we need some other scheme that append entries to deal with the gap between the end of some follower's log and the beginning of the leader's log and so that solution is to install snapshot RPC. The deal is that when a leader, we have some follower whose log that just powered on its log is short, the leader's gonna send it in append entries and it's gonna be forced, the leader's gonna be forced to back up and at some point the leader failure or failed append entries will cause the leader to realize that it's reached the beginning of the actual log it stores and at that point instead of sending in append entries the leader will send its current snapshot plus current log, well send its current snapshot to the follower and then presumably immediately follow it with an append entries that has the leader's current log. Yeah and the sad truth is that this is adds significant complexity to your lab three, partially because of the kind of cooperation that's required between RAF. This is sort of a little bit of a violation of modularity that requires a good deal of cooperation. Like for example, when an install snapshot comes in it's delivered to RAF but RAF really requires the application to absorb the snapshot so they have to talk to each other more than they otherwise might. Oh yes, so the question is the way the snapshot is created dependent on the application. It's absolutely, so the snapshot creation function is part of the application, it's part of the key value server. So RAFs will somehow call up to the application and say geez, I really like a snapshot right now. The application, because only the application understands what its state is. And the inverse function by which an application reconstructs its state from a snapshot file is also totally application dependent. Whether it's intertwining because of course every snapshot has to be labeled with a point in the log that it corresponds to. You're talking about rule six in figure 13? Okay, so yeah, the question here is that and you will be faced with this in lab three that because the RPC system isn't perfectly reliable and perfectly sequenced and RPC's gonna arrive out of order or not at all or you may send an RPC and get no response and think it was lost but actually it's delivered and it was the reply that was lost, all these things happen including to send to whatever install snapshot RPCs and the leaders almost certainly sending out many RPCs concurrently, you know, both append entries and install snapshots. That means that you can get things like install snapshot RPCs from deep in the past or almost anything else, right? And therefore the follower has to be careful, you know, has to think carefully about an install snapshot that arrives and I think the specific thing you're asking is that if a follower receives an install snapshot that appears to be completely redundant, that is the install snapshot contains information that's older than the information the follower already has, what should the follower do? And rule six in figure 13 says something but I think a equally valid response to that is that the follower can ignore a snapshot that clearly is from the past. I don't really understand that rule six. Okay, I want to move on to sort of some more conceptual topic for a bit. So far we haven't really tried to nail down anything about what it meant to be correct, what it meant for a replicated service or any other kind of service to be behaving correctly. And the reason why, you know, for most of my life I managed to get by without worrying too much about precise definitions of correctness but the fact is that if you're trying to optimize something or you're trying to think through some weird corner case it's often handy to actually have a more or less formal way of deciding is that behavior correct or not correct. And so for here what we're talking about is clients are sending in requests to our replicated service with RPC, maybe they're resending, who knows what, maybe the service is crashing and restarting and loading snapshots or whatever. If a client sends in a request and gets a response like is that response correct? How are we supposed to tell whether response A would be correct or response B? So we need a notion, we need a pretty formal notion of distinguishing, oh, that's okay from no. That would be a wrong answer. And for this lab, our notion of correctness is linearizability. And I mentioned strong consistency and some of the papers I mentioned strong consistency and it's basically equivalent to linearizability. Linearizability is sort of a formalization of more or less of the behavior you would expect if there was just one server and it didn't crash and it executed the client request one at a time and nothing funny ever happened. So it has a definition and the definition, I'll write out the definition and then talk about it. So an execution history is linearizable. And this is in the notes. If there exists a total order, so an execution history is a sequence of client requests, maybe many requests from many clients. If there's some total order of the operations in the history that matches the real-time order of request. So if one request, if the client sends out a request and gets a response and then later in time, another client sends out a request and gets a response, those two requests are ordered because one of them started after the other one finished. So it's linearizable, history is linearizable, if there exists an order of the operations in the history that matches real-time for non-concurrent requests, that is for a request that didn't overlap in time. You can think of it as each read sees the value from the most immediately preceding right to the same piece of data, most recent right in the order. This is the definition. Let me illustrate what it means by running through an example. So first of all, history is a record of client operations. So this is a definition that you can apply from the outside. This definition doesn't appeal in any way to what happens inside the implementation or how the implementation works. It's something that we can, if we see a system operating and we can watch the messages that come in and out, we can answer the question, was that execution that we observed linearizable? So let me write out a history and talk about why it is or isn't linearizable. All right, so here's an example. Linearizability talks about operations that start at one point and end at another. And so this corresponds to the time at which a client sends a request and then later receives a reply. So let us suppose that our history says that at some particular time, this time, some client sent a right request for the data item named X and asked for it to be set to one. And then time passed and at the second vertical bar is when that client got a reply. We send a request at this point, you know, time passed, who knows what's happening when the client got a reply there. And then later in time, that client or some other client doesn't really matter, sends a right request again for item X with value two and gets a response to that right. Meanwhile, some client sends a read for X and gets value two and sent the request there and got the response with value two there. And there's another request that we observed that's part of the history. The request was sent to read value X and it got value one back. And so when we have a history like this, you know, the question that you ask about this history is, is this a linearizable history? That is, did the machinery, did the service, did the system that produced this history and was that a linearizable system or did it produce a linearizable history in this case? If this history is not linearizable, then at least when we're talking about lab three, we know we have a problem. There must be some bug. Okay, so we need to analyze this to figure out if it's linearizable. There's linearizability requires us to produce an order. You know, one by one order of the four operations in that history, so we know we're looking for an order and there's two constraints on the order. One is, if one operation finished before another started, then the one that finished first has to come first in the history. The other is, if some read sees a particular written value, then the read must come after the right in the order. Right, so we want an order. So we're gonna produce an order that has four entries, the two rights and the two reads. I'm gonna draw with arrows the constraints implied by those two rules and our order is gonna have to obey these constraints. So one constraint is that this right finished before this right started and therefore one of the ordering constraints is that this right must appear in the total order before this right. This read saw the value of two. So in the total order, the most recent right, that this read must come after this right and this right must be the most recent right. So that means that in the total order we must see the right of X to two and then after it, the read of X yields two. And this, this read of X of one, if we assume that the X didn't already have the value one, there must be this relationship and that is the read must come after the right and this read also must come before this right and maybe there's some other restrictions too. Anyway, we can take these, we can take this set of arrows and flatten it out into an order and that actually works. So the order that's, the total order that demonstrates that this history is linearizable is first the right of X to one, then the read of X yielding one, then the right of X to two, then the read of X that yields two. The fact that there is this order that does obey the ordering constraints shows that this history is linearizability and doesn't, if we're worried about the system that produced this history, whether that system is linearizable, then this particular example we saw doesn't contradict the presumption that the system is linearizable. Any questions about what I just did? Each read sees, you know, read of X, the value it sees must be the value written by the most recent proceeding right in the order. So, you know, in this case, in this case we're totally okay with this order because this read, the value it saw is indeed the value written by the most recent right in this order. And this read, the value it saw is the value. I mean, informally it's that reads can't real, should not be yielding stale data. If I write something and read it back, gosh, I should see the value I wrote and that's just like a formalization of the notion that, oh yes, oh yeah. All right, let me, let me, at least write up an example that's not indeed linearizable. So, here's example two. Let's suppose our history is, we had a write of X to value one, write of X to value two. And so this one, we also wanna write out the arrows and so we know what the constraints are on any total order we might find. The write of X to one, because of time, because it finished in real time before the write of X to started, it must come before in any satisfying order we produce. The write of X to two has to come before the write, before the read of X that yields to, so we have this arrow. The read of X to two finished before the read of X to one started, so we have this arrow. And the read of X to one, because it saw value one, has to come after the write of X to one and more crucially before the write of X to two. Certainly can't have this read of X to yielding one if it's immediately preceded by a write of X to two, so we also have this arrow like this. And because there's a cycle in these constraints, there's no order that can obey all these constraints and therefore this history is not linearizable. And so the system that produced it is not a linearizable system. It would be linearizable. The history was missing any one of these three and it would break the cycle. Yes, maybe. I'm not sure because suppose, or I don't know how to incorporate very strange things like supposing somebody read 27. It doesn't really, if there's no write of 27, a read of 27 doesn't, at least the way I've written out the rules doesn't sort of, well, there may be some sort of anti-dependency that you would construct. Okay, I will continue this discussion next week.