 All right, so our next speaker is Laura Hampton, who comes to us from New York where she helps organize the New York City Python Meetup and has also worked on the organizing team for PyGotham. She's currently active in the warehouse project, but comes to us today to talk about the notoriously tricky problem of distributed consensus or how you can make computers agree on things in the face of unrivaled networks, garbage collection pauses and other nasty things that make requests get lost. Yes, it's a talk about raft, so please float Laura up onto the stage with a round of applause. Thank you very much. My name is Laura Hampton, and I'd like to thank the organizers of North Bay Python and the sponsors for making the conference possible and for having me to speak to you all today. So, you want to build a fault tolerant and scalable system that can serve a lot of requests from a lot of people in a lot of places. But you can't use a single machine because it's not reliable, it's susceptible to crashes and downtime, it needs maintenance, and if it's far away from your users, the ping times are going to be insane. So, the solution is to use a cluster of machines. But how do we get them to work together? The network connections between them are not reliable, the messages between them may be duplicated or may not arrive at all or may arrive at some arbitrary time in the future. The machines themselves may crash or fail or leave the cluster and then come back all of a sudden. The other thing is we don't want to involve people to fix things. Chances are if there's an algorithmic way to get your computers working together, you want to rely on that instead of having somebody come in at 3 a.m. on a Sunday when their pager has gone off to go and press buttons and plug things in. So, in order to do this, you need a consensus algorithm. This allows you to build reliable systems from unreliable components and allows also the machines to recover from failures and network partitions automatically. So, to be effective, a consensus algorithm needs to have the following properties. Using a consensus algorithm increases overall system reliability. You have several machines doing the same job. And in this talk, I'm going to cover naïve solutions to consensus and why they fail. Then give a brief tour of RAFT, which is a more sophisticated and fault tolerant system which is resilient to several different types of failure. I'm going to briefly touch on Paxos, which is a consensus algorithm that's kind of the granddaddy of RAFT and also of all provably correct consensus systems. And then I'm going to speak about things to consider when deploying a distributed system like RAFT. So, in order to be correct, consensus algorithm must display the following properties. Agreement, every correct process must agree on the same value, which is also called the safety property. Validity, the value agreed on, must have been proposed by another process so you can't have individual participants making things up or everyone storing none, which is called the non-triviality property. Termination, where every correct process eventually decides on a value, which is also called the liveness property. And to illustrate this last property, let's say you have Alice and Bob, and Alice wants to meet Bob at a restaurant at a certain time. But they're using a faulty messaging system that might drop messages or delay them and afford an arbitrary length of time. So Alice sends Bob a message and says, let's meet at Central Market at eight o'clock for dinner. And Bob responds to Alice and says, I have received your invitation and I would like to accept and I will see you at Central Market for dinner at eight o'clock. So then Alice realizes that Bob doesn't know that she's gotten that message. So she has to respond to Bob and say, yes, I am confirming your confirmation and I will see you there. And this process can continue indefinitely and results in a state called LiveLock where no consensus is ever reached. Alice and Bob will never have dinner. And they just keep reconfirming and confirming being like, yes, I'll see you there and nobody gets to eat anything. So in consensus, there's also something called the cap theorem, which states that it's impossible for a consensus algorithm to provide more than two of these properties. Consistency, where every read receives the most recent value, right or an error. Availability, where every request receives a response but there's no guarantee it's the most recent. And partition tolerance, where the network continues to operate despite delayed or lost or dropped messages on the network. So now let's take a look at what could possibly go wrong. There is the fail stop or crash where a participant does not recover and does not rejoin the cluster. A fail recover where after an arbitrary time caused by processing delays, networking delays or network partition, one of the participant drops out and then comes back. Network partition where a cluster participant is separated from the others due to failure of part of the network or through a change in network topology. And a Byzantine failure where one of the participants in the cluster displays arbitrary or malicious behavior and sends contradictory or conflicting data to the other participants. Now the Byzantine failure was named for a paper in which nine Byzantine generals are surrounding the city. They can only take the city if they all decide to attack at once. The messengers between them may get lost or captured. And one of these generals is a spy and he's going to try and foil the other general's plans. He may send an attack vote to four generals planning to attack and a retreat vote to four generals planning to retreat. And in this instance, the traitorous general is the malfunctioning machine in the cluster which is sending contradictory or conflicting information to the other participants. And it's a difficult problem to mitigate. And RAFT does not officially mitigate Byzantine failures but it can handle some instances of it. So in this talk, I'm going to use terms from the RAFT paper because terms and distributed systems can be confusing and can get overloaded. So in all of the following examples, including RAFT, the client sends data and is not a participant. The leader accepts data from the client. The leader sends data to its followers. All data basically flows from client to leader to follower. The log in each participant stores data and the aim is for the logs to match. And these logs are not syslog. They are not about what's going on internally in each participant but they're a record of the messages that each participant has seen from the leader to be committed to the individual logs and they function as a state machine. And in the examples that I'm going to give, logs store numbers but they could also store system commands. They could also store database rights that they could be anything that changes the state of each participant. So let's take a look at some simple consensus algorithms and see what happens. So this is two phase commit. It is a synchronous system. This and the following one are both synchronous. And these algorithms send all messages in rounds and they come in pairs in a call response pattern. So the leader sends every participant a message, waits for a response and once it's received all the responses from all of its followers, then it can send another message that says commit the value. So the leader sends a commit message, the followers respond and the participants basically wait to hear from the leader if in between message rounds. However, if the leader stops and fails while sending a commit message to its followers, some of the followers receive it, some don't, the followers logs no longer match. And it's 3 a.m. on a Sunday and guess what? So let's take a look at a slightly more complex algorithm. We're going to add another round of messages in between the proposal and commit messages. So now we have a proposed value. The followers say yes, we agree to this proposal. The leader then says okay, everyone's agreed to this proposal, but we're gonna wait to commit and then the followers say we are ready and waiting. And then finally the leader sends a message that says okay, we're all gonna commit now and the followers respond to that and everyone commits. This allows if the leader fails, like in the last example, one of the followers can take over and the cluster can more or less keep functioning. However, there is a case in which it doesn't always work. So in this example, the leader has received affirmation from the first message. All followers have indicated that they are prepared to commit. The leader sends a message that says we're all ready to commit, get ready. And then shortly after this happens, the leader crashes and doesn't receive all the confirmations from its followers. A participant sees the leader has sent the commit acknowledgement so the cluster is ready to commit, but also sees the leader has crashed. So it takes over and sends a commit message just as the leader wakes up and says, I didn't receive all the messages from my followers, we're gonna abort this transaction, we're gonna try it again. One of the followers gets a message from the original leader and one of the followers gets a message from the follower that's stepped up as leader and guess what? So now I'm going to talk about RAFT, which is a more sophisticated algorithm that can recover from both of these circumstances. So RAFT is used in several different technologies. It's a fairly flexible algorithm. It's used in fault tolerant databases in MongoDB replica sets, where each replica stores its own copy of the database. It's also used in lock managers to coordinate who can read and write to a database as in Chubby, MapReduce, and ZooKeeper. It's also used in container synchronization and scaling as in ETCD or in Kubernetes and Docker Swarm. And it's also used in systems that elect a leader to do work while the other followers are all on standby. And the interesting thing about RAFT is that any deterministic program can be implemented as a highly available replicated service by being implemented as a replicated state machine on top of a consensus algorithm. So basically if you have a deterministic program, you can apply it like frosting to RAFT, and RAFT will just take care of letting it be expandable and flexible. So RAFT is a consensus algorithm that was created by Diego Angaro at Stanford in 2014. It is asynchronous. There are no bounds on when a message might be sent or a response received. And because it's asynchronous, it's hard to tell if a participant is crashed or is slow. And now I have to warn you that asynchronous consensus algorithms with one faulty process that may fail stop may never terminate according to the FLP paper in 1985. However, like Wikipedia, these things work in practice and not in theory. RAFT has an overall rhythm. It has periods of log propagation punctuated by leader elections. A term in RAFT is important. And it is one election and the following period of log propagation. The voting terms increase monotonically and act as a logical clock. Messages in RAFT are not time stamped because computers never know what time it is anyway. Leaders in RAFT serve until they fail, which triggers a new election. And the cluster does not accept data from the client during the election because there is no leader and only the leader can talk to the client. And no log propagation happens during the election also because there is no leader. So let's take a look at a RAFT cluster. All RAFT participants are identical and any of them can serve as leader if their logs meet certain conditions. All of the participants start as followers. RAFT can recover from network partitions and participant failures and is resistant to fail recover and fail stop of any member of the cluster. So let's run an election. Each participant in RAFT has a random election timeout and followers will start elections unless they do not receive a heartbeat or append to entries message within the minimum election timeout of the cluster. If a leader discovers a follower with a larger term, it updates its term and reverts to the follower state. And elections happen within the span of the shortest election timeout because there are no heartbeats being sent and there's nothing to suppress the tendency of the followers to revert into candidates. So in order to start an election, a candidate votes for itself and then sends those request messages to followers. If the followers have not voted in the current term, they vote yes. If they have voted, they vote no. If the followers logs are longer than the candidates or at a higher index, they vote no. If the followers have heard from the leader via a heartbeat or add entries within the minimum election timeout, they vote no and do not update their term. And in terms of the timing for your RAFT cluster, the time it takes for a leader to send messages to its followers must be an order of magnitude less than the election timeout, which should be a few orders of magnitude less than the mean time between failures of the participants to ensure progress. So now we have a leader and we can enter the log replication phase. The leader is the only participant that can accept data from the client. The leader proposes log entries to all of the followers. When a majority of the followers have responded, the leader sends a commit message to followers and then they all commit. If a leader does not get a response from a follower, it retries and all messages in RAFT are adempotent. They can be sent multiple times and they will only have effect when they're received once. The messages contain information about the leader's log, the new entry to be committed and the entry in the leader's log preceding it. The leader never deletes data in its logs. It only appends it and followers can't change the leader's log. During quiet times when the client isn't sending data, the leader sends heartbeats to which the followers respond and the followers can take several actions if they received an ad entries message from the leader. They can accept the entry if their log, if the previous entry in their log and the previous entry in the leader's log match. They can redirect contact from the client to the last leader they heard from. They also ignore any accept entry message from a leader that contains a log entry they already have or from an old term. So they can just reject stale messages. However, the follower can reject an entry with, oh, and the follower rejects any entries that are behind the current term. If the entry has a higher term, the follower updates its term to match the new leaders and add entries messages contain the new entry and the committed entry in the leader's log before it, and it keeps the logs matched and I'll explain why later. However, if the log entries don't match, the follower can reject the new entry request. When that happens, the leaders respond to this by going back through their, the leader logs, sending ad entries messages to the follower who's rejecting these messages and until they find the last point at which their logs match. Then the leader goes into the follower logs, deletes everything following that and then sends new ad entries messages to get the follower caught up and match the leader logs. And this prevents some kind of failures and if you have a follower that's slow or not getting all the messages, then this allows the leader to get it caught up with everyone else in the cluster. And if the leader fails, we trigger a new election and if a follower fails, well, we've got others and the cluster can function normally as long as the majority of the participants have not failed. So let's take a look at some of the messages in Raft. This is a vote message. It contains who it's from, who it's sent to, what their status is, whether they're a leader or a candidate or a follower or a candidate, the length of their logs, what term they think it is and the value for the vote. This is an ad entries message which contains who it's from, who it's sent to, what their status is and we hope that they're a leader, what term they think it is, whether to commit or add to a provisional log, a success message that the follower can set in order to accept the entry when they respond to the leader, the index of the latest log entry, the last log entry and the new log entry. And the reason why I've been banging on about having the last and preceding, the new and preceding logs match is because logs in Raft are governed by a property called log safety. If two entries and two different logs have the same index and term, they store the same data. The leader creates at most one entry with a given log index during a term. Log entries never change position in the log and if two logs have an entry with the same index and term, the logs are identical in all entries up through that index. Raft is also governed by a property called leader completeness and this is enforced by voting no for candidates that have short logs and this prevents them from becoming leader. Since commit messages are confirmed by a majority of the cluster an entry must be stored by a majority of the participants and this selects for leaders with the longest and most complete logs. Raft can get into a situation called a split vote and in this instance a leader cannot be elected based on this configuration. The candidates need a majority of the cluster to vote for them and split votes can't elect a leader and in this case what would happen is since heartbeats are not being sent, one of the two followers will reach its random election timeout, will revert to a candidate, will update its term and will send a message to the rest of the cluster which will hopefully vote for it and we'll have a new leader again. Raft can also get into a situation where it can recover from network partitions. In this case, two of the participants have gotten separated from the rest of the cluster and messages are arriving slowly or possibly not at all and in this case one of the followers has timed out and has become a candidate however it cannot get elected leader. The leader will, once the partition is resolved the leader will contact the candidate which will revert back to follower state and the leader can get the candidate and the separated nodes logs back up to date. Raft also allows for an interesting process called log compaction where logs can become too long and unwieldy to store and replay. So the solution is something called a snapshot where the entire log is written to stable storage as a single blob. The snapshots contain metadata about the last term and log index. The snapshots can be sent to new or slow followers to catch them up as a single block. The followers can snapshot at any time without leader permission and this keeps the Raft cluster moving ahead in a fairly steady clip because depending on the level of virtualization it can take between one and 10 milliseconds to write a new add entries to stable storage so all of the Raft logs are kept in volatile storage and the snapshots allow all of the new logs to be written as a single chunk in one shot. Raft clusters also automatically handle membership changes because it may be necessary to add or remove large numbers of participants due to config changes or updates or maintenance. In this case it's important to avoid having two majorities that elect two different leaders and so during this process log entries are sent to and committed by both clusters, the old and the new and a participant from either cluster can serve as leader. Agreement on elections and entry commitment require majorities from both clusters and because it can take time to swap out and add the new participants the cluster can continue to serve client requests while this maintenance is going on. In order to affect this change a config message called old common new is sent and committed by the leader to both conjoined clusters. So even if a leader fails a follower with only the new config cannot be elected finally the leader sends out a message called new and then the servers are updated and the old ones can be shut down. So we have reached the end of Raft. In order to get data out of a Raft cluster the leader sends a round of heartbeats once a majority of the followers have responded then it will serve the client request. So now we've seen how Raft works and how it can recover from certain kinds of failure. Some it can recover from some Byzantine failures but it's recommended that you run Raft in secure network environments and a lot of the time with Docker swarm you will connect them cryptographically so they exchange keys so that you can't have people adding malicious nodes to your cluster which would be bad. So we have earned ourselves a trip to a Greek island the island of Paxos. Paxos is sort of the granddaddy of Raft. It is a consensus algorithm that shows an early proof of safety in a fault tolerant distributed consensus protocol and it was described in a paper published by Leslie Lamport in 1998 called the part time parliament and he described the algorithm by imagining that in ancient Greece on the island of Paxos there was a parliament except the parliamentarians were not they didn't prioritize their job as lawmakers very highly and they would wander in and out of the legislative chamber and communicate by kind of flaky messengers. So the idea behind Paxos is that as long as the majority of the members of the cluster have accepted a message it's considered committed and it has the same kind of process as Raft where the leader sends an ad entries message and the followers respond and then it gets committed. Raft was created in reaction to Paxos as a more understandable consensus algorithm. Paxos is more academic and kind of narrower in scope than Raft and Raft is more practical. Raft is as safe as Paxos and Paxos is focused more on the process of getting all the participants to agree to store the same value as opposed to a series of values. And one of the things about Paxos is there isn't really a mechanism for getting followers who might have missed messages or not committed messages to get their logs caught up with the rest of the cluster. Leaders in Paxos are also not distinguished and any member of the cluster can send a message saying, hey guys, commit this. So let's see how Raft works when it's time to actually use it. So when setting up a cluster, it's important to keep the following things in mind. The need for reliability, frequency of planned maintenance and how it might affect the system, risk, performance and cost. And a Raft system with two F plus one nodes can tolerate F failures. So however increasing the nodes does not necessarily increase capacity. Three fifths of a five member cluster must be available for a quorum, but four sixths must be available in a six member cluster. There are also performance considerations when deploying Raft clusters. Increasing capacity by adding more followers puts pressure on the leader's computational and bandwidth resources. There are constraints on performance. Raft is not designed as a performance algorithm. Higher performance distributed algorithms tend to rotate leaders or use a single stable leader. The leader in Raft is kind of a bottleneck on performance because the outgoing bandwidth may restrict the number and size of messages that might be sent to the follower. And leader performance problems can also limit the throughput of the algorithm. The other thing to think about is network latency. It affects performance and it can be quite significant. Round trip times within a single data center may be on the order of one millisecond, but within the United States, it's on the order of 45 milliseconds and between New York and London it's around 70 milliseconds. So that's an important consideration to think about where you're gonna locate your replica nodes. And one thing to keep in mind is the idea of failure domains. That the number of components affected by a single failure. So if you're running your whole Raft cluster on one machine, if the machine fails, you're out of business. One, if you have a running your cluster on one rack in a data center and it's using a single power supply that power supply limits your bounds of your failure and if that fails, you're done. Several racks in a data center using a single piece of networking equipment. If that networking equipment fails, your cluster's down. And data center connected via a single piece of fiber optic cable. If the cable gets cut, then you're offline. And a set of data centers in a geographical area that could be affected by a natural disaster. And if you have that natural disaster, then you're also offline. And as you can see, as the distance between the replicas increases, the latency increases, but the size of the failure can tolerate also increases. However, if all the users of a cluster are in a single geographical area, it might make sense to locate the cluster near them so that they have low ping times. And if there's like a major disaster that affects them, then your users are gonna be offline too so everyone can eventually recover from the disaster, but you don't have to provide service for them because they don't have internet. So sometimes people ask me where raft got its name. And I found this from Diego Angaro on a mailing list. And it's a sweet little story so I thought I would share it with you. There's a few reasons we came up with the name raft. It's not quite an acronym, but we were thinking about the words reliable, replicated, redundant, and fault tolerant. We were thinking about logs and what can be built with them. We were thinking about the island of Paxos and how to escape it. As a plus, we were using the randomly generated name Chizomi in the paper before we came up with the name raft in September 2012. The name appeared just over 100 times in our paper submission back then. So switching to the shorter name actually helped shrink the paper down quite a bit. If you want even more detail, we had a trouble coming up with a good name. So we made it an explicit agenda item during a RAM cloud meeting. I found two photos of the whiteboard during and after that meeting and I attached here. Looks like the top contenders were raft, knocks, as in Fort Knox I guess, redondo and cloud sense. No clue. I don't remember the details of how we ended up with raft since it didn't obviously win the vote, but I do remember the name caught on really quickly. People seem to like it almost right away. I'm so glad it's not called redondo. So thank you very much. Hey, thank you Laura, thank you, thank you.