 Hello, everyone. I'm Tianye from Data Lord. Today, I'm going to talk about how can we mathematically prove distributed algorithms using formal methods. First, let me introduce our company. We are a company focusing on storage solution for sky computing. We are trying to integrate software and hardware to build a high-performance storage solution across clouds. Our open-source project X-line is a geo-distributed storage system. Okay, now let's start our talk. First, the background. So, what is a geo-distributed system? As you can see in the graph, it compares to a single data center distributed system. Geo-distributed system has its nodes in multiple data centers. This makes the system have high availability and fault-tolerance since even if a data center was cut off, the system can still function normally. Because the distance between nodes are quite long, geo-distributed systems have relatively high latency between replicas. We often abstract a distributed system into a replicated state machine since it is replicated. All the nodes have the same content on them. So, the consensus algorithm, or we can call it consensus protocol, is how we achieve the agreement on a single volume for an instance in a cluster. Once a volume of the instance reached consensus, it must be stable so it can survive crashes. Establishing consensus often requires at least one run-trips between replicas. How can we achieve consensus between nodes? For example, a node, or we can call it leader, sends, persists x equals 1 to all replicas. Then each replicas process this message and reply acknowledged. Now the leader knows that every replica in the cluster have persists x equals 1. Now the consensus is reached. So we can tell we consumed one run-trip between replicas to reach the consensus. This is a simplified version of the consensus protocol. Actually, the real consensus protocols are far more complex than this simple graph. OK, let's see what is the consensus algorithm for distributed system act in the real world. We got a cluster consist of three replicas and a client. Client sends a request to the cluster. Usually it sends a request to a single replica. We can call it leader. Then the leader replicate the command in the request to all replicas. This consumes one RTT. When the consensus is reached, the leader replies the result to the client. In this graph we can see including client, the whole process costs two RTTs to finish. Two run-trip times is quite expensive since the latency between nodes are quite high. So the disadvantage of traditional consensus algorithm in geodistributed system is it costs huge additional latency compared to unreplicated systems or single data center clusters. So how can we solve this problem? What is CURB? CURB is a new replication protocol introduced by a paper in NSDI 2019. This is the original CURB protocol. It is a replication protocol, not a consensus protocol. CURB allows clients to replicate requests that have not been ordered yet, as long as they are commutative. We extended the original CURB protocol to a consensus protocol and used it in our project X-Line. In X-Line, CURB acts as a front-end pairing with RAFT to maintain the data consistency. Let's talk about CURB's procedure. We still got a cluster consisting of three replicas and a client. So now instead of the client sending requests to a single replica, it sends the request to all the replicas in the cluster. Each replica receives the client request, process it, and replies the result to the client. Now the client knows if the result is reached consensus. At this time, client is down for the process. The only remain step is the cluster to send the result to God. This looks quite simple, but unfortunately, in the real world, this is oversimplified. So we still got a client and some replicas. One of them is the leader of the cluster. We got in total N nodes, in which N equals 2F plus 1. F is the maximum of four replicas in the cluster. Now the client want to replicate a command X equals 1. It sends the request, including the command, to all replicas. And then each replica saves X equals 1 to its spec pool. After this, each replica replies the result to the client. If the client goes more than F plus 1 divided by 2 plus 1, positive replies, the command X equals 1 is committed. If the client want to replicate a new command X equals 2, it sends the request to all the replicas. But the replicas checked their spec pool and found X equals 1 is already there. Since X equals 1 is complicated with X equals 2, they have the same K. These replicas will reject the request. But if the client want to replicate a command Y equals 3, since Y is not in any spec pool, the request can be accept and replies the result. So why couldn't we accept multiple commands conflict with each other? This is because if there are some replicas received X equals 1 first, and the X equals 2 later, but other replicas received X equals 2 first. So how can we serialize those two commands? This is a problem, so we cannot have two conflict commands in the spec pool. Replicas are syncing commands in background. At this time, X equals 1 is synced in the cluster. Now X equals 1 can be removed from the spec pool. Now if a client want to request a new command X equals 3 or 4, it can be successfully replicated. We got a leader in our cluster, so why we need it? Each replica may have different spec pool, since they may receive client request in different order. So which order should we serialize the commands? We got the leader to decide so. The leader always knows the reliable latest information. This is because the client can only consider the command being committed after it receives the leader's reply. Other replicas can have different spec pool, this doesn't matter. The leader also speculatively executes the command and attach the result to the reply. So the client can only got the result in the leader's reply. This is the normal procedure of the Kerr consensus protocol. Now let's talk about the failure recovery procedure. The failure recovery or leader change procedure consists of five steps below. First, we elect a new leader. This is done by the backend protocol like roughed. We are reducing the roughed leader. Then the new leader needs to gather at least F plus 1 spec pools. After the new leader got enough spec pools, it picks the commands appeared in those spec pools more than F plus 1 divide 2 plus 1 times. The new leader put them into its new spec pools and thinking them using the backend protocol. This is a brief introduced of the Kerr consensus protocol. So how can we know if it is correct? We use TLA plus to check the correctness of our new Kerr consensus protocol. So what is TLA plus? TLA plus is a language created by Lazio Lamport. Lazio Lamport is also the creator of the Paxos consensus algorithm. While using TLA plus, we abstract systems into mathematical models with finite states. And then we use TLC model checker to check all the possible states using given invariance and properties. How can we abstract a program in TLA plus? A normal program is a sequence of instructions that tells the computer how to perform a specific task. A TLA plus model is a mathematical description that tells the human or the tool what are the possible behavior or properties of a system. In TLA plus, we are not writing a series of procedures. We are writing how the program changes from the state to another state. A state in TLA plus is a snapshot of all the variables in the system at a given point in time. Each state is atomic. The key idea of abstracting the model in TLA plus is to simplify the abstraction by heading or ignoring some details that are not relevant for the properties we want to verify. For example, a traffic light system, we got three possible states. Red, yellow, and green. What we need to do is to define how the light is changing. We can change it from red to green, from green to yellow, and from yellow to red. But we are not going to allow to change it from red to yellow. That is what we do in abstracting TLA plus model. We are not showing the TLA plus code here because this is not a TLA plus tutorial. We are not focusing on these thin tags in TLA plus. We are only focusing on the core idea of abstracting a system in TLA plus mathematically. Let's see how can we abstract the curb consensus protocol into a TLA plus model. We define several actions to describe states' transformations. First, we got an action of the client sending request to all replicas. This is the atomic procedure. The process of the client choosing a command and sending the message to all replicas is a single state change. The next action is the replica receives client request. Check the confliction in its spec pool and reply the result. Then the client gather enough replies from the replicas to decide whether the command is committed. Those three actions above is the normal procedure of the curb consensus protocol. Then we got the cluster thinking commands action. This is the action done by the backhand protocol. The last action we got is the leader change action. This is also the failure recovery action. Let's talk about these actions one by one. First is the client sending request action. This action is quite simple. We just choose a command and send the command in the request to all replicas. Then is the replica receiving client request action. Since every replica got its own network and internal storage, we can't make multiple replicas acting on the exactly same scene on the same time. So we are only focusing on processing one message on a single replica in one state. In this section, we first check the spec pool to see if there is any conflict commands in there already. If there's no conflict, we can put the command into the spec pool. Then if this replica is the leader, it should speculate to execute the command and start the process of replicating the command. Also, the replica needs to reply the result to the client. Now it's the client gathering replies action. If the client received more than f plus f plus one divided by two and then plus one positive replies, the command can be considered committed. If the client receives more than f plus one divided by two, then plus one negative replies. The negative replies means there is conflicts in this replica's spec pool. Then the command can never be committed. Then is the syncing command action. Actually, we better describe this syncing finish action. Since syncing a command in the real world needs time, this action leads to the state that the command is finished synced. So this is not a procedure. This is an atomic action. The syncing is done by the backhand protocol. So we are meeting most of the procedure. We're just focusing on the action that we removed a command from the spec pool. The final one is the leader change action. At the same reason, we don't care about how the leader change. We only care about what we should do after the leader change. When a new leader being elected, the new elected leader should gather at least f plus one spec pools. This can include its own spec pool. Then it should pick commands that existed in at least f plus one divided by two plus one spec pools. Finally, the new leader can put them into its spec pool. So this is the complete action details of the current consensus protocol in TI plus. Now we have to prove that this module is correct. So how can we do this? We have to ensure a key property of this module. If the client considered a command is committed, the command will eventually be synced. If this command is not synced yet, it will be synced in the next second or tomorrow, or we don't know the exact time, but it will eventually be synced. Now the TI plus specification of the current consensus protocol is complete. You can find the whole document in X lines GitHub repository. Now we can use the TLC model checker to check if the state meets the requirement of this property. This is a quite simple algorithm, but it can hugely improve the latency in geo-distributed system. The last scene, let me introduce our open source project X line. X line is our new geo-distributed KV store for managing metadata. It is compatible with ETCD API and geo-distributed friendly. It is also compatible with Kubernetes. You can check it out at our GitHub repository. Thank you for watching.