 first a little bit about myself. This is my GitHub handle. I am a maintainer of the open source at CD. What's this presentation for? I'm hoping that this talk will be helpful if you're you know raft and you want to start reading or you just started reading the raft source code. I'm hoping that this presentation will give you a nice starting point on reading the code. And also for those who want to understand the relationship between at CD and raft, basically how at CD uses raft or why raft is important to at CD, we're going to talk about this in this talk. Also if you are writing your own project, your own application, you want to import, you want to embed raft package. Since we're going to talk about how at CD uses raft, it can be an example of how your application uses raft. All right. So this is today's agenda. First, we're going to do a very quick recap of the raft. So everybody is on the same page. And then the second point is we're going to we're going to talk about how at CD uses raft. And then we're going to diverge a little bit from the second point. We're going to talk about third point, which is the implementation details of the raft. And after the third point, we're going to revisit the second point, because by then we're going to have more details to talk about. And last, we're going to talk about some of the ongoing effort, effort I'm aware of in the raft. Okay. All right. So raft recap. First, quickly, some concepts. What is a quorum? A quorum is basically maturity. So if q1 and q2 are quorum, their intersection is not empty. So what is a quorum? For example, if you have a cluster of size one, so one server in your cluster, the quorum is one. If you have two servers, your quorum is two. But then if you have three servers, your quorum is not three is two, because you can pick any two out of the three, and you can find intersection node between the quorum, right? So in this example, where you have five, five servers in your cluster, the quorum is three. So here you can see these three becomes a quorum. And the next, next time these three becomes a quorum, and they do have an intersecting node, right? So what's the nice thing about raft consensus protocol is that the entire cluster can make progress as long as there is agreement among the quorum. What does that mean? So in this example, you don't need all five servers to be available for you to serve the client. You can have at most two servers unavailable at the moment. But as long as you have quorum, which is three, and if they grace to something, you can, the entire cluster can still serve client, right? So what is the problem raft tries to solve? Imagine you have a collection of servers. Each server is a state machine. For example, in at CD is a key value store, right? You can write keys, you can read keys. It is a state machine. You have a cluster of servers. Each server is a state machine. Then when the client interact with this cluster, how do you make sure that from the client's view, this cluster of servers behaves like a single state machine, a single reliable state machine. So for example, the client might, at this moment, right, it choose to issue a right command. It writes a key value to the state machine. And one of the servers in your cluster served this request. Let's say server A. And the next moment, the client wants to read that key. But server A is not available at that time. So it's a different server. Let's say server B served the client. How do you make sure server B's local state machine has the same state so that when it serves the client, it returns the correct result, right? So that's the problem. How do you achieve a replicated state machine? Now the nice thing about state machine is that a state machine, given its current state, given its current input, its next state and its output is determined. So if these state machines, if they all have the same initial state, which you have to make sure, right? So whatever bootstrapping process you're using, you have to make sure when these servers starts, they're going to have the same initial state. And then if you can somehow achieve this, if you can make sure that each server also has the same input sequence, then it's guaranteed that basically they're, so for these state machines, they're always going to have the same state and output, right? So the problem becomes how do you achieve this? How RAF does it? So the RAF consensus protocol achieved this by basically doing this, it manages the replicate right ahead log. So whatever the request the client wants to execute on the state machine, it first needs to be right to this, to a log, which is managed by the consensus protocol. In our case, it's RAF, right? So it's the consensus protocol's job to make sure that these logs are actually coherent or the same across different servers in your cluster, so that because these are, these are the input sequences to the state machine, right? So if these are the same, then when they input to a state machines with initial, same initial state, then the problem is solved, right? Now, so how does RAF actually does it? Roughly speaking, it has two parts. First is leader election. The second is how leader replicated its logs to the rest of the servers in the cluster. So we're going to first, sorry, we're going to first talk about leader election. A RAF node in the cluster can be either a candidate, a follower, or a leader. So term, what is the term? A term is basically a logic clock in the RAF starting of each term, there is going to be a election. So each node can vote for one node in a given term. When a node gets majority of the votes, remember the quorum we talk about. So if you get votes from a quorum, then it becomes a leader. So you can have at most one leader in a term. Now, once a leader is elected, it manages the rest of the term. The leader will periodically send out the heartbeat messages to make sure everybody else knows that the leader is still presence, right? So this is the first step. Then, so now we have the leader, right? So the leader is responsible for manages those logs we just talked about earlier, right? So the leader is going to append to the log. So whenever there is a client request comes to the leader, that request becomes a new RAF log entry. The leader is going to append that entry to its own log. And then leader also tries, keeps trying to replicate its own log to the other RAF node in the cluster. When majority of the node in the cluster has this particular log entry, so basically the logs are indexed. There is log entry 0, 1, 2, 3, 4, right? So the leader is trying to replicate these logs to the followers. The leader also monitors everybody's progress. Let's say node A has up to maybe log index 3. Node B has more logs, maybe has log 0 up to 6. So leader monitors the progress. And when it finds out the majority of the node in the cluster has a particular log, that log is marked as committed. So when it's marked as committed, it is safe for everybody to apply that log. In our example, let's say a leader finds out, so let's say leader has 10 logs from 0 to 9. And remember it tries to replicate logs to other nodes. Let's say there's two other nodes, node A and node B, besides from the leader. So node A has log 0, 1, 2, 3, and node B has 0, 1, 2, 3, 4, for example. Then, well, I guess since the leader and node B, they both have log entry 4, right? So from 0 to 4, they are all basically committed because you have majority of node have those logs. Once committed, it's safe to apply those logs. So each local node is going to know from the leader which logs are committed, and it's going to apply those logs, right? All right, so now let's talk about how raft uses at CD. This graph is basically the same as the one we just saw. Remember we already explained how a raft can help to solve the replicated state machine problem, right? Now, in this graph, it's basically the same. Here is just a client. This is an example application. It's at CD, Kato is a command line tool. It embeds the at CD client library. Another example can be in Kubernetes API server also embeds this client v3, which can talks to the at CD cluster. Now when a client sends a request to the server, the request is routed to the raft consensus module. For this example, let's say this server is a follower. It's not a leader, right? So let's say this is a cluster and this is one of the server in the cluster and it's not a leader. It's a follower. So once the request is routed to the raft module, it's going to be forwarded to the raft leader. The raft leader will append that request to its own raft log and then it will try to replicate that log entry to other nodes. For example, this node showed here, right? This is a follower. And once it knows majority of the node has that particular raft entry, it's going to be marked as committed. And now this server knows, okay, that particular log entry is committed. So I can apply that. So now it's going to be applied to this state machine, which in at CD is multi version concurrency and control key value store. So it's going to apply that to this machine. Remember that log could be, it is a request from the client, right? For example, it can be read a key or write a key value. So whatever the output is, if you're writing is going to tell you, okay, write successful. If you're reading is going to return the value of that key, right? So whatever that applied result is, it's going to be returned routed back to the client. Just quickly saying that because the server can crash and restart, it's very important that the input to the state machine and the state machine, they need to be persisted on the disk, right? So when they crash and restart, they can replay whatever was before so they can go back to the correct state. All right, so we're going to revisit this graph later after we talk about raft implementation. The design philosophy for the raft package in at CD is that it tries to keep it minimalistic, meaning that the raft package does not implement network transport between the raft nodes. Remember, there are going to be messages between raft nodes, right? Like just like we saw the leader my replicated logs to the follower. But this package does not implement network transport. Also, it does not implement the persistent storage layer. So whoever is using this raft package needs to implement network transport and persistent storage layer by including less in the raft package. It makes the package more flexible because now you can choose your own implementation. It also makes the behavior of the raft package to be deterministic. As a result, the raft, well, deterministic is a very, very nice thing because if the behavior of the raft package is deterministic, it's usually means it's easier to implement, is easier to test, it's also easier to argue its correctness. So the raft itself is actually modeled as a state machine. Now let me go back to this graph. Now we're just going to focus on the raft package implementation itself, right? So we're not talking about this state machine here. We're talking about raft and we're saying the raft is modeled as a state machine, right? Just remember. All right. So raft is a state machine. Then what is its state? What is its input and output? How does state transition between each other? So now we're going to dive a little bit into the code. Just a quick disclaimer. The code I'm showing here is mostly pseudo code so that they can actually fit in the screen and you can still see it. But they look very similar to the actual source code. So when you see this and then later when you read the source code, it's going to be very familiar and it's easier for you to understand what the source code does. So the state here is basically wrapped into this structure called raft. You can see there is a node ID. There's a term. Remember, term is like a clock. Vote. Each member can vote for one node in a term. So this indicates which node it voted. This is the log we talked about. Remember, we said raft consensus protocol helps manages the log to make sure the log on each node is the same. This is the log. And the state basically says a raft can be either a leader, a candidate, or a follower. This is an interesting function. It's called step function. It has different implementations. It depends on if you are a leader. It's going to be a step leader function. If you are a follower, it's a step follower function, et cetera. Now, what's the input to the raft state machine? It is a message. What is the message? First, it has a type. So messages are of different types. For example, you can have a message vote, which is for requesting a vote. Or a message append, which is used by the leader when it tries to replicate its logs to other nodes. It's going to issue a message with a message append type. Two is the destination node. From is where the message is from. We're going to skip this. And entries, so a message can carry a whole bunch of entries. For example, when the leader tries to replicate logs from itself to the followers, it's going to include the raft log entries here. What's the output of this raft state machine? The output is wrapped into this structure, which is called ready. It means these are ready to be consumed to be processed by whoever is using the raft package. First one is the state of the raft. Next one, this state needs to be saved to the storage because, remember, the raft package itself does not include implementation of persistent storage. These are the entries that need to be saved to storage. These are the committed entries, which are, remember, safe to be applied to the key value store. And these are the messages. Remember, the raft package does not implement network transport. So it's whoever is using the raft package needs to consume these messages and actually deliver these messages to their destinations. That's why they're included in the output. Now, how do the state transition? The core function is going to be the function called step with the capital S. Step function advanced the raft state machine using the given input. Remember, what's the input of the state machine? It's the message, right? So this is the input to the state machine and you call this function. It's going to advance your raft state machine to the next state and output the thing here. If we look into this function, it eventually calls this function. And in the beginning, there's some common logic for no matter whether you're a leader or a candidate or a follower. And then it calls this lowercase step function, which is different depends on whether you're a leader, then this function is going to be called step leader, et cetera, right? So it depends on your state. This function is going to be different. Now let's go to an example. How many times? Now let's say we want to propose a change to the KV store, right? Remember, let's quickly go back to this slides. When a client requests to execute a command, for example, a read command or write command to the key value store, it's going to, so how it's going to do is going to propose a change to raft. So this is the example we're going to talk about. So the first function is propose function. This data here is basically a serialized data. So for example, if you want to write something, it has a key and value, it's going to be the serialized. So that key and value serialized to the byte slice, right? So this data corresponds to the client's command. Now what this function does is it generates a message and feed that message as input to server's local raft state machine. In this example, let's assume it's a follower who is serving the client request. So what's in this message? You can see the type is message prop and it includes entries. This is the data. Remember, this corresponds to the client request, right? So this is the data. So this message, let's call it message A goes into the follower raft state machine. Now if you go to the source code, remember it's going to be sent to the step function, which eventually, because it is a follower, eventually it goes to step follower function. Now this function depends on its input message. It's going to do different things, right? It's going to execute different code paths. In this example, because our input message is of this type, so it goes to this code path and what it does is it basically marks a destination node to be the leader of the raft cluster and send that message. What does send mean? Because remember, raft package does not actually implement network transport between the raft nodes, right? So if you look at this send function, this is basically marking this message is from myself. Then it simply append this message A, right? This is the message M that is appended to its local message queue and this message queue is going to be, so remember the raft state machine's output is something called a ready, right? It's a ready structure. Let's just quickly go back to make sure everybody still remembers. So this is the output of the raft state machine, right? We just saw its input. Now its output is going to include this message A. See, this message A is appended to this message queue, then it's going to be included in this message. So what's the output of this follower raft state machine? It's going to be a ready structure, but for simplicity, we're just tracking the messages, okay? The ready has other fields, but we're just tracking the messages. So it basically is the same message, right? But marked from as from itself and to the leader. We'll call this message B. Now, whoever is using this raft package needs to process those messages, needs to process these messages from the output of the raft state machine, right? In this case, these messages will be sent to their destinations. So let's look at the leader here. This message is, this is the same message B. It's going to be, so remember, you need to implement the network transport, right? So that network transport implementation needs to forward the message B from the follower's output to this leader's input. Now it's here. Let's look at step leaders function. Again, this is the message B here and it's of this type message prop. So let's see what leader does. It first appends all the entries from this message to its own raft log. This is what it does. And then this function tries to append its raft logs to other raft nodes in this cluster. So let's say we have two other nodes in this cluster. It's going to, for each of those nodes, it's going to call this function, which is a new message of a different type now, message append, because now we're trying to replicate or the leader's own log to other nodes, right? Two is to that peer, then entries includes leader's own entry. Commits we're going to talk about later. So this is basically, oh, actually, we already talked about it, right? So when, because leader monitors each node's progress, when majority of those already have a certain entry, it's going to be marked as committed, right? So this is leader telling everybody the progress of the entire cluster, which means like, which, which raft logs are actually committed. And after it generates this, this message is going to send this message. This is the same send function we just saw. So this is going to be append to leader's local message queue. And eventually it goes to leader's ready output, the ready structure, right? Okay. So I think we are going to probably skip some of the steps. What I'm trying to show here is how to read, how to, how to read the source code, like given, given an input based on whatever the current state is, whether you're a leader, whether you're a follower, it's going to have different state transition. And in return, it's going to generate more messages and maybe more entries. And they are going to show up in the raft state machine's output, which is the structure we called ready, right? So maybe let's just fast forward to the last step of this example, which, so what we have seen, we've seen that we've seen the client's request, which is the message A, including this message A, was forwarded to the leader, right? We saw step two, it's forwarded to the leader, appended to leader's local raft log, and leader tries to replicate that raft log to other nodes in this cluster. And eventually it's going to be successful. So let's say most of the nodes already has this new raft entry and the leader knows that and leader marks that new raft entry as committed, and eventually it's going to send a message with the updated committed index. So when this message is received by the follower, the original follower, it knows, oh, that new raft entry is committed. So it's safe to be applied. What happens then is that, because remember this new raft log entry is committed from the output of this follower's raft state machine, from its output, you're going to see that new raft entry show up here as committed entries. Now it's ready to be applied. All right, so this is the graph we just saw earlier. We're going to talk about with more details, because now we've seen more details. First, we don't care about the client side now. So the first thing we know is that the raft package does not implement persistent storage interface. It does not implement network transport between the raft nodes. So these are actually, if you are using the raft package, you need to implement these. And then again, if you're using this raft package, you're probably going to end up with something like this, a handling loop. You're going to need to consume and process all the output from your local raft state machine. So this gives you the ready structure we just saw. It's stored in here. The first thing you should do is this storage is your own implementation of the storage interface. You need to save the state, save the entries. And then the next step is that this is your own implementation of the network transport between the raft nodes. You need to send those messages. Remember, you are responsible for actually delivering those messages. And then for all the committed entries, it's safe to be applied. So what's the life cycle of a client request? A client request is first going to be, remember, it's going to be routed to the raft. And it's going to call this propose function, which generates a message, a raft message, right? And within the raft consensus protocol, it's going to, so the local raft state machine is going to generate based on that raft message, it's going to generate a whole bunch of new raft messages and be sent to other node. Another node in return might generate other messages, right? So it's back and forth, but eventually if that raft log entry is committed, which means most node has that entry, eventually it's going to show up in your output field committed entries. So now the server needs to apply the committed log entries to the key value store and the applied result is going to be sent back to the client. All right, so quickly this is on some of the ongoing efforts I'm aware of. The first one is actually completed. So the raft package itself supports learner and non-multi-member, actually for a while now, but recently at city server also adds support for learner and non-multi-member. If you're interested, here's the link for you to start. Another ongoing effort is that there's a raft membership, there's joint consensus for raft membership reconfiguration, which is different than the current membership reconfiguration implementation in raft. If you're interested, this is the starting point for you. All right, so do we have time for Q&A? Oh, we have plenty of time. Okay, good. Any questions? Oh, this is a lot. So do When you receive the reading, you need to ask the leader about the entire cluster that you think we should apply to which entry. You need to wait for all the things that should be applied. When you receive the reading request, you need to read all the commands that everyone has agreed to apply. So you need to wait a little bit. But you can bypass this process. You don't need to do a disk. That means before 3.1, you need to go through this flow. Yes, you can see the entry of the read in the rough log. It should be gone in 3.1. Okay, thank you. Any other questions? I have two questions. The first one is, if all my log updates are passed through MAS, will it become a bottleneck? The second question is, for example, my client is a follower of mine. And the last one to change is my follower or my master? Yes, one more question. The first one is, before 3.1, all your requests need to be recorded on this log. All your requests have been serialized. So that's it for this video. After 3.1, if you need to read it, you don't need to go to the leader to read it. You can also read it on the follower. But you need to ask the leader first. What's the situation now? You may need to wait a little bit. What's the second question? The question is, if a shoe player is a follower of his server, whether it's a leader or a follower, then each server has a key value store. The shoe player will forward the entry to the leader, and the leader will send the entry to his own log. After the leader, the log will be replicated to the other followers. So eventually, the follower will also have the same log. When most people have the same log, the leader will say, it's very safe now. If you mark it as committed, you can apply it now. If you apply it now, as long as you have the node of the log, including the follower, it will apply. After you apply it, the result will be returned to the client you taught you. Hello, I'd like to ask you a question about the log. Be rational. I'd like to ask you a question about the log. How did you get the question? I used to have a friend who had a relationship with me. He asked me a question. When they realized it, maybe I would choose the same way as I would choose the same way, to download fsync and write the log on the board. But there's also a problem. If I said that the log is not big, I might have to download fsync every time. This might be a problem. I'd like to ask you how did you get the question about ETCZ? I'm not necessarily right. This is actually outside of RAF. Inside the ETCZ, there's a package. If you look at it, it's under the log. The name is a little confusing. The name is WAL. Under the folder, it's implemented... This is how you get the question. Personally, not every single log entry will be downloaded from disk.io. It should be batched. Do you have a way to write the log on the board? I don't know much about this. What's the first step? When I was working on fsync, I could choose the same way as I would in a log. But there might be other ways. My understanding is... This is a good discussion. You can ask the ETCZ on the Ripple. My personal understanding is how to use RAF package. How to consume the output. The output will give you an entry. You need to store the RAF log on the disk. How do you store it? When I see the RAF on the CD, it's not a step. You need to do the first step. You store the state and the RAF log. Then you send the message to the other RAF node. Then you apply the committed entries. I think this is the right order. But if you want to do the first step, you can't answer any other questions. Let's ask a question. You can discuss it. Okay, thank you. Hello, I'd like to ask you a question. About two types of ETCZ data. One is OpJax and the other is Event. You mean the Kubernetes data used at CD? Yes. As the time goes by, Event has a lot of data. Will it clean up the Event data or release space? I'm not familiar with this. Maybe there's some data. I understand that Event has TTL, right? TTL can be set up. Event will attach a list. I may remember how much the default is. It will be cleaned up. After TTL, it will be cleaned up, right? It won't be recorded. Hello. Hello. There's a small sound. Let's ask a question about the ETCZ function. We know that there are three types of ETCZ data. We want to change the number of the three types to one type. Do you have any tools or methods to implement it? We can talk about this in private. It's not that big of a deal, okay? Let's talk about it in private. Is our time almost up? Okay, thank you all. Thank you very much.