 Hi there, thanks for being here today. My name is Li Qigen. I'm very glad to be here to share with you how we reduce dry latency in Type-AV. Let me introduce myself a little bit before we dig deep into our topic. I'm infrastructure engineer at Pincap, also a committer of Type-AV and a Raft-Ice project. I focus on making Type-AV more efficient, scalable, and reliable. And I'm passionate about distributed systems and storage systems. First, let me introduce you to the Type-AV project we're working with. Type-AV is an open source, distributed, transactional key value database. Also, it is a CNCF graduated project. You might not be familiar with it, but so far there are more than 9,800 stars and more than 320 contributors in GitHub. Furthermore, it is adopted by more than 1,500 adopters in production across multiple industries worldwide. So you can see we have a good open source community and a healthy ecosystem. Back to today's topic, to figure out how to reduce dry latency in Type-AV, I'd like to begin with a question. Why we need low average latency? According to latest law, concurrency is equal to throughput times latency. Also throughput is equal to concurrency divided by latency. Please note that the latency here is average latency. So if average latency is lower, the throughput will be higher when the concurrency is the same. Next question I'd like you to think about is why we need low tail latency. Tail latency is the tail end of a systems response time spectrum and is often expressed as 99 percentile response times. Type-AV usually serves the customer-facing applications. The customers can only see the application as low and they don't care about whether they have encountered low probability events. Furthermore, the slow events may not have low probability because tail latency may have amplification effects for heavily-offended requests. For instance, in the transaction model of Type-AV, the pre-right phase needs to wait for all parallel pre-rights to be done. So the latency of pre-right phase is the maximum latency of all pre-rights. If the pre-right phase involve n-pre-rights, the probability that latency is longer than x-percentile latency or pre-rights requests is one minus x-percent to n-power. x-percent represents the probability that the latency of the request is shorter than the x-percentile latency. And the issue request is an independent event. So x-percent to n-power means the probability that n-request latency are all shorter than the x-percentile latency. Therefore, the probability of its complementary events is what we want. For example, x is 90, n is 10, then the probability is 65%. It's pretty high, right? In summary, while n is larger, the latency of pre-right phase is closer to the latency of pre-rights requests. Next, let's move to the right floor of Type-AV before we discuss the optimization. At first, a request is sent to scheduler workers which are responsible for transaction, transit, and checking, and transform requests to key value pairs to rough store. Rough store is a consensus layer of Type-AV. It uses rough consensus algorithm to make Type-AV for tolerant. Rough store has two series groups, that is star series and apply series. Star series are responsible for handling rough message and new proposals. When receiving a new proposal, star series will write it to roughDB and send the message to others. After this proposal is committed, it will be sent to apply series which will save it to KVDB. Then, applied series will call back to notify the outside that the request is written successfully. Each right of rough store contains the time to roughDB write, KVDB write, network run trip, and so on. So, it is the key point to reduce the right latency in the rough store layer. To this end, we focus on the star series first in the rough store layer. Star series handle the work of multi roughed groups and they use roughed eyes as consensus algorithm module. Roughed eyes were written in rough and are originally ported from ECD roughed. It is essentially a state machine. Events such as sending messages, saving to disk are done by external modules, such as star series in TechKV. Then, let's look at the flow chat. At the first of each round, star series call step to handle multiple messages. For example, for the leader, the star series can propose commands and handle the append response from the follower. For the follower, the star series can handle append request from the leader. Then, the star series gets ready to handle. Ready as a data structure that includes entries to be saved to disk, the committed entry to be applied and the method to be sent. Leader sends a message before saving entries to disk while follower does the opposite. Finally, the star series calls advance a pen to end this round. At each round, star series handle each roughed groups messages and ready one by one. The entry from this ready are batched and then returned to roughed DB together in the end of the round in order to optimize the right performance. I'll show you in order to find the problem. Let's calculate the star duration, which is time between propose and commit of a proposal. Let's make some assumptions for simplicity. First, we only consider single roughed group. Secondly, disk write rate is constant. Thirdly, the request arrival time is uniquely distributed. Finally, the systems reaches a steady state. In practice, a real system is far more complicated. For example, there's always a burst in requests for many reasons and the disk write latency is not always the same. So it's a rough model based on these assumptions, but it's not enough to analyze the problem that we will mention next. This is a sequence diagram of the star duration and the normal conditions that is equal to the sum of all paths on the red line. Star loop is the processing time of one round of star series is equal to method duration plus IO duration. Proposed weight, method append weight and method append response weight are all the time the message waits for the star series to process. According to the aforementioned assumptions, as the messages arrive unify and they will be processed in the next round, the average waiting time is about half of the star loop, that is 0.5 method duration plus 0.5 IO duration. The duration time of a message append and a message append response is equal to network round trip time, RTT for short. After the network RTT and method weight are substituted into the equation, the star duration is equal to the sum of 4.5 method duration, 2.5 IO duration and one network RTT. Let's take a good look at this equation. Is there something unexpected? Since IO requests of a leader and a follower are parallel, the 2.5 IO duration is unexpected. Where does the extra time come from? As mentioned before, message weight is equal to the sum of 0.5 message duration and 0.5 IO duration. When a proposal is proposed, message append has to wait for 0.5 IO duration more before sending to followers. Similarly, when a message append response is received, the proposal has to wait for 0.5 IO duration more before being committed. So the root cause is both proposed weight and message append response weight contained and then serrated 0.5 IO duration. But why message append weight is not included here? It seems to be very similar to them. As because followers must wait for IO to be completed before sending a message append response to the leader. There's no difference whether the message append is handled earlier or not. Now we found the problem. How to optimize it? Obviously, we need to move the IO process out of the store series. Given that rough eyes have to wait for all the entries in Reddit to be written before proceeding to the next round. We need to support asynchronous Reddit to break this limitation. In addition, store series needs to use asynchronous IO to cooperate with asynchronous Reddit. Note that although we call it asynchronous IO, for now we do not use Linux asynchronous IO technology such as AIO, IO U-ring. We just simply move the IO process to the dedicated IO series in the current implementation. This is a PR for rough eyes and tracking issue for TechEV. Now let's look at the flow chat. Compared with the older flow, asynchronous Reddit allows the save to disk operation to be asynchronous. After writing is completed, the store series will call unprocessed ready to notify the rough state machine. For rough store, besides using asynchronous IO to cooperate with asynchronous ready, some other paths related to asynchronous IO need to be notified such as snapshot process and disappear process. Given the limited time, I won't go into more details here. Next, let's calculate the new store duration of the asynchronous version. In this sequence diagram, the leader IO and the follower IO are added to indicate the IO series. IO wait means a waiting time of an IO request. Based on the assumption mentioned before, IO wait is equal to half of IO duration. Since the store series only process message and don't need to wait for IO, the store loop duration becomes a message duration. Therefore, message wait is equal to half of message duration. After message wait and IO wait are substituted to the previous equations, store duration is equal to the sum of 4.5 message duration, 1.5 IO duration, and one network RTT. We compared the two different store duration and can find out that asynchronous versions store duration is one less IO duration than the synchronous versions. Same perfect right. However, in reality, the message duration is not the same in the synchronous version and asynchronous version. There is an optimization called command batch install series. Command batch will batch as many proposals as possible into one. In the case of the same total number of requests, the smaller number of proposals, the smaller the spill overhead of store series and JRPC series. That's because of some internal implementation problems raised by many proposals and rough messages. We are trying to optimize them to mitigate the impact. But in my opinion, it can be totally eliminated. Back to this equations, the effect of a command batch is a very good asynchronous version due to the long IO duration while it's not in the synchronous version. So the message duration of asynchronous version is longer than that of the synchronous version, which reduces the benefits of this optimization. Here is a benchmark result of the Cispatch insert. We use three Type-AV, three Type-B and one PD. The number of series of Cispatch client is 800. Look at this figure. Asynchronous IO's QPS is more than 30% higher than master's. And these three figures are all about store duration. In the first figure, asynchronous IO's average store duration is about half of master's. In addition, asynchronous IO's average store duration has less jitter than master's. In the second figure, asynchronous IO's 99% higher store duration is about one-third of master's. In the third figure, asynchronous IO's 99.9% higher store duration is about half of master's. Next, let's talk about the future work. That's it to apply unprocessed entries. As mentioned before, store duration is a time between proposed and commit of a proposal. Actually, it's not accurate. The exact definition is the time between when a proposal proposed to when it's committed and persisted. As shown in the figure, the leader's IO encountered a tier latency and the two followers respond to the leader with message-append response earlier. At this time, the proposal has been committed but not persisted to our leader so it cannot be applied. This proposal has to wait for IO to be completed before being applied. By the way, this situation cannot happen in synchronous version because our leader, this proposal's message-append response can only be handled after the IO is completed. It will find that this situation accounts for about one-sixth in our pressure tests which is much higher than we expected. Therefore, this is also one of the most important causes for tier latency of store duration. In fact, as long as the majority peer still exists, the committed entries will not be lost so applying and persisted entries in advance should not break correctness and it can significantly reduce the tier latency of store duration. As shown in the figure, the store duration is much shorter than before. However, extreme case of majority peer loss, some special processing are required to be recovered because the local applied process makes it a local maximum log entry. Here's the suspension insert benchmark of the demo. You can see that the 99% tier latency is about 40% lower than the asynchronous IO version. And the 99% tier latency has less jitter than the asynchronous IO version. This shows that this optimization effect is quite good. The next feature work is parallel apply. At present, the applied process is a single-serial series for each rough group. If applied process can be multi-year series in parallel, the average latency and the tier latency of applied duration can be reduced, especially when it's very hot. That's basically everything I plan to cover today. If you would like to explore more or dig deeper into TechV project, please check out the resources I listed here. You're welcome to contribute to TechV project and follow us on Twitter, YouTube, and contact us by Slack channel. Hopefully, you get something useful today and thank you.