 Hi everyone. My name is Zhao Jiawei and I'm a Raster Distributed Storage Engineer from Dayton Load. The topic of today's presentation is Raster in Distributed Key Value Store. We will use the Distributed KV Engine, X-Line, as an example to explore how to use Rust to develop a Distributed KV Storage Engine. And the content of today's presentation is primarily divided into four sections, the X-Line Overview and Consistency in Wain, Persistent Storage and the Deterministic Simulation. And all right, without further ado, let's dive into today's topic. First, let's get to know what X-Line is. An X-Line is an open-source distributed KV store developed by Dayton Load and it was designed for achieving high performance cross-data center strong consistency. An X-Line is fully compatible with the ETCD interfaces and provides metadata management across data centers. Right in Rust, X-Line adopts the KERP algorithm as its backend consensus protocol and currently X-Line is also a CNCF sandbox project. And the following two links are the GitHub repository address and the official website of X-Line respectively. And X-Line can be roughly divided into three layers from top to the bottom, the assess layer, intermediate layer and a persistent layer. The assess layer is primarily use the Tonic GRPC framework to receive requests from clients or initiate proposal requests to the consensus module and the intermediate layer can be further divided into the consensus module on the left and the business module on the right. The consensus module adopts the KERP protocol as its consensus algorithm and the bottom most persistent layer is responsible for the persistency of data and metadata in X-Line. It can be further divided into two sub layers, the storage API layer and a storage engine layer. And currently X-Line utilize ROXDB as its underlying storage. So what is the KERP protocol? Before answering this question, let's first understand some traditional consensus protocols such as the multipaxels and raft. And here is an overview of the simplified multipaxels algorithm proxies. A client initiate a proposal request to the leader. Upon receiving the request, the leader will execute an accept operation and broadcast it to the other followers in the cluster. When the leader received responses of acceptance from a majority of the nodes, it considered the proposal accepted in response to the client with the chosen proposal. And if the leader doesn't receive enough accepted responsible, it responds to the client with a proposal failure. And here is an outline of the simplified raft algorithm proxies. Initially, a client send a proposal request to the leader of the cluster. Upon receiving this request, the leader invokes append entries to persist the proposal in its old state machine log. And then it broadcasts the append entries request to all followers in the cluster. When a response to successful append entries is received from more than half of the nodes, the leader responds to the client with a successful proposal response. If the required number of successful response is not received, the leader responds to the client with a failure proposal response. How come the raft and multipaxels needs two RTTs to achieve the consensus? Be multipaxels or raft, reaching consensus inevitably requires two RTTs. Both of them are basic. Both of them are based on a core assumption. The criteria of being durably stored in order must be met after command approval or lock commit. As a result, the state machine can directly perform the approved commands and apply committed locks. Due to the inherent synchrony of the network, ensuring orderliness is challenging, therefore, a leader is required to enforce the execution order of different commands and achieve persistency by obtaining replication from the majority through broadcasting. This process cannot be complete within a single RTT. So, we need to consider a question as a requirement of two RTTs necessary to achieve consensus. Or in other words, as the fulfillment of both being durably stored and being ordered conditions a necessary criterion for consensus. This requires, case-by-case analysis, in certain scenarios, relaxing the requirements for ordering does not affect the final result of the state machine, while in other scenarios it does. Let's examine two specific examples. Suppose a user issues a PUT Y5 followed by PUT Z7 to the cluster, and after the state machine applied this command, retrieving the value of Y and Z from the state machine is Y equals to 5 and Z equals to 7. Now let's reverse the execution order of these two commands. And we observe that even after the state machine applies both commands, the retrieval value for Y and Z are still Y equals to 5 and Z equals to 7. This illustrates our earlier point. In certain scenarios, relaxing the requirement for ordering does not impact the cluster ability to reach a final consensus. Now let's consider an example where ordering does matter. Assuming that a user issues PUT Z7 followed by PUT Z5 to the cluster, and after applying these commands, retrieving the value of Z from the state machine is Z equals to 5. However, if we swap the execution order of these commands, upon retrieval we will find that Z equals to 7 in the state machine. This outcome contradicts our initial expectations. It's not difficult to imagine that in such cases, relaxing the order constraint could lead to different nodes in the cluster having different values, thereby compromising final consistency. And the CURB protocol originates from a paper titled Exploiting Communitative for Peractical Faster Replication, present at NSDI 2019. The authors of this paper are C.O. Jim Park, a Ph.D. from Stanford, and Professor John Osterhout. Professor John Osterhout is also the author of the RAF consensus algorithm. In a comparison to a traditional consensus protocol, CURB's major innovation lies in dividing the consensus scenarios into two categories, the faster path and the slow path. And in the fast path, the consensus can be achieved within a single RTT suit for scenarios where ordering doesn't matter. On the other hand, the slow path requires two RTTs corresponding to the situation where ordering doesn't matter. The CURB protocol introduced the concept of a witness. When a client initiates a proposal, it sends a request not only to the leader, not only to the leader node, but also to all witness nodes. The records within a witness are unordered in the fast path. Upon receiving the request, the leader immediately writes the data to its local storage and responds with an OK without waiting for data synchronization to the backups. Once the client receives an OK response from the leader and over three quarters of witnesses, it confirms that operation is persistent. In a slow path, due to the witness rejection, the client needs to wait for the leader to synchronize the data to other follower backups before achieving consensus. We can illustrate how CURB achieved consensus with a simple example. Consider two commands. Put Z5 and put Z7. When a user issued the first command as put Z5 doesn't conflict with the records in the witness, it entered the fast path requiring just one RTT. For the second command, since put Z7 conflict with the put Z5, the witness rejects the client's request. Consequently, the leader executes after sync to synchronize the data to a follower backup within the cluster. Once the majority followers complete the present persistence, the leader responds successfully to the client, including indicating that consensus has been reached. For consensus algorithm used in production environment, relying solely on testing to ensure correctness is far from suffixion. We also need to establish the algorithm correctness theologically through a process known as the formal verification. So what exactly is formal verification? Formal verification is the act of proving or disproving the correctness of intended algorithm underlines a system with respect to a certain formal specification or property using the formal methods of mathematics. In XMI, not only is the CURB algorithm implemented, but it also provides a TLA plus proof for the CURB algorithm for specific details. Please refer to the following link. As mentioned earlier, the intermediate layer can be divided into the consensus module and the business module. So how does these two modules interact? On a single note, the business server communicates with the CURB server by initiating RPC request through its old CURB client. Once the cluster achieve consensus, the CURB server will invoke the callback of the business module. This callback method is defined within the command executor chain and is implemented by a business module. And here are the RPC interface definitions relevant to the CURB module. Among these, proposed is primarily intended for external business servers to invoke. Business servers use proposals to initiate proposals to the CURB server. The remaining RPCs are mainly designed for communication between CURB nodes within a cluster. Here are the CURB external interface definitions and the CURB module provides three traits for inversion of control. The command tree represents a specific command and a valid command should include the following two aspects, the criteria for determining conflicts and which phrase are involved and the result produced by each phrase. And the conflict chain describes how conflicts are determined between different commands. Command executor chain, the execution entity of a command describing the specific execution process of different phrases of a command. The CURB server doesn't concern itself with executing commands, the only interesting in the relationship between them and doesn't delve into the specifics of how commands are executed. On the other hand, command executor describes how commands are executed without concerning itself with the relationship between commands. Each interface is separated, focusing on its respective aspect. And the entire CURB server can be structurally divided into two parts. One is the from and CURB nodes and the other is the back end ROCKERP. The CURB node responds to RPC calls from the CURB clients and forward the corresponding request to the back end ROCKERP for execution and ROCKERP in terms performs the different operation based on the type of the request. Conflict track NPMC channel and command walkers are two core components of the CURB service. So what are their responsibilities? Conflict track channel is an NPMC channel which guarantees there will be no conflict message received by multi receivers at the same time. The command walkers receive the commands from the channel and execute them. Let's consider an example. Suppose there are three commands A, B, and C and command A and B conflict with each other while command C doesn't conflict with either A or B. The working process of the conflict track NPMC channel can be simplified as follows. Of course the working principle of the conflict track NPMC is not as straightforward as described. We will encounter the following two challenges. The first question is how to model the conflict relationships? We can use a deck to model the relationships commands act as a vertex and the conflicting relationships act as edges. And the second question is how to find all non-conflicting commands? And we can transfer this question into calculating the topological order of the connected component in the deck. Once these two challenges are addressed, we can replace the buffer in the conflict track NPMC channel with a deck and ultimately the working process looks like. Now let's describe how KERP module achieves consensus. When a client initiates a proposed request, the request first reaches the KERP node and the KERP node invokes the handle propose to process the command. Since put z equals to 5 doesn't conflict with other commands, KERP insert this command into the conflict track NPMC channel. Subsequently, command walker will retrieve the command from the channel, execute it, store the execution result into the command board, and notify the KERP node to send an OK response to the client. And after the client sends a request to the KERP node, the KERP node also forwards the request to rawKERP. However, since put z7 conflict with the preceding put z5, rawKERP submit the command to the sync task and return a key conflict arrow to KERP node. The sync task wrap this command into a log entry and send append entry's request to other followers within the cluster. Once all nodes have stored this log entry, the KERP server applies it. During the log application process, the command is sent to the channel and executed by command walkers and execute an after sync stage. Only after this stage are complete does KERP node receive the notification to send OK response to the client. And X9 storage layer can be divided into two sub layers, the storage API layer and a storage engine layer. The storage API layer is responsible for encapsulating low-level storage interfaces into the business-relevant interfaces. The underlying storage engine layer corresponds to specific storage engines and essentially provides a thin layer of encapsulation for those storage engines in the form of a key value interface. When selecting a storage engine for an industrial grade metadata storage system, what factors should be considered? First and foremost, from a functional perspective, the chosen storage engine should provide transaction semantics and support basic key value operation and enable batch processing operation. Next, from the maintenance standpoint, consideration should be given to the backing supporters of the engine prioritizing large commercial companies or in active open-source communities. Addictionally, the engine should be widely adopted within the industry to provide a wealth of experience for debugging and tuning in later stage. Lastly, the engine should process a high-level recognition and a popularity measured by GitHub starts to attract to attractive excellent contributors. Moving on to performance, as storage engines often become bottlenecks in system performance, it's essential to select a high-performance storage engine. A high-performance storage engine naturally requires coding in a high-performance language and should favor synchronized implementation, prioritizing the Rust language followed by C or C++. Finally, from a development perspective, prioritizing implementation in the Rust language can reduce some additional development effort at this stage, and a priority order of these requirements are as follows. The functionality is greater than the maintenance and maintenance is greater than or equals to performance and the performance is greater than development. And currently, the mainstream storage engine in the industry can be broadly categorized into B++-G-based storage engines and LSM-G-based storage engines, and their read-write amplification is as follows. And from the above graph, we can draw a conclusion that B++-G-based storage engines has lower rate amplification, so it's suitable for scenarios with more rates and fewer writes. And LSM-G-based storage engines has lower write amplification, so it's suitable for scenarios with more writes and fewer writes. And however, due to the lack of the B++-G-based storage engines widely used in the industry within the Rust ecosystem, XLINE ultimately choose ROXDB based on the LSM-G-based storage engine. For further information on the design and trade-offs of the persistent storage layer in XLINE, please refer to the log, the design and implementation of the XLINE persistent storage layer. Okay, let's discuss some topics related to the distributed system testing. Here are some common pain points in testing distributed systems. As we all know, distributed systems in reality often operate in unreliable environments which can stem from the following source, like network packet loss, out-of-order delivery, network partition, timeouts, malicious attacks, software issues like bug in software may lead to a crash of the distributed null instance and something like hardware problems, hardware failure such as the power issues cause null instance to crash. The mission of a distributed system is to achieve the full tolerance in such complex unreliable environments to verify the distributed systems can tolerate errors within a reasonable range. A key aspect is testing and chaos engineering. However, testing a distributed system often encountered the following pain points, a large number of test cases and long testing time and has some bug bugs, those errors that occurred in one moment may not be reproduced in the next. What you can do, what you can only do is run long and repetitive tests to reproduce the issue. The author of SLED has published a blog post title SLED mis-simulation guide on their website discussing how SLED conducts testing. This article points out that the success of Jepsen serves as a continuous reminder that we have building distributed systems in a misguided manner. So what approach can truly be considered correct? The right approach has two steps. The step one is write your code in a way that can determine it particularly tested on top of a simulator. And the step two is to build a simulator that will exercise realistic message passing behavior. Anyone who doesn't do this is building a very buggy distributed systems as Jepsen rapidly show. A notable expectation is foundation DB let's learn from their success and simulator. And the medicine is a deterministic simulator. It's essentially a runtime with the key feature of deterministic simulation. It's its fundamental unit as node representing a simulated entity that can be associated with a network IP address. Using the API to provide by a medicine it becomes possible to control various states of a specific node. As the code snippet show above the init method receives an initial task and when the node restarts it will run the task again to simulate the functionality of restore after a node crash. So how does the medicine simulate deterministic? And currently medicine is primarily used to stimulate randomness network and timers in XLINE. For network communications in medicine it utilize an implementation called endpoint. An endpoint doesn't use the actual IP address and port but instead use channels to simulate data communication in memory. Each endpoint can be bounded to any valid IP address and port and can send data to other endpoints. The routing in medicine allowed connecting any IP address. Medicine is also provide simulated timer and clock. This simulated timer records events. The time in medicine runtime is not a real time but as instead advanced by the runtime to simulate the system clock. After running each task the medicine runtime will invoke advanced to next event to progress the system clock and trigger expired timer events. And how does medicine reproduce the Hassenberg bugs? And medicine provides medicine use a random number C to produce the system's environment when Hassenberg bug occurred. There are two steps. The step one is to obtain the corresponding seed when a Hassenberg bug occurs. And the step two is to use the environment variable medicine test seed to set a seed for the corresponding test case. Like this. And medicine when creating the runtime a specific seed needs to be specified to medicine and this runtime will use this seed to create a global random number generator and override the Lipsies random function ensuring that all the random numbers in the runtime are deterministically generated from this seed. Additionally all IO operations in the runtime including the time, timer and network status and extension derive from this seed. So therefore once this seed is determined the system running state is fixed ensuring that the same seed will produce the same result each time it runs. And how to integrate medicine into your project? And there are three steps and using the medicine components to eliminate uncertainties in your integration test like a random generator, time, Tokyo, Excelsior. And the step two is to substitute the integration test macros with the medicine's test macros and set a medicine as the corresponding runtime. But here's one thing you should know. You need to convert certain interfaces with external dependency into side effect free interfaces. For example, medicine tonic does not provide a service incoming method which accept an external stream. Since the stream may originate from outside of medicine it could introduce side effects beyond medicine's control. That wraps up my presentation. And if you are and if you'd like to learn more details about XNI, please feel free to visit the repository and website. And thank you all for your time and attention.