 Hello everyone, my name is Yin Zhenghao, and I'm a Distributed Storage Engineer at Dayton Lord. Thank you for joining me today for this presentation on the topic Chaos Engineering Testing and Analysis of X-Line. This presentation aims to provide an overview of application and analysis of Jepsen tests on Distributed KV Store X-Line. To begin, let's start with a brief introduction to Jepsen. I'm sure some of you may already be familiar with it, and we will explore how to apply Jepsen tests to X-Line. We will also dive into the testing results and analysis in our system. And finally, we will discuss the lessons learned and future work of Chaos Engineering on X-Line. Now, let's familiarize ourselves with the Jepsen framework. As many of you may know, Jepsen is a powerful library used in Chaos Engineering. A Jepsen test is a closure program which uses the Jepsen library to set up a distributed system, run a bunch of operations against that system, and verify that the history of those operations makes sense. Jepsen consists of several components, including a database for testing, a generator to generate operations, a model to check for correctness, and nemesis components for failure injection. I'm sure some of you are wondering how Jepsen works exactly. Well, the most important part of Jepsen is the checkers. NOSOS is one of the checkers in Jepsen that verifies if the operation histories are linearizable. On the other hand, L checks for transactional consistency or serializability. With this due combined, we can verify various types of databases that provide different consistency guarantees. In our testing, we specifically target strict serializability, which requires both linearizability and serializability attributes. If you take a look at an image, you can see that transactions not only appear in real-time order, but each transaction also reads the modification made by previous transactions. Now, let's talk about the nemesis component. Nemesis offers fault injection capabilities, allowing us to simulate real-world application scenarios. Jepsen provides some beauty nemesis, such as pause, which will pause the current process and kill which will kill the process, and partition, which will partition the network among the nodes, and the clock nemesis will screw the system clock. Except for this, you can also write your own nemesis that tailored your system. To give you a better understanding of how nemesis work, let's take a look at this video example of the partition nemesis. It computation the cluster into a majority-minority paradigm. And it can also create a ring network that may result in multiple majorities if the system design is flawed. Now, let's move on to the setup of Jepsen for X-Line. Before we do, let me provide you with some background information about X-Line. Currently, X-Line is a sandbox project of the Cloud Native Computing Foundation, providing KB storage for metadata management. X-Line uses the Curve Consensus Protocol, which is geo-distributed friendly. It also offers an AdCD-compatible API. Similar to AdCD, X-Line provides strict civilizable reads, writes, and transactions across the entire system, along with functions like watch and distributed logs. X-Line provides an AdCD-compatible API. This allows us to reuse Jepsen tests for AdCD. And to improve performance, a native client is bundled with X-Line. So what we need to implement are the DB part and the client part for Jepsen X-Line tests. Speaking of tests, let's discuss the ones we selected. We choose a few tests directly from the original AdCD tests. The first one is the register test, which checks for innerizability. The second one is set, which checks for still reads, and transactions. Lastly, we have append, which checks for strict civilizability. Most of these tests are based on a common operation called compare-and-set. Just like at CD, X-Line relies on predicate-based semantics in transactions and a global revision for each mutative operation. Let me demonstrate an example of a predicate-based compare-and-set operation with the guard function in the append test. This function ensures that a key is not modified by the time of the transaction. As you can see, it first fetches the value of the key through a get operation and then checks if the next modification revision of the value is the same as the current read revision. If the key is missing, the predicate will be that the next modification revision of the key will be less than the latest global revision observed by the previous get operation. Okay, let's move on to the analysis of the test results. Through the Japson test, we observed two main categories of issues in X-Line. The first category is asynchronous-possessed issues, which are caused by X-Line's original asynchronous IO design. The second category is revision-generation issues, which is caused by the at-cd-compatible revision generation. To elaborate on the asynchronous issues, unlike at-cd-synchronized-possessed approach, X-Line uses asynchronous methods to persist on log entries and kv-storage, which introduce extra complexity. One issue we identified is that the read state implementation is incorrect because the committed operation and the index barrier trigger operation happened asynchronously, and we initially ignored the potential gap between them. Additionally, kv-storage might be inconsistent with the logs due to the asynchronous persistence. Also, another problem is that one transaction might read different values for the same key, as other commands may be executed during the transaction, thereby violating atomic execution. The asynchronous IO introduced interleaving system states, making reasoning about them difficult. We found that synchronous IO, despite having some performance overhead, is simpler and ensures correctness. As a result, we decided to refactor X-Line to use synchronous IO. Moving on to the second major category of issues, revision-generation. X-Line uses the Kerb consensus protocol, which leverages command commutativity to achieve one round-trip consensus. On the other hand, SCD employs the roughed protocol and executes the commands sequentially. Our goal was to implement an SCD-compatible API for X-Line while maintaining the one round-trip performance. However, our analysis of the test revealed that this approach was not feasible. The Kerb protocol allows for concurrent execution of commands if they commute. Similar to the FastPaxels approach, this is different from the state-machine approach used by SCD, where commands are executed in a global order. In the Jepsen append test, we expect strict serializable execution histories. However, Kerb itself does not guarantee a global order of all commands. This means that the commands do not execute in a serial order, resulting in transactions that do not follow linearizability, thus violating the strict serializability constraint. Now, let's discuss the lesson we learned from debugging. When debugging a distributed system, understanding the event topology across multiple nodes can be challenging. However, based on my experience debugging X-Line, I would like to share some tips with you. Firstly, logs are crucial for debugging a distributed multi-node system. Make sure they provide identical information for tracing purpose. Avoid unnecessary messages, and especially avoid logging large objects, which will generate a huge amount of noise, making the debugging even harder. Starting to debug from small symbols is also important. Begin with a small, easy-to-understand symbol. As you can see from the Jepsen test result examples, the upper graph is challenging to understand, but the bottom one looks much more clear for humans. Sometimes it may require running the test multiple times to find a suitable symbol. Lastly, if developing a distributed system, I strongly recommend integrating KL's engineering methods early on. Humans are not good at analyzing complex systems, and traditional tests often lack sufficient coverage. By incorporating KL's testing, you can significantly increase test coverage and uncover bugs that may not be revealed in traditional tests. For the final part, let's discuss the future work we have planned for KL's engineering. Analyzing logs can already be a tedious experience, especially when the need for additional tracing logs arises. In such cases, you often have to add the necessary code and essentially restart the entire debugging process. The non-deterministic nature of Jepsen tests further adds to the debugging time. It becomes challenging to read about the system state based solely on a single log file. To address this, we plan to migrate some of the Jepsen features, such as generators, checkers, and nemesis, to MedSIM. MedSIM is a deterministic simulation framework. This will allow us to conduct Jepsen-like tests when maintaining test determinism. It is expected to greatly improve debugging efficiency as we can reproduce the same results deterministically. Additionally, integrating MedSIM into a CR environment will be more convenient as it runs locally on a single machine, unlike Jepsen, which requires running tests separately on multiple machines. That's all for today's presentation. I hope you find it helpful in understanding KL's testing and debugging distributed systems. For more information, please visit our websites or feel free to reach out to us directly. Thank you.