 Hello everyone, thanks for joining us today. I am Charles, software engineer at Pincap. Today I'm going to talk about the evolution of ThaiKV. The talk will be divided into four sections. First, we will briefly introduce the history of ThaiKV and summarize the major requirements for a Cloud Native Key Value Store. Then I will give a high-level introduction to the overall architecture of the ThaiKV, which can help you to better understand the rest of the talk. After that, I will elaborate how ThaiKV meets the requirements we mentioned in the first part. And finally, I will introduce how we adapt ThaiKV to the modern Cloud infrastructure and why it can be a solid building block, not only for a Cloud Native database, but also for other higher-level Cloud service. Okay, let's get started. First, why we wanted to build ThaiKV at first place? There are so many Key Value Stores out there. Why they does not fit for our case? And what are some requirements for a Cloud Native Key Value Store? Then we have to talk a little bit history of ThaiDB. The Cloud Native database support both OLTP and OLAP workloads. Back in 2015, when we built up the first version of ThaiDB, we built it up on HBase and HDFS. If you have ever dealt with the Hadoop software stacks, you may have the same feeling that they are not easy to work with. In addition, as we want to develop a database support distributed transaction, then we have to add distributed transaction support on top of the HBase, which costs poor performers in high operation cost. After suffering enough, we finally designed to build our own storage engine with following requirements. First, it must be able to leverage the modern Cloud infrastructure to easily scale out like other NoSQL system. Second, it should support high-performance IO. Third, it should support distributed transaction inherently. Fourth, it should use modern data replication protocol, which we choose to wrap the protocol. Fifth, we want to make sure it is easy to maintain so it must use a clean architecture. And last but not least, it should be easy to use. Anyone who use the other key value store should be able to pick it up very quickly. So from the day one, we designed to develop a storage engine that can be solid building block for other distributed system and really help people out. And on this principle can be applied to all software from the PINCAP community. In general, we can divide the aforementioned requirements into four categories, which is to say the cloud native storage engine we wanted to build should first be highly scalable, a large data processing platform built on top of it may need to store tens of millions of entries and hundreds terabytes of data, which requires the backend key value store to be highly scalable. Second, ensure data consistency. As we mentioned before, we want to build storage engine supporting distribution section inherently and use modern data consensus protocol. Third, support high-performance IO. If we have hundreds of clients try to read or write the storage at its end time, it will be critical that they can get a response within a short time. And a force be extremely reliable. A key value store is usually played as the brand of the whole system. The failure of the data storage can result in losing control of the whole system, which is intolerable. Next, I will try to elaborate how TyKV meets all this requirements. But before that, let me briefly introduce the overall system architecture of TyKV from high level and hopefully it can help you to better understand the rest of this talk. A TyKV cluster looks like this. The TyKV nodes are on the button. You can have as many as you like, but we usually suggest to have at least three. We split the data into regions. I know the name is confusing, but you can just treat a region as a subset of the data. As shown in the graph, we split the complete data set into five subsets, which are five regions. Each region contain multiple replicas, which are spread across different TyKV nodes. In most of the cases, three replicas should be enough. On the top, we have the TyKV client communicating with the TyKV cluster through the GRPC protocol. And on the top right, we have the placement driver, which is the scheduler of the TyKV cluster. The placement driver has three major jobs. First, generate global monolithic and a unique timestamp for the transaction, which can also refer as TSO. Sometimes people like to use the term timestamp oracle, but I think it is a confusing term, so let's stick with TSO. Second, manage all data region across TyKV nodes, like rebalancing regions, migrate regions from offline nodes to active nodes. Third, as placement drivers towards metadata of all regions, it will also be responsible for routing read and write request from client to correct the node. Okay, we have seen how TyKV looks like on cluster level. Next, let's have a look at how TyKV implemented internally from the node level. Each TyKV node can be divided into four layers. From bottom up, we have storage engine layer, which is a low-level local key value store, and we choose to use the ROX DB. We leverage this layer to persist the data into the local file system. We did not implement a local key value store from scratch. As implementing and high-performance local key value store requires a lot of efforts, and ROX DB has been proven to be reliable with predictable performance in production by manning date management system. One level up, we have the consensus layer. This layer presents the abstraction representing multiple replicas, which allow upper layer to feel like they are dealing with one piece of data instead of multiple replicas, like ETCD. We use wrapped consensus protocol to ensure the data consistency between replicas across multiple TyKV nodes. But different from ETCD, we use multi-rupt group instead of single-rupt group to improve the scalability and IO throughputs which we will cover in the following slides. On the top of the consensus layer, we have the transaction layer. We use the multi-version concurrency control, MVCC, to implement the Google Percolator protocol, which commits transaction using two-phase commit algorithm. A lot of database terms, right? If you feel like it is a little bit too much, that's normal. It is not easy for database experts as well. And from my opinion, the distributed transaction is one of the hardest parts to understand in database management system. But fortunately, TyKV has already implemented and we can simply rely on it. On the top, we have the key value API layer, which is responsible for handling requests that send from client. The API is quite generic, get, put, delete, scan, et cetera. If you have ever used any other key value store before, then I believe you can pick it up very quickly. So now we have enough background and I can start explaining how TyKV address the requirements we mentioned before. First, let's try to find out is TyKV highly scalable. In TyKV, we divide large data sets into small regions with each region containing several replicas making up their own rupture group. The number of replicas may change depends on how TyKV nodes are spread geographically. But in most of the cases, each single TyKV node does not need to store the full copy of the data. Let us to say, TyKV is horizontal scalable. When the volume of data is increasing, we can simply scale out the TyKV cluster by adding more nodes. You may be curious if each node does not have a complete copy of data. Then how can we tell which node should we access if we want to read or write certain entries? If you remember, we have a central component called placement driver, which will act as the scheduler of the TyKV cluster. As replicas of each region making up their own rupture group, the leaders of each group will hold the complete information of the region. Like the number of active replicas, the topology of the replicas is actual. And the leader of each group will send a heartbeat to the placement driver periodically, including all this information. Then when a TyKV cluster service can service and request from the client, it will first query the placement driver, get a target node, and then route the request to the node. The placement driver also support users to create highly customized configurations, like maximum number of replicas in each region, how can we balance the leaders across nodes? And in the scheduling throughputs, because too much data balancing and migration may affect the performance of online servers. To sum up, by breaking up the data into partition and the spread of replicas of the partition across nodes, a TyKV cluster can be easily scaled out to hundreds of nodes. Here are two large real-world TyKV clusters. The left is one of our user's tweets. It is a TyKV cluster backed by a large TyKV cluster consists of 168 nodes, with 1,820 billion rows and the 318 terabytes of data, which need to support 100 million reads and 87,000 writes per second. The users thought it is the largest one, but soon we had a larger one, which consists of 212 nodes and it can hold up to 827 terabytes of data. We have not tried yet, but we did not experience any pressure with this cluster, so I believe we can actually grow larger. Okay, now we met the scalability requirement. Next, we need to ensure the data consistency. When we are talking about the data consistency of the distributed database or data storage, we usually need to achieve two goals. One is isolating transaction from each other. Another is keep the data consistent between replicas. This are two complicated topic was a whole session to discuss and we do not have time to do that today, so let's try to make it a single. In short, transaction is an operation unit containing multiple operations. For example, you may want to read an entry and update it depends on the existing value. This involves one read operation and one write operation and the isolation between transaction means two transactions will not interfere with each other. And meanwhile, we have multiple replicas for each data partition, so we need to make sure the content of each replicas is consistent with each other. As we can see in the graph, the left subtree contains different transaction isolation level and the right subtree contains all the local object consistency model. If you still remember, TyKV internally contains four layers. In the middle, two layers are the transaction layer and the consensus layer. And then the transaction layer ensure two things. One is the snapshot isolation, which means we create an independent, consistent snapshot of the database for each transaction. Its change are visible only to that transaction until commit time. This allow us to handle transaction concurrently if they are not interval with each other. Another is the repeatable read, which means any data read cannot change. If the transaction reads the same data again, it will find a previously read data in place unchanged. Also, for the consensus layer, we use the roughed consensus protocol. TyKV ensures strong consistency, which is also known as the linearizability among replicas of each object. The next requirement we need to meet is the high-performance IO. For the read, as we use multi-version concurrency control to implement a two-phase commit, which means we version in each key value pair and create isolation snapshot for each single read and write, which prevents a read request from being blocked if there is an ongoing write on the same entry. We also apply many other optimization approach. For example, we support follower read that allow us to spread the read workflow across all replicas. Or we prioritize small reads to prevent the overall throughput from being affected by several large reads. For the write, since we divide data into partition, which replicas of each partition make up their own roughed group, then write request on key value pairs belonging to different partition can be handled concurrently. We also try to optimize the write performance on a local storage engine level by developing a ROXDB plugin called Titan. Even though ROXDB provides good and predictable performers, it may still suffer from write amplification as it uses a log structure merge tree internally. Basically, each time when we update or delete key value pair, the ROXDB will append operation to the end of the log file instead of update it and will periodically compact this files to remove the redundant key value pairs, which will cause the write amplification. Titan avoids this by storing the large value outside of the ROXDB. And for the scan, since we do the range-based shorting that divide data into continuous range determined by the short key, scanning keys sharing same prefix can be very efficient, since they are usually located in a handful of regions. Besides all these, we also apply some configurable optimization feature, like we can enable the load base splitting on placement driver, which will split the hot region if it receive a lot of read-write request. Last but not least, we need to make sure the TyKV is reliable. Otherwise, all the aforementioned advantage will be meaningless. We usually recommend users to use at least the three TyKV nodes and the three replicas for each partition. By default, we will spread replica across nodes. Therefore, we can tolerate the failure of some of the nodes. If you plan to deploy TyKV on public cloud, you can also spread TyKV nodes across multiple available zones. And then the placement driver will place a complete copy of data in each dome. If we lost the sum of the nodes, placement driver will automatically rebalance the replicas and the leaders to make sure the total number of replicas is unchanged. Besides that, we also provide tools that can help users to recover the cluster from disaster. In addition, we'll also make sure that the TyKV is well tested. We run regression test frequently to ensure the new feature is backward compatible and will not break the historical version. We also run large multi-day performance test that use real-world data to ensure the performance is predictable. Since many users adopt TyKV as the backhand storage engine to build their own software as a service product, predictable performance is critical to make service level agreements for higher level service. A few are familiar with Japson organizations. They run tests on various distributed system to validate the different consensus claim. We also run a range of Japson tools on TyKV to ensure we deliver the transaction promise we have made. We do chaos engineering with ChaosMesh. If you are not familiar with ChaosMesh, ChaosMesh is another project initiated by PINCAP and has been donated to CNCF as an incubating project. It allows us to simulate unexpected situation like node going down or network breaks and so forth. As we can see, TyKV is a highly scalable distributed key value store with high performance IO and high reliability. It works pretty well in an on-premise environment. However, we are seeing an ever-growing demand of deploying TyKV on a public cloud. The public cloud infrastructure is very different from the on-premise infrastructure. Then how can we ensure that TyKV running on the public cloud can still provide reliable performers? Nowadays, most public cloud vendors provide virtualized disk. Those disks can be mounted to a local file system and they appear like a local disk as well. But internally, they are forwarding IO to multiple remote disks that are potentially shared by multiple users. AWS EBS, for example, will replicate any right IO to three different locations. This internal complexity can be a real problem for our system because the latency of starter is obviously higher than a local disk. Also, since you are sharing hardware with other people, anything that you use will be charged. That includes disk bandwidth and IOPS. Finally, we should all know the cloud infrastructure are not always as reliable as they claim it to be. Service degradation will be relatively frequent and it should be considered in system design. Ideally, we want a large TyKV cluster to behave similar to a traditional database, relational database system. But unfortunately, it's actually hard to accomplish that on cloud storage. First, we want to build a scalable service, but scale means more fails. To be more specific, we are worried that as a system, its storage hardware performers is very likely to degrade. The cost is problem too because every storage operation is charged by the cloud vendors. The user now have more reason to care about exactly how and why our system is using those resources. And by that, I mean read and write amplification, which is amount of IO the system needs to send before finishing one user request. Here is a simple graph to demonstrate our system's runtime usage of IO resources. Over time, the user writes are very stable. As you can see from the yellow bar here, but there are amplified multiple times because of background writes. That includes compaction. As TyKV use RocksDB internally and the garbage collection, in addition to that, large events incur extra IOs that are usually not predictable. From this graph, we can see the most straightforward way to reduce cost and improve scalability is to keep IO usage under the watermark at all time. For that, we introduce two new features. First is the new rough to engine. It is a new log store for TyKV that is written in Rust. For those who don't know, we previously used RocksDB to store all transaction logs. It is not the optimal choice, but it is a decent solution at the early stage. And right now we want to replace it and improve it. The primary goal here is to write less than a RocksDB. Consequently, we can reduce IO cost and reduce the possibility of hitting storage performance limits. And this new rough to engine can also help us to improve the performance, which is under development. And we are really hoping more contributors can join us to improve it in the future. Now, let's talk about how we accomplish the primary goal. Rough to engine maintains an in-memory index of all log entries. The reason we do that is not to improve read performers. It's actually about reducing the background work. In RocksDB, compaction is needed to keep all data stored and clean up deleted data. But in roughed engine, we don't need to sort anything and the garbage collection doesn't need to read out, obsolete data, because we have a map of all active data in memory. Then roughed engine further reduce the foreground IOs by compression. All log entries are compressed with LZ4 before they are written. Mainly with those two techniques, we are able to reduce 30% of all server write IOs. The second feature I would like to talk about is called priority IO scheduling. It's not a new thing, many system already has it. What we manage to do is to add this functionality without introducing major change of architecture. That means we did not change the internal tasking system. No addition IO queuing is required with no extra overheads. We trace the categorized all system IO into three different priorities. Let's use priority A, priority B and priority C here. During the execution, we periodically assign individual IO limits to those priorities. At the beginning, the IO limits are high. All IOs run with no restrictions. Eventually, here at approach two, the IO usage exits a predetermined global limits. That is what we call an overflow. After the overflow, we adjust the IO limits to lower the priority to make sure that in the next approach, the system will not use so much IO resource. The algorithm is heuristic and not perfect, but it works pretty well in practice. Here, we conducted a test to simulate large events during online workload. In this test, a large table is imported while a TPCC workload is running. After applying IO scheduling, as you can see on the left hand, the system performance is much more stable than before. Meanwhile, there are several new cloud-based feature are under intensively development, like CPU limiting that is designed for environment with limit resource, rough to witness that reduce replication cost using write-only nodes to replicate transaction log without readable data. In the future, we want TIKD to further adapt to cloud hardware, and we truly hope community users can benefit from our works. That's all for today's talk. If you are interested in TIKD, feel free to try it out or join our community select channel. If you have any questions or comments, please show me an email. Thanks.