 Hello, everyone. I'm an engineer from PinCAP, the company that made Type-AV. And I'm here to share with you our experience with implementing distributed event streaming from Type-AV. Before we get to business, here's some background about myself. I'm Zixun Liu, a software engineer at PinCAP, and I'm part of the data migration team. I'm a contributor to Type-AV and one of the maintainers of Type-CDC. Within the field of computer science, my interests are databases, operating systems, distributed systems. And for education, I'm an alumnus of the University of Chicago, and I was part of the class of 2018. Now let's go over today's agenda. First, I'll give a brief introduction to Type-CDC. And then I'll talk about the high-level design of our distributed event streaming project. Then I'll talk mainly two aspects of the detailed design. And then I'll also talk about our use of EDCD and the high availability we implemented with it. Okay. Now let me talk a little bit about Type-AV to give you some background for the main topic of this talk. Type-AV is a distributed transactional key value database, and it is a graduated project of the CNCF. This diagram shows the overall architecture of Type-AV. So this is a Type-AV cluster with four machines. We use shared nothing architecture so that Type-AV can run on commodity hardware with no special equipment needed. Each machine can host multiple regions. The number of regions on each machine can be quite high, like more than 10,000. These regions are replicated by draft protocol and are regulated by placement driver, which is a component that runs outside of the Type-AV process and communicates with Type-AV by GRPC. Here I summarize four points that are helpful to understand before we proceed. First, Type-AV is a transactional key value storage where you can manipulate multiple keys within a transaction. Second, the keys are shared into multiple regions. Well, you can regard the region as a replicated key value store. Third, the regions are replicated by using draft protocol, which provides strong consistency among the replicas. And the fourth point is that regions can merge and split while being regulated by the placement driver. In order to demonstrate the rationale behind the design choices we made when implementing data streaming, it's important to go over the abstraction layers of Type-AV. First, on the top of everything is the transaction layer. The transaction layer is where our concurrency control protocol is implemented. We use a percolator like protocol, but I will not go into detail of that. Our protocol requires a multi-version key value data store, and that's our second layer. The third layer is called draft-AV, which is where replication comes into play. And we use drafts to maintain consistent replicas of regions, if you recall. At the bottom is the drafts DB that we use to persist data onto the disk. Now with these layers in mind, let me introduce you to the main goal that we want to achieve here. So on a high level, our goal is to create connectivity from Type-AV. This is what the development of the data processing ecosystem is demanding, and the real-time data capture is meaningful in that it creates connectivity to other parts of the ecosystem. For example, we can capture all data changes in Type-AV to Kafka, and from there the data can be processed by Flink, etc. So this project is actually a step towards using Type-AV as a hub for data exchange. This is something very significant, and we also have other projects that include data imports from other sources. The most basic goal is to be able to capture new rights to Type-AV. This is basically what we mean when we talk about events we need. But only the new data is not enough because from time to time we'll need to read historical data from RocksDB because as a user of a change data capture tool, you may want to get a little bit back into the history and retrieve, say, a data stream that starts from five minutes before present. The most important and also the most challenging sub-goal is to preserve the atomicity and consistency of transactions. A consumer of the change log should be able to retrieve the transactions in exactly the order in which they were applied to the database. Such an order must exist because we support snapshot isolation. We'll talk more about transactions later. And when we got started, we first needed to decide where we should capture the data. We wanted to choose an appropriate abstraction level where we intercept the data rights. First, we skip the transaction layer. Implementing data capture at a transaction layer basically means that we would have to record each API calls. This means that we would need to do a lot of extra bookkeeping to achieve the level of consistency that we wanted. So we went down one level and look at the MVCC layer. This is not a good level to work with because, well, it's a distributed multi-version key value store and there's no centralized data structure that we can monitor. So we needed to go further down and let's take a look at RAPTKV. The RAPTKV is a system of replicated state machines, meaning that we can monitor the state transitions on each of these machines. We also needed to make use of the RAPTKV, even though it's not good for listening for changes. From time to time, we need to read from it and get some data that's written in the past. Well, it's actually possible to capture RAPTKVs right ahead log, but since regions are moving constantly here and there across the nodes and the RAPTKV can go down and up at any time, it's hard to circumvent the concept of the region and capture RAPTKV changes directly. Now, since there's a one-to-one correspondence between the RAPTKV state machine and the RAPTKV region, we decided on an architecture where we captured data on each region and then we do something to combine the data streams from all regions. I'm going to walk you through the thought process through which we arrived at the current design of the algorithm that we used to combine the data. The high level design is this. First, we will receive data from all regions. We can think of this as receiving from multiple streams simultaneously, and then we decide that. And when we decide that all regions are done with sending transactions, which commit, yes, it's less than or equal to some value, we produce a watermark. This would be a watermark in the conventional sense for the streaming system. Then we started data and restore the transactions that are earlier than that latest watermark. With this high level idea in mind, let's continue. This is an overview of the types of events that we receive from each region. We have pre-write, which has the start yes of the transaction, the key and the value. And we have commit, which is paired with pre-write and it contains both start yes, the commit yes, and the key. And we have lock, which contains the start yes and the key and we have unlock. Now we know the four types of events. Let's take a look at the first attempt of making an algorithm to produce the final data stream. Here I use the pseudo code with a syntax that's similar to that of Rust since Type-AV is written in Rust. We assume a data type that's called TXN event that represents the items of a stream that has watermarks in it. We have a special name for this watermark and we call it the Resolve TS. But quickly, we ran into a problem. What should the Resolve TS be? We didn't really have enough information here. The locks are being tracked and it's easy to know the start yeses of the transactions that have placed at least one lock. The problem here is that we don't know if there is a transaction that hasn't earlier start yes that is so being processed and it's yet to write a lock. Here we summarize the problem. The events themselves say nothing about whether a region has a pending transaction whose commit yes is smaller than the minimum of all current locks start yes. It's possible for such transactions to commit if there's no conflict. To resolve this problem we'll need the placement driver's assistance. Now let's take a second try. We no longer try to generate a watermark from the same thread as the one that process the region's data. Notice that we still need to track the locks because the start yeses of the locks provide a lower bound of the commit yeses of the running transactions. In addition to the function shown in the previous slide, we need to launch another thread. The reason is that we need a barrier here. We need at least two concurrent threads because the barriers needed to wait for all pending rights to be observed. In this piece of code, we first read the latest logical commit, logical timestamp from PD, and then we enforce what we call a barrier here so that when we read the locks here, all previous rights would have already been processed. This way we can generate a watermark for a single region. Next I'll show how to further process the data. But before getting into detail, I'll introduce you to another distributed components separate from Tai KV, Tai CDC. Tai CDC is designed with three goals in mind. First, from the management point of view, we wanted the data streaming component to iterate separately from Tai KV. And the second goal is that we aim to design a fully distributed component to eliminate every single point of failure in the stream processing process. And third, we wanted a component that is separate from Tai KV. Because we'd like it to work with Tai DB, which is a relational database project that we built upon Tai KV. But Tai DB is managed independently from Tai KV, so we wanted Tai CDC to be independent too. This is what the architecture looks like with Tai CDC. This diagram did not show the Tai KV nodes because nodes are no longer important here. Tai CDC communicates directly with the regions that can be migrated from nodes, node to node. So we don't need to think about Tai KV nodes when we design the high level logic of Tai CDC. There are two concepts that we need to clarify a little bit, spans and regions. Spans are just intervals in the key space. The keys in Tai KV are totally ordered, so it's possible to define ranges. And the placement driver can be queried to find out which regions correspond to what span. Tai CDC then schedules GFTC connections to Tai KV regions with the knowledge of this span to region mapping. When regions merge or split, Tai CDC will reschedule the affected span based on the new span to region mapping. And another way why spans are important is that they are used as keys into a segment tree, which is used to track the watermarks of the regions. Here is an example. It shows the how Tai CDC connects to the regions and they receive the push stream from each of them. And highlighted are the resolve TSS that are being produced from each of the regions. This example is a little bit complicated, so you guys can take a close look at it when you download the slides. And there are a few more points that we need to mention for you to get a complete picture. Now that we have a stream event separated by watermarks, it's very easy to sort the stream and combine the events to produce a stream of transactions. Second, we implemented high availability using EDCD to distribute it at CDC. Third, to make Tai CDC a useful product, we made it support decoding Tai DB records. So it can be used to capture changes from Tai DB. What we have implemented is similar to MySQL bin log, but ICDC is capable of writing the events directly to various kinds of downstream such as MQs, F3 files, and MySQL compatible relational databases such as MySQL itself and Tai DB. One important point in our implementation is that we make extensive use of Tai, we make extensive use of EDCD. We make use of it mainly in two ways. First, we use it as a persistent state storage for fault tolerance and recovery. This makes Tai CDC a stateless service and makes it very easy to deploy and maintain it. Second, we use EDCD as a consensus service to maintain a current state of the Tai CDC cluster that all nodes can agree on. This is crucial because if you want to implement high availability, you need to know which nodes are alive and what they are doing. Now I'll show you an example of how high availability works. First, we add a node. We add another node. The first node is now elected the owner of the cluster. The owner is the node that's in charge of actively manipulating the state of EDCD to coordinate the nodes. And the owner decides what tasks need to be run and assign the tables that are to be replicated to each node that is available. We can add new nodes at any time. When the owner discovers a new node, it will initiate table migration if there's significant imbalance in the workload. Now the table migration is completed. In the event of node failures, the owner will reassign the tables previously being replicated by that node. Now as you can see, a surviving node takes the table. And the owner itself can go down too. In this case, a new owner will be elected. And the new owner will compare the current state of the cluster with the desired state. And it will assign the tables that are not being replicated to a surviving node including itself. That is all I plan to share today. If you have any questions or want to join us, please visit our website or contact us in the Slack channel. Here are some useful links.