 Hello, thanks for joining me today. I'm Charles, software engineer at Netflix, and I'm also a TaiKV maintainer. So today, I will share our experience of building a scalable and reliable change data capture service for TaiKV. So today's talk will recover several topics. Firstly, we will explore why we need to build a new CDC tool for TaiKV. Then we will talk about how TaiKV works internally. Though TaiKV currently only support desynchronization from TaiDV built on TaiKV, the same principle and experience can be used by users wishing to sync data from any system built on TaiKV to other systems. This is particularly relevant as syncing data from a distributed database like TaiDV is often more challenging than from other systems. Once we have a better understanding of TaiCDC internally, we will cover the performance gains achieved after TaiCDC version 6.5. We received a lot of feedback from the community last year, like concerns about performers in assistant reliability. So we put a lot of efforts in time to enhance the reliability and optimize the performers. First, let's talk about why we need TaiCDC, what is a major use case of TaiCDC. There are two common scenarios where TaiCDC can be particularly useful. The first one is incremental data synchronization for heterogeneous systems. This means that if you have multiple database that need to be synchronized with one another, TaiCDC can help ensure that all the data is up to date in real time. The second scenario is cross-region disaster recovery, which based on primary and a secondary replication. This can be critical for business that rely on their data to operate. Compared to traditional database system, TaiKV can hold much larger volume of data, which makes capturing change data for TaiKV very challenging, as we want to ensure not only the high data syncing throughputs for large volume of change, but also different level of data consistency. So compared to some other systems, what kinds of features do TaiKV provide? First, TaiCDC provides, so first TaiCDC supports low latency incremental data replication for various downstream. So that is to say you can use TaiCDC to replicate data from upstream database to Kafka using Canon, JSON, Avro, or OpenProtego. Second, TaiCDC support database in a table filtering, which enable you to filter specific data based on your requirement. This feature helped reduce the amount of data transfer and makes the replication process more efficient. Third, TaiCDC support most operation through OpenAPI. This means that you can easily integrate TaiCDC into your existing application without having to worry about compatibility issue. Last but not least, TaiCDC supports bi-direction replication between TaiKV cluster. This means that you can easily replicate data between two TaiKV cluster, making it easier to change your data across multiple clusters. Now we have understand why we need TaiCDC. Next, let's talk about the design goal and the challenges of building CDC service for TaiKV. Our first objective is to ensure high availability. We understand that the critical natural of syncing data to downstream system for users, but accidents and the disasters are inevitable. Therefore, we must guarantee that even in the event of system faults, the data syncing process will continue uninterrupted and in the CDC cluster will remain fully functional. The next goal is to achieve high throughput and a low latency. As previously mentioned, user typically store vast amount of data in upstream cluster, spread across thousands of nodes. Consequently, a capable CDC service should be able to concurrently sync data from multiple TaiKV storage regions with optimal throughputs. Our third goal is to ensure consistency and ordering. Unlike other systems, maintaining the consistency and order of events operated upon is crucial for TaiKV, as many users are using TaiKV as the backend storage for distributed system. Therefore, we intend to provide snapshot isolation and eventual consistency to address this requirement effectively. However, all these goals are not easy to achieve, as we face many challenges at the same time. First, we need to capture the change data instantaneously, as any delay can lead to large synchronization lag. Additionally, we must be aware of and able to catch up with the scheme revolution. As I mentioned before, many users using Tai CDC or TaiKV to build their database system. The structure of the database can change over time, requiring Tai CDC to adapt accordingly. Another challenge is striking the right balance between ordering and high throughputs. While maintaining the order of events is crucial for date integrity, we also need to handle the large volume of change without compromising the performers. Similarly, we face trade-off between consistency and low latency. Guarantee eventual consistency while minimize latency is a delicate task that will require careful consideration. Furthermore, Tai CDC needs to efficiently fetch data that is spread across multiple TaiKV nodes. This requires effective coordination and communication to ensure the seamless retrieval of data from various source. Lastly, we aim to minimize operational complexity as a complex system can introduce challenges in maintenance and troubleshooting. Now, let's take a closer look at how Tai CDC addresses these challenges. Here is the overall system architecture of Tai CDC. Tai CDC talks directly to TaiKV, which allows for more streamlined communication and a faster date transfer. Tai CDC captures change data by watching the change log of TaiKV, which provides reliable and up-to-date source of information. The system is horizontal scalable using table as the basic scheduling unit, which allow for greater flexibility and ease of use. Additionally, Tai CDC provides an extendable downstream syncing interface, which can be easily customized to support different third-party downstream platforms. Currently, we officially support MySQL, Kafka, and Cloud Storage service, but our flexible architecture allows for easy integration with other platforms in the future. Okay, now let's take a closer look at the Tai CDC cluster internally. The Tai CDC cluster is composed of multiple capture server, each of which can be either an owner role or a processor role. The owner acts as the leader and coordinator of the Tai CDC cluster, responsible for scheduling change feeds to different processor managers. Each processor manager manages many small sub-processors, which serve as the base working unit of the Tai CDC cluster. After a change feed is scheduled to a processor manager, the manager will initialize and assign one processor to one table of the change feed. The processor will then start watching the change log of the table on a TaiKV and begin pushing change data to the downstream. When pushing a change feed, the owner will calculate the watermark of the overall change feed process based on the watermark of each individual table managed by each processor. The metadata of the change feed and the Tai CDC cluster itself will be stored in a PD cluster of the TaiKV, which is, and you can treat it as an ETCD cluster. This process allows for efficient and reliable change data syncing even in complex and large-scale environments. Notice that in the Tai CDC capture server, each server will always launch a processor manager to capture change data and push it downstream. However, only one server can act as the owner and the coordinator of the cluster at any given time. After the owner has been elected, we can start scheduling and dispatching tables to processors. The owner initialize a scheduler for each newly created change feed. The scheduler's primary function is to assign syncing tasks to different processor managers with each syncing task mapping specific table. Once the processor manager received the task, it initialize a new processor and begin syncing the corresponding table. During the syncing process, each processor initialize an agent that communicate with the coordinator to track and report on the syncing progress of the table. Now, we have introduced how Tai CDC as a cluster work from high level. Next, we can dig into the details and see how each processor work after being assigned task of syncing tables. Here is the topology of different components. The component topology of Tai CDC involves several key elements that work together to ensure accurate and efficient day syncing. Firstly, one change feed can be spread to multiple capture servers. Each capture server will initialize a new processor that is dedicated to handling the syncing task for the change feed. Inside each processor, a pipeline is created for each table, allowing for efficient syncing of individual day set. The owner of the change feed is responsible for syncing the DDL event which is the data definition language event to the downstream. Such as table scheme update while the processor pulls the DDL events but it's not responsible for pushing the DDL to the downstream. Instead, the processor applies the DDL schema to the monitor which will deserialize the DML which is the date manipulation language. Even later, this component topology ensure that Tai CDC can handle large volume of data syncing task in a maintain accuracy and efficient throughput throughout the whole process. Here is what happened inside a table pipeline. From the left to right, we have firstly the Tai KV CDC subcomponent which is a subcomponent embedded inside a Tai KV. Allow us to read the change log of the Tai KV for ending updates. Tai KV provides interface allowing client to watch the change log which can be useful for building change data capture servers for any system built on Tai KV. Next, the puller component connects to the Tai KV CDC associated with the corresponding regions and watch all change events. When the change event occur, it is sent to the puller. The puller streams the change event to the sorter which sorts the event based on its time stamp. After sorting, the sorter push the event to the monitor. The monitor decode the key value entries into table format and sends them to the sinker. Finally, the sinker syncs the events to the user specified downstream location. We have talked about the Tai CDC syncs to DDL events. Next, let's talk about how the Tai CDC synchronize the DML events. First, how the puller work. The puller is a critical components in Tai CDC state syncing process that is responsible for mounting for watching change events of a table and pulling them from the source database. S table in Tai KV can be split into multiple date region and store in multiple Tai KV nodes. A puller will connect to multiple Tai KV nodes to watch all the related regions. Tai KV nodes push change events to a channel through the GRPC string and the puller creates a worker for each region which periodically reads the change events from the input channel and all region worker share an event channel that connects to the output channel. The output channel dispatch the change events to the DDL job puller and the puller node based on the event type. Next, let's talk about the sorter. Why the sorter is required? Sorter provides two major functionalities. First, buffering the incoming events allows the sorter to smooth out the peak in the valleys of upstream date flow. This ensure that data-sinking process is efficient and it can handle large volume of data without becoming overwhelmed. Second, incoming events may not be in time order which can make it difficult to provide snapshot isolation or ensure eventual consistency if the system crush. The sorter ensure that the events are sorted based on their timestamp allowing the processor to reconstruct the cluster snapshot correctly. Overall, the sorter is an essential component of high CDC ensuring that the data-sinking process is efficient, accurate, and consistent. So how the sorter work internally? We use an external KV store adopting the log structure merge tree. On the top, the memory buffer area is called map table. The rights from the foreground are initially sorted in the map table. When a buffer reach its size limits, the data is flushed to the disk forming SS table, which is the sorted stream table. All map table and the SS table are divided into multiple events. Levels, the layer that includes all map table and the SS table directly flushed from map table to the disk is called a level zero. As sorter needed to sort incoming change events based on their timestamp. However, different disk files may have overlapping timestamp intervals for an SS table. So an iterator needs to wrap to the merge sort this files when accessing them. Managing memory and disk files for hundreds of thousands of table can be challenging. So multiple tables may share a fixed amount of memory buffer and then flush them out to disk together. This can result in file and disk having overlaps not only in timestamp but also in tables. To address this, the files can be periodically read, merge sorted and rewritten back to the disk. Once the events has been sorted, the mounter converts events with a timestamp smaller than DDR barrier into table role format. To do this accurately, the mounter require knowledge of the corresponding table schema. Although the owner is responsible for pushing the DDR events to the downstream, the processor must also store the table schema to achieve the processor DDR puller retrieved the table schema from Type AD CDC components. Filter out unnecessary information and store the table schema in the processor schema storage for later use. Next, the sync components in Type CDC is responsible for transferring change events to the downstream. Each change feed has a resource manager, a resource manager that connects to the disorder of all tables associated with the feed. The sync manager periodically pulls the source manager which retrieves events from all orders and sends them to the sync manager. The sync manager then distribute the events to the corresponding table sync which writes them to the designate downstream. Type CDC employs a pool-based sync to ensure efficient and uninterrupted event transfer. Push-based sync can result in block events on the sync, leading to performance degradation. Type CDC's pool-based approach can manage a large volume of data with precision and consistency, guarantee accuracy throughout the data synchronization process. So while Type CDC is theoretically horizontal-scalable, certain performance issues such as high CPU utilization and unexpected lag between the upstream and downstream may still occur. We received feedback from users last year about the concerns of CDC's performers. In response, we have put significant efforts in a time to optimize the performance of Type CDC. So starting with 6.5, the throughput of Type CDC has been improved by seven times. In the following section, we will discuss the approach that has been taken to enhance the performance of Type CDC. First, let's check out some performance baseline of Type CDC before 6.5. I put some odd number here because I want you to have a more intuitive feeling of how much performance gain we have already achieved. The first case we consider is using Type CDC to sync data between two Type DB clusters. The hardware specification of our Type CDC node is 16 CPU and 64 GB of RAM, which was proven to be more than if sufficient for our case. Single-table throughput weighs a big table, 1,200 bytes per row, is 80,000 writes QPS, weighs a throughput of 120 megabytes per second. Even before 6.5, we found that Type CDC is able to handle very large tables, up to 30 to 40 terabytes in size. And Type CDC was able to keep up with the throughput demands without any problems. And there are no limits to the amount of upstream cluster data that Type CDC can handle. This means that as our data needs continue to grow, we can rely on Type CDC to keep up with the increased demand. The second case we consider is using Type CDC to sync data from Type DB to Kafka. Still, it is the old data before 6.5. The hardware specification are same as previous case, which is 16 CPU and 64 GB of RAM. When testing a single-table throughput on a large table, still 1,200 bytes per row, we have achieved 35,000 writes QPS with a throughput of 52.8 megabytes per second. While this is a slightly lower than our previous test with only two Type DB clusters, we have also found that Type CDC is able to handle very large tables. We have a test table up to 30 to 40 terabytes in size, and Type CDC was able to keep up with the throughput demands without any issues. So, the throughputs of Type CDC table pipeline is an essential performance matrix that measures the efficiency of each of its four consecutive subcomponents, the puller, sorter, mounter, and sync. This pipeline can be visualized as a water pipe, where the narrow list part determines the overall throughput. So, by measuring the throughputs of each component separately, we have identified that the sync is the bottleneck in most cases, which has the average throughput 76,000 while the sorter has the largest throughputs. Therefore, to improve the overall performance of Type CDC, optimization efforts have been focused on sync, puller, and mounter. To optimize the performance of the puller in Type CDC, several measures have been taken. Firstly, the processing of resourced time stamp has been optimized by processing them in batch, which resourced time stamp you can treat as the watermark. Reducing the time taken for the puller to calculate, this measure helps us to reduce the time taken for the puller to calculate and advance the checkpoint. Secondly, we rewrite log have been implemented instead of using just exclusive log, enabling concurrent access to resource and increasing the efficiency. Lastly, the frontiers inspection of region split merge has been optimized to reduce the time taken for the puller to access the update data. An unnecessary memory allocation has also been removed, which help us to further improve the puller's efficiency and reducing overheads. So the performance, so here is what we try, how we try to do the mounter improvement. So as you can remember that mounter acted as the deserializer decoder in our pipeline. So once it will receive a DDR event or key value entry, it will try to decode it and make it like a table format and send it to the next components. So in order to improve the throughput on the mounter, we try to use a decoder pool instead of just a single thread decoder to decode the key value events. So the performance of the sync components in TICDC has been significantly improved through the implementation of a pool-based sync, as I mentioned before in the previous slides, which help to improve the management of the event flow, reducing the risk of data log on the sync. So we achieved a significant performance after applying a formation optimization. A comparison between TICDC 6.3 and TICDC 6.5 reviews after actually it's the version larger than 6.5 reviews a substantial increase in performance when a downstream is Kafka using different protocols. When using a Canon JSON protocol, the performance has improved from 5,000 per seconds to 41,000 per seconds. When using an open protocol, the performance has improved from 8,000 per seconds to 58,000 per seconds. And finally, when using an Avro protocol, the performance has improved from 9,000 per seconds to 63,000 per seconds. So specifically to address the checkpoint lag when syncing data to MySQL using TICDC, we apply a batch-syncing approach. Rather than executing events one by one, events are now synced in batch, and multiple connections has been established between MySQL sync and the downstream MySQL-compatible database. So to measure the effectiveness of this measure, a sysbench workload has run with 500,000 rows per transaction. The results show that this measure has led to a significant reduction in the update and the delete lag by about 30%. And lastly, I would like to share some of the valuable lessons we have learned. Throughout the journey of building TICDC. The first lesson we have learned is to decide the system architecture in align with the upstream database. This means that when creating the data pipeline abstraction, we should closely align it with how the database organized its data. Instead of focusing on adapting the system to Golan channel, we should prioritize the table pipeline approach. The second lesson is the importance of establishing clear boundaries between subcomponents. For a significant period, we lack precise regulation and scope for each subcomponent, making it challenging to optimize or enhance the system. Any change often requires modification in multiple places throughout the entire pipeline. Hence, defining clear boundaries is crucial for scalability and demandability. Careful consideration should also be taken into account when choosing between push model and pool model. Previously, we utilized the push model for sending data across the pipeline, which made it difficult to isolate each table pipeline. However, by transferring to the pool-based sync, we gain more flexibility in supporting different downstream systems and efficiently syncing data from various tables. The third lesson involves implementing the old-value features from the early stage of the development. In the initial version of the high-CDC, only put and delete events were synced due to performance concerns. However, we discovered the usefulness of the old-value features in many scenarios. After months of struggle, we finally decided to implement the old-value cache and fetch the old-value from Tikev using historical snapshots. And then the final lesson resolved around implementing an efficient sorter while considering scalability. Initially, Tikevc used an in-memory sorter, which increased the risk of OOM, especially when dealing with large volume of changes in the upstream cluster. To eliminate this risk, we engaged in discussion and eventually adopted an LSMG-based DB sorter. This feature took approximately half a year to mature and was designed with scalability in mind. Alright, so that's all for today's talk. We'll keep adding more features and optimizing the system. You can find details for advanced features like large table scale-out, which will split the large table syncing process into multiple table pipeline in our official documentation. For the GitHub repository of Tikevc, feel free to check out. If you have any questions or comments, just show me an email. Oh, okay. And also, you can scan the QR code and provide the feedback for this session. Thanks. Any questions, comments? Yeah. Could you use the microphone after you? I think you mentioned about filtering capabilities. Initially, when you're explaining the Tikevc, where do you put the filtering part in that architecture? The filtering, right? Yeah, like if you want to selectively get records. Correct. That's a very good question. So the filtering part is very important because users usually have huge amount of tables. So to ensure that we are now going to transfer unused data from Tikevc to downstream, we put the filter part on the Tikev side. So from the very beginning of the pipeline, we are going to filter out all unnecessary events and the tables. And just one small follow-up. On the sorter, is there some kind of a configuration as to how long that sorter waits for all the events to arrive from the change or how does it know when to trigger a sorting? So your question is about is there any configuration we can use to configure the trigger of the event pooling? Is that correct? No, no. You had that sorter component in the middle, right? You were saying that sometimes the events may not arrive in the right order and the sorter in the middle is going to reorder those events before it sends it to the sink. So my question is that sorter component in the middle, how long does it know that, you know, this is the trigger at which I'll have 100 records, for example, and I need to reorder them and send it through. Or is it a time-based threshold or is it a record-based threshold? Yeah, so I think that is I don't remember the exact number of the frequency we trigger, but I believe that is configurable. We can config how frequently or how often we wanted to push the sorted events to the downstream. Yeah, I think that is configurable. We can double-check the configuration file. But that is a very good question. Thanks. I have a question about the consistency guarantee part. So it looks like Thai CDC pulls single row level changes from Thai KV and you said it ensures eventual consistency. So it doesn't mean that if there is a transaction that involves multiple row changes, does the asset property is still guaranteed by Thai CDC? I don't think so. Currently we do not support transaction level consistency. So it's difficult to achieve that because the row or the data can be across multiple Thai KV nodes. So we do not recommend users like on a downstream database you try to sum your transaction on an upstream database and then you will receive the same kind of transaction on a downstream database. Yeah. One more quick question. I just want to clarify the puller's working mechanism. But one follow-up is when we're syncing the data to the downstream database we are actually submitting the MySQL events to the downstream database. So if we really wanted to implement the transaction consistency, it would be very simple. So there is some price in the trade-off. If you wanted to do that the throughput and latency can be low and latency can be large. That makes sense. One quick question I had was about the puller's working mechanism. So the puller's pull the changes from the Thai KV periodically or not Thai KV pushing changes from the internal container. So it's like the stream. So Thai KV have a change. So inside the Thai KV we have a sub-components that is watch the change lock. When you receive the events it will sort of have a stream push to the puller. Even though this one is called puller that is for Thai CDC itself. It looks like a puller pushes those changes into the entire Thai system and then pulls from there. That makes sense. Thank you. More questions? Great. So I guess that's all. Thank you.