 Hi, everyone. Hi. Thank you all for being here. We are glad to have this opportunity to share our experience on elastic scheduling with Tai TV during the next 30 minutes. First of all, let us introduce ourselves. I'm Yu Tong, an effect structure engineer at Pincap. Also, I'm the tech leader of the scheduling special interest group of Tai TV. I'm Sun Gao, an infrastructure engineer at Pincap. I'm the maintainer of ChaosMesh and the committer of Tai TV SIG scheduling. Yes, we are working on the scheduling system for Tai TV. Okay, let us take a look at the agenda of this talk. There are five parts covered for this topic. An introduction to Tai TV, the elastic scheduling background, the implementation in Tai TV on Kubernetes, some future work we are working on. And there will be time for questions at the end of the presentation. I will talk about part one and part two, and Sun Gao will work you through the part three and part four. Before we dive deep into the details about elastic scheduling, you might need some background information about Tai TV. So what's a Tai TV? Tai TV is a key value database, which is open source, distributed, and support transactions. Tai TV is a same-self-graduated project currently. With the great support from community and the users, Tai TV now has more than 8,000 GitHub staff and more than 200 contributors. Now we are going to take a look at the Tai TV architecture. Each Tai TV known store different data shards named regions, which are divided by range. For a single region, it has multiple replicas we call RAT group. It is used to guarantee high availability through the RAT algorithm, and each group has its own RAT leader and other followers. Tai TV client use GRPC to communicate with each Tai TV node, as well as Playsmith driver, the cluster manager of Tai TV. And Tai TV will send the heartbeat periodically to the Playsmith driver, which is short for PD, to update the information about itself. PD periodically check replication constraint to balance load and data automatically across node and regions. And how does PD balance load and data? PD will collect Tai TV statistic containing Tai TV heartbeat to help make a scheduling decision. Before we come to the next part, it is better to know how Tai TV migrate a region through the RAT algorithm. Let's take a look at an example. We have three nodes in the region here. Each region has three peers which are located on node A, B, and C. We use the border to mark the leader. Consider that we add a new node to the cluster. PD has a scheduling decision, which are going to migrate a peer from the high load node A to the newly added node D. We will show the detailed process in the next few slides. Assume it selects region 1 as a source region and the node D as the target node. PD will first gather the whole scheduling steps and combine them into a command. And send it to Tai TV. The first step of this command is to add a new replica in node D. Since region 1 is a leader, we need to transfer the leader before going on. So the next step is to transfer the region's leader. After the leader is transferred, PD will execute the next step to let Tai TV remove the replica on node A. When the above three steps are down, now the region has three replica which are located on node B, C, and D. That's how PD migrate region. Now let's move on to the elastic scheduling in Tai TV. Another question, what is elastic scheduling? Briefly speaking, elastic scheduling is also scaling depending on the workloads. When the Tai TV cluster is deployed, normally the scaling of Tai TV cluster is able to handle the average load. But sometimes the workload will become larger, which requires operator to manually scale out the Tai TV cluster to handle the workloads. When the heavy workloads get lower, the operator also needs to manually scale in the Tai TV cluster to save the resource and cost. The elastic scheduling makes the whole process automatic. So why do we need or choose the elastic scheduling? Here are three reasons. First, elastic scheduling could handle the unexpected traffic automatically. Nowadays, we're living a world with a huge amount of information, and the breaking news always appears with no prediction. Here is the statistic about the storage QPS for the operator about its business. As you can see, most of the time, the QPS for the storage layer stays low, and it suddenly becomes larger at some point. After a short period of time, the QPS fall back to low level again. The changing curve shows that it is difficult for the operator to manually manage the storage layer because the traffic is unexpected. Second, elastic scheduling could help us save the resource when we don't need it. As the traffic is unexpected, and the operator needs the ability to handle the heavy workload, they have to pay the cost for actual resource. But most of the time, these resources are wasted. If the storage has the ability of elastic scheduling, the resource could be reduced when the workload becomes low. Finally, the cloud infrastructure has already become virtual nowadays, and the elastic scheduling can benefit from it. As the cloud infrastructure provides the on-demand availability of computing resources, the elastic scheduling could apply for resources from the big cloud infra when they need it and release the resource at its own pace. Moreover, as Kubernetes has provided the power for API to manage container and resources on cluster, it is more convenient to manage the stable application like distributed database. For now, we have introduced the background of Type-KB and elastic scheduling. We also discussed why we need or choose the elastic scheduling for the storage layer. Now, we will have Song Gao introduce the implementation in Type-KB. Okay, thanks, Yutong. I'd like to move on to the introduction of how Type-KB scheduling builds its elastic scheduling management mechanism based on these solutions. First, an introduction about the elastic scheduling architecture on Type-KB. The architecture is composed of four components. First, the Type-KB cluster explores its metrics of its status according to the workloads. Second, the monitoring system, like promises, collects these metrics. The scheduling system, which is called PD here, will fetch the metrics from the monitoring system and calculate the autoscaling plan, which is exposed by the API. Finally, the operator system will take the autoscaling plan provided by the PD and scale the Type-KB cluster. Now, we will explain how this architecture works by two parts, the operator side and the scheduling side. We use operator pattern, a famous method of packaging, deploying and managing the Kubernetes application. It is basically composed of two components, the controller and the custom resource definition, which is also called CRD. We use CRD here as the elastic scheduling configuration for users. After the configuration file is deployed into the cluster, the controller will start to track the configuration file periodically and carry the autoscaling plan from the PD. As the operator manages the Type-KB cluster on the Kubernetes, after the operator gets the scaling plan from the PD, the operator will start to scaling the Type-KB cluster according to the plan. On the scheduling side, the PD meaning do two things. First, fetch the metrics from promises as Type-KB has exposed many metrics in lots of dimensions. The PD could use this information and calculate the proper autoscaling plan for the operator. The second thing for scheduling is to decide the scheduling strategy in this elastic scheduling situation. Before we introduce this specific strategy for this situation, we can imagine that what if the PD takes the newly created elastic Type-KB as a normal node, like normally scaling out. As PD would balance the data distribution between each Type-KB, it will transfer lots of regions to the newly created Type-KB. As the data transfer need core system resources and also could cause latency, the default balancing scheduling strategy couldn't work very well for the elastic scheduling. To solve this problem, the PD would only transfer the regions which is under high-frequent visiting or updating and their located Type-KB should be under high load. We call these regions as hot regions. So how do PD only migrate the hot regions to the elastic scaling Type-KB node? Before we answer this question, we need to know how we recognize the hot region. PD will record the read and write flow of each region and stores. Each store will maintain a top-end cache to save the most hot region in this store. If a region is beyond a predefined threshold and continuously hits the cache, we think this is a hot region. With this information, the hot region scheduler in PD will decide if migrating the region can make this Type-KB more balanced. Since we have known about how PD recognize the hot region, once the hot region is selected, the last thing is to find a way to only migrate it to the new lake scaling node. Type-KB supports using some predefined label to manage the data placement. So we can utilize this mechanism. Actually, PD has many kinds of schedulers working at the same time besides the hot region scheduler. When these schedulers start to schedule, we usually use some filters to filter the store, which don't want it to be a source or target according to each scheduler's rule. And thus, we can create a new label for newly scaling nodes. The label consists of key value pairs. The key of the label is special use and the value is hot region. When the node is scaled out, the label will be added by the operator automatically. PD will get this information once the newly scaling node sends its heartbeat. For other schedulers such as balanced regions, before we create them, we add a label filter, which use this new label to it. This filter will prevent the store being selected as target. For example, we only allow the hot region scheduler to select the store one as the target. The rest scheduler cannot. Here is an example of how elastic scheduling configuration looks like. As you can see here, the configuration points out the necessary factor and the parameters that PD needed during the calculation, during calculation, the auto-scaling plan. We can also define some concentrates here in order to limit the total counts of scaling Thai KV nodes. As the configuration is managed by custom resources in Kubernetes, after the configuration is deployed, the elastic scheduling will start working. Now we will show the experiment result about the elastic scheduling on Thai KV. In this experiment, we will use ThaiDB, an open source distributed database which used Thai KV as storage layer and run the suspense, a famous benchmark tool to show how elastic scheduling works during the heavy workloads. This graph shows the relationship between the regions and the Thai KV. The column indicates the status for the dedicated Thai KV and the rows shows that the region and its replication in each Thai KV. The blue rectangle represents a follower peer for the region and the red rectangle represents the leader for this region. As the Thai KV cluster is consuming lots of CPU resources during benchmark, the elastic scheduling starts to deploy two new Thai KV automatically. At the beginning of the elastic Thai KV is created, there are no regions being scheduled to the elastic scheduling Thai KV. So we can see the suspense result didn't change yet. After the PD transferring some hot regions to elastic Thai KV nodes, as we can see, there already exists some red rectangles under the elastic Thai KV columns. Now we can find that both the TPS and the QPS of suspense have increased. The whole process shows that the elastic scheduling automatically detects the workloads and the scaling the cluster to improve the performance. In addition, we are going to support more features with elastic scheduling in the future. Here are some ideas. The first is to dynamically increase the read-only replicas according to workloads. As we know, Thai KV only can read data from the region's leader in the earlier version. But now we support follower read and thus we can read data from followers also. When the read pressure is high, it will increase the read-only replicas automatically, which can greatly help us reduce the read pressure. When the read pressure becomes low, it will switch back to the original state. The second is in some seniors. There is some data which is barely accessed. It's better to move this kind of data to the cheaper storage in order to save the cost. Combined with elastic scheduling, we can make the cluster recognize this kind of data automatically and decide to scale out the cheaper storage and scale in the highly paid storage. This is basically everything we want to cover in this talk. Thank you very much for attending. If you are interested in what we are doing, you are welcome to join us. If you have any questions regarding what we have talked, please ask. Thank you very much.