 So good afternoon and welcome to Qubicom 2019 and this is about Thai QV best practices and I will introduce about Thai QV. So from the description, actually the speaker would be Zhang Jingpeng, but I come here rather than Zhang Jingpeng because he is the CEO and this morning we discussed about the topic that is Thai QV best practices and in order to avoid being sleepy to let our audience to be quite sleepy, so I come here to introduce the open source community status and the background of Thai QV best practices so I will quickly go through the details and what is Thai QV, why we are doing this project, this program as well as its recent progress and the second half would be given by Zhang Jingpeng about the Thai QV hardcore technical details will be introduced by Mr. Zhang Jingpeng and how to design it and we come from Pincap, I myself is the co-initiator founder of Pincap, so you may ask what is Pincap, so anyone have heard about it, okay, maybe most of you have heard about Pincap, well as you can see from the list, this is CNCF company's contribution, the list of the contribution and you can see globally the global contributors in the project, the top ten, among the top ten one is startup that is called Pincap and IBM is falling behind, so this is our company, it is an open source company enterprise and behind it and the Thai QV, the Thai QV actually they are all the projects that we are doing, so talking about the topic today, actually Thai QV it is a key value database and at that time my idea is if you have no about the uphatch, if you are quite familiar with it you will know the HBase and HBase CNCF it is the database and it is the project of the database, however CNCF actually it does not have a design for scalable, the database a large scale database to support our CEO key value storage, so that is why at that time we come up with this project, but if you think about another HBase, well actually it is of no significance, so at that time when we were doing this project, this program, actually there are some important decisions, the first is we need to support the transaction, if you are quite familiar with HBase or Consecutors and you will know their consistency and their transactions does not perform quite well, especially HBase if we want to operate, it will be full success or full failure and this is quite hard to realize, however in Thai QV compared to with HBase we have a big difference that means in key value operation we can support the ACID, featured support, so this is the first aspect and the whole design is by Spanip and as you know they Google distributed database right, and the second point to make is another difference from the HBase, if you are quite familiar with HBase you will know that his data is stored in HDFS, but actually we just leave it because we do not want the whole stack to rely on a British storage, so here at the bottom we use raft and ROXDB to realize it and to improve the overall performance and to reduce the latency, but actually this project was initiated in April 2016 and until now 3 years have passed, well actually our company, the PINCAP, its original objective is to build a relational distributed database that is to recreate Spanip, well at the very beginning, maybe you do not know that, at the very beginning the HBase supports the Tyki V, but it has various kinds of problems, so since 2016 we have a new key value storage so that we do not rely on HBase anymore, and later we discover that this can work well in relational database for Tyki V, but it also can work in other ways, for example if you have the bottom layer storage, and if it can support key value and transaction, so that based on this system you can build different distributed systems, for example we can use Tyki V to build another distributed, that is metadata, so that we can change the metadata without concerning about the inconsistency of the data or the availability, so this is one scenario, another scenario is that we can, on top of the KV we can have a radius interface, so that the visitors can access and it can be horizontally extend, so some of the communities they may be quite familiar with this process because they are doing it right now and it enjoys high popularity, so our vision is to use Tyki V as a building block for the distributed platform, and that is why we contributed to CNCF, and we hope that the community can, based on Tyki V, to build new things, and this is the architecture, a very brief, simple architecture of Tyki V, and later Mr. Zhang Ximpeng will give further details, so I will not elaborate on it. Actually this is very cloud-native project, and its language is Rust, and for the whole Tyki V it is Rust community, one of the largest open source project, including Rust, its programming language and its initiator, its founder actually is working in our enterprise on the Tyki V project and the GRPC. Actually we use it from all the times, and we are also the maintainer of its library, so online if you want to use GRPC, and at the same time it is GRPC, and most probably you will use our products. Another thing, also we are the maintainer, and the storage, we use RustDB, and its storage, the Rust, we are also the maintainers of these two things. So our relation with CNCF, after one year sandboxing, finally several weeks ago we become the part of the CNCF, the incubator of CNCF. So this is the first incubator level database, and this project is also led by China's team, and it is initiated in China, so we feel quite proud of it. And this is Tyki V, its official website, you can visit it. So essentially it has many relations with Tyki V. And this slide shows maybe the relation between Tyki V and Tyki V, and it has many users. In addition, besides the call of Tyki V, if you want to participate in the contribution of this project, for example, some friends would ask me, I want to contribute code to Tyki V, but I'm not a database expert, I'm not an expert, so how can I participate in this project? Well, actually you can also think about building a database, it's not only about the database code, because it has some surroundings around contributions. For example, now we have, at present, its official terminal end is, if you operate Tyki V, you can only use Go library to access it. But in fact, in Tyki V community, you can also build a pure Rust. Actually this is quite weird, why we don't have the Rust as the user end, because you know the Tyki V, we use Go, and we find if you want to make the hub call, it will be quite troublesome. But now we have some improvement actually, but finally we choose Rust. So until now, we have some available resources and community to help us to do Rust and client. So in Rust, if anyone of you is interested in Rust, you can also participate in call contribution of Tyki V, but also you can join in through Rust and client. For example, to C++ and to the whole community. After entering into the CNCS, and then many community friends provided the calls, about 90% of that from our company, about 90% was from Alibaba. Very much. Now I will give the floor to my friend, Jin Hong, and he is the co-developer of the Tyki V project. So he will just give some related technology about the core of this project. Thank you for introduction. And he introduced why we have the Tyki V project, or what Tyki V project will brought to us. Next I will just mainly from two aspects to introduce the project. First is about the framework. And the main part is about the practice. That is to say, when we use the Tyki V project, what problems we may encounter and how to address them. But for the first part it's about the theories. Once we're exploring the Tyki V topology class, what is the idea about it? And for the core part it's called the multi-roughed. This project is quite distributed, the Tyki V project. And we want to use the multi-roughed to ensure the data consistency. Since it's such a distributed system, it will definitely have the data migration. And this must be completed automatically if it is manually performed. That will be not acceptable. But for the third part, I would like to check the eluxibility of our program and how can we achieve the eluxibility of this project. And for the fourth part it's about the ecosystem. This slide shows, this part is what Tyki V lays. That's only the Tyki V project. And if we want a project, we want the involvement of PD, which is the brain of the whole cluster. And he managed the cluster data and monitored the state of the cluster with these status because schedule the capacity of the whole cluster. And from this layer, and we can find the transaction and the raw Tyki V API with this API we can do whatever we want. For example, the Tyki V itself actually is a new circle database based on Tyki V. And this part is actually the circle level and it simplifies the circle level to key level and sends the information to Tyki V. Of course, you can view it yourself. Your own distributed system based on Tyki V. For example, you can see all the various protocols or other protocols. For the second part, it is a multi-raft. How to ensure the data? Actually, we go through the rafter. If we only have one group, we may have some questions. For example, all the data or the copy of the data can only be stored on a single device, which is not acceptable because it cannot be extended. But now we have the multi-raft. That is to say we have a cut of the data and each range is actually a rafter group. So this data can be maintained with consistency with this rafter. What does the black box mean? That is to say the black box is a leader of the rafter. We have a leader in each rafter. I guess it's not very hard to comprehend. This is not originally what data looks like. For the beginning, you don't have the data, you only have the origin. For example, you only have three copies, like region 1 and region 2 and region 3, with data inputting. What about data will be input? On one single device, you have some constraints if you're not splitting or migrating these data. So we will do the auto-split like the cells of our body. When the cells grow, the cells will split into two cells. So this is the set for all of the data. What does split happens? For example, we have one region with a range from A to E. And this range or this function is copied from the rafter. When the volume increases to a certain amount, it has to be split. For example, I want to split it from range 1, A to C, and range C to E. Actually, this action can be copied by the rafter. This action must be consistency. And rafter will do the action, and other region or other actor will just copy this actor. This is the splitting. Similarly, we also support merging. The normal of merging appears when we delete this data. And the region will be empty. If the original data will not be collected, then the MAC data will increase. So now we just merge the neighboring region with a limited amount of data so that we can control the number of MAC data. This is the foundation of the stability of the whole projector. Since I have mentioned split and merging of the rafter, now I would just talk about the scale. How do I migrate this workload? First, we have the translator. For example, originally, we have a one leader. In the CHI TV, both read and write must have a leader. For example, in Node 1, we have a leader. We selected one leader. If all the leaders selected on the same Node, then it does not work. So we need to have a mechanism to quickly migrate this leader. And in the rafter, we have a transfer leader. We have such concept. For example, in region 1, we have a composite to transfer the leader, how to quickly perform and how to treat this action and how to perform this action. Actually, it is according to the workload uploaded by CHI TV. It's according to the read flow, write flow and disk memory, disk capacity. Once this information is sent to the PD and the PD will charge, the workload will be migrated to other devices. And I will talk about the scaling. For example, we have three devices and we have the data shown on the slide. Now I want to add another device. And once we added a new device, the action you performed is completed and the others are automatically performed. For example, here the device will add a copy and then we will implement the copy in the original device. In this way, we perform the data migration. Similarly, if you just want to delete a device, the workflow is actually the same. It actually needs to just kill this device with the process or the node back to another device. CHI TV is not only one project, it also includes clients. It also includes clients and other related projects developed by ourselves. Now comes to the second part. It's about practice. The first step is about deployment. We have two aspects, one is single-DC deployment and cross-DC deployment. The second part is about elastically scale. And for the third part, it's about a very issue that we focus on the performance for the distributed system, that is to fight with hotspots, how to identify hotspots and how to address that. For the fourth part, according to CHI TV cluster performance, what can we do? For example, this is a single-DC. For the single-DC, we have different cabinets, we have different racks, for example, rack one, rack two and rack three. And if we do deployment or multi-copies of the data, we want to scatter the copy into different racks. Why we want to do that? Once the client and this offline, these racks are still available. They can still be online. So we want to tell the PT or CHI TV that this information stored on these devices is stored in which rack, in which DC. Once these information are told to the PT, then the PT can allocate the copies of this data in this way and balance the distribution of that. Wow. It comes to the cross-IDC cases. The setup is quite similar. But we still have other requirements. For example, the leader of the business is only in one city. For example, you have two cities, city one and city two, but you have three centers on three DCs. In this case, most of your application are in city one, for example, in Beijing. Another DC in city two, for example, Xi'an is for another business. So this case, you want to add the label to the rack to ensure your data is distributed, is well-distributed. In addition, we can control the leader or control the service of copies and make them in the city one node, for example, in Beijing. How to do that, which is quite simple. If you have a performance action, you will understand with the glides of these commands. For example, we want to configure the number of your copies, replicas, like I used to say, and the other is to configure the location labels. For example, we have zones, rack, and host. What does Kite TV do? And you only need to tell which Kite TV is in which cab, in which zone, in which rack. You only need to specify these parameters. And we also want to control have precision control of these leaders to make sure that they will not appear in other racks. Here we have the schedule so that we can migrate the leaders in other racks in the same rack. Okay, now talking about the scale, previously I talked about the dynamic demonstration. Well, here, obviously, what do we need to do if we want to add a new node? Actually, adding new is very simple. Just like start in new Kite TV node, you have a corrupt PD address. In this way, we can automatically add a new node that is quite simple, right? Why it is so simple? On compared with the traditional database to migrate data or to scale it, actually it has a big difference. Why is that? That is what we are doing and what we are working on. Working on because we want to liberate or liberalize the productivity of our workforce so that we do not need to waste a lot of time on it. And how about remove the old nodes? Maybe actually just one other and then just turn it off. Well, actually, it is okay, but actually it has some risks. For example, where you turn off the machine, the holes then you delete the data and then another hole to the replica or the disk is broken down. So it is risky because we need to ensure that in raft, the majority of the nodes when they exist so that your data is quite secure. But if all the machines, the holes break down, then you cannot ensure that the rest of the holes also maintains the data. For example, previously, one data has three replicas and now with deleted it has only two replicas. But if it is not within time to add the third replica on the other holes and the other holes break down, so now you only have one replica. But don't be worried about it. So this time, just remove it and don't close the deleted store now because the replication works of the store's regions is still going on. And you should, after the deleted store's data transfers to Tom's stuff, you can stop this Taipei. So what you need is first which which store you need to delete. For example, this Taipei and what you need to do then is to store and delete. So what it is doing now after the order, actually it is now doing the migration of the data because the data on the store, it only has two replicas. So you need one more action that is to complement the two replicas into three replicas. But how can we know when it is ready, when it is okay? And you can see the state of the store. If after implementing this order it will become offline as you can see from the slide. So at this moment don't close the deleted store now because the replication works of the store's regions is still going on. So after the deleted store's data transfers to Tom's stone, after it becomes Tom's stone when it is labeled as Tom's stone that means all the replication works have been finished. So you can stop this Taipei. And the next is about hotspot. In the distributed system hotspot it is a very important issue. Why? We can see this image. If all the if all the written work is through one node, single node, that means you will come to some problems. The first is the single node will become the bottleneck of the whole cluster. And the second after the data comes to the KV you need to carry it out because you have large small amounts of data. So this is not recommended. Or if we discover this problem how should we address it? What kind of reasons would cause such problems? First is the update time is now. And actually you will find that the time is incremental. So those will be in the same region. And the second reason the auto incremental ID. So similarly it will be written within this region. For example if you do not know whether you have the auto incremental ID or update time. So if we can use a simple way to find out, to identify if we have problems of hotspot. So if we can if we see this picture writing hotspot because we have the real-time monitoring around store and apply CPU clearly the website it is for roughed distribution and roughed writing. And the second is to implement roughed to apply the state of roughed. So mainly they are for writing hotspot. And if we say if it is not balanced between different time kv for example one is too high and the other is too low that means we encounter the problems of hotspot. Reading is the same because there are two matrix the first is kv read. It is storage read port handles the kv read and if it is CPU is not balanced then you will also encounter the problem of hotspot. And the next one is co-processor. Well I didn't introduce co-processor just now actually it does not support transactions but also the interface and the framework for calculation for algorithm. One example for example for time kv actually originally it is a storage block for time db so there will be some complex algorithm calculation within it. For example to do some some work and during this process if you input the time kv you cannot understand the data so that it is hard for you to calculate and it is more reliable because first it is bandwidth cannot suffer and second your time db becomes bottleneck for computing but if the time kv can support the co-processing framework and you tell it how to process this data so that it can accomplish this task. For example it can calculate the average and then to return it so time db after getting these results it can do second calculation so we can find that we have co-processor cpu monitoring also but if you will solve only use transaction or time kv interface maybe you can only notice the read pull cpu so after talking so much about hotspot in time kv if there is any automated solution to the hotspot issue well actually we have because it will regularly to update the region in the past period of time it is traffic of the read and write its read and write flow as well as cpu so after sending it to the pd it can calculate and present the clusters which are busy and which are not busy and into a automatically to classify the read and write flow or traffic and to ensure that it will not be concentrated in one small region for example when you only have 100 and we constantly read the 100 megabits so you will encounter the problem of hotspot and how to address this because we have the automated process we have a manual process so that means through the pd control this is also a tool of us to find out the hottest region for example if it is very small with a small number of data and if we can cut it actually it cannot be cut automatically but we will seek to realize this function in future years so in future years we will work on this but at present the current version you can manually to resign and after the segmentation you do not need to do it anymore so after the calculation the we can balance the hot regions between so pd will learn from the collected data and distinguish hot read or write regions for example to split this region by hand and to have the split region the operator add split region so we can split the region and the policy whether to adopt the precise calculation or whether to split in which size it's okay it is just two strategies and you should really split the region by hand and another way to just now we talked about the transfer leader that means to split the leader and the second we also have the transfer peer for certain peer or certain data we also provide some manual balance for this operator and transfer peer so according to the order you can just operate it directly and the next part is about the performance tuning how we identify the bottlenecks and how to fix those bottlenecks for example the variety because you have graphed store and apply thread pool and if they are high if they are the bottleneck the graphed store or the apply thread pool also for example your disk IO if it is the bottleneck or if the CPU becomes the bottleneck so actually they are all monitored and this slide shows the processes of the TITB or TICLINE you can see there are many thread pools so according to your monitoring which part becomes the bottleneck you can adjust it through the parameters on the right side but if your disk becomes the bottleneck you can also adjust the manner for the compression and to have a higher level compression so that you can save the space for the disk and if the CPU usage is high you can use the compression type with low CPU cost the compression per level this is reading to Tai KV we have two ports which is KVread and disk Qread the parameters are on the right side so according to the monitoring you can observe these parameters the last one is ROG's GB it has a block cache so through adjusting the block cache the size of the block cache you can add or you can adjust the block cache I think ok so in this slide you can see accurately that the DB it's accuracy success rate so the block cache you need to leave some space for your system the space system page cache because according to this process the from memtable and get from block cache and then resolve enough memory for the page cache so directly to the page cache it is compressed and then decompressed so thank you due to the time limit this is the end of my introduction and I just go through it very quickly without too much details but if you have any questions feel free to ask me after the conference thank you