 Okay, okay, thank you Mr. for your introduction, yeah. And thank you everyone, so glad to be here, yeah. My topic is about how Tencent use PGXZ, or PGXZ is an internal name. In fact, the community name is PGXZ in our WeChat payment system. First, let me introduce Tencent, and WeChat payment system. Tencent is one of the biggest internet companies in China. It has more than 20,000 employees, and its Hong Kong stock code is 0700. And WeChat is the most popular social network application in China, and it is also the latest product of Tencent company in social network area. Last year's financial report shows WeChat contains 600 million active users per month, yeah. And WeChat payment, it is wireless, fast secure, and efficiency payment solution provided by WeChat, yeah. Yeah, 50 million payments business due per day at present, and the number of business just keep growing all the time. Okay, so after this, let's come to what is PGXZ used for? Okay, you know, when people are using WeChat payment, they pay like the left, yeah, pay money to buy everything in supermarket for taxes, buy things through website, shopping website, like that, Jingdong, and some Chinese style Uber, yeah, that's the car, yeah, they pay for that. After that, people, you know, the payment bills, and the fund flows, and the bidding flows will come into our background system. The background system is PostgreSQL XZ, yeah. So after this, the merchants, the merchants, the companies, the website company, and the taxi company, they want to know how much money they have earned, or the details of the payment, they will query PGXZ to get the information, yeah. Okay, so this is what is PGXZ is used for. Okay, come to another question. Why we use PGXZ to do this? Okay, let's see two numbers. Number two, number one, at present, 100 million joules coming to the system at present, and our product guide told me, this number will be 200 at the end of 2016, yeah. The data sets of the cluster is 18, 8,500 gigabytes at present. At the end of the year, it will be 16,000 gigabytes, and you know, no single node database can handle so much data, so we need a cluster to do this. Another reason why we choose PGXZ is because we can, we can finish sort operation of 19 millions of joules in a few seconds. In fact, we can manage this in very short time. Maybe several, less than a hundred million seconds, yeah. And we can provide fast aggregation of a lot of volume data sets. Of course, we can provide faster query for real-time query also, and the most important thing, we require here transaction per second. At present, it is 5,000. Maybe at the end of the year, it will be 10,000, yeah. Next year, it will be even here. You know, the data, the data in our system just double every month, yeah. This month, double sets of the last month. Next month, it will double. It keeps increasing very fast, okay. Maybe, talking about PGXZ, many people will ask why you guys don't choose Postgrease XL. Postgrease XL is another Postgrease cluster solution. Why we don't choose this? You know, Postgrease XL is more stable and it can do more efficient internal joins and also it's also more up-to-date. It is merged to the latest version, 9.5. It's very easy, it's a very easy reason for this question. It's because at the time, when we start up the project, Postgrease XL is not available. We, at the time, Postgrease XL is not open source. Two years ago, yeah. So we choose Postgrease XLC, okay. Next page, let's talk about just this slide, just shows the hardware that we use and the nodes number of cluster running online, yeah. For co-auditors and data nodes, we use machines with 24 cores, 60 for Gigabit memory and one terabit PCIe fusion card. The network is Gigabit, it's a network, yeah. For GTM, we didn't need so fancy machine. The machine is a little, not so good. Just one terabit, set of disk and ordinary network. The online cluster contains 16 primary data nodes, 16 standby data nodes and two GTMs, three co-auditors are deployed together with data nodes. In all, the cluster has 64 machines in all, yeah, okay. After this description, let's come to the agenda of today. This is, we will talk about seven topics today, okay. Before we come to the first topic, let me just introduce my team. They are so glad to see you all guys, so they all smile. Smiling, yeah, they're all smiling. This picture was taken two days ago in Shenzhen, China. At that time, oh, you can use this one, okay. Yeah, two days ago in Shenzhen, no, that's good. Okay, let's continue. Two days ago in Shenzhen, you can see, it's warm there. New York City is much, much colder. So I caught a little cold, yeah. Okay, has anyone been to Shenzhen? I know that guy, yeah, this guy has been. You are welcome. So all you guys are welcome to Shenzhen, yeah. When you come to Shenzhen, just contact me. I will invite you at dinner. You know, Chinese food is very, very delicious. You will never want to miss it, yeah. Okay, let's come to the first topic. The first topic is about the transaction management modification we made to PGXC. We know in PGXC, a statement coming to the coordinator, the coordinator will request the transaction info from GTM, the transaction info about GXID and Snapshot, and then the coordinator will send the statement and the transaction info to data nodes to execute. By doing this, the post-grace XC can maintain consistency view of transaction, but it has its shortcomings. First, standby nodes cannot be used for read-only query. Machines will not be fully utilized. And in some corner cases, the corruption happens. It is hard to trace and fix. The second issue ordered us a lot, yeah. So we just made a simple modification to it. We remove global transaction information when query come to coordinator, we just send the statement itself to data nodes to execute without other things. And when executing the CQ, data nodes just use its local transaction ID and Snapshot. By this way, we can totally skip the bad thing of PGXC. Yeah, PGXC just bores a lot. Sometimes it just the catalog just a crash. It bores a lot, yeah, okay. And by doing this, we have another good effect. We can use standby data node to provide read-only service. Here, we have two separated kind of co-oniters. Normal co-oniters catalog have entries of master data nodes. Read-only co-oniters catalog have entries of hosted by data nodes. Master data nodes ship logs to standby in hosted by mode. By this way, we can use standby data node to provide the query-only service, yeah. Okay, let's go to another topic. Just know I said our business just keep going, keep growing, yeah, very fast. Almost every month, it still says double, yeah. So we need to secure out. And sometimes our merchants, their business keep running. They want to know their business status in real-time. They don't want to wait. So we cannot stop the business and secure out. We need to secure out when business is still running. So we have to minimize the service interruption, okay. Okay. But in PGXC, it has three kinds of distribution strategy, hash, run robin, and replicate. Here, we just talk about hash. Hash strategy, where the rule will go depends on the numbers of data nodes and the hash value of distributed K. It has its own shortcoming. That is, when we add new nodes, if we want to use the newly added nodes, we have to export and import all rows. In this way, service cannot keep going. We have to stop them. You know, by doing this, the merchants, they will just complain out, maybe, yeah. So we must find a solution to minimize the service interruption, okay. So we introduce a new kind of distribution strategy into PGXC. That is, we call it logical sharding map. In this way, every entry of the map just maps a shard ID to a single node. This is the middle layer between the rule and the data nodes. The new distribution strategy use the hash value of distribution K and the length of the shard map to decide which data node rows will go, yeah. So we scale out. We just move the data in the shards to new nodes and then report the shard map entry to new data nodes. By doing this way, we don't have to rebuild the whole cluster, yeah. No questions? Let's come to the next slide. Okay, this slide is just to describe the steps of scale out. First, we dump the data with the specified shard ID of the data node that we want to scale out. And then, meanwhile, we start the logical decoding of the specified shard ID need to scale out. This is just to ensure the new rows in coming will not lose. After this, we import the dump file and the logical decoding output to the new added node. When there is very little modifications to the source node, we lock source nodes by refusing red access to the shard ID. After all this, we tell, we tell the co-auditors to update the shard maps so that the moved shard ID will point to the new node. After this, the scale out is over, yeah. But finally, we have to erase the moved data of the source node just to release this occupation. Let's come to the next topic. The topic is about data scale across nodes. So first we have to know what causes this scale. First, we created a table with distribution K merchant ID. So a specified merchant database rows will be stored into a single node, yeah. But merchants have different amounts of business, yeah. Some are huge, yeah. Some are tiny, sometimes. So huge business merchants will use much more data than others. Sometimes, like the shopping website I showed in my third slide, that merchant, it can just run out of one disk, one machine, the capacity of one machine in only one month. After that, if we didn't find a solution to this, after that, when new data comes in, it's full, yeah. Business will fail, we'll get a complaint. I mean, the websites, they will not be so happy, yeah. Okay, then let's talk about the solution of data secure. Of data secure. At first, we divide the pgxz nodes into groups. A group contains one or more data nodes. Each group has only one shot map. Small merchants and huge merchants can be stored in different node group. Co-auditors can access all groups, of course. After this, we introduce a special distribution strategy for huge merchants. The distribution strategy for normal merchants and the huge merchants are different here. You can see the normal merchants will just use the merchant ID, the hash value of merchant ID as a key factor to calculate the shot ID of where Joe should go. But here, for huge merchants, we get another factor. The factor is a critical day off-site from 1970, January the 1st. That means when we decide where Joe of a huge merchant should go, we use two factors. One factor is merchant ID. Then the second factor is world time. I mean the day off-site of the time the business happened. By doing this way, the data of the huge merchants will be well-balanced across the huge merchant group. So at first, we should pick up the specially huge merchants and apply the customized strategy for them. And then, normal merchants use the default starting distribution strategy. Okay, this is the last thing I just explained. Next slide, I will give you an example. Here, yeah, here. For small merchants, it's data, different days, yeah. Different days from here to here, all go into one shot ID. One shot ID means one data node, yeah. But for huge merchants here, here, different days, different days go into different shot ID. Shot ID means different data nodes, yeah. So data are well-balanced across all nodes inside of the group. By doing this, we handle the data skew very well. Now, inside our cluster, the data, the capacity of huge, huge merchant group, disk is very, very, almost the same. This is about the data skew solution. Let's come to another topic. This topic is about optimization, optimating, optimating query performance on order result results of 19 million rows. You know, sometimes a merchant may want to see the details of the business. Maybe the list, the bidding list, very carefully, one by one, just by just click the website, next page, next page, that thing. So they need to just order, just sort it, I mean, just sort the data by the time of, I mean, by the time of the business happen, yeah. So they need, first they need, we need to sort the data, and then second, the time range, maybe one day, one week, or one month, especially for the huge merchants, the data will be huge, yeah. Seven million a day, a month it will be 90 million, quite a lot of rows. So it will be challenging for us. You know, the merchants, they didn't have enough patience to wait, to long. Click, yeah, several seconds is enough. If even longer, they will just not be so happy, yeah. Just, you're not so good, let me see to us. Okay, let's go to have a look at what's the solution, yeah. So, at first, it's, when we create the table, we partition, we use partitioned table. We partition the table by the time of the row, one partition, one month. And of course, for huge merchants, we use the customized distribution strategy, yeah. And the data will be distributed into different nodes by day off site. By doing this way, almost each node has the same number of rows, yeah. And we create a multi, multiple column index on merchant ID and the sorted column. By doing this way, the scan on each partition will use index scan, okay. Then, second, we have to do something about the plan, yeah, partition by time. So, in fact, this will here, here, here, this is not kind of partition, yeah. In PGXC, it's not kind of partition. Like chart, yeah, yeah. Across the cluster, I mean, across different nodes, we use merchant ID to chart it. But inside one node, inside every single node, we use time to create a partition. Here, after that, we need to do something about the plan to make it work, yeah. So first, when we optimize the planner of coordinator, the planner of coordinator. When coordinator find this kind of, this kind of query, and of course, he saw the indexes on each partitions match, yeah. Then it will push down the order by, order by just SQL, push down of this two data nodes. And on each data node, sometimes, you know, we, on each data node, we use indexed game for each data partition, yeah. Sometimes the time range is longer than one partition. So, we scan more than one partition, yeah. The planner will use much append, yeah, much append to combine the result from more partition, from multiple partitions, yeah. You know, much append is kind of sorted. It's output is sorted. So by doing this way, we can produce sorted results from every data node. Yeah, it's important. They want, you know, the merchant, they want to look very carefully at their, the time, the time when the business happen. Sometimes, you know, the merchants, when they look at the bidding list, it's not the guy, not the CEO or something, it's just some, a guy just in front of the, in front of the building, that guy. Some people just complain to the, complain, yeah, just, you know. So your system is not good enough, just you have to check what happened at which time, sometimes these things happen, yeah. Okay, by doing these things, each data node just output, output tuples in a sorted order. And in Coaliter, we also use much append to combine, combine all outputs from all data nodes. By doing this way, Coaliter can output a sorted, a sorted output, yeah. This that shows the running performance, yeah. So here, at the present, we, in our test case, we can get a result of 19 million rows sorted in less than 30 million, 30 microsecond, yeah. And when we have 800 sessions running at the same time, we can even process more than 5,000 queries per second. More than 5,000 queries per second. And the latency is less than 200 microsecond. Let's come to next, next slide, okay. I'm sorry, my pronunciation is not so good. So I just tried to speak slowly and loudly so I can be well understood, okay. Thank you. Okay, let's, this slide talks about the parallel execution of CQ and the optimization. Okay, 20 minutes is enough maybe. Okay, I have learned about the community, they are doing the same thing, but we did this two years ago. This term is running online, okay. Okay, let's talk about the architecture. So here, we introduce a shared memory manager. We use bitmap method to manage the shared memory. So when a statement comes to the system, the post-crisis session just to judge whether the plan can be run in a parallel way. If it is, I mean, it can be run in a parallel way. It's just, you know, divide, divide the plan into small pieces. We call it a plan slice, yeah, plan slice. And then put it into the queue, the plan queue. The plan queue is stored in the shared memory manager. There are several, several executed workers are waiting on the other side of the plan queue. When these guys see anything coming to there, just pick it up and run, yeah. When the executor, executor worker finish their job, they just push, push the result. I mean, the table, put the table pointer, put the pointer into the result queue. Result queues are also stored in shared memory manager. Yeah, by doing this way, we can run SQL in a parallel way. Any question about this? Okay, come to next. Okay, this page just shows how we defined the plan slice. The plan slice contains this kind of things. First, the XID, XID means which transaction is running the slice, the plan, and is the piece of the plan. And then the user ID. User ID means which user is running. And the next one is also the transaction snapshot. And then this fourth one, plan statement. Also, it's a plan statement of current slice. Then the next one, plan. This is the really plan slice will be one in executor worker, yeah. So, the slice info is about the specified execution node. It is used to describe detail about the range of the plan slice. Plan slice, yeah. Sometimes, just like how many blocks to scan, which blocks start, which blocks stop, this kind of stuff, yeah, in this structure. And then the last one is a memory context. This memory context is created in share memory manager. We, it is used to hold executor result table, where the result table pointer point two, yeah. You know, the result table here, the result queue. The result queue here, it just returns the pointers. The pointers just point to that context. Okay. After all this, let's talk about how many operators have we done, yeah. We have done all this. SecretorySky, HatchJohn, NestlubJohn, MotionJohn, of course, we have, you know, we have plan, but, you know, in fact, this tree is enough for us. But we have just estimate it. I found it's okay to do it, but this tree is enough so we have not finished it, yeah. And then the remote query. Our remote query is a special operator in pgxc and then hash aggregate and sort aggregate and append, okay. And in the last, in the last few, I mean, in the next, in the next few charts, I will show the acceleration, acceleration of the parallel running, yeah. In this chart, we can see the acceleration rate, almost the same as the numbers of CPUs in our system, almost 12 times, yeah. Then this is the secretory scan. And then here is the hashJohn. HatchJohn, almost none, none of, eight times, yeah. Here is the next loopJohn. Next loopJohn, maybe not that good, only at most six times faster, yeah. And hash, hash aggregate, hash aggregate, this is good, I think. It can, the acceleration rate can reach, 13 times, it's great. Sort aggregators, this kind of, you know, the way, when we, when we develop them, when we develop them, this is kind of, is some kind of, some kind of test, I think. At that time, when we do this, we think maybe it will not be so good as hash aggregate. So we just did it. And the result just tell us, yes, we are right at the beginning, yeah. It's just only three times faster. Okay. After all this, okay. I think, you know, execution, it has its own set effect. Our own, our running, running system, I mean, the online system, the parallel execution sometimes uses too much resource. Almost all CPUs, all disk IOs is, you know, too much resource. So the QPS, I mean, the query per second cannot be too high. The CPU and IOs just keep busy all the time. So let's come to another topic. This topic is about high availability across international data centers and cities. You know, our system are running for the financial, something about money, yeah, money data, yeah. It's critical. So sometimes we need several copies just to ensure any disaster cannot just make the system down. So we have to design a system that is robust enough, yeah, good enough for any disaster. So here, we build a system like this. This system contains several types of components. The first one, we all know that it's the keeper, yeah. How do we do the keeper? We use the keeper to store the data node and the collater status and then the judgment center. Judgment center, we deploy judgment center to every city. I mean, at least one city must have one. But this, so many judgment centers, there is only one master. That means only one master work. Others are hoisted by. And you know, what is the difference between master and the sleeve? The master, it collect status from the collect agent and beside this, it also make judgment like which node is down. It should be replaced by another backup node, yeah. It does this thing. If master is down, the judgment node group itself will promote another sleeve as a new master, yeah. Okay, then after this is the collect agent. The collect agent just collect status of database nodes and execute the command from judgment center, okay. After all this, we can see it's the database cluster itself. We deploy coordinator in every city. At least one international data center. And we have a copy in every international data center. Yeah, we can see that. So even two of the three international data centers are down. We, our system can still keep running. Let's come to, okay. Maybe here, this is only the conclusion. I have explained this. Here, we can see each data node has two parallel sleeves. Synchronized one is in the same city as master and the other one in far away city is asynchronous. And I have said that coordinators are deployed into each international data centers. And monitor agents are used to report node status and each international data center at least has one, I mean the judgment center here. It is monitor center, okay. And auto fill over by active monitor center by the help of zookeeper, yeah. Okay, this is all about my topics. Then about our next step job. First, just at the beginning of my representation I talked about the growth of our business. Our product guy told me at the end of 2019, the data, I mean the cluster will be more than 400, 400 machines in all, it's too big. So many machines of course need so much money. So we need, sometimes we need to compress to reduce the cost. So we need a compressing store engine. Maybe the fair system with compressing all the internal, I mean the internal database itself. And then we will migrate to post base to pitch XZ. Tensioned company also has product like Google map. We have plenty of data, yeah, plenty of data. So we need this, okay. And the latest, I mean the very, very, very early future before the end of May, we will be the three, I mean another three clusters, I mean pitch XZ clusters because more and more business coming, yeah. We need to serve them, okay. And that is my email address, anything you can contact me, okay, yeah, okay. Any questions? Okay, I'm ready. One thing, one thing is, you know.