 And so my name is Aistan Gavilan, I work for Alibaba Cloud. I used to work for all groups working on the MySQL of myself. So this is about a project we are doing in Alibaba, adding parallel query execution to MySQL. And if you stay for the next presentation you will get to know more about PolarDB in general. But that is the cloud-based architecture for the individuals. So in this talk I will talk a bit about what parallel query is, how to use it, and some of the implementation and the performance we have. And this is working for us so I will also say something about what is counter-acting and what we plan to do in the future. So what is parallel query? Traditionally MySQL has only been able to use one single thread for executing one query. You get a connection and then you get a thread. Another thing you do within that connection is to isolate it to that thread except for IO which were multiple threads maybe used. So what we want to do is to be able to use multiple threads to execute one single query in order to use more of the process that my mother and my computers have. So the goal is that if you use eight parallel threads you can go eight times as fast as if you use a single thread. And why this is needed is because for the last 50 years the processors have stopped getting any faster. But what makes computers faster is that you add more and more cores to the computers. So we want to take advantage of all these cores. If you have like a high transaction process that exists, you take advantage of that because you have many concurrent users hitting at the system. But if you run more long-running queries you often have fewer users and you won't be able for them to use all these if you use. So the goal here is to be able to run these queries directly on your InnoDB storage engine that is the most used by MySQL. Because there are other systems where you actually extract your data from your transaction processing database and then you put it into some other database like Redshift or Special Purpose Analytics. This is not a project for this. This is a project where you run your existing queries on your existing database and get better performance due to parallelism. So take an example. If you want to count on how many rows there is on a table if you use one thread you have to scan all the rows of the table and do the count. And if this is the only query running currently on your machine you might have one active thread and then you have 63 idle cores that are not doing anything. But instead you want to spread the work across all these cores using 64 threads and then each of the threads will count just one part of the table and they all do 2% of the work and then you just sum it all together and report it back to the client. So that's the basic principles of the parallel execution. So it works the way that there is a coordinator it splits up the table or index again into equal sized pieces and then gives those pieces to the workers to put it to a work on. And then the coordinator at the end will gather the results and put it together and then report it back to the users. So each of the workers will write the results to their own buffer and there will be message queues between each of the threads and the coordinator and the coordinator will then read from these message queues and put the data result together. So the internal Sarah as I said if you have a parallel sequential scanner or parallel index scan you will partition the data and distribute them across the workers. So if you for example you have a B-tree like this and you want two partitions what you do is to distribute the data across the workers. So if you have a B-tree like this and you want two partitions what you do is to divide the B-tree into two parts and the worker will always see this, worker one for example will always see this part and worker two will always see this. So we are kind of logically partitioning the B-tree and alternative ways to do it are some other databases that they kind of just hand out one page at a time to each of the workers. One advantage by partitioning like this is for example if you have two like this and worker one if you only see this part is that we cannot put more than just sequential scans. If for example worker one is told to access the row with key values 19 for example it will not see this because it is not in his partition. So that means that index lookups and everything is partitioned between the workers. And for example then we can use the existing code for example yesterday someone talked about 8.0 supporting skip scan which is a combination of scan and index navigation. And if you partition like this it will scan and now we get down scan, skip and scan and so on and when it comes here it will not see any more rows and it is part of the skip scan that is finished. And the partitioning is that if you ask for a few partitions then you will only look at the root page and you will divide if for example here there are four but three keys there will be and two partitions and you will get two of the ranges to each of the partitions. But if you want more partition you will have to go down one level in the between and partition at that level for example. Yes but I will come to that on the next one and we could do one partition to each worker but what happens then is that some partitions may be slower the process than other partitions and then one worker will get behind. So what we have found is that if in partition because one hundred times as many partitions as we actually have worker things will be even out and there will be kind of only a small percentage of difference between how much time each worker is spent. But some worker might process twice as many partitions as other workers it depends on the because there might be correlation between the scan sequence and the wear condition that you have for example so that some workers will fill out many more rows than other workers. Also it's easier to get partitions a more equal size if you ask for more because since the splitting is done at the high level if you like ask for a few partitions then there might be three small and one door twice as big for example. And it's implemented so that when one worker goes beyond its partition it will just be assigned to a new one. So the motivation for doing this is to be able to use existing code without doing any much changes and be an existing storage API between the server and the storage engine can be used without any change of the code. So we support different operations done in parallel for example sorting then each worker will sort the rows of its rows and then the leader will merge these rows together. Group by the same way each row will compute the groups for its rows and the aggregation for its rows and then it will report back to the leader and it will merge the groups. In some cases the groups may be split so that there's no overlap between the different workers and in that case it's just a single merge. In other cases the same groups will be at many other work and then there will be more work on the leader and it will not scale as well. We can also do Nestlib joins because county MySQL does not support hash joins. Nestlib joins is the only thing you have here and we can inverse the way that you partition the first the driving table and then it joins with all the other tables for the rows it sees and then it reports the results back. This works well if the driving table is a good table for partitioning but in some cases the first table will be just a small table that there's no possible partitioning and that's something we are working on in the coming months to also support that kind of operation. So how do you use it? We have this hash variable, MaxProl under me that tells how many workers can you do and use it most for this way. So you set this to a certain value for example you set it to 64 and you can use the maximum of 64 worker plans for executing the burden. It may use less plans if it decides that it's not efficient to use that many tasks for the worker. And if it decides that this for example this looks like the table is too small or something for parallelization you can actually force it to use parallel execution by setting this force parallel mode. We will also support hints so that you can for a specific query say that this should be executed in parallel or this should be executed in parallel with this number of workers. So this will override any decision made by my scale on how many workers to use. One thing I forgot to say about join was that we have not changed the optimizer yet so we can't just use whatever query plan that the optimizer has originally decided on. And that may come up with a plan that is the best for serial execution but not necessarily the best for parallel execution for example putting a small table at the first. So you can use hints also to change the join order for example to something that's better for parallel execution. But the goal is that you should be able to produce plans that are better suited for parallel execution. If you want to know whether a parallel scan will be used for a query you can use X-ray and it will actually say how many workers it wants to use. So a little bit about performance. We have run the DBT3 queries on argumentation given that there is an open source version of the TVCH benchmark which is set up 22 queries for decision support type of requests. This is query 6 which is a single table where you sum up the revenue for a certain period with a certain discontent and so on. And we have tested this on different types of sizes of the TVCH database. This is scaled up to 5, 10, 20 and 40 I guess. And you can see here that it scales pretty well compared to the other line yesterday. Ideal scaling. If you use the time by two if you add two more if you double the number of plans. So it's scaling pretty well. And it goes down for example from 89 seconds in the scale factor 5 case to 3.4 seconds. If you use 32 for defense. This is on a 32 core machine. And similar for the DBT3 query 1 which is also a single table query. You see that it scales pretty well. 29 times... It's divided by 29 the execution time if you use 32 for defense. So it's close to the linear scalability. And we can also... And why is the linear scalability so important? It is because if your business go you want also your performance to keep up. And so if you scale it already you see that if you get twice as much data if you add twice as many cores you're still using the same time to execute the query. So in this case we say if you go from scale factor 5 to scale factor 10 to scale factor 20 that we have the same execution time if we proportionally add cores to execute the query. So if you add more cores on the same data or if you get faster if you add both more data and more cores you will keep the same performance. We can show that this scales linear also for many of the joint phase. Here is the example. 12 which is a two-tamer joint I think. And we have... These are the results for the TV3 queries that we are able to execute in parallel currently. Not all should show good speed up some of them do like 1, 6, 9, 12, 14 and 19. There is one issue that is not particular to parallel execution in MySQL is that if you use an index, common index scan there is a bottleneck on the root page of the tables. You scan the index then you go because indexes in MySQL use primary key to refer to the row. So you have to navigate through the... And many rows, many events are hitting on the same... That is a bottleneck. Even not particular to power query if you do that with multiple different users doing covering index and at the same time you see the same problem. So that is something we will look into and see how we can get better scaling on index scans. For query 5 this is a small table at the beginning so this is actually not scaling that good because of the query plan. In fact, it scales better than nothing because now you have actually two threads you have the leader and the single worker that actually does that thing and so that combination gives you a 1.7 speed up or something but you can't go beyond that by adding more workers. And query 10, the problem is that there is very many groups. So that all workers will have all the same groups and the leader will become the bottleneck because it has to group together all this. So we are working on getting a better group by execution where maybe the workers do some sorting so that the merging will be easier for the... Some demo I can show maybe. Okay, this did not show a point. I think that has to be... So this to show this is not a very powerful computer but at least it has four cores so you can see the principle. So in this case, I set the power and the need for us to say why I have and then I have this query. This is the query 6 from TPCH and if I try to land this that is nothing. So it takes 2.9 seconds if I do it and do it again just to show you that this is kind of time. I have enough buffer pool here to have all the memory. So if I set it up to 4 and then run it again you can see that it's less than one third of time so it's not perfectly scaling but this computer is doing other things at the same time so you don't expect the scale to be there for you. And we also have some simple tasks so if I set that many hours you can see each of them poses about the same amount of time. So that's just a good example. So if I can get back to... So as I said this is a working progress so we have some limitation currently there are only support selectories and the powerless scale is only on the driving table and it has to join as I said so it doesn't scale well for other programs. We can change the join order but it's not optimal. It only works for me and Evie also I'm not sure if we'll do anything about that but things that is not counter-executing powerless upgrades and that's why I only show 7 of these because the rest of them are using separate in some way. So that's the... We are starting to look into supporting that now some of the sub-parries are used much as high stations so you can execute them in parallel first and then you can use that really temporary result and execute the rest of the query in parallel. Others will be just pushed down and each worker will execute the same sub-parries because they are dependent on those of the other programs. We want to support video functions and all of which is pretty important with the analytics and we also do not support special data types of JSON and JS so there's a lot more to do here so we do not suppose there's a large selection but I'm not sure we'll be able to do it very easily because it's inherent in the way InnoDb works that it's a bit difficult to parallelize if you use serviceable so I think the MySQL team at home will need to do something without InnoDb before we try that. So what we would like to do we have done this part is what I've talked about we want to support more doing parallelism in sub-parries and optimize some of the performance of the existing maybe we want more diagnostic support but we have a performance scheme that is always pretty simple. So for the next major stage is to be able to use more than one little gather process so that you can emerge you can distribute loads between workers using what we call an exchange operator and one parallel has joined because it's what is most effective for many of these parallel executions and as far as I know Oracle MySQL is working on this so when that is ready we will start looking into how to parallelize this and the end goal is to be able to have that actually takes advantage of such tricks. That's all. Is there any questions?