 Hi, how's the mic? Good. OK, I'll probably move, and then it'll probably be bad. So just say something if I get quite all of a sudden. That doesn't usually happen, though. So I'm Robert Haas. I work at Enterprise DB, which is a lot more yellow and a lot less blue than that slide. It's actually really supposed to be orange and black, but it isn't. And at Enterprise DB, I am the chief database architect. And in the community, I am a PostgreSQL major contributor and committer. And I'm here today to talk to you about next generation parallel query. As many people here probably know, 9.6 was the first release of PostgreSQL, which had any kind of parallel query feature that was user exposed. And in 10, we have quite a few enhancements. And I'm going to talk about both some successes and some failures today. I think in some ways, I'm very happy about what's coming in PostgreSQL 10, because I think we've really made a lot of very significant enhancements over what was present in 9.6. In other ways, I'm sad, because six months ago, I thought that we had solved a large fraction of the problems that there were. And now I know that that was totally not true. And there are many more problems that are still left to be solved, which is a little disappointing, because I like to be done with a project eventually. But on the plus side, if there's still more work to do, maybe they'll keep paying me. So I'm going to talk about parallel scans a bit, then parallel joins, then parallel aggregate. And in each of those categories, I'm going to talk a little bit about what has been done and what remains to be done. I'm going to talk about a few miscellaneous improvements that have been made in this release. I'm going to discuss some TPC-H results, and I'm going to talk a bit about future work. And feel free to pipe in with a question. As long as we don't get tight on time, I have 25 slides, and we have 50 minutes. So if I'm going faster than two minutes a slide, then there's probably a little bit of buffer in the schedule. And if we run out of buffer, I'll let you know. So when you think about accessing data via SQL, the most basic thing that you can do is just access data from a single table. You scan the table in some way. And PostgreSQL has a bunch of different ways to do that. Obviously, the simplest method is a sequential scan, which we abbreviate to SEQ scan, sex scan. And then you could also maybe want to not do that. So you can also access data in a table via an index scan or via an index-only scan or via a bitmap heap scan. So an index scan means that we go to the index and we say, give us the first tuple that matches the index conditions. And then when we get it, we say, give us the next tuple that matches the index conditions. Next one, next one, next one, next one, until there are no more. An index-only scan is the same thing except we attempt to avoid accessing the heap because we think that all of the data that we need is going to be present in the index tuple. And so there won't be any need to touch the table data itself. A bitmap heap scan is one of my favorite kinds of scans based on my previous experience as a PostgreSQL user rather than a PostgreSQL developer. Basically, what we do is we do a bitmap index scan, which means we zip through the index and find all of the heap blocks that we need to touch. And then we zip through the heap in order so that we hopefully get mostly a sequential IO pattern. And we look at each one of the tuples that the index pointed us to. So it essentially sorts the IO against the table, which is important when your tables are big and the index is accessing a lot of rows. And then we can also combine data with multiple indexes using bitmap and bitmap war operators. The importance of table scans from the point of view of parallel query is that the table scan, one table scan somewhere in the query, the driving table has to be scanned using a special kind of scan that is parallel aware. And that special scan essentially partitions the data. So wherever the parallel scan is, that is going, there's gonna be several workers running and each one of the workers is gonna get a subset of the rows from that table. And together they're gonna get all the rows from that table that match your search criteria. And so if a particular kind of scan is not parallel aware, then it cannot be the driving scan for a parallel query, which is sad. So how are we doing? Well, parallel sex scan is the first thing that we did because it was the easiest and that's fully supported from 9.6. And in 10, we have several enhancements coming which I think are very good. We now support Btree index and index only scans on Btree indexes as the driving scan in a query. And we also support bitmap heap scan in parallel but not bitmap index scan. So what happens is if a bitmap heap scan is chosen as the driving scan for your query, then the index scan will be performed by one process which will build the bitmap and then all of the processes can cooperate and jointly iterate through the bitmap visiting the heap blocks. So this is great if the index scan is short and the heap scan is really long but it's not so great if the index scan is a large percentage of the time that that access is going to take. The nice thing about this bitmap heap scan thing is because we're not relying on any parallel support in the index access method, we're able to do this for any index access method. It doesn't care whether it's Btree or Brinn or anything like that. And Brinn would be a very good candidate for this because the index scan portion for a Brinn index should be very, very small because Brinn indexes are tiny. Eventually, obviously it'd be nice to combine these things. So we could have the index scan also be parallel. That unfortunately requires specific work to be done for each index access method. We did Btree for this release. That's obviously by far the most important one because there are more Btree indexes out there than anything else. But it would be nice if someone wanted to do the work to extend this to GIST and JIN and SPGIST and all the other access methods that we have out there. Questions on table scans? Okay, so next up is joins. So there are three join algorithms available. We've got nested loop, hash join and merge join. And nested loops kind of come in two flavors. So the normal thing when you get a nested loop is what's called a parameterized nested loop, which means that one side of the nested loop isn't really scanning all of the data. It's something like an index scan driven by a value from the other side. So you're joining two tables, you're scanning along for one table sequentially. Each row you find, you pull one column value out of there and you feed that to the index scan on the other side of the nested loop and you go look up that value and see if you find any matches. That's a parameterized nested loop because the value that goes from the one side to the other side is a parameter. You could also have an unparameterized loop and one of my colleagues when they saw a draft of these slides said, what's an unparameterized nested loop? Right, well they don't happen that often but the way that they happen is if you say something like you select from two tables but your join condition is something really strange like this column mod that column from the other table equals zero, right? Then you've actually got a scan for every row on one side you've actually got to scan all of the data on the other side and check which rows match. People tend not to write these conditions very often because either slow and be there not that useful necessarily, but they do exist. So how are we doing on joints? Well, so first a little bit of general philosophy here. It's easy to perform a join in parallel and I put in paralleling quotes here for a reason by giving each one of the worker some of the rows from one side of the join and all of the rows from the other side of the join, right? So if I need to find all of the rows where a.x equals b.y, I've got all of b over here and part of a over here and for my part of a I do the join. I check for matches on the other side but I can't partition both sides, right? Unless I've got some trick because if I only see a subset of the rows from this side and a subset of the rows from this side then the match things might not match up. I'll miss answers to the join. So we can do a kind of, you know, version one parallel join by doing this. There's some problems with this which will become apparent in the following slides. So going through the join types one by one parallel nested loops are supported in 9.6 and also in 10. Parameterized nested loops actually work very well with parallelism, right? If you have an index scan that's being driven by a value from another table and you do that in parallel, it's fine. I get my values and I look them up and you get your values from some other rows and you look those up and everybody's happy and things are great. Unparameterized nested loops are also supported but of course they're pretty inefficient whatever data is on the outer side every worker's gotta scan it repeatedly. If it's not just a plain table but some kind of intermediate result set that's expensive to build every worker has to build its own copy of that intermediate result set which is not so hot. That's probably not a huge problem in practice because when I say unparameterized nested loops everybody says, huh? Which means you probably don't have very many of these and if they're not, if they don't work great it's maybe not the end of the world. Obviously it'd be nice to do better. So hash joins. We've got these in 9.6 and they'll still be there in 10. Each worker builds a separate copy of the hash table which is not great, right? If the table that were hashing is small it's fine, right? You got 10 rows or 100 rows or even 1,000 rows and you load them into a hash table and then you scan a subset of the other side of the join and you do a probe for each row and everything's great. If you've got a million rows and you need to put that into a hash table suddenly things are not so great, right? You're doing a lot of work. Every worker is doing that work repeatedly to construct their own copy of the hash table and it can easily get worse because it may be that you're building a hash table not on rows read out of a base table but that you're hashing the result of some aggregate which scans a terabyte of data and then aggregates that and then the result of that is what's going into the hash table and the planner is probably not gonna pick that plan because it would be very awful but you'd clearly like to do better. You really don't want everybody to be building their own copy of the hash table because what it effectively means is only one side of the join is parallel. On one side we partition the data but on the other side everybody is doing all of the work over again. So we have a proposed patch series to fix this. It does not look like it's going to make the cut for version 10 unfortunately because there are still some issues remaining that need to get fixed. Obviously that's a little disappointing because this would have been really nice to have but I think that one's probably gonna have to wait for 11. And then so the last kind is merge joined. In version 9.6 we just never did merge joins in the parallel portion of a query. In version 10 we do allow them but they've got the same problem as the hash joins and the unparameterized nested loops which is that every worker is building one side of the join in its entirety and joining that against the subset of the other side. And that's not great, right? So we don't have patches for this yet. We need some kind of a solution to this problem of how to do merge joins in parallel that is better than what we've got. And if you think about it a little bit it's actually kind of a tricky problem because a shared hash table is a pretty obvious solution to the problem of how to do a hash join in parallel, how to do a merge join in parallel is not as obvious, right? It feels like it's sort of an inherently serial algorithm. You've got two sets of values and you're trying to go through and line them up, right? Well, I mean, it doesn't work unless you've got one set of pointers that you're trying to line up as you move through the data. So what are you gonna do there? We have some ideas, we do not have code. So one idea is we do have code for partition wise join and parallel append. Again, this is probably not something that's gonna end up in 10 although the patches are written but they do have some issues left. So here the idea would be that you manually can partition your data. This is not something the system would, this first idea is not something the system would do for you but you would partition your data and then you would join between two tables that had a compatible partitioning scheme. And with those patches, the planner would be and the executor would be smart enough to break down your one big merge join into a bunch of smaller joins, one per partition. And of course, even if you can't parallelize each one of those, you can do multiple ones at the same time, right? If I've got my table divided into 40 partitions, I've got, instead of doing one big merge join of all the data in one table against all the data in the other table, if I'm joining on the partitioning key, I can instead do 40 little joins. This partition to this partition, this partition to this partition, this partition to this partition. So that's our first approach to this problem and we do have patches which will support that that are not quite ready to go into 10 unfortunately. So that's okay but it requires some explicit user action, right? You have to partition your data and depending on what the situation is, it might not be easy to find a partitioning scheme that's good for this and also good for other things that you wanna do. Yeah, right. So the currently proposed patch would require the partitioning schemes to match exactly what you're talking about is I think one of the things that we need to think about either as a revision of what that patch does or as a future step or something, right? Because that clearly is an issue. I mean, part of the problem is you don't wanna spend a lot of time pursuing planner optimizations that are not likely to work out well at execution time. You don't wanna spend a lot of time planning a query and then you've wasted more time than it was worth at execution time. But on the other hand, I think that's a real thing that you're talking about and you don't wanna miss those optimization opportunities either so that's a research area. Yeah, so beginning in PostgreSQL 10, we're going to have declarative partitioning where you can actually say create table partition by range or partition by list and then you can say create table blah partition of blah for values from blah to blah. So in previous releases, we didn't have that obviously and you had to sort of do partitioning in an ad hoc way using constraint exclusion and table inheritance and all of that fun stuff but that process is getting considerably streamlined in 10 which is what makes this kind of optimization possible. Okay, so if you can't or don't want to partition your data in a way that makes your merge joins fast, then you'd like something that lets the system do it for you. So a second idea is we could break up the merge join into a bunch of smaller merge joins using hashing. So we hash every row on both sides, right? And we take the low, let's say four bytes, four bits of the hash code and that puts every row that we're interested in into one of 16 bins. And then we do the same thing with the relation on the other side and now we've once again broken up one big hash join into a bunch of, or one big merge join into a bunch of smaller merge joins. It's an idea. Another idea that we're thinking about is a sort of parallel materialize node where on one side of the merge join the data is already partitioned. On the other side of the merge join you insert this parallel materialize node and under it there's some partial plan where each worker will generate some of the rows that need to be seen on that side and they go into some kind of a tuple store that everybody can read from. So then everybody still has to read through all of those rows, but the parallel materialize node means that all the workers can at least cooperate in generating those rows. So that would help when one side of the merge join is something like another join. You do that join in parallel, the output of that join feeds into this parallel materialize node which then allows the merge join to work correctly against some other parallel thing. That could be further optimized with something called a skip scan. So right now the way, imagine you're doing a merge join and you have an index scan on both sides, right? So index scan on one side, index scan on the other side. So what we basically do right now is we see these two index scans and whichever side is less, currently less, we pull another tuple from that, right? Which makes sense, right? This one's less, this one's less, this one's less. Now this one's less, this one's less. We keep pulling a tuple from the side where the value is lower and matching up equal values as we go. However, that doesn't work well if one side skips way ahead, right? So like imagine on one side we have the numbers one to a million and on the other side, we have the numbers 1,000, 2,000, 3,000, 4,000, 5,000, 6,000 and so on. So what happens today in Postgres's merge join and this is true with or without parallelism. This isn't specifically a problem for parallel query is we read one on one side and 1,000 on the other side. And then we look at the guy who gave us one and we say your value is smaller, give us the next one. And he says, two, give us the next one. Three, give us the next one. Four, give us the next one. Five, and we just keep doing that, right? And eventually we get up to 1,000 and then we're like, oh yay, a match and we emit the match. And then we do this again, right? It would be much nicer to say, to tell this guy who just gave us three or just gave us one, say to him, hey, I don't really care about anything less than 1,000. Could you just fast forward to 1,000, right? So that would be a good optimization even for non-parallel merge joints. And I think for parallel merge joints, it might help even more or also help or something because we have this problem of wanting to make sure that we're not doing all of the, you know, we're not repeating work, right? And having every worker pulling every row from one side of the join is not as good as having it somehow jump forward to the next interesting part of the dataset. This is pretty vague and that's because I don't really know how we're gonna solve this problem yet but I've become convinced that it's an important problem to solve. Yeah, right. So if you knew how the values were distributed, you could try to chunk up the two sides into ranges and then perform submerged joins on the ranges. The question is, there's some fuzzy details about how exactly you do that and doing it based on statistics is one of those things that might work out for you but it might also go badly wrong because we only know what the statistics are for the base table. By the time that table has been filtered and maybe joined and aggregated and so on, you may have statistics that aren't very representative. So you have to be careful not to over commit to a design that assumes the stats are accurate. Well, there is some overhead that's intrinsic to parallel query. So if you only have one worker that's really doing any work then you're behind in some sense. Also like, I think people will sort of be unhappy with that on a metaphysical level, right? Like you don't wanna see two or three or four workers start up and then have all of them except one just kind of sit there and not do anything. People are gonna report that as a bug. And I don't blame them, yeah. Exactly, right, yeah. So for index, for skip scans, going up a level from the bottom, if you're doing a skip scan and the thing you're trying to skip through is a B-tree index, as Peter is pointing out, there's a very natural way of doing that. Okay, so aggregates in PostgresQL. We've got three kinds of aggregates. So, and these are the names that are used in the query plan so they must be perfect. If there's only a single group that's being produced, we just call that an aggregate. That's like if you do select count star from table, right? There's one group that's coming out. Many times you have multiple groups and then you have to pick between a hash aggregate and a group aggregate. So when a hash aggregate, we just throw all the data into a hash table and we accumulate the transition state for each group within one hash table entry and then at the end, we have a complete hash table with all the data in it that we need and we dump out the results. The other option is to do a group aggregate where we sort by the grouping key. And then we have all the values in a particular group next to each other and we just aggregate one group, aggregate the next group, aggregate the next group and so on. So that's generally what the strategies for aggregation are. Now, how does that play with parallel query? Well, the way we're able to do this today is like this. We do what we call a partial aggregate which can be done using either, using any of those three strategies. It can be a partial aggregate, partial hash aggregate or partial group aggregate. And it reads from a parallel, a partial plan. So each worker is running its own separate partial aggregate and it builds its own separate set of groups. And then those groups are fed up to the leader through the gather node which then performs a finalized aggregate step to basically if more than one worker created a group for the same value, you have to merge those groups before you get your final answer. How well does this work? No hands? Well, it works pretty great if the number of groups is small because you don't have to do very much IPC. If the number of groups is large, it's not great because if you're getting lots and lots of duplicates that then have to be merged, it's not fantastic. It often still wins. It's often still significantly faster than not doing parallel aggregation but you don't get the kind of speed up. One of the things people used to complain about in PostgreSQL was still a count star from table is really slow. It got noticeably faster when we got index only scans because it turns out in some cases, scanning the narrowest index to count all the tuples is a fairly fast way of counting the tuples in the table. And it got a lot faster still when we got this because you just do a sequential scan at the table and you've got four workers or whatever and each one chunks through as fast as it can counting them up and then you've got four subtotals and you total them up. So it's great for that kind of a case. When you got a million groups, less great. Still usually faster but not as much faster as we'd like. But the thing that really stinks about this is when the aggregate is not at the top of the plan, you see what happens here. I do my partial aggregate, then I do my gather and then I do my final ass aggregate. Now I need to do a hash joint but I cannot parallelize the hash joint because I'm already above the gather, right? The portion of the plan that runs in the workers is the portion that's below the gather node. And here I have to use up my gather node to get the aggregate done. And now I'm out of parallel mode because the leader has to perform the finalized aggregate state and that means this hash joint is, it cannot be done in parallel. Now it could be that under the hash, there's a second gather node and that that hash table is actually being built in parallel but that's parallelizing the build of the data that's going to go into the hash table. That's not parallelizing the hash joint itself. This is one of these problems that when I initially saw the design of parallel aggregate that David Rowley put together I was like, yeah, sounds good, great design. And then it took a while to understand that this was actually not good if you have this kind of a plan. I mean, the good news is many times your aggregates are actually at the top level of the query and you're not really doing anything after that or it isn't that expensive but if you have this kind of case then things are not so great. What we've been thinking about is it would be nice if instead of separating the partial aggregate phase from the finalized aggregate phase it would be nice if you had just a parallel aggregate that did the whole aggregate but somehow everybody talked to each other so that you didn't get wrong answers. It's pretty easy to see how to make this work with a hash aggregate. You have all of the workers read a subset of the rows. Each worker puts the rows that it sees into a shared hash table and then everything just works, right? More or less, unless you ran out of memory then you have some problems. But it's pretty easy to see how to do an aggregate that is a hash aggregate entirely within one, you know, down under parallelism. You just need a shared hash table to do that. It is less clear how to make this work with a group aggregate. That's tricky. But this would be cool because then if we had that some kind of parallel aggregate node that could run as an intermediate step underneath the gather, then it would combine with the parallel shared hash thing that I talked about earlier and let you parallelize the heck out of everything. Yeah. I think it is very similar to the merge join problem. Maybe not identical, but it's in the same ballpark. Questions on this? Are you mean at-hoc partitioning on the fly? I don't have a plan for that. I might have a plan for that sometime, but I don't have a plan for that today. Okay. Fair enough. Any other questions on this? I know this is getting a little bit deep into the nitty gritty of how queries actually run, which may not be familiar to everybody. So I hope I'm not leaving too many people behind. That is not true even on 9.6. You can have parallelism on child tables. Yeah, you can get a plan where you have an append node with a parallel sequential scan per child underneath it. That does work. Okay. So we've got a few miscellaneous improvements that we did here in version 10. Gather merge, merges, you know, if each worker has a sorted stream of tuples and they're all sorted in the same way, gather merge will bring those back to the leader while preserving the sort order. And that combines very nicely with some of the other ordering related things that we now have like parallel B tree index scans and parallel merge joins. Uncorrelated subplans can now be executed in workers, but each worker repeats the work just as in the join case, but it's better than nothing. Access to parallel query from PLs has now been improved. So if you're using procedural languages to issue queries and you're wondering why those queries are not actually using parallel query, that's been improved significantly in version 10. And we now passed the query text down to the workers. So if you look at PG stat activity, the workers in beginning in 10 will actually tell you what query they are involved in running rather than saying, yeah, I'm a parallel worker for something. Okay, so I want to talk a little bit about some TPCH results here. These were run by my coworker Raffia. They were done at scale factor 20, which means 20 gigabytes of data. She did add some extra indexes. We ran these tests on a power server, which IBM provided to EnterpriseDB, thanks IBM. Some of the other details of the configuration are up there. This was done on a commit before. Andres's faster expression evaluation stuff went into the tree. This is from about March 9th. So the results are not distorted by the things that he's fitted up. This is mostly just parallel query related changes. So these are the queries out of the 21 queries on TPCH where the plan changed and there's a note about what we used. And I'm just gonna pause here for a sec. I have a few more slides to get through so I can pause here too long, but just give you a sec to drink that in and then I'll talk about what happened here. So from that chart, you can see that the biggest wins here by far came from parallel index scan. There are a number of queries in the TPCH query set that seem like they can't be sensibly executed in any way other than an index scan on the driving table. So when that can be done in parallel, then you get big speed ups. How much of a speed up you get depends on what else is in the query. So query six is basically nothing but an index scan. So that's why it gets like four times faster because you have four workers and it scales well enough up to that point and it gets about four times faster. Query 14 has an index scan on the inner side of a hash join. So the outer side of that, oh, I have inner and outer backwards there. So the index scan gets faster because now it can be parallelized but the hash join still everybody has to build their own copy of the hash table. So that does not improve but you still see two X performance improvement on that query. Query two only manages to use a parallel index scan for a small part of the complex query but that small part gets a lot faster. So you see a significant speed up there. The relaxation of sub plan related restrictions also helped a lot. That was query 16 allowing the uncorrelated sub plan to be part of the parallel portion of the query even with each worker re-executing it bought us like 25% on that query, which is not bad. Gather Merge got used a lot in those results. It didn't always help on queries three, four, five, seven and eight. It really made no difference at all. I think a lot of the variation that you saw in those queries is actually just random fluctuation from run to run rather than any actual performance change. Some got faster, some got slower. There were three queries where it helped significantly. On query two and query 18, if you hadn't had the Gather Merge, you probably wouldn't have been able to use the parallel index scan either because the parallel index scan needed to provide the data in a sorted order for a subsequent merger aggregation step that was depending on the data being ordered. On query 17, Gather Merge avoided an expensive sort. Query 10, it was pretty much a wash. Actually, the query 10 number, yeah, that probably should be saying, no, I don't think so. So query, that shouldn't say query 10, but I don't remember which query it is. One of the queries here, well, so 12 is the one where it behaved really badly. 10 behaved badly, but for a different reason. The regression on 10 is actually not Gather Merge's fault, and I want to get to why that actually did happen because it's sort of interesting here. Parallel Bitmap Heapscan did not work out very well on this benchmark. It was not treated kindly by this particular set of queries. On query 10, that gigantic regression that you saw happened because the parallel Bitmap Heapscan was slow, but the thing is that if we'd done a non-parallel Bitmap Heapscan there, that would have been slow too. The winning choice is actually to use a parallel sequential scan. So 9.6 gets it right because the only kind of parallel things it knows how to do is a parallel sequential scan so we cannot mess it up. There are two underlying issues here in terms of why it picks this plan. The planner thinks that the cost of a Bitmap Heapscan, whether parallel or not, does not depend on how much memory you have available, and that is false because we are building a Bitmap which is limited in size by work map. And if you don't have enough memory, then it does what's called lossifying the Bitmap, which means that the data stored at the Bitmap becomes less precise and it no longer remembers which tuples it needs to look at, but just which pages it needs to look at. It turns out that hurts a lot and the planner does not care. So if you give that query a lot more memory so that the Bitmap does not lossify, the most of the regression goes away. The remainder happens because you lose like three seconds if you fix that problem. The remainder of the problem happens because the parallel Bitmap Heapscan is essentially a gamble that if we spend a bunch of time not in parallel scanning the index, we can save enough time later when we're scanning the heap that we'll still be better off than if we just parallel sequential scan the crap out of the entire table. And so it matters what the cost of the index scan is estimated to be versus the cost of the Heapscan and the planner is just wrong about what the relative cost of those two operations is. So despite this, it might be possible to fix this one by tweaking the costing settings a little. I suspect it is, but obviously there's work to be done there. I think the biggest thing is we really need to fix the planner so that it knows that lossifying the Bitmap hurts a lot. Query four was the other one where parallel Bitmap Heapscan got picked up. It actually does succeed in being a lot fat in that particular operation ends up being a lot faster. But for some reason the surrounding join doesn't get faster even though the scan that is driving it gets faster. I have not yet been able to figure out why that happens but you can see what's happening here. In example, in the 10 results or 10 develop results the parallel Bitmap Heapscan is 2.2 seconds but the nested loop join takes 11 seconds to finish. Now in the 9.6 plan it took 5.6 seconds to do the same thing that we now did in 2.2 seconds but not all of that time translated through. The nested loop semi-joint only gets slower by 0.7 seconds even though the thing inside of it slowed down by more than three seconds. I don't know why that is yet either. So one thing that I've learned and I'm running short on time now. So one thing that I've learned from all of these results is that Amdahl's Law is a big deal which will not maybe surprise anybody who understands Amdahl's Law better than I did when I started this project. Basically what we're seeing is that Parallel Query makes the things to which it applies reliably about four times faster. The problem is the things that it does not apply to. So the work here is making sure that we're able to drive down the parallelism into every single part of the query and leave no parts unparallelized. So I think that's gonna drive a lot of future development in this area. As I mentioned, obviously we need to try to fix the way that we do parallel joints so that we get parallelism on both sides of the joint and not just on one side. We need to allow parallel aggregate to happen without terminating parallelism at that point in the plan tree so that the higher parts of the plan tree are not excluded from the goodness of Parallel Query. We need to do more work on init plan and subplan type restrictions. There are several problems that remain there that limit the ability to use parallel query in queries that involve those constructs. And of course, the limitation I mentioned in regard to bitmap scans is very relevant here. We can parallelize the heap scan but we can't parallelize the index scan at the same time. We've got parallel index scan or parallel bitmap scan with the heap scan in parallel but the index are not. We need to obviously put those two things together so that we can have everything parallel at the same time. There's a bunch of other things to improve too. I think a lot of those things are actually planner problems. One of the things that we have discovered this release cycle is that it really doesn't help you to implement a large number of new parallel operators if they don't get picked in the cases where they win. In addition to the cases that I already mentioned, we've noticed that the costing seems to be off for hash aggregate. So there are queries where a gather merge and a group aggregate would actually execute faster than a hash aggregate but the planner thinks the hash aggregate is faster and picks it anyway. We haven't really figured out why that happens but we know it does happen sometimes. Some of the estimations on the TPCH queries in particular are incredibly bad like off by many orders of magnitude and obviously that makes it a little hit or miss whether you get a good plan after that. And the way that we currently select the number of workers is kind of dumb which doesn't really matter when you're talking about trying to do parallel query with three or four workers. If you're off by a worker here or there, it's not generally a catastrophe. If we wanna move up and do a parallel query plan that involves 40 workers, I think this matters a lot more. If you launch one worker or two workers and it doesn't help that much, it's kinda okay maybe depending on what your situation is and how tight you are on resources. If you launch 40 workers when the correct number would have been three, that's more significant. So more work is needed there. So parallel query in 10 is substantially improved from 9.6 but as you can tell, I've discovered a lot more things that need to be done than what we were able to do. So stay tuned. And that's it. Questions? Thank you, thank you. Right, so the question is about using TPCDS. I definitely think we need to move in that direction. One of the obstacles that we're gonna face is that there are quite a number of TPCDS queries that don't even run on PostgreSQL or that run for longer than I will live. And so probably we need to fix some non-parallel problems around TPCDS before we start thinking about the parallel related problems on TPCDS. But I do agree that's an area that we gotta get to. I think we've really gotta get some of this basic joint and aggregate stuff settled down first before we dive more deeply into the complex things like CTEs and window functions and stuff like that. But I totally agree with the point. Yeah, so parallelism like all parts of the parallel query all parts of the PostgreSQL query planner is cost-based. Okay, go ahead, Peter. A little bit. I can show you some of the plans afterwards if you're curious. Keith. If it was for you to run, each of them needs three, but sometimes running super fast and sometimes it uses a lot of... Right, absolutely. So I think in many ways this question comes down to something that's gonna become a bigger and bigger issue as parallel query becomes better and better. If you're talking about a 2x difference in performance when you get parallel query, maybe it's okay if you don't always get that. I mean, it depends on the situation, obviously. When you're talking about, the bigger the difference gets between when you get a parallel... Even 2x is a big difference. Yeah. Yeah, I think better worker management is a whole different area to explore. I mean, here I was mostly talking about what can we do to make things faster, but being able to sort of make the kind of usability improvements that you're talking about is another area for exploration. Eventually, I would like to get to a point where you can actually have workers move around depending on the workload. So one guy starts up, he gets all the workers, then a second query starts up and half the workers go over there on the fly, but we're a long way from being able to do clever tricks like that right now. I gotta stop because I'm out of time, but if people wanna come to talk to me individually, that's cool.