 Carnegie Mellon vaccination database talks are made possible by Ototune. Learn how to automatically optimize your MySeq call and post-grace configurations at ototune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome everybody to another vaccination day-to-day seminar series. We're super excited today to have Robert Haas. He's the VP and Chief Davis Architect at Enterprise DB. And he's been a veteran of hacking on the Post-Quest Internals for over a decade now. So he's going to talk us about the stuff he's been working on in the query optimizer. As always, if you have any questions for Robert as he's given the talk, please unmute yourself. Stay here you are and where you're coming from and ask your question. And feel free to do this anytime. We want to interrupt Robert as he's going along. That way he's not talking to himself for an hour. Okay? Robert, we thank you so much for being here. We really appreciate it. This is a big get for us. We're very excited. So thank you so much. I'm really excited too. I thought it was a big get for me that you asked me to present. So I guess we're both doing well here. Yeah. All right. Good work. Okay. Yeah. So I was excited when you asked me to talk about the optimizer, but obviously you can't talk about the whole query optimizer in an hour. So I decided to talk about two parts of the optimizer that I think are pretty interesting. It's not so much stuff that I've worked on, although one of the two things is something that I've done some work on, but it's just stuff that I've spent some time studying and I think it's pretty interesting. And one of the reasons why I think these are interesting topics is because there's stuff that doesn't always work or doesn't always work as well as we'd like it to work. So maybe somebody will have some good ideas that we can use to make some of this stuff better. But if not, at least thinking about problems that haven't been solved yet is more interesting than thinking about the ones that you already have solved. And this is going to be kind of high level and oversimplified. It's not possible even for just these two topics to really get down into all the nitty gritty in the time that we have. So if there's anyone on the call or watching this later on YouTube who is very familiar with PostgreSQL, you're going to be like, but you're leaving things out. And that is true. So the two topics that I picked out to talk about today are joint planning and statistics. And we're going to start with joint planning, which I think will be a little bit more than half of the talk. And then statistics will be the second part, which will be a little bit less than half of the talk. And so the first question about joint planning is, you know, why is that interesting? Why are you talking about that? And I have three reasons. So the first one is that joint planning has a big impact on query performance to the extent that, you know, if you get a good plan for the joints that are involved in your query, you're probably going to be reasonably happy with the performance of your query. If you get a bad plan for the joints that are in your query, the query is probably going to be so slow that you're going to be complaining. That's not always the case. Sometimes you get a bad joint plan and you still survive. Sometimes you get a good joint plan and it's still bad. But it's a pretty good indicator. It makes a big difference. The second reason for talking about this topic is that joint planning is complicated and expensive. The number of possibilities that need to be considered grows astonishingly quickly as the number of tables being joined increases. We'll talk about that a little bit more in just a minute. But this is, you know, a big consumer of CPU during query planning. And the third reason is that joins are really common. A lot of queries use joins and a lot of queries that use joins use many joins. It's not particularly uncommon to see a query with 20 or more joins at it. Whereas it is pretty uncommon to see a query with, you know, 20 group by clauses or something like that or 20 limits. I mean, it can certainly be done and people do. But joins, it happens a lot more. So why does the number of possibilities that need to be considered grow so quickly? Well, I got a query on the slide here that I think illustrates the point pretty well. I've got end tables with very creative names. And they all have an ID column which has the same meaning in every table. So it's sensible to just say that all of the ID columns are equal to each other and join all the tables. How many ways can we join all of these tables? Well, I think it's pretty easy to see that there have got to be at least n factorial possibilities here because we can pick any one of our end tables and use that as the driving table. And then we can join it to any one of the n minus one remaining tables. And then join that to any one of the tables we still have left and so on until we get down to the last one. So that's n factorial possible join orders. But that's actually an underestimate because it doesn't consider what a comment in the source code calls bushy plans. Because we could, for example, join a one to a two and then separately join a three to a four and then join the results of those joins to each other. And that's not considered in any of those n factorial possibilities because in those n factorial possibilities, we're joining the tables in one by one. But here they're coming in at a clump two at a time or more than two potentially. I looked around on the internet for a formula for what the actual correct equation was for the number of join orders that had tables. And I couldn't find one that I was totally convinced was correct. So somebody probably knows what the real equation is. The textbook says it's the Catalan number. It's like four to the n. Okay, could be. But all what I know for sure is it grows faster than in factorial. So that's already a very fast growing thing. And that's just for the number of join orders. So then for each possible join order, we have n minus one joins that are going to happen. And each one could be a nested loop or a merge join or a hash join. So that's three to the n minus one possibilities. So it's exponential. Three is actually kind of an underestimate because there's three basic algorithms. Sure. But each one of them has multiple variants. So actually it's not really a constant exponent, but if it were, if you imagine that it were, the base would be larger than three. So when you put all of that together, you've got something that grows faster than a factorial multiplied by an exponential. It's a really large number of possible ways that you can join n tables together. So faced with that complexity, you sort of have two options. One is that you could try to give up on some of the possibilities without even thinking about them. Try to figure out which things are less promising and just don't even consider those. And the other thing that you could do is try to explore the entire search space as efficiently as you can, given how big it is. And PostgresQL mostly opts for the latter approach. There are some corner cases, which I think are not interesting enough to spend time on in this presentation where we don't consider certain possibilities, but by and large, what the planner is trying to do is as an exhaustive search. So we need to just make sure that we do that as efficiently as you can for a problem of this size. And the basic strategy is to avoid recomputation. So if you think about two possible join orders, say a one, a two, a three, a four, or on the other hand, a one, a two, a four, a three, they both start by joining a one to a two. So we hope that when we study that problem, the problem specifically of joining a one and a two, well, we hope that when we solve that problem the first time, we wouldn't use an algorithm where we would need to then figure out whatever we learned a second time when we studied the other join order because they're closely related. So PostgresQL uses a bottom up sort of dynamic programming strategy here to avoid that kind of recomputation. And basically the idea is we're going to consider every subset of the tables that are named in the query. So if we've got n tables in the query, there are two to the n subsets of those tables, the empty set isn't interesting, but we're going to consider all of the other subsets. And we're going to do so starting with the subsets of lowest cardinality. So that basically means we're going to start with the tables, right? Because if we have six tables, for example, there are 63 subsets of six items, six of those have cardinality one, and those are the six subsets that contain exactly one table. So we're going to first consider each one of those tables. Then we're going to look at the subsets of size two, which correspond to the 15 possible two-way joins, and then there's going to be 20 possible three-way joins and 15 possible four-way joins and so on until finally, when we get to considering the one possible six join, we're actually six-way join, we're actually considering the original problem that we started out to solve, which is how to join all six of those tables together. As we consider each of these tables first and then into these joins, we're going to make a list of potential strategies for solving that part of the problem. And it's important to recognize that we're not actually going to use all of these joins, right? Like we have 20 possible three-way joins that we could do, but whatever final plan we come up with can involve at most two of those and usually only one because it could involve two if we join three tables and then join the other three tables and then join the results of those two together, but typically that's not going to be what happens. So we're only going to use one or two of those three-way joins, for example, but we don't know which one, so we're going to consider all 20 to get there. So in this example, we have six tables or what the planter calls base relations and then 57 possible joins, what the planter calls join relations for total of 63 relations and we're going to make a list of potentially interesting approaches for each one. So this is not an efficient algorithm because we have an exponential number of things, an exponential number of relations and we're doing a very much non-constant amount of work for each one. We're going to consider lots and lots of possibilities for each one of those relations, especially the joins that are higher up in the tree. Throw away the possibilities that look clearly inferior and then keep the rest. So this still does not scale to large join problems because we are searching this fast search space. So at some point, the algorithm breaks down and there's a parameter called gecko threshold that controls when we switch to an alternative algorithm. The default is 12, which I think is kind of low. In almost all cases, you can go considerably higher than 12, but at some point it's just going to get too slow to do planning the regular way. So then we'll switch to something called the genetic query optimizer, which has a genetic component, but the genetic component doesn't really work. Basically, what it does is it tries a bunch of join orders at random and picks the best one. It will consider all of the possible join strategies for each of the join orders that it selects, but it's just going to consider a very tiny number of possible join orders. So hopefully one of those will be okay. Some queries, that's very likely to happen because in certain query shapes, there's a lot of join orders where they're all kind of the same. And so you're likely to hit one of them if you just choose some at random, but in other query shapes, you know, there may be only a few join orders that are any good and you may happen not to get one and then you may get a very bad plan. So you said that for the genetic algorithm, it sounds like the cross-breeding portion doesn't work and therefore it doesn't just a random walk. Yeah. I mean, there's code for it. My understanding from people who have played around with it is that the plans don't really improve when it goes through the genetic phase or they don't improve much. So the quality of the final plan seems to be mostly a function of how good a job the random number generator does picking the initial population. Interesting. Okay. Thanks. So we're making a list of strategies for each relation, which strategies should we keep? Let's look at this query that I've got here on the slide, select star from food, joint bar on food.x equals bar.x or food.y equals one and consider this query from the point of view of relation foo. How should we get the rows that we need out of foo? Well, I think for most human beings, looking at this query, your eye goes to the where clause. You see that food.y is equal to one and you say we should use an index on food.y, which does indeed have a very good chance of being the most promising approach, assuming that such an index exists, but not necessarily because it could be the case that nearly all of the rows in the relation have y equals one. And if that is the case, then going through the index is going to be really expensive, like not just a little bit worse, but like a lot worse. Doing a full index scan of a table when you could have done just done a sequential scan is really bad. So if rows with y equals one are very prevalent, then we want to forget about the index and just sequential scan the table. The nice thing about those two strategies, which are here listed as A and B, is that they are easy to compare. As long as we know how common rows with y equals one are, and we'll talk a little bit more about that in the second part of the talk, we should be fine. We can compare the costs of those two strategies and we'll probably get the right answer. But there's some other strategies that are very difficult to compare, basically impossible to compare at this stage, and I've listed those here as C and D. So in strategy C, we forget about the index on y, we forget about the sequential scan, and we say, hey, maybe there's an index on food at x that we could scan, and we could just filter for rows where y equals one. And that's going to be slower than whichever of A and B is faster, but it might avoid a sort down the road because we could decide on a merge join between food and bar. And if we do, and we don't use this strategy, then we're going to have to sort. Sorting could be expensive because there might be a lot of data. So we're going to need to keep strategy C, even though it's going to be more expensive. And we're also going to need to keep strategy D, where we're basically totally unable to make a cost estimate at this time. Strategy D is again using the index on food at x, but now we're not scanning the whole index once. Now we're repeatedly scanning it each time looking for a particular value. So the way to imagine this is suppose we end up doing a nested loop join where bar is the driving table, and then we're probing into food for matches, which could be really, really good because it could be that bar is a very small table and food is a very big table. And if that's the case, that strategy might be a real winner. But we can't estimate how expensive that is compared to A or B or C, because we don't know at this point how big bar is. So we don't know how many times we would have to do that repeated index probing, and we can't come up with any sort of meaningful estimate. So we'll have to hold on to this strategy as well, because we can't tell. So that's the basic algorithm, and now I've got two slides here kind of talking about what works well and more to the point what doesn't work so well. The good point is that we can effectively postpone our decision making. We can really effectively postpone our decision making. If we've got a potential strategy or what the code calls a path, and we don't know whether it's better or worse than some other path, we just keep them both. No problem. I mean, unless you're concerned about performance, then you might have a problem because the more paths you keep for the base relations, the more expensive your joint planning is going to be at level two and at level three and at level four and at level five and however far up it goes. And, you know, similarly, if you have more ways in which one path can be worth keeping, then you'll probably not only end up with more paths for your base relations, you'll probably also end up with more paths for each joint relation. So everything that makes you have more ways to compare things and say that we don't know which one is better, the more expensive the whole things get, the whole thing gets. So when somebody comes along and they say, I want to make a plan or do something new, if they just want to generate paths that will, in a particular case, either be better or worse than the paths we've already got, that's probably OK. I mean, assuming it's well implemented and they're not going too crazy with how many new paths there are, it adds some cost, but it's probably not going to be a big deal. When people want to add new cost metrics, right, to say that something is better because it may avoid a sort or something is better because, you know, it may facilitate a nested loop or because it's better because it may let us do parallel query later or, you know, it's better in some other way other than just being straight up cheaper. That kind of multiplies the planning cost through the whole process. So sometimes we can't do all the things that we want to do because it would just make planning too expensive. The other sort of problem that I see here is that we're doing decision making with very limited information. One of the big things that we don't know as we're working our way through the individual relations is how many rows are going to get produced at the end. So, you know, if we knew that your query was going to generate a billion rows, it would probably be sensible to just give up on the idea of doing parallel query straight off because the bottleneck is going to be how long it takes to send those billion rows to the client in virtually all cases. So to expend additional machine resources to generate those rows faster, most likely makes no sense at all. But we have no idea when we're thinking about the individual relations in the query. We have no knowledge of what the final row count is going to be. We don't even know yet what the row counts of the other relations that are involved in the query besides our own is. We only have this very local information about what's going on with a particular relation. We have, you know, very slight pieces of visibility into other parts of the query, like, you know, what join causes are available. So which which tables is this table being joined to and we can, you know, skip over generating things that aren't interesting because there are no join clauses. But that's like practically the only information that we have. So I don't know exactly how we would make use of more information that if we had it, but it does feel like, you know, the postponing decision making thing often forces us into making somewhat uneducated guesses about which strategies are worth pursuing at the early stages. And instead of skipping the work of generating them and saving everybody else the trouble, we have to generate them and then see what happens. So to be clear, like, it's not so the process after is a multi-stage, multi-level thing, as you called it. And so at this first stage, we're just generating the paths. You don't do any introspection to say I have some histograms or sketches that say what I think this activity will be a different try to guess what the output of a single operator or single join will be. No, we do know that. We do know that, but only for our own relation. So like if I'm trying to estimate the join between foo and bar, I know how many rows are in foo. I know how many rows are in bar. I know how many of those rows are coming out, given the filters that are in the where clause. I have an estimate of the joint selectivity between foo and bar, right? But I only know about those two tables. If there's a third table basing vault in the query, I don't have any idea what he's doing, except I know whether he's joined to either of my two tables. But I don't know how many rows are in that. I don't know whether there are four other tables involved in the query or 15. Like I only have facts when I'm planning to join between foo and bar. I basically only have facts about those two tables and not anything else that may be happening. This is almost like an engineering thing that like sort of as you as you're doing local optimization rather than the global view what's going on. Right. Yeah. OK. So I wonder, sorry for timing, but I wonder whether there's an active fact that PUSCRAS is doing a bottom up query planning so that you look for local optimum for like a few number of relations first and then you walk up. But on the other hand, if you use cascade, right, if you use a top down framework, then maybe you can know what are the stats of the entire query plan, even though you are only considering a sub tree for now because you will know what will happen. What might happen logically on the other side of the tree? That makes sense. I wonder. Yeah, I think that's right. I think, you know, it's been proposed on the mailing list that maybe we should have more of a top down approach, which would obviously be a big change in terms of how much code would have to be revised. But I think it could potentially help because then you know, if nobody ever asks, you don't have to generate the data. And if somebody does ask, then you know, you generate it and you cash it in case you get asked again. True. So the guy, Gerst Graffit, the guy had been vetted top down cascade. He made a sort of offhand comment at SIGBOD one year that like you should use top down for all their planning, but join ordering. You want to go bottom up and the Germans and hyper guys were going to hyper a number of those guys are very adamant about bottom up is the way to go. Ha, you know, I mean, so the first person you're the first person I've heard is like, yeah, top top down might be the right thing to do when they already have a bottom up implementation. Well, the interesting thing about it is like if you've got a path like sequential scan the table, bottom up is fine, right? Yeah, because every time you consider that table, you're going to consider that path. Right. Like if I if table Z, OK, has a sequential scan path, I can use that when I think about how to join A to B when I think about how to join B to Z when I think about how to join the joint product of A and B together to Z when I think about how to join C to Z. I'm going to get a lot of use out of that path because it's general, right? But what about strategy D here, repeatedly scanning an index on food at X each time looking for a different value of X? That path is specific to when I'm trying to join something with bar in it to food, right? Now, it doesn't have to be just bar. I mean, in this case, it does because there's only two tables in the query. But but this kind of path on table food could be used when joining bar directly to food or I could join bar to some other tables first and then join the result of that join to table food afterward and I could still use that path. But but this path is less reusable than than these other paths, right? These other paths are a hundred percent reusable. This path has diminished reusability because there's only reusable in certain contexts. Now, I don't quite see how going top down wins because I feel like I'm still going to have to generate that path sometime. And if I've got to do it, I might as well do it in the bottom up flow. What am I really gaining out of postponing it? It's not clear that I'm getting anything at all, but it makes me uncomfortable. Like when I'm trying to write code to generate new paths, I'm like, gosh, you know, I don't like I don't really know whether this idea is any good. Right. I can't I can't tell. And so I don't have a specific proposal here. I just think there might be something to it. I mean, another idea that I've thought about is doing one bottom up past where we just estimate row counts and don't do anything else. And then once we've got row counts for everything, go back and do a second pass where we generate paths. I feel like maybe there's something to that idea, right? Because then we'd have some information about how things we're going to turn out in the end earlier before we actually generated all of the paths. And if knowing what the end result was going to be, made you able to skip some of the intermediate work, that might be cool, but I don't exactly see how it would let you skip anything either. So like, I have a feeling there's something we could be doing that is better than the thing that we are doing, but it's probably hard. And even worse, I don't know what it is. Yeah, that's awesome. Thank you. I have a quick question. What does the planner with foreign tables currently? Foreign data wrappers. Yeah. So if you're using a foreign data wrapper and you have a foreign table involved in the query, then basically the foreign data wrapper has to provide a callback that that can do the same function that we would do for a native table. So so it's basically up to the foreign data wrapper to provide a substitute implementation of the same functionality that we're talking about here. I think it's another question from Andre. Right. Yeah. So Andre is talking about something called a partial paths. A partial path is something that I invented during the development of parallel query and basically a partial path represents something that if you if you executed the partial path within a single process, it would produce a subset of the results. So you you execute a partial path in every one of your processes, and then you have to gather all of those partial results or gather merge all of those partial results to produce the whole output of the parallel query. I don't really want to get into that in this talk because that's one of the things where I mentioned at the beginning, like if you're a postgres expert, you're going to see that I'm leaving things out here on that's just, you know, so that we can cover some of the general principles and keep the time to something reasonable. So I don't really want to, you know, dive down into that topic and this presentation. But, you know, it is a very interesting question. I mean, I think what I would say in general is I actually would like to generate a lot more partial paths. But I can't because of this problem that we just talked about that when you generate more paths, it blows up the planning cost in a way that we really can't afford to do too much of. So yeah, it's a it's a problem for sure. OK, so moving on, I'm going to go to statistics now if that's OK. Absolutely. Yes. OK, cool. So statistics, we have this command called analyze and it gathers various pieces of statistical information. And basically what we're hoping to do with that statistical information is get accurate row count estimates. We have a background process called auto vacuum. It actually does both automatic vacuum and also automatic analyze so it will notice when tables have had a significant number of modifications since the last time they were analyzed and when it notices that like, oh, it's time to run analyze on that table and it will do it for you in the background. So most of the time you don't actually need to worry about it. Sometimes you do, but that again takes us outside the scope of this talk. And, you know, the row count estimates are important to everything that we just talked about. So, you know, as we said before, if you've got something like select star from full where equals one, you need to know whether rows with equals one are really, really common because if they're really, really common, then you should probably forget about the index. If they're less common, then you probably want to make use of that index in some way. Here's a list of the statistics that analyze gathers. These are all on a per column basis. So for every column, we're going to estimate the fraction of rows where there's a null in that column. We're going to estimate the average width of a non-null value stored in that column. We're going to estimate how many different distinct, sorry, different distinct, how many distinct values appear in that column as you go through the table. This one is a problem because it's often not very accurate. It tends to cap out in the tens of thousands, no matter how large the table is. And that's because we don't want analyze to run for an unreasonably long time. So we want it to sample a fixed size portion of the table. I mean, you can control exactly how much, but it doesn't scale as the table gets bigger. So we just don't have enough information to really know for sure how many distinct values there are in the column. Then we look at the values that appear in the column most frequently and make an estimate of the frequency of each one. And then we make a histogram to give some idea of the distribution of values that are not MCVs. So we take all the and then we divide the remaining ones up into like usually a hundred buckets. So the lowest one percent of the sample values were between zero and 17. And the next one percent were between 17 and 51 or whatever it is. Then we make an estimate of physical to logical correlation. If your table has a lot of updates and deletes, this is likely to be zero. But if not, if you have like an insert only table and it's like timestamp data so that the timestamp column is always going up, then the correlation might be very close to one. If you have a column that's only inserted in descending order and you don't have a lot of updates and deletes, then the correlation might be very close to minus one. And then if the column has an array type that will actually peak inside the arrays and see if we can make an estimate of what elements most commonly occur in your arrays and the estimated frequencies of them. Does it peak into other data types as well, like trees on similar? No, I don't think so. But maybe you're about to tell me that the real answer is yes. I don't think so though. I wonder if there are any discussion within Postgres that to add the possibility to add any sketches to the stats like ContaminSketch estimate to give you a rough estimation of the activity of a particular value, things like that. I couldn't understand the word you were using. I wonder whether Postgres has internal discussion on adding some types of sketches as a type of statistic. Sketches, I'm not sorry, I'm not familiar with the term. It's like the hyper log log Count Min sketch, they're like approximate data structures that keep you like approximate counts, probabilistic data structures. Yeah, no, as far as I know, that idea has not been proposed. I might have missed it, but I don't remember a discussion on that topic. We have used log log for a couple of other things, but not I don't remember anyone proposing it for this purpose. We do keep the histograms as basically one type of sketch, but we haven't we don't have anything more closer than that yet. Me, the CEO of Splice Machine came and gave a talk at CMU a few years ago, and he was raving about how like the accuracy of their cost model improves quite significantly when they switched from using traditional his rams to sketches. OK, so let's revisit this topic on a slider too, because I'm about to say some more things which are relevant to this to this topic. So. You know, the question is like if we're gathering all of the statistical information and it's a looks like a decent amount of statistical information. You know, does it work? Can we accurately estimate row counts? Back about 10 years ago now, not long after I got started at Enterprise D.P. I did a sort of informal survey of email threads on the PostgreSQL performance mailing list. I actually gave a talk based on that study at a old PG con if you want to look it up. But I went for like a hundred and sixty eight email threads and sort of made my own tentative diagnosis of what I thought it happened in my opinion. And it looked to me like the big winner in terms of why those queries were slow was some kind of problem with the planner. I attributed ninety four of them to that to plan a related causes. I attributed twenty six to unreasonable user expectations and or confusion. Twenty three to poor settings choices. Fourteen to bugs either in the operating system or in PostgreSQL and eleven to deficiencies in Postgres outside of the query planner. You can see that ninety four is a lot larger than all of those other numbers. And of those ninety four I attributed forty eight to row count estimation errors. So the number one cause in this informal survey of queries being slow was plan or doing something bad. And the number one cause of the planner doing something bad was estimating the row count. So this is not a solved problem. This is very much not a solved problem. Do you remember you did that survey whether you were the the hospital is overestimated or underestimated. Oh both both. Yeah I mean and just as a sort of general piece of context you know if you get the row count wrong by a factor of two or three it's not that big of a deal. The real problems come in and we'll see some examples of how this can happen in a minute when you're off by orders of magnitude. You know and it's often a question of not what multiple were you off by but how many of orders of magnitude was it like it can be six right when your row count estimate is off by six orders of magnitude. Something bad is probably good to happen to your query plan. So yeah they're basically only three cases where we can effectively estimate the row count and everything else is a mess. I'll talk about some of the messiness more in just a minute but let's talk about the three cases that are pretty good. If you've got a simple equality condition like X equals 10 that's typically going to be well handled. If 10 is an MCV then we're going to have a specific estimate for 10 which is likely to be fairly accurate. If not then we don't really know how common 10 is but we know it can't be that common and that's usually good enough. We do tend to be high more often than we're low because basically we're going to say well you know 70 percent of the rows seem to have MCVs in them and that means 30 percent of the rows account for all of the non MCVs and we think there are 25,000 distinct values in the table so take 30 percent divided by 25,000 and there's an estimate and that number may not be particularly correct for any non MCV but unless the table is really really big it's probably going to be a relatively small number and that's kind of what we need because we want to know things like yeah you should use an index if you've got one and as long as we get a relatively small number there we don't tend to get ourselves into terribly bad trouble and then things like X is greater than 10. We know which MCVs are greater than 10 and we can use the histogram to refine the estimate for the non MCVs. In my experience even though this is not perfect the number one problem is not algorithmic but just if the table changes really fast and the new analyze hasn't run yet that it might be off since we've got a couple of PostgreSQL hackers on the call I'm going to just mention the fact that there is some code that is designed to correct for certain kinds of errors in this area on the fly at runtime but it does not work perfectly or cover all of the cases so it's not a get out of jail free card and you can't have issues because analyze hasn't run frequently enough it's not super common but it happens and then the third case that we can estimate pretty well is stuff like X is null or X is not null that's directly one of the things that analyze is measuring so we're fine. Pretty much anything beyond that we've got issues so my favorite example of this is the first query on this slide select star from through where the quantity a plus zero is equal to a the planner does not know what plus does it certainly does not know that zero is the additive identity. So all you said is select star from through where it has no no clue what's going on there. So it's like okay default estimate half a percent which you know if there are no nulls then this is always true if there are nulls then this is sometimes false which is easy to overlook point but if there's no nulls you know your your estimate is off by a factor of 200 so that's that's the kind of order of magnitude problem that can set back query planning pretty significantly and if you change it to a plus one equals a then you still get the half a percent estimate because it still doesn't know anything about plus and now you're off by however many orders of magnitude in the other direction because the real answer is is zero right so that that case that those kinds of cases suck what's a lot more common than that in the real world is something like the second example on the slide here select star from food where a equals one and b equals two in general all we can do is hope that those two conditions are independent of each other but they might not be independent of each other at all it can very easily happen that rose where a equals one are super likely to also have b equals two or super unlikely to also have b equals two and I can result in you being way off I mean it's particularly toxic if you know both a equals one and b equals two are relatively common individually let's say they both happen 10 percent of the time so together we're going to say okay it's probably going to happen one percent of the time that the real answer might be 10 still 10 percent in which case we're off by an order of magnitude in one direction or it could be zero in which case we're off by however many orders of magnitude in the other direction this particular case we now have a nice tool that often helps a lot you can run a create statistics command and say please gather statistics on the joint distribution of those columns and then you rerun analyze it gathers additional statistics on the joint distribution and things get better usually there's probably data sets where the additional statistics that it gathered are not fine grain and grain enough to resolve all the issues but I think results have been pretty good for what I've heard unfortunately as soon as you make things a little bit more complicated we're back in trouble so the third example on the slide here involves a join and now we've got two possibly correlated columns but they're in different tables and create statistics isn't yet smart enough to deal with that case so you're out of luck you may be way off so to be clear when we say smart up meaning you simply the you don't support correlated statistics across tables correct so commercial systems can do that I was just cannot okay I guess I was yeah that was the statement sorry not a question yes you're like I mean like I said I don't actually know what other systems can do but we can't and that's bad yeah and I should just like if we'll get to him he to say it like what is what is the overhead for when we're new on analyze if you add correlated statistics I don't think it's very much although I haven't checked it I think most of the expense is having to read the table pages and what you compute after that I don't think it's the big problem that okay right to me go for it yeah so for your example number three which is a very important one when because it happens a lot because of the chain of the joins and you have pretty case on two ends so one solution over these actually run a sample to it it doesn't take that long to do it and you will see the correlation well I mean it it takes a long time to run the query unless you plan that query using accurate statistics no because what happens over here is that when you sample the query you can actually have other predicates in this such as you the apps which you have no idea what the hell they are doing but you will see the correlation you will see the effect of you the apps and it doesn't take that long you go server does something similar like this as well where they maintain I think a little small stable of the tables and they can you in order to derive this is that that correlation you run a lot on the table I should say also or just to run the sample on a real life table. Sure. Yes. I mean it also efficient. Yeah, I mean we do have this. Yeah, because our sample is running the two ways you know Bernoulli but we also have table pages sampling so we skip over pages. So if I do one in 10,000 it runs 10,000 fast. I say yeah, I mean we have a table sample facility but it's just like a user level facility we don't use it for anything involving gathering of yeah that that wouldn't cut it because it will also visit every row and the popcorn. You've got to skip pages. What we need to proportional we have the ability to skip pages and pick a random subset of pages but it's not it's not wired into the statistics framework. Yeah, but you know SQL standard supports that right you know you put from table X table sample and then you specify your rate. Yes, you know it will be newly or you do system that's we put that in this that that's what I'm telling you we have we have you have okay that's good then use it. Well, that's the using it is the part that we don't do but what we don't don't use this it's just for users to use which you're pointing out that that's not great and you're right. I want to I want to talk one about one more bad case. This is easily not not even close that this is easily the most annoying optimizer fail in in in PostgreSQL the general shape of the query is select star from foo where a equals one order by B limit one and the problem here as I think many of you will immediately grasp is that we need to decide whether to scan an index on a or whether to scan an index on B and one of those is likely to be really good and one of those is likely to be really bad. If we scan an index on a then we need to visit as many rows as there are that have a equals one and just keep whichever one of those has the lowest value of B. So if there's not very many such rows this plan is really good and if there's a ton of those rows this plan is really bad. The other possibility is that we could scan an index on B and just stop as soon as we find a role where a equals one in which case things are going to be great if we find one quickly and terrible if we go a long time without finding one. So, you know but to see the problem imagine that 10% of the rows in the table have a equals one. The planner may say, ah, well this is good and let's say that's a big let's say 10% of the rows in the table is a big number. So the planner is going to say, ah, well if I just scan the index on B I will on average only have to scan about 10 rows before I find one where a equals one. So that looks pretty good comparing to go through going through 10% of the table. But in the worst case it's actually far worse. You end up scanning 90% of the table because it can happen that all the rows where a equals one have very large values in column B. And so you just scan through the other 90% of the table and then all the rows you actually want are clustered at the end of the index and you don't find one for a long time or maybe there's, you know maybe you think there's a lot of rows where a equals one and there actually aren't any for some reason. I mean, that shouldn't really happen if your statistics are up to date in the example is this simple but sometimes we can get things wrong there too. So yeah, this is a hugely annoying problem. Generally the planner goes wrong by picking the index on B when it should pick the index on A. And the best advice that I've been able to come up with in, you know probably pushing 20 years of playing around with Postgres is to suggest to people that maybe the index on B should be dropped. Which, you know is a fairly blunt force instrument and often works because people often create more indexes than they really need. But clearly if they'd needed that index for something else then that strategy is not gonna help. This comes up a lot. It's- So I was gonna say this question to the end but I think this is clearly the right time for this. Why no hints in Postgres? What is the design philosophy? It's clearly like this problem would be solved by hints. It's nice to have it all be automatic but in this case it would save people a lot of headache. Yeah, I mean reason for not having them. I mean, my sort of you know, advanced server is an ADB fork of Postgres and it has hints. And I don't think I've ever told a customer to use them because they're not they don't solve the problem, right? Like the problem that I mean they might in this particular case actually but in general the issue with hints is that typically what a hint does is it basically says use this exact query plan. And whether you think that makes sense or not you're kind of forced into specifying the whole plan for the entire query. Cause once you start, right? You're like, I know you think it's a bad idea to use a nested loop to join these two tables but I say do it anyway. Well, you forced that to be a nested loop but all of the other decisions the planner is still making in other parts of the query tree are still wrong because the root of the problem is the row count estimates are wrong. And because the row count estimates are wrong everything's wrong. So you can't really fix it by hinting a little bit of it you've got to hint the entire thing, right? You have to basically fully specify the query plan. So if I had my way the hint would say, hey, you're wrong about the row count estimate and it would give me a way to fix that. And if I could tell a customer to put that thing into their query to fix the problem or better yet, if I could do it declaratively with something like create statistics that'd be awesome, right? But just having a way to force one particular part of the query to do a certain thing even though it doesn't seem to make sense is generally not great. Now this case might be an exception because in this particular case there's only sort of two possibilities and because this is being planned as a sub query it's not gonna have the same kind of cascading ramifications on the rest of the query plan. So in this particular case, yeah, a hint might be golden. That's fair. Yeah, okay. So last real slide, I'd sort of like you to take away three points from this discussion of statistics. The first one is that the system actually works surprisingly well for how dumb it is. I was astonished the first time I discovered that A plus zero equals A could not be estimated remotely correctly. And then I realized that I'd been running queries against Postgres for years and most of them had been working just fine despite it being that easy to fool the darn thing. And I think that's a pretty general experience. Most queries run okay for most users most of the time but despite that people are constantly running into problems and some of them just don't have reasonable workarounds. You end up saying to people, have you thought about redesigning your entire schema? And they're like, no, but I've thought about using a different database product that will work with the schema that I've got or at least some of them say that, right? Which is not very satisfied. And then the third sort of takeaway that I'd like you to have is that create statistics is pretty good stuff and it does help a lot. And I think it serves some of the function that you might hope to get out of a hint system but without requiring that you decorate every individual query because that's another problem with hints. Once you start using inquiry hinting as a way of solving problems you're gonna, every query that has the problem has to be hinted whereas if you can somehow make create statistics able to deal with the problem then you do it at the DBA level and you fix that once and then all the queries that everybody runs just work properly after that, which is nice. And those queries will probably also adjust to changes in the data which also won't happen if you try to nail down the plan to be a certain particular thing. Not trying to be too hard on hints I'm actually more receptive to the idea of hints than at least one member of the PostgresQL community but it's not, I think it's not a problem free approach even though it could be good in some cases. There's also like other engineering aspects like they're sticky so like you upgrade to a new version and maybe the problem you're trying to overcome with hints is solved but like you're still, like no one's gonna go back and fix that application. Yeah, I mean that's another pitfall of that approach. So I see we've got Tamash on the call who is the author of Create Statistics and I guess he's still working on that for the new release which is great to hear. I think the improvements that we've had in that area are some of the highlights of how this has gotten better over the last few releases. So hopefully that will continue. And I see- Okay, is this the end of the talk? Cause I'll applaud and then we'll do questions. Sorry. Yes, that's the end of the talk. Yes. I will applaud. Thank you. This is awesome. This is gold. Okay, so now if you have questions, Becca, if you wanna unmute yourself and go for it. Yeah, so Rebecca is asking another question in the chat here. She says, did you consider using index prefixes to determine which multi-column stats to collect rather than using Create Statistics? I don't know if Tamash wants to unmute himself and answer that question but he would be the best person in the world to answer that question cause he wrote the code. So Becca is actually the lead optimizer expert at Congress TV and somebody I worked with with Mike Stoebrick at MIT. Becca's awesome. Thanks, Eddie. Yeah, I'm just curious because we decided to go with the index prefixes approach. You can call Create Statistics in Congress TV as well but yeah, we also automatically collect index prefixes. So just wondering if you had considered that or why you rejected that if you did. I didn't personally reject that. I mean, I think generally the post-crest QO philosophy tends to want to decouple things like that so that users get more control over the behavior because if we collect something that you didn't want us to collect with that kind of approach, there's not a way to turn it off. But on the other hand, if it gives people enough if it gives enough people a free benefit, maybe it's a good idea. I don't know. I haven't really thought about it. Oh, thanks. Okay, any other questions? Okay, you're all fools. Robert's here. You're not having this opportunity. I'll do it. Screw you all. So when I guess you've already answered one of my questions where I was, you said like, you don't even look to see what other systems do. And I guess I was curious about, I was curious that like in your development, like if you come across a creative post-crest that's been weird or unexpected and do you ever go look at like, what does SQL server do? What is the commercial search system? Or even the open source optimizer tools like CalSight or Orca? I personally don't. And maybe I should, I probably should. In some cases, I think it would be useful, but I tend to find that, I think navigating the post-crest QL community, especially in the early years when I was first getting started is, but even to some extent now, it's a little bit about the art of the possible, right? So it doesn't do me any good to know that somebody else is using an approach that I know that the community is never going to accept. So I sort of found it, especially when I was first starting out to be more useful to just approach every question from the point of view of, how could I do something that is an incremental change to what we have now that I might be able to get accepted? I think now having been around for a long time, I could probably get more things accepted if I had the time to spend working on them. So some broader knowledge of what's going on would probably be more useful to me now that it was 10 years ago. But this is also is not the thing that I spend all of my time on. I spend a lot of my time on, basically just dealing with different things all over the system. I mean, one of the projects that I've been working on when I get time is some refactoring of the base backup code so that we can get some new features for hot backup that we can't get with the current code organization. And it's just like, I can't learn about everything, right? So I think I do sort of regret the fact that I don't have a broader perspective of what's going on with everybody else, but I also don't have time to go learn a whole bunch of other systems that are not post-cross because I'm not even close to having learned all of the things about this system. And there's already a lot, there's already more work that I can do by a vast margin with just knowing something about this system. So if I learned about more systems, I would only get more ideas for things that I don't actually have time to do. So I do kind of regret it, but I also sort of feel like I have to be careful not to learn a whole bunch of things, spend time learning a whole bunch of things that I'm not then gonna be able to put into practice. Okay. In the case of the A plus zero equals A case, there are some, whether you would, it's constant for our expression evaluation, there are some optimizations that I know that post-cross can do, like if you say where true equals false, then that will skip the entire scan. And so I'm curious to like, is that just, is it very piecemeal of what those optimizations look like? Cause A plus zero equals A, I think you do have to go to look at the catalog and say, okay, what is the type of A as well? Maybe that's what you guys can't do with the limitation there. So what's happening there is, if you say where true equals false, then, or if you say, let's take a slightly more complicated example. Let's say you say where two plus two equals four. So what's gonna happen is there's a function called eval constant expressions. So just from the name, you can probably tell what's gonna happen here, right? When it looks at two plus two equals four, it's going to say, oh, look, two plus two can be simplified to four. And four equals four can now be simplified to true. And then I can drop the clause altogether. And if you said where two plus two equals five, then it's gonna simplify that to false. And then it's going to decide that the relation is a dummy relation and that can never produce any rows because it's got a constant false filter applied to it. The problem with A plus zero equals A is eval constant expressions can't actually do anything. It doesn't view that as a constant expression. Now, I think you could imagine a smarter system for simplifying expressions that says, okay, that's not a constant expression, but I can simplify it to where A is not null. And if it could do that, it'd be awesome because then it would be able to get a perfectly correct estimate for it straight out of the chute. But that would require us to insert some kind of logic that I don't know exactly how it would work, right? Like how would you, what facts would you think about putting into the system catalog about plus that would enable it to do something useful with that thing, right? And the real examples where things like that come up are like someone was saying before, user defined functions. Somebody writes a function and they put a pile of SQL into their function. And then they're like, where are that function returns some answer? And it's like, how are we supposed to know what that function is going to do when you execute it? So you're not gonna like this answer, but SQL server can handle that. It's been called Freud. They can basically convert the UDF into relational algebra and then they inline that as a query to the rest of the query. And then they keep up the Japanese or do this thing on it. So Postgres SQL has a system, a pluggable system for procedural languages. And two built-in ones are SQL and PLPG SQL. So PLPG SQL, you have declarative statements in there and we can't do anything with those functions because they could do anything. But SQL- So SQL server can on those when it's procedural code, they can convert that into relational algebra. It's impressive. So if you say, if you make it a SQL function and it's just a single query, then it will subject to certain conditions, try to inline it. But PLPG SQL, we can't do it anymore. Yeah. And then we're also working on compiling PLPG SQL. That's another, that's another topic for another time. Okay. And a guess? All right. So my, it's more of an engineering question. Sorry, just a small follow-up on the constant evaluation thing. I'm curious about one small thing. I wanna know if you know is when you say you may use this evaluate constant function to see the predicate expression to see the value of predicates. We wonder whether that's a function specifically wrote for the optimizer or that's this constant evaluation is actually something from the executor that you can reuse in the optimizer. Just a little bit on the implementation level. I mean, I don't think there's any theoretical reason why you couldn't call that function from anywhere in the system where you had the right kind of input to it. But in practice, you would only call it from the planner because by execution time, I mean, at execution time, you're only gonna evaluate the things that you couldn't evaluate at plan time, right? So anything that, you know, there's a, it's not that there's a, the execution that is performed by eval constant expressions is the same kind of execution that's performed by the actual executor at runtime. They reuse a whole bunch of the same machinery, but it's just got different framing around it because you're trying to do something different. Okay, sure, sure. Yeah, that's a little bit of an issue, yeah. We're a little over time. I guess, let me ask my last question. Can you sort of comment at a high level with your impression of the overall code quality of the optimizer for Postgres versus other parts of Postgres? Like the Postgres storage engine, the execution engine, it's beautiful C code. It's well documented. It's pretty straightforward and coherent. What is your, how would you describe or characterize what you think of the optimizer looks like? I think the hardest part about the optimizer code is understanding in which order all the things are going to happen. I know I spent a lot of time studying the optimizer code. And when I looked at little bits of it, I was like, oh, I see what this is doing. Like this is how we plan hash joints. And I got it, right? I understood that it was doing what it was doing. I find it more difficult. I think it is more difficult to understand like all of the different phases. And there are some data structures that are reused in different phases, but you have different invariance at different phases of the process about what has to be true or may not be true. And that's harder to get your head around. So, I mean, I think the optimizer, you know, again, largely thanks to Tom and his amazing effort on the project over the years and him being a very, very good coder, you know, all of the things that he's worked on tend to have the things that you talked about where the code is well-organized and the comments are pretty good and all of these kinds of things. I think like, you know, the optimizer could benefit from more work to make it more understandable, you know, what the phases of processing are and which things have to be true at which phases. But, you know, it's definitely not one of the dark corners of the system where things get squirrely. Okay, awesome, thank you so much. Okay, I will applaud on behalf of everyone else.