 Welcome to a deeper understanding of PostgreSQL execution plan at plan time and runtime. I'm joined by Jobin Augustin, Senior Support Engineer at Percona, who will discuss and demo improvements in PostgreSQL 12's planner, including how plan changes and behaves. My name is Lindsay Hooper. I'm one of the PostgreSQL conference organizers and I'll be your moderator for this webinar. A little bit about your speaker now. Jobin is a PostgreSQL expert and open source advocate. He's more than 18 years of working experience as a consultant, architect, administrator, writer and trainer in PostgreSQL Oracle and other database technologies. He has always been an active participant in the open source communities and his main focus area is database performance and optimization. He is a contributor to various open source projects and is an active blogger. He loves to code in C++ and Python. Jobin holds a master's in computer applications and joined Percona in 2018 as a senior support engineer. With that, I'm going to hand it off to Jobin. Take it away. Thank you for a great introduction. Let us directly jump into the topic. Today we are going to discuss about the PostgreSQL execution planner in a bit more detail. I'm Jobin working for Percona as Senior Support Engineer as Lindsay mentioned. Let's get started. So, all of us sense SQL statements to post this database as an English text stream. There are a lot of processing happens in the server side. The first step is lexical analysis, then parsing. So, for these things, we use the most common things like a flex and bison default in any Linux or any systems. And then next stage is analyze where we analyze the semantics of the SQL statement. Then it goes for the rewrite. The statement will be rewritten based on some of the rules. And the next stage is plan and optimize. This is the stage where the entire brain of the Postgres comes into picture. And the final stage is execute. As we can see in the screen, even a select statement has a lot of sub clauses. So, we can understand the complexity associated. There is target list. That means the values to be returned into close, front close, work close, and group close, and so on. So, we'll see things in detail in the coming slides. So, let us look at the first part stage. What we are seeing in the screen in the right side is the grammar specification of all the statements in Postgres. So, we have around 124 types of statements. And definitely that's the reason why we call Postgres is one of the most advanced open source database in the world. Because it can handle pretty much every type of SQL statements. So we stick to standards. And the entire grammar specification runs into 16,000 lines of code. So it's pretty complex. Okay. In the previous screen, it will be very difficult to spot the select statement because it's somewhere here. Okay. And even the select statement will be very complex. It has the bit close and select closes, then work close, grouping close, and the entire syntax is so complex, so we can appreciate the complexity associated with the backend processing. And as we can see in the screen, we can see how the parse happens using enabling a parameter called debug print. If you turn it on, you can see how the parsing happens. But it is only for experts. It's not for end user. So, since we are going to discuss about the execution plan and explain plans. Not all the statements, we saw around 124 types of statements. Not all the statements are explainable. Only very few set of statements are explainable. So they are supposed to have the execution plan associated. For example, if you want to draw a table or a drop in database, you don't need a execution plan. It's very simple. There is no planning required. But when we try to do something meaningful, like getting that out of some table or joining things or do some transaction, it becomes complex. And that need to be planned. So who is going to do this planning? Before, we need to take a decision whether we need a plan or not. So we have a concept called Tophiko, who decides whether a statement need to be planned or not planned. It basically splits statements into simple and complex. Simple ones, no need of planning. Complex ones, yes, we need planning. So if you can look into Postgres source code, it will be very clear. But yeah, that's a lot of complex things. So the next stage is analyze. In this stage, the meaning of each statement is identified. Basically the column types, data types, collections, etc. are identified. And this stage transforms a parse tree into query tree. And for simplicity, I mentioned the internal functions here. If anyone want to look into the Postgres source code, please look into these functions. It will be clear. So at the top level, we have PG analyze and rewrite. So in the right side, I have a screen sort of some of the functions. No need to go into details. Basically it takes a parse tree and converts that into a query tree. So most of the functions will be named as transform. So it transforms a parse tree into query tree. So this is the query rewrite happens after analyze. And in this stage, a lot of rules will be applied to the statements. So we can create custom rules. And the syntax also mentioned in the screen, create a rule name and what to do. So this rewriting happens between the parser and the planner. And we can create custom rules and other building rules as we are going to see some of the examples. Okay, for example, then we create a view. Basically, it is set of rules. So for example, Postgres has a built in view called PG settings. Through which we can get the settings data. This is nothing but a function call. Select start from a function call. And even if you create a view on a user table, it is basically rewrites that view into a select statement internally. So this is what happens. So before getting into too much of a theory, I would like to show some Qt demo. Okay. Yeah. I'm just connecting to a psql prompt. And we all know that how to see the workman. We can say show workman. It will show how much work memory we have. And we can set it using set command. For the session. Okay. And it is already set to 6 MB. And we can do the same thing using a function call. There is a built in function call set conflict. We can involve that instead of the set command. Say like this. So set conflict workman 8 MB. So we increase that to 8 MB. Yeah. It's become 8 MB. And instead of show command, we can set it. So instead of show command, we can query the same thing using PG settings. Yeah. That 8 MB is visible there. It's 8192 KB. And we mentioned that this view, the PG settings is nothing but a function call built in function. So we can even do the function call here instead of coding on PG settings. Let's see. Let's do that. Yeah. So here we are doing a function call. And it is showing the same value. Now the question arises. Since this is a view on the top of a function, can we do an update statement, something like this update PG settings to set some value. So in most of the places when I ask this question, I got a reply that it won't work because it's a function goal. Yes. It's not really working. But in our case, it works because we have a rule. Internally, there is a rule. So let us check whether it is really done. Yeah. It's increased to 10 MB. And when we look at the rule, we can see that there is a rule saying that PG settings view on update to that settings do a function ball set config. Okay. So this is why the update statement works. Okay. So let us proceed. So for example, what about creating a custom view? Okay, let's create a when we create a custom view, there is an internal table, internal table where the rule will be the rewrite rule will be updated. Currently, we saw the PG rules. Similarly, we have PG rewrite table. So I'm just taking the count of records there. It is 131. And I'm going to create a table and a view on the top of that table. Now if you look into the number of the rewrite rules, one is incremented, it become 132. Okay. So basically this view created a rule. And if somebody want to see what really that rewrite rule is, we can see that using a select statement, something like this. But it is the output will be really complex for a user to understand. But yeah, somebody who knows the plan, parts of statements can understand this. Yeah. Now basically what we need to understand as an end user is okay, the views are nothing but rules. And we can create custom rules to do manipulate anything. For example, if there is a statement coming and saying that select start from table a, okay, we can create a rule saying instead of calling on table A, query on table B, and the end user won't be aware of actually the query is excluding on table B. Hope this clear. And let's proceed. And next stage is the plan. So we saw the what how the rule works, the rule system works. Now we have a very complex statement, the query tree, which needs to be planned. The query tree can be actually executed in a wide variety of ways. We know from to travel from one place to another, another place that could be n number of ways. Same, same. Similarly, to execute one statement that we multiple ways of executing that. So the plan only to identify the shortest or the cheapest and the most efficient way of executing the statement. So the plan is that intelligence intelligence system. Okay. And how it does is there is a internal data structure called parts. These parts are cut down versions of plants. We talked about the query plan, the, the, the parser tree and then a query tree. Now we are going to get into the plant trees. So the parts are cut down versions of plants. And from the, the cheapest path is selected. And after all these planning phase, friends plan tree will be prepared. And how it is prepared is based on cost based analysis. We are going to see details of that. The last phase of a SPL execution is excluded. So once we have the skill, the plan, the plan tree properly, we can just execute the plan nodes are executed recursively from top to down. And the top node can return the result. Okay. Now coming back to the plan. This is our talk is about plan. Few important key decisions are to be made. What scan method we should use, what joint method we should use and the join order. So in the scan method, we have sequential scan index scan or bitmap index scan. In joint methods, we have nested loop hash join or merge join. And the joint order is about which table need to be joined in each order. So the plan, as we mentioned, the plan tree is prepared with the cheapest path. So it can be given to the executor at the end. That is what the plan does. The output is plan tree. And for that, it considers the statistics in the table to estimate the cost of the access pass and all. So what Postgres has is a cost based optimizer. And it has internal generic query optimizer also. So the selection is like, if there is less number of tables in typical case, it's a 12. If there are less than 12 number of tables, it goes for a cost based optimizer. We can change that parameter. Otherwise it will go for a generic query optimizer because then we have a lot of tables involved in the query. There could be multiple paths. So all the combinations of executions, analyzing that will be complex. So in that case, it goes for a generic query optimizer. So let's do a Qt example again. So let's create a table. I'm creating a table with a company ID, company name, and company type. And creating an index on that. And starting a transaction. Then we are going to insert a lot of around 5,000 companies with the company type one and just five companies with the company type two. And commit the transaction. Now, if you query this table for company type one, we get a bitmap hip scan. And when we query for the type two, there also we get a bitmap index scan and hip scan. So we know that these values are highly skewed. We have around 5,000 number of company type one and only very few company type two are existing, but we got exactly same execution plan. And, but say the plan looks same, but if you watch closely, there are a lot of things different here. In the first case, the heap blocks extractable. The heap blocks extracted is 28. And in the second case, it has to get only one heap block. And then the number of rows returned by the index. In the first case, it is, sorry, in the first case, it is 5,000 because we have company type one, 5,000. And in the second case, we have, so the plan is same, but the execution and the complexity of execution was totally different. But why the positive selected same execution plan? So let's try executing the same plan once again. Again, the same statement for company one. And this time it is sequential scan. And once again, for company type two, now it is index only scan. So earlier we had bitmap scan, and now it turned to be one is sequential scan and another is index only scan. So what could be changed by this time? So the, the, what really happened is by the time we explain the execution plan, explain plan at the background and the analyzer and the auto vacuum done the analysis of the table and we got more statistics about the table. Now our plan become more intelligent to take a bit more accurate, better plans. That's what really happened. So why this bitmap index scan in the first place? Is it Timel plan? No, actually the bitmap scans are, is kind of a compromise. So it stands in between sequential scans and index scan. So when there is generally if we see this bitmap and the index scans happening, most probably the statistics need to be collected or we have a situation where we are in between, really in between. Okay. So that's a smart, it's a smart selection when we don't have much of information and all the information is in between, not favorable for sequential scan or index scan. So what is that index scan? So to understand this entire stuff, we need to understand the index scan. So fetch one double point at a time from the index. So as we can see in the screen, the blue box is the index from the, it's a bit index. So it need to be scanned and then it visits the table and pulls up the, the tuples. This is what happens in normal index scan. And there is one more thing called index only scan. So in that case, we don't have to again visit the table. The information is available in the index itself. So data can be directly pulled out. That's what we saw in our example. And the sequential scan is scanned from one end to another end, discarding all the unmatched rules. That's sequential scan. And the bitmap index scan and heaps scan. This is slightly more complex because we need to do the index scan first and get all the data and sort it. Now, once it is sorted and in memory, we know what are the heap pages you need to access to get the data. So the heap pages are visited again. And in many cases there will be a recheck because we are getting the pages. It needs to be rechecked to get the final results. So the final result will be given to the client. This is what happens in bitmap index and heaps scan. So the advantage of bitmap scan is, we know that the index is scanned and the data is collected and it is sorted and then goes to the heap page and it orders the scan. So block by block. So it improves the locality reference to the table. So it collects all the information from one block then goes to another block. Then that's really good thing about bitmap scan. But there is a descent to the base. We need to keep all the information about this bitmap structure in memory. So there is a little more overhand in that than other scans. And another side impact is the data is not retrieved in the order. So that's another side effect. So now we talked about the cost and how the plan takes care of these things. So what really happens in the analysis phase? So we saw that the auto-backing in the background done the magic, the auto-backing analysis statistics. And it stores the information in PG underscore statistics table and there is a view called PG stats where the users can query and find out what is the current information about the table and details like that. And when we do the stats collection, generally we do the sampling. In a bigger population we do a sampling. So the sample set can be controlled by the parameter defaults statistics target. So generally we talk about all these stats and the user may not be knowing this. So let's see a demo about that. So let's create a small table. So this is a Pakistan table I created. And I'm going to insert 10,000 records and another 8,000 records. I'll show you the details. And then I am analyzing. Now we have the stats collected. Now if you query on this table, we can see that it is not equally populated. The state that is in this table, we have state ID and city ID. So it is not an equal distribution for state 1 and 2. There is a lot more number of cities than the rest of the things. So actually state 2 has more. So when we look, so now we know okay state 2 has around 5,000 plus cities in that and state 1 is having 4,983. But other states are having around 2,000 kind of. So now let's look at the stats about this table. So as we can see, let me do one thing. There are common values. Most common values are taken and its frequencies are reported in the table. So this is what happens when we collect the statistics. So the most common values and its frequencies are updated for the city ID column. So what are the common city IDs? For state ID, we already saw that state number 2 and 1 had maximum number of participants, cities. So this state 2 has around 27% and state 1 also has 27% and 5, 6, 3 are respectively smaller. So currently we have the statistics in the system that state 2 has maximum number of cities so that the chance of getting a record for state 2 is higher. So the filtering and the number of records returned from this table can be calculated based on this stats. So I hope this is clear. So coming back to the presentation, now we saw what is the stats all about. We have most common values reported and there are other things like histograms and all. We'll talk about that later. So it is collected by the auto-backing broker and it happens in the asynchronous fashion in the background while the system is up and running as a background worker. And it results in better execution plan and then we can have a very effective post-based plan system because of these stats. But in the negative side, there is overhead in collecting the stats and maintaining that. And when our sampling size is bigger, it tends to get a lot of data and it requires a lot of space and analyzing that information may be complex. So that's the negative side. So next, the joint part, nested loop joins. So this is the most simplest thing anyone can do to a loop join the two tables. So the right side relation, the right side relation is scanned once for every row found in the... And the corresponding left relation is joined. This works great if there is index scan possible in the right relation because that becomes more efficient because every right side relation needs to be joined with the left side. So that's the execution plan will select this nested loop join in such conditions. And next is hash joins. These are very good for equal joins. So basically both... The values are hashed and joined. So it can happen in memory. And if the records are... the number of records to be joined is small. It can happen in memory. And this is one of the fastest way to join the tables. But what happens if the number of records to be joined are not fitting in memory. So it need to be... It need to be going out. So for... There's a different strategy there. So the merge join comes into picture. So in that case, the results are sorted first and then joined. So here also it's almost like equal join and it can create temporary files. So if there is merge joins, we need to think about are we dealing with a large amount of data? Can we reduce the amount of data in the joins? Okay. So coming back to the cost. So as we mentioned, the cost decides the execution plan. And... So the key point is the cost, the numbers associated with the cost don't have a unit associated. It is just a number for comparison. So when we compare, the value one means the cost of sequentially accessing a page is one. Then compare that with other things. So we don't need a unit associated. Okay. And we have certain parameters by which we can control the cost. So basically we can say what is the sequential page cost or random page cost, CPU tuple cost, or CPU index cost. The sequential page cost is the cost associated with scanning a page sequentially. And if it is randomly accessing, it is a random page cost. And the cost associated with processing the tuple is CPU related cost also. And then their cost is divided into two. There is startup cost and runtime cost. Run cost. So all these things say the CPU cost as well as IO cost comes in. Run time cost. And the beautiful thing in Postgres is we can tune all these parameters at user level or at session level. And even at table space level. So I have given a sample, create table statement where we specify what is the cost associated with the table space. If the table space is sitting on a different type of file system or a different type of storage. So generally we do this for archived data, which is for archived purpose, we put it in a slot. So we have a different cost structure for that. Okay. So we saw the most common values letters, stats. And we have a similar kind of costing model for histogram as well. So let me show you something. Keep clean. So I dropped and created one table and inserted one set of records and analyze the table. And now we can see the histogram associated. Now we have around 2000 records inserted and it is equally distributed. So in the histogram bounds we can see 120, 40 that is equally distributed till 2000. So this the distribution histogram gives an idea of how the data is distributed. But what happens if we insert lot more values like I said 100 to 1000 multiple times. And if you analyze then their histogram changes. Now we have more values from between 100 and 1000. So we as we can see in the histogram those area become more dense. So from one to 1000 occupies the most of the histogram. So this is how the histogram is used internally. So we saw most common values and histograms and the histogram. And in later business database systems has more extended statics other than the default ones. We can create more statistics. So if there is a city and stated it there is a correlation. So we can if there is a dependency between columns we can have a functional dependency can be defined using a suppressed stats so that we give a better idea about the data distribution. And we can add more data about the distinct values. And we have a multivariate stats. This is not just bottom level at the raw level also we have more data collected but all these things are more advanced use it only in cases where it is essential because this will all this will create extra data into a system. Stats information which need to be further analyzed. And when it comes to the execution plan another factor which we need to answer is the parallel execution. And we have parameters associated with that I am not going to getting into details. It is great for oil IP workloads for complex queries. And another thing is just in time comparison this is new in Postgres 11 and by default it is enabled in Postgres 12 but use it with caution it may not be useful for all the cases. But if there is a very complex query which does a lot of computation and then the expression can be broken into very few cases then in such cases JIT shows very good improvement but not for all. So enable all these things only after testing. And these are some of the cases where it can go wrong as we can see in the right side in parallel execution combined with the just in time comparison sometimes it creates a lot more functions currently we can see it is 34 functions but without parallel it is just 9 functions so there is no saving so as I mentioned the parallel execution have a negative impact and in most of the cases general cases turning it off may be right thing to do unless proven otherwise. And so now we understood the entire complexity associated with the planning analysis and coming up with a good execution plan and the entire planning the parsing till the plan can be saved by preparing the statement in advance that's called prepared statements. So as I shown in the diagonal part the right side we can see the planning took more time and effort than actually executing the query and we tend to see this in the real world actually these numbers are taken from some real world cases yeah so use prepared statements wherever possible and even for a very simple query like say select count start from sometimes the planning may take more time than execution so in such cases the plan need to be cached and the entire plan again and again won't be efficient just like explained plan the prepared prepared statements also are very few so general select in set update release statements are prepared out of all statements so this is an example how easily we can prepare a statement prepare a planning and the select statement then we can just execute a prepared statement using a parameter the parameter will be passed in and it will be executed so again demo so we are selecting the company type 1 as we know it's a sequential scan so now the question is we already know that it keeps switching the plan say it was a sequential scan now it is in the scan for a company type 2 it goes back to the sequential scan again if it is company 1 and this is the advantage of planning it at planning every time but what happens if there is a prepared statement so I could the session and connected to a new session and prepared a plan now if I execute plan a statement when for a sequential scan but now the question is whether it will change back to more efficient index scan if the parameter is 2 the answer is yes it switches so there is the prepared statement the last stage of planning is still happening even if it is a prepared statement it switches back to the original plan sequential scan again and back to index scan so let's analyze the parts analyze rewrite all these things are removed but final plan and execute is still remaining so now if we execute that statement repeatedly what will happen so again prepare the statement and execute the same statement multiple times 5 times and all the 5 times it was sequential scan and now if you execute for the plan with parameter 2 it is not actually switching the execution it's still showing a sequential scan so why it is happening why it is happening is this one the rule is if there is 5 or more execution which produces a plan whose estimated cost is more expensive than the generic plan then the generic plan is selected so that's the reason why it was not switching the plan in that case but the reverse won't happen say for example I am executing the terminal and if I execute for the plan 2 it is index only scan multiple times but if I execute for parameter 1 it switches back to sequential scan so why it is happening is the customer custom plan for each of these cases is so cheaper than the generic plan so it goes for the custom plan so these concepts the custom plan and the generic plan need to be understood when we deal with the prepared statements and the 5 times repetition was before Postgres 12 and in Postgres 12 we can control this we have no need to wait for 5 executions for a generic plan fix we can set it a forced generic plan or a forced custom plan this is the Postgres 12 feature by which we can control the behavior I don't have time to go for demonstration and another question generally comes in is we saw the prepared statements is the functions the PLS scale functions or SQL function in this case we can see that the same thing is implemented as a PLPG SQL function and the next one is an SQL function is this better or the prepared statement is better so the thing is PLPG SQL can handle a lot more complex business logic and it has a lot internal optimizations done it can hold prepared statements so even though the PLPG SQL execution engine is separate effectively in many cases PLPG SQL engine gives a better result instead of manually doing that but definitely the prepared statements are far efficient because it's cached at the session level so if prepared statements are not possible preferably you may have to look into whether PLPG SQL is better and the SQL functions don't have any caching mechanism generally we see the poor preparedness all these things need to be tested in your specific cases but just be aware that these are the options available to be tested so I just mentioned the PLPG SQL interpreter parses the function source text and produces internal binary instruction 3 but for the first time of the function is executable so that's why it speeds up and what is there in Postgres 11 and 12 when it comes to the dynamic nature of execution plan so even after preparing an execution plan at the run time, at the execution time some of the steps won't be executed as I highlighted there few of the scans on few of the partitions are never executed based on the value, based on the joint condition and in Postgres 12 in the bottom slide we can see that in many places the up and stage itself is removed in Postgres 12 so it's a smarter way of preparing execution plan so at the run time as well as at the plan time there are a lot more optimizations happens with the newer versions of Postgres 11 and 12 and still Postgres 13 we have more option to analyze things so there is we all know that we can look at the buffers of explain plan during analysis, analyze buffers in Postgres 13 we have a provision to see how much effort is there in the planning how much of shared heat happens in the planning phase and details will be visible if the format is JSON and the normal texture format will have a single line where it is with the least information and Postgres 13 has a lot more optimizations in terms of planning and execution there are distributed hash aggregation related patches and wild generation information also there in the explain plan all these things are new but until it is released there is no commitment these are expected so some of the recommendations for future reading and some of the reference I used for this preparation of this slide that's all from my side and thank you for joining thank you all have a nice day wonderful thank you for such a great presentation thank you to all of you for joining us today and I hope to see you on the next Postgres conference webinar