 So now, if you remember how your instructor Andrew Crotty looked like, unfortunately I'm not Andrew Crotty. My name is Lin Ma, I'm a postdoc at CMU Database Group. So I'm pretty sure that at this point you'll all start to missing Andrew already. But I will be the instructor that covers the lectures for the rest of you guys, for the rest of this semester for you guys. So why it's me? It's not me. Okay. Well, that's because I've been learning about databases for many years. So I was a PhD student at this very university, CMU, working with Andy to learn about databases. So that's when I started my PhD journey in Andy's office to learn databases. So that was me when I was learning more about databases. And then I was learning a lot more about databases. And still in Andy's office. And finally, I just learned too much about databases. So here I am today, talking about databases to you guys, all right. So first of all, a little bit of administrative stuff. So the midterms are all graded. So I believe we have already posted all of your midterm grades on Canvas so far. So I'm not sure whether it's synced correctly with S3 yet, because we will have some syncing issues. But we've already posted all of them on Canvas. I think you should be able to check it out. And if you have any questions about your exams, like any concerns about a grade, et cetera, you could come to either me or Andrew's office hour. So we'll have all the exam papers stacked there. And you could check it out if you have questions or concerns. And also, I was traveling before, so my office hour was on Zoom. So now, since I obviously came back, I will change my office hour to tomorrow 2.30 to 4. And it's the same place as Andrew's office hour, essentially it's in Andy Pavel's office. And again, like we have announced on Piazza, we have postponed the due date of Project 2 to the 21st. So this is because of various policy changes I have mentioned on Piazza. And then because of that, we have also postponed the release date of Project 3 to October 20th. That will be on Wednesday. Right? Sounds good. Okay, now let's get to today's lecture. So today we're going to talk about query planning and query optimization. So if you remember, we talked about earlier in the semester, CQ is really a declarative language. What does it mean? That means that if you want to query the database about some questions, you just tell a database at a high level what you want to retrieve, the results of some columns based on some predicates at a given specified month or year, et cetera. But then you didn't tell the database how to execute it or how to execute it efficiently. And then we have covered lots of different organizations in databases. We have talked about different drawing algorithms, et cetera. And you can probably tell that different options, different implementations or different ways to execute a query may actually end up with a very different performance in terms of the queries. So it's actually very, very important for the database to pick a best way to execute a query and execute it efficiently to give you the correct answer in time. So this is actually a very, very important for the database system. And as far as I'm concerned, it's actually still an unsolved problem. There are still lots of efforts put in this area. And neither Enterprise or Open Source Database, they still put lots of efforts in this. So talk a little bit about history. So like many other concepts in the database, this concept of query optimization also dates back to the 17th, the very early stage of the database system. So when the author or the inventor of Relational Algebra, Ted Cod, wrote the paper about Relational Algebra, he actually didn't specify how to implement all these things. And then later on, in 1970s, it's a research group from IBM that have a bunch of PhDs from mathematics or computer science. They saw this paper and they decided that, hey, this is actually a good idea. Let's implement a system based on that. And I don't know whether Andrew covered this story earlier in the class, but essentially these researchers at IBM, each of them would pick a specific part of the database system. One guy picked the language, which is CQ. The other guy or lady picked the checkpointing or logging algorithm. Some other people picked the two-phase logging. And then there's a very renounced researcher called Pat Selinger. She actually picked query optimization and essentially many of the concepts and methods about the query optimization we are going to discuss today. We're actually already being discussed or at least explored in the original system R in the 1970s. And then one interesting thing at that time is that back in the days, people were actually very skeptical about the idea of a query optimizer. Essentially, they argued that a machine or a computer program can never come up with a way to execute a query better than how human can specify exactly how the system should execute a query. So what does this look similar to? This is similar to the idea of the compiler for high-level language. So when the first time the language C is invented, then many people would argue that, hey, why do you want a high-level language like C? Nobody can write a new computer, can really generate assemblies, because that is more efficient than a human expert can handwritten. But then over the years, people found out that he's actually a good idea to have a high-level language, even though it may not have the most efficient assembly code that you can execute. But then since it's a high-level, lots of features, lots of functionalities, the language can provide, then people can actually use the high-level language to do more interesting things. Right now, there are even more high-level languages like Python or Scala. So at that time, it's considered a radical idea, and then there's no textbook on query optimizer. So lots of the concepts and the methods were just developed by the researchers there, and then improved over time by people like us. By the way, that system is called System R. And then lastly, I want to mention that there are actually many assumptions that they made in System R at that time actually are still carried over today. Because again, like I mentioned, query organization is a pretty difficult and pretty important problem, and people come up with methods to simplify things, to make assumptions, to make the problem attractable. And then many of these assumptions actually we are still using them today. And essentially that means that we may not be able to find the optimal way to execute query because of lots of simplification and assumptions we made. But then it's a good enough approximation in most of the cases. So next, for query organization by how to solve it exactly. In general, there are two types of methods that we are going to discuss in this class, as well as generally there are two types of methods used in systems today as well. The first will be called heuristics or rules, which means that essentially you just write simple methods or sorry, simple policies by using either A4L's clauses, or you have some more complex rule engine that will help you do that. But essentially you can write specific clauses and heuristics to enumerate the query plan. And then you can do so without actually looking at the data. Just at the query, SQL query level, as well as you can go a little bit deeper to try to parse it, to try to take a look at the syntax tree of the SQL query. But essentially without looking at the data, you can use some heuristics and simple rules to approximate what would be a preferred way to execute that query. But these could be simple things like how do you rearrange the predicates a little bit, or there was some preprocessing of the obvious optimization you can do in the predicates as well. But of course there are more complex things that you just cannot solve with simple heuristics. That's what we're going to solve with the second type of approach called cost-based approach. Essentially there are complex things like how do you decide, for example if there's a three-way drawing, how do you decide which two-table you're going to draw first, and which two-table you're going to draw next. And when there are choices like whether you should use soft merge drawing or necessary loop drawing or hash drawing, those questions are probably very difficult to answer with simple heuristics or simple rules. So in that case we'll have a cost-based search to use a cost model and then a enumeration mechanism to explore different complicated alternatives of the query plan if you will and then figure out the optimal choices among those more complex alternatives. And usually a mature system will have both of these mechanisms in place but then there will be some newer databases that still in the early stage of the development for example then they may only have the first type because obviously that's easier. So now to give you an overview of how this query optimization process would take place. So I would warn that at this point you probably feel like this is going to be a little bit abstract because it's an overview. Because I will try my best to give you some examples and work you through the process but then again it's a little bit complex problem there are quite a few steps so it might be a little bit abstract but then as we go further into the lecture it will become more clear. So now it's just a high-level overview. Let's say you wrote a SQL query just like what you have done in the first assignment and then the SQL query when the SQL query first arrives at the database system you actually immediately have an option to look at just the raw string of the SQL query to maybe tweak things a little bit and rearrange the positions of different tokens and different specifications etc. But of course it's very difficult at this stage to rewrite the SQL query to be a very efficient form because if you just look at the raw SQL string and there's not that much information there but there are papers saying that you could do it but in practice people rarely actually start to optimize the query at this early stage. So what would people typically do? Well people would typically just parse the SQL query to an abstract syntax tree essentially so that's just a straightforward compilation technique you learn from your compiler class. You look at the tokens what would be the keywords and what would be the tokens for the tables and columns etc. You generate the syntax of this query syntax tree and then what do you do? You actually bind this syntax tree with the information in the database system. So typically there will be a component in the database system called system catalog essentially that's just the metadata about the system. It starts what's the name of each table what's the name of each column and the column ID a table ID etc. It would just this next binder step would just bind the tokens especially the names of the table's columns in this abstract syntax tree to the information in the system catalog or metadata so that at this point you already know the semantics of the query. At a high level you already know what this query is trying to do and by the way either the parser or all the binder would have like two complex organization or any organization at this point. This is just a straightforward compiler technique to look at the query and then try to understand the syntax of the query. So now the result of this binder step would be what do we call the logical plan of the query. So that's where we are going to start the organization. So one important thing I need to emphasize here is that for one specific SQL query there could be a multiple possible valid logical plan for that query. So this is just the logical plan is just trying to figure out hey at a high level what this SQL query is trying to do. And then usually at this step it just generates a logical plan that is most convenient to generate at this point even though there are multiple alternatives which means that if there's a multi way drawing for example a join b join c and if you specify the drawing order to be a join b join c then in this logical plan at this point you are likely to get a logical plan that's just a follow set order right it will not reorder things it will not try to optimize the predicates etc but just be reminded for that there could be multiple equivalent logical plans but at this point before the organization we just generate a single logical plan that is valid right that's it. Then the next step will just be the tree writer or you can call it logical writer right similar thing. So this is actually pretty common in databases and I think it's fair to say that most database will have this step to rewrite the logical query plan. So what it needs to do is usually needs to look at again the information in the system catalog look at what will be the queries I was sorry what will be the columns and the tables and start to apply a simple heuristic that I mentioned earlier to prune down obvious stupid execution choices right and we'll give more examples later on but just thing to remind here is that at this point it's purely look at the structure of the logical query plan as well as the metadata information in the catalog right we are not looking at the data at all right no information how many rows right what would be the properties of this data distribution of different rows no any of those information yet right this is like simple logical heuristics and then after the logical rewrite this optimized logical plan would be sent to the query optimizer or in this case we use the I mean the name query optimizer to specifically represent the cost based plan emulation which would be at the most complicated step if you will generate advanced organization and with the information both in the system catalog the metadata as we will have a cost model and statistics look at the data in the system itself what will be the distribution of different columns right how many rows in each column and then what will be the number of distinct values etc and we are going to like I mentioned earlier we are going to figure out come we are going to make this on complex organization choices and lastly this physical plan will just send back to the system to execute all right any question on this overview so far okay so just a little bit more note to distinguish logical plan and the physical plan so essentially the job of the query optimizer is to generate a physical relation algebra representation that the system can execute in the most efficient manner right and then we have this distinction between logical and physical because at the physical level there are many choices right that the system can explore that the search space is just so huge right for example different orders of drawing are pretty complex decision many different choices and different methods for your drawing either a software drawing hash drawing as a loop drawing etc lots of different choices so if you are trying to enumerate all those like alternatives of physical plans then the search space would be too big right so to solve this we have this distinction between logical and physical so before we even get to those complex organization choices if by just looking at at a high level what this query is trying to do right what would be the operations at a high level just just look at the predicates and the drawing structures if there are simple rules that we can already apply to eliminate the obvious stupid choices right to shrink down the search space as much as possible then that will help the later on physical enumeration to be much more targeted and much more efficient right another note on this would be that usually there would be a one-to-one mapping from logical operators to physical operators in the plan for example in a logical plan maybe it will just specify at a high level table a is going to join table b right that's it that would be would be the things that would show up in the logical plan but then the physical plan it will say that hey table a is going to join table b uh using a hash join right under what property and maybe either you could compress data or not not compress data etc right uh but but usually it's just a one-to-one mapping it's just the physical plan would be a much more specific uh contains like a specific algorithm and choices you want to do but there will be some uh exceptional cases if you will uh that uh multiple logical operator may map to a single physical operator in the plan for example if you have a logical join operator at join table b but then order by some logical order by operator to order by some column but then if you choose to use a salt merge join in your physical operator then you can perform the join and the order by the two steps in a single physical salt merge join right so in that case even though there could be two logical operators you may end up with only one a single physical operator make sense all right so uh more comments on the overall core optimization problem is it's just it's an p-hard problem right it's it's I think it's arguably the most difficult problem in database systems that you have to solve and again uh based on my understanding is still a unsolved problem today and then many people including uh commercial database vendors as well as open source communities are still taking lots of efforts in this space and just give an example uh based on the information I know the core optimizer in a SQL server right which is the the flagship data product database product from Microsoft that's the optimizer itself contains a more than a million lines of code so it's pretty complex and uh lots of for well lots of efforts there but also lots of opportunities there as well I should say and then another thing is that I mean a question that many people would ask these days is that hey since this problem is it's pretty difficult then have people considered using the new machine learning or like neural network artificial intelligence whatever to help this problem well the answer is it's obviously yes right there are lots of recent research talking about this uh I mean there are many interesting topics or interesting discussions but based on what I know right uh there are the application of machinery techniques to help out in this space is still very uh very limited if you will so people are still exploring uh this option I mean doing lots of experiments at at experimental stage at this point and it's in the very near future it's probably uh the the machine learning based approach is not going to probably not going to replace the uh high level structure of the core optimizer either even though there could be a specific cases like a specific uh sub problems in the core optimization that the machine learning can help but the overall structure that we're going to talk about about the core optimizer in this course will probably just remain as it is for the foreseeable future and as an example IBM DB2 had a earlier project called Leo that it would help a specific sub problem in core optimization called a kinetic estimation in the early 2000s but then after a few years they actually abandoned that project right because essentially when the machine learning things or the machine learning model work that's great right but then the problem is that if it doesn't work then why doesn't work right it's going to cause lots of trouble for the people to figure out hey what's going on when your model doesn't work and then how to fix it but that said based on the information I know I think a year or two ago they actually revived I mean IBM actually revived that project and they started to look at that again right so this is actually still a ongoing effort and then at some point I mean this could be a part of the optimizer that can make it more efficient okay so now today's agenda most of the today's lecture will be focusing on the first type of optimization that I talked about earlier which is the heuristics or the rules that you can directly apply at the logical operator level right could include we will talk a little bit about the equivalence in the relational algebra so that we can explore different alternatives and we talk about of course logical career optimization as how do we expand nested queries and then how do we do some simple optimization on the predicates I mean expression here just means mostly the expression in the predicates and I think we still would have a little bit time after that so after that I would be give a little bit a heads up on the cost model approach which is a second approach or second type of approach for career optimization but then we will focus on that more on the next lecture okay so an important concept before we talk about any organization is a relational algebra equivalence because if you want to optimize the query you have to guarantee that no matter how you optimize the query is still correct right but otherwise like it was a point of organization if you if you generate the wrong query so the definition of correct would just be the equivalence of relational based on this relational algebra property so most specifically two relational algebra expressions are considered equivalent if they if they generate the same set of two pause at the end of the day and also in the same order it's not written in the size but also needs to be in the same order and then the database system will just try to figure out the better alternatives out of all those relation equivalent relational algebra algebra expressions to execute that query right that's essentially the job of career organization or we can also call it the query writing so the first example and it's probably the most common example we will be using is called predicate pushdown I think I mean I think this is actually probably more implemented in in every material they have a system out there so what we can do so how to exemplify this say we have this simple query where we are just joining two tables and then student and in road and then we do a filter right on the grade of the students and then we project the name and the id out so if you just execute it this way then we can I mean I mean we can draw the relational algebra expression which is this essentially looks very similar or essentially the same as the logical query plan of this query right this would be essentially be the original algebra expression so as you can see here we have a join on the student and and in road and then after that we can filter on this grade and then we project the name and id so again like I mentioned before this the this logical query plan we are seeing here is what would show up immediately after the binder step right so after the binder step even though there could be a multiple valid alternatives for the logical query plan we are only going to pick one logical query plan that is a convenient right and then we are going to start our optimization from there so if you look at this logical query plan right all relational algebra expression same thing what we can do here to optimize it well then one obvious thing we could optimize is that we can actually push this predicate which is going to filter the grades based on the grid level a before we execute the join right so essentially right after we access the in-road table and read all the data we can already apply this filter to only select the columns that has this agree cost to a before we perform the join right then when we perform the join we will only be looking at much a much smaller number of columns in this case right and then the join would be because the the join is a well in the worst case the join could be a Cartesian product right but in this case after we filter out the grade based on a I mean the number of tuples that we need to feed into the join operation would be a much less right so the join can be a much more efficient in this case and again to reiterate the concept of relational algebra equivalence the two approaches or sorry two expressions that we consider that just now the one is the first to the join and then do the filter and then do the projection as well as you do the projection earlier and then after that you do the join and then sorry compare to the second equation where you do the filter first on in-road and then you do the join do the projection I mean the two expressions will generate the same set of tuples right and with the same order so these are equivalent relational algebra so more on this predicate push down so the example we just show we just shown just now was actually pretty simple right there's only a one one filter in the predicates but what if there are multiple filters right I mean the predicate can very well be hey you have a grid on a but then the enrollment date of these students could be something something or you are targeting or maybe you are targeting a students from a specific department right so when you are when you when a system receives a query the predicate of the query can contain a multiple filters in the predicate so what we will do in this case is that it's just that for a set of predicates I mean combined with a conjunction we are just going to break it down right to individual predicates and then we are going to push individual predicates down to the to the table to to the position that is the closest to where you access that table right so you have a predicate on a table in road right back to the example say you have predicate on table in road grid grid equals to a then you are going to push this predicate down to this table maybe this is better right but then if you have another predicate say it's our student I think you are trying to filter the student by certain department you are just going to break those two predicates separately and then push the predicate on the student a table to the left and then push the predicate on the on the on the grid a column to the right right from the in road table so when you access those table you already filter out at most tuples as possible so lastly there could be also some simplifications you can do by just looking at these predicates right say you hit this example here if you have a wear clause that has a predicate that has a set of predicates called x equals y and then y equals to three then you can already use this like so in linear algebra rules to original algebra rules to rewrite this predicate to be x equals to three and y equals to three right so this may seem a little bit naive right like because essentially there's not that much difference but this could be actually be very useful right for example these two columns x and y they could actually come from different tables right so if you wrote the if you have the predicates in the first wave x equals y and then y equals to three then you can only push down one predicate right because if the other predicates x equals to y if that's on the other table then just no no way you can do a early field terming with x equals to y right but on the other hand if you redo this to be x equals three y equals three then if again if x y are come from different tables like I mentioned earlier you can push x to the access of one table and push the predicate on y to the access on the other table and then you can filter out many two poles early and then with these logical original algebra equivalents you can also do more complex things right like rewrite the joint orders using the rules like a communicativity and associativity right here are some examples here but at high level with those rules you can actually start to generate lots of lots of different alternatives for example with these joint enumeration rules like or I mean communicativity and associativity rules specifically the number of possible joint orderings that you can generate for your n-way join is approximately a cattern number right essentially to fall to the power of the number of ways you have and well then at this stage you can't really just use the simple characteristics or the simple rules to find the better alternatives so essentially for more complex original algebra equivalences like this we are not going to consider that in the first type opposition using rules or characteristics that's what we're going to discuss in the second type of opposition with the cost model and a more principle search mechanism right and lastly right you can also do early projections with the original algebra equivalences this would again this would be something that you could be could be done in the logical rewriting phase but you don't really need the cost model yet so I will say this is mostly useful for the for the row stores instead of comparing to column stores essentially what this does is that if you have a final projection right like the example earlier we are only looking at the id and the name of some of some of some of the records in some table then we don't have to select everything right we can when we are trying to access the data we can already only looking at the the the records or the columns of the eventual projection that we need instead of reading everything out again so this is more often for row stores compared to column stores because for the column stores you would do later the columns are separated anyway right you only record the id of the each row you are going to get and then you do later virtualization so again give you a specific example here right here we have this the same query with the the the the drawing on student on the road and finally we are just only we only care about the name of the students as well as this id right so what we can do is that is that of selecting everything right every column from student or in road we can just do this projection earlier right we can only look at the id and the name from the student table and then only look at the these like two different id from the in your table that we care about so i mean again this may seem very simple but consider a case that your student table could be very large right student table may contain a thousand columns of various students of with various information of that student then in that case instead of reading the thousand columns of every tuple then you could only read these two tuples sorry two columns and then that could significantly save your time for execution all right so just recap a little bit about what we talk about so we we we well we talk about different ways that we can rewrite these queries based on the logical query plan and then we an important thing we are going to we did here is that based on with these simple rules we reduce the search space that the later physical query plan enumeration we need to we need to search over right so these are all like a simple heuristics like very efficient to be done and then we already reduce our search space significantly and at this point we have not looked at the cost model at all right we have not looked at either the data distribution or column distribution at all right but of course that also there's another thing i want to mention here is that because we use these heuristics and then these simple rules to change our logical query query plan then we are actually not guaranteed that we always going to generate a better plan optimal plan right even though based on our intuition right based on our understanding of these rules of how the database system works most of the time we are going to generate a better query plan but then there could be cases that hey if you push down a predicate or if you rewrite a specific expression there could be cases that they may be a little bit worse right but then that's just a price we are going to pay okay so before i talk about a talk some more complicated organization that we could do is there any questions so far on the query right we talk about yes please yes yeah yeah i can mention that the wording on the slides is probably not very like a specific it's actually a order list should be order list yeah yeah it's not very very specific should be should be fixed anything else all right then we can continue okay let's just some more advanced or not advanced right but more involved organizations that we can do still at this logical query rewrite phase okay so like i mentioned earlier right if you if a query that have a predicate that contains a multiple filters then one thing we could do is that we will actually decompose the multiple filters into the predicate in this predicate so that we can we have more opportunities for organization so here just give you an example right we actually using a one of the earlier example in this case so this is actually a three-way drawing right a drawing on artist appears an album and then i mean we just have a complicated predicate i mean that contains many clauses and including this like a specific filter our name right have a specific string that it needs to filter and as you can tell here if we just have a naive logical query plan generated after a binder at this point then this predicate is pretty big it's difficult to optimize more right so what do we do we're just going to split this predicate to separate filters right and then these are the three filters filters this predicate has and then after that we are just going to look at each filter individually and then we are just going to try to move these filters to the lowest point possible in this query plan right why we want to do that well because the lower this predicate are then the earlier that we can filter out unnecessary two posts right in this query processing so that our overall query processing overhead would be lower so in this case i mean we look at this predicate and then we just push down these predicates to corresponding places right for example in on this predicate that on the rightmost filtering on the name of this album right we can directly apply that filter once we access the data on the album right and then of course also on the artist and the appears on the and the join there we instead of doing a Cartesian product right just a blindly matching two posts between the artist table as well as the appears table we we can push this filter to be close to this to this to this join right essentially there's the artist id and then this appears id and what would this enable us well what this enable us is that not only this filter moves closer to the data access but also we can replace a naive join operation right for example a Cartesian product if you don't have any filters with a inner join once you have a join predicate right because i mean you can that would be much more efficient so here essentially uh like i just said you instead of have a Cartesian product and then filter out the two posts later we just directly combine the join with the predicates right that we split up earlier or filters we split out split out earlier so that essentially this can be executed a as a inner join which would be much more efficient than a naive Cartesian product and again all of these are logical right if you just look at the syntax of the query look at the information right what would be the columns at the tables you can already figure all of these things out even though you don't know what will be the actual content in the table yeah yeah essentially uh that's the same thing so okay so like i mentioned earlier uh there will be a another organization you can do called already a projection where you just look at uh the columns you're eventually going to read or return with the with the results right you can look at the specific columns and then push these projections down to the lowest possible point in your query plan and then you don't need to read all these all the columns in the table from the beginning to the end right you can just select as few columns as possible and as early as possible okay any questions in this example cool next another organization you can do in this uh logical query right phase is called is is to deal with the nested queries right so uh i'm pretty sure that uh you or your earlier homework i think it's homework one right already you'll be familiar with some nested queries that maybe many of you have used of them some some of them so essentially what what they have a system will do uh with a nested query by default is that it will just treat nest the nested query or the inner query as a black box right as a function and it will just treat it as a function that will take parameters and then return either one value or a set of values and but with that of course there it could be a very inefficient way if you just treat it as a black box so there would be two possible ways that you can accelerate this process either you can rewrite this inner query so that you can combine this inner query with the outer query together so there's no nesting anymore or it always called a flatten sometimes or there's other other choice that you can leave this inner queries outside of this query or outside the outer query essentially you can execute this inner query in a nested query separately and then store the values first right so later on you could directly replace the value in the original query instead of execute it as a nested query let me expand so for the first approach the flattening approach here i'm giving you a different example where we are just it's actually a it's i think it's like an example in a cedar's club you are going to find out the cedars that have any reservation on a specific day right yeah if you look at the query it's just look at all the cedars and then the the inner query in the warehouse is just trying to figure out hey whether this cedar has a reservation on a specific day right so here we can see that what this inner query has is that it's actually a joint query right it is referencing a column in the outer query so it was just trying to pair up the two tables right to see whether a sailor right in the outer which would be a record in the outer query would have a specific corresponding record in the reserve table on that day so what we can do in this case right it's usually this looks like a joint query right so instead of writing this as a sub query what we can do is that oh we could just look at these two tables together right and then we can have an inner join to join these two tables on this id and to figure out that whether there will be a matching records on that specific day right so what would be the benefit of this the benefit of this is that instead of let me hear the for the for the second approach after the rewrite is one single inner joint right but then before the rewrite in the first case you actually have to re-execute this inner query for each individual query from the outer table right let's say the sailor's table has a billing record right for example there could be a billing a sailor's then for each individual sailor you have to execute the wear clause in the in the inner query once right that could be very inefficient if the sailor table is pretty large on the other hand for the second approach if you use a inner join it will be much more efficient okay that's the case when you want to expand the query and then use a join instead of execute these queries over and over again but then there could be other cases right so here I'm giving you an example of a oh actually it's just a more complex example in the same dataset here we are just trying to figure out the sailor's with the highest ratings and then for for the sailor's that have at least two reservations of redboast right so and then we are going to project the sailor ID as well as the artist date on which the sailor has a reservation for this boat right so it doesn't it doesn't matter that much of what this query is exactly trying to do so what we are going to focus here is that for this inner query right it's just trying to figure out the max rating among all the sailors in the table right and then the outer query will have a predicate filter supposedly based on that but but what we can do here is that obviously for this predicate also for this inner query it's not even referencing any information from the outer query right it's not like reading any columns we're not doing any like predicates filters etc it can be actually executed by itself so what we can do is that instead of for each joint operation outside we execute this maximum inner query over and over again we can actually just pull this inner query out right and then execute this inner query ahead of time once and store the result in a temporary table and then when we finally execute the outer query we can just replace the result right so that we don't need to have a duplicated execution of this inner query so even here is just a general rules summarize what I just discussed for these complex queries the operator would just again if it there's no reference from the inner query to the outer query then the operator can just break this up right and then can lift this inner query out to execute ahead of time and store the result in a temporary table and directly replace it instead of executing it over and over again right so to illustrate what this exactly means we are going to take a look at this this query block is is the standard terminology will be called nested block and then we lift that up right to be a separate query and then we execute it and then replace the values there right and then we just directly execute this second query with these values in place and the second query I mean again the standard terminology will be called the outer block all right any questions so far on the rewriting and nested query expansion etc we talk about yeah page 26 okay would it be a what after rewriting let's see uh second name from sinners uh yeah I think you're right yeah the problem should be a distinct yeah yeah yeah yeah I think you're right yeah all right okay uh we talk about nested query now I'm going to talk a little bit about a some simple statistics or rules you can apply to rewrite expressions right so these things would actually be well they actually would be show up a little bit more straightforward than the complex query expansion and flattening etc so what this is doing is that essentially it just look at the predicates right so predicates say the predicates could be it goes to three or be greater than five etc etc then you just look at the predicate itself to see whether using like a simple logical rules you can just tweak this a predicate a little bit to make it simpler or make it to be able to execute it more efficiently right so it's not going to touch the other parts of the query and so for this there would generally be two different approaches that people implement that one is that okay man is that many people for this simple rewriting of expressions they just use if and if then causes right they just write one of rules to uh to uh I mean based on each individual organizations or tweaks they want to apply to the expressions or there will be other people that have developed a more principled way to do this if we will they could have a rule or pattern matching engine that they can have a organized way that you can specify uh each individual uh uh rules that you want to tweak these expressions as policies in this engine right and then the engine would just enumerate these policies or rules and then re-optimize this predicates over and over again until that there's no other no single rules that is still applicable right so that will be a more organized way to approach this I think cockroach DB did a second way but many other systems just use one of ifs and clauses okay let's look at a first example here again we have a simple table in this case which is ID as the primary key and then value not now then the first type of opposition we can do is just look at the impossible predicates right so say we have uh this uh this this table here so this query here select star from a where i goes to zero well there will be a no cases that i goes to zero right so essentially uh nothing would actually be valid with this query and then without even executing this query we already know that hey there's no result will come out right we don't even need to execute it and then similar similar uh well not similar but a little bit related to this we can also look at the obviously unnecessary predicates in this case if there's a predicate called uh equal one uh equals to one then in this case we we don't need to evaluate this predicate either right we can just simply because it's always going to be true we can just simply remove this predicate and then this query well even though it you just remove one predicate but then you actually just you limited a step and an operator right you limit a filter operator in the query execution so this query actually became much simpler and then you can just directly read the data and then put that to a output buffer right so that would also be an opposition to this query and then there could be other simple things uh i mean for this it will be a drawing elimination so in this case we have a query uh that is i mean join two tables that are same right essentially it selects star from a from a well as a one and then join as a two and then its uh drawing clause would be the just be the id which is actually be the primary key of that table so that means that for this a drawing query you would exactly have uh i mean one record would be exactly matching to a one record because there's a primary key in that table and just itself right so this join is completely uh unnecessary and then uh you can also just limited that and then return everything from the table right because every record will exactly have one match in this query right it's just unnecessary to perform the drawing okay uh more examples on this you can actually have a case where uh you can combine the optimization of the joint organization and then subquery right so uh in this case we it's a little bit similar to the uh the subquery example we see earlier right where you have a select query and then you try to see whether it there exists and then then whether predicate uh you're trying to have a select query with a predicate with a inner query as a you exist clause and in this case the inner query on the exist operation would actually be the self-join identity drawing we talked about earlier right so in this case you can just expand this query right you pull on this course out as we talk about earlier and then if the join was lifted out you will realize that hey this join is completely unnecessary right and so after the expansion you can do the join elimination and then essentially this query is just the same as selecting everything from a the big sense okay lastly another uh rather a simple thing you could do here is to merge uh duplicated or not not duplicated but merge predicates together they actually not duplicated let me take it back so uh in this case we are selecting star from table a where the first filter in the predicate would be this value between one and a hundred but then the second filter would be uh this is a value between uh 50 and 150 and then there's a disjunction here so uh in this case we can actually observe that there's an overlap between the first filter and the second filter and then since it's a disjunction we can actually just directly combine these two together right because say if a value is 51 then it satisfies the first filter and then also satisfy the second filter but we don't really need to evaluate it once right so instead what you can do is that we just have a wider range that covers one to a hundred and fifty and then for every single tuple you only need to evaluate it once right all right okay these are all the uh simple uh an expression rewrite or i will sometimes call them tweaks that you can apply that you could optimize the performance again at a logical rewrite stage any other questions because before we go a little bit deeper into the give you a heads up on the cost model based approach that we are going to talk about more focus on next week sorry next class is there any questions on logical rewrite so far yes please yes this slides me yes yes yes yes yes yes yes well but but since you have the like I said in the logical rewrite phase you can access the information in the metadata right so you can know that but you don't have information out of data or data distribution yet okay uh just uh reiterate this uh slides a little bit right we'll talk about the uh simple heuristics of the rules that we can apply to eliminate uh stupid choices but then there are more involved organizations that just can't cannot be done with simple rules right for example you have a three-way drawing four-way drawing you want to figure out hey whether I should join to your a or b first I'll b or c first c or d first and then you have different choices of drawings for example have a sort more drawing high drawing etc I mean just with simple heuristics it's just very difficult and to know which one would be the better choice right so in this case we will just use a more organized cost-based approach where we are going to have a cost model right first so that we can execute the the execution cost of a query plan in more detail and then we need to leverage another search mechanism to help us efficiently explore or enumerate uh different uh complicated alternatives of a query plan and then we cost it we see that we can try to identify that which query plan among all the all the alternatives would be more efficient or have the smaller cost okay so again to this class we are just going to give a little bit heads up on the cost models we used and we will talk a bit more about about this approach next time so well of course and obviously in in this step or this approach you need a cost model right to estimate the cost of a query plan otherwise you cannot just you cannot compare different alternatives and this cost model would just be a way to generally estimate of the cost of a particular query plan by given a specific state of the database system and I should say that most of the time or actually not most of the time essentially almost all the time this cost or estimation of a query plan is only internal to the database system that means that if you take a specific database system right you take postgres and then for postgres you have a query or a few different query plans for a query then you can compare the cost of these different query plans with the postgres cost model right but it doesn't really make sense to say hey if you have a query with a query plan or have a query plan for a query in postgres with this cost estimated by a postgres cost model and then you have a different system for example mycq you have a query plan from mycq and have a cost estimation it doesn't really make sense to compare the cost estimation between different system just because different system would actually use different units different assumptions right different definitions of this cost model and just vary a lot from system to system so it just doesn't really make much sense if you compare this cost estimation among different systems and then another thing with this cost model is that this cost model is actually independent of the alternative query plan enumeration step or search step that we're going to talk about next next class so what this means is that for in one single system there could be a different alternatives of cost models right there could be a simple cost model right it's very easy to apply and very efficient to compute and then there could be also be a complicated cost models right would be maybe more accurate but then maybe it takes more overhead to evoke right and then different cost model may have different tradeoffs but then these are all independent of the search mechanism right for for these cost models you can just pretty much plop down any of the cost models you have to the search mechanism and then it just search mechanism just invoke the cost model as a black box typically and then there are different choices of this cost model right people have used I mean different ways with the different units to do this cost estimation so here at high level we group them into three different categories the first is called the physical cost right that would actually include the physical resource consumption that the system is going to take to execute that query right for example there could be a cpu side cost how many ios you're going to perform or even more fine-grained things like how many cache misses how many prefetchings you're going to perform etc so and as you can probably realize that it would depends a lot on the hardware right depending on the speed of the hardware on the cpu how big the cache the cpu has or how fast your disk is to perform sequential versus random IO etc so as a result this kind of physical cost estimation would typically only seen in a commercial or enterprise database system so i know my cpu or not my cpu cpu server can do that or call might be able to do that i forgot but for most of the open source database system like postgres or my cpu you are not going to see this kind of physical cost organization because it's much more involved relative to hardware that said there's only well there's only one thing that most open source database system would consider in terms of this physical cost which would be the cost that you distinguish between random versus sequential IO right so because for disk based system that matters a lot and then a random and sequential IO would have a very distinct overhead obviously so open source system will typically distinguish between that but for more fine-grained things typically you can only see them in commercial systems and then the second type of cost that the systems can define would be called the logical cost right so this i mean essentially like his name in this step you only consider the logical intermediate computation that you would need in your query plan so what does this mean this just means that oh it's a little bit similar to our definition of additional algebraic equivalence so by logical cost it's only going to look at the size of the input and output of each operator during your during when you execute that query right so you first have a sequential scan that are going to look at all the tuples well then logical cost would just be the size of the sorry the table right then after the filter then maybe you filter out i mean 90 percent of the queries then the logical cost after the filter of what you are going to feed into the next operator would just be i mean the 10 percent of the number of tuples before you filter it right so this just means just the result size between the operators the input and output and lastly there's a third type of cost called arithmetic cost again this is a little straightforward as what the name sounds like right so beyond the size of the input and output let's say one operator would be hash join right so besides the input size of that hash join could be a thousand tuple output size could be an 100 then this algorithmic cost would just plop in the complexity of the hash join operation with the amount of input tuples right which is in this case would be a thousand see the complexity of hash join is also i mean the cost is just a thousand itself but of course for some other operations the complexity is going to be higher than over right so these are the logical cost and algorithmic cost are what mostly used in open source systems okay so now to emphasize a little bit more about the cost in disk based system since that's the system that we are going to are primarily focusing on in this lecture or in this course right so for most of the disk based systems i mean including my cq postgres etc the number of a disk accesses will just usually dominate or always dominate the execution time of a query right because just comparing to the time you're going to go to the disk and then fetch the result or sometimes in some cases you're going to write the result back and that the cost of the cpu just be negligible right so for disk based system the cost model will just primarily be considering the cost of the number of disk accesses and like i mentioned earlier in this case obviously you just have to distinguish between a sequential and random i o otherwise your cost will be off by a lot and then another thing or nice thing about database system is that since we are going to cost the disk access we need to know how the system is going to access the disk right but not bad but the lucky thing is that in a database system we actually control have control over our access on the disk right instead of using the virtual memory and the page cache from the operating system we actually have our own buffer pool right we know what the replacement policy is we know what page will be or what policy we are going to pin page and unpin page etc so that we can have a better understanding of the disk access of the system and better estimation of how many pages we need to read or write right so because we have control of our own buffer pool manager oh actually another thing i forgot to mention is that again for most of the systems we are going to when we develop the cost model we're actually only going to focusing on a query that is executed alone right so in most cases we are not going to consider the contention if you have a multiple queries executed concurrently right say maybe there's some log contention or maybe you are competing a disk an i o etc right so this is not because this is a more accurate way obviously it is more inaccurate but it's just a simplification or assumption that we make so that we can make our cost model simpler right and make our search process a little bit easier and more efficient and for most of the system it only consider a single query but i would say as some as i will show a one example in some commercial system it will also consider certain aspects of the concurrent query not all not all the aspects but some commercial system will consider a few aspects of the concurrent queries well that can make the estimation more accurate but of course it would be much more much much more complex so let me just give you two examples before we wrap up the one example would be an open source system and the next i will be giving an example of the commercial system right so the open source system we'll be using is postgres so it is like i mentioned earlier for open source system the cost model would be simpler but it's not as complicated and and evolve as commercial system but i still consider the cost model in postgres system is very good and it's actually a kind of like a textbook cost model for the heavy system so essentially what it does is that it will just use well it will just use certain magical numbers to distinguish between the cost of sequential operation versus random operation as well as operations in disk and and memory so so like i specified or listed here the postgres cost model first of all assumes that the system does not have a large amount of memory right so that most of the operations would be performed under the disk but then of course there's a buffer pool right so there's i mean for some operations could still be in memory but that depends on the size of the buffer pool and your heat ratio etc but then essentially to distinguish between memory versus disk and also sequential versus random the postgres will just use these random numbers so to process a tuple in memory postgres would just specify that processing would be 400 times faster than reading a tuple from disk but that's just a magic number there's no specific reason why but they're just determined by the postgres developers based on their experience right to deal with the datasets and postgres query opposition and similarly in the sequential versus random IO there's just another random number that determines that decides that the sequential IO is going to be four times faster than random IO right so so essentially if you specify the cost of a random or for example a sequential IO as one then the cost of random IO would just be four right that's essentially how we distinguish between those things and then you times that magical number with the logical cost model I discussed earlier which would be the input output size and then you times that again with the complexity of the operator complexity of the operator right so that's that the query is essentially your cost opposition that makes sense okay so now just to look a little bit specifically as at the postgres documentation to see how they explain these values or these magical numbers so you can read it but essentially what they are saying is that there's no a good way that they could find to determine those magical numbers right so they just the developers just determines those numbers by themselves with the values that they seem appropriate and of course variant to variance these values actually are generally pretty stable I don't see they change these values like 400 or 4 drastically across different variants probably better for to make the postgres system performance to be more stable right probably better for that but then over time they may also tweak it right gradually especially if there's some like a very new hardware that released I significantly changed the properties of these random sequential IOS memory access etc then they may go and tweak these numbers but but generally these numbers are stable okay so lastly just to give you another example of the IBM DB2 cost model so for this I mean I don't have a source code access so I don't know the exact formula anyway but then I think I could mention from a talk I think we could know from a talk by a inside engineer from IBM is that this IBM cost model consider many more aspects compared to postgres right for example it will first of course consider the the properties of the database system or the these columns themselves for example what would be the schema would be the properties on this schema whether the schema is now or not now where the schema has constraints and it's also used various statistics of each column in your database right like a number of rows the distribution of the data etc it will also consider the hardware environment right if that will actually when the system starts it will actually run some micro benchmarks right to decide the the property of the hardware to see how fast the CPU is how fast a random IOS and how fast a sequential IOS right and then it will test the system and then determine their number or their unit assigned to each operation based on the micro benchmarks they run and then for the communication across the network they also considered a network bandwidth etc a network latency and lastly they actually also consider the concurrent operations of the queries even though it's not everything but then at least you'll consider hey how many concurrent users I have and what is the isolation level I'm using then under this isolation level with this many users will it be possible that I have lots of log content on the data and then maybe that will significantly steal the queries down and then I need to do or maybe I need to do certain operations on top of that but all these would be much more complex than the postgres optimizer that sorry postgres cost model that we talk about all right just to wrap today's lectures up essentially we talk about we primarily talk about the statics rules static rules and heuristics that we can use to optimize a query plan without even to look at what's the content of the data right what's the data distribution what's the size or the number of rows etc these are usually typically these rules are efficient to apply and then we also give a little bit heads up on the cost model that we are going to use so that we can perform more or complicated or advanced enumeration of a different alternative equivalent original algebra query plans that we can and we can use these cost models to compare the cost of these alternatives and then determine the well to find out the better query plan with more advanced operations right so next week which oh sorry next class which is going to go a little bit deeper into the statistics that we are going to use in the cost model as well as the planning enumeration mechanism that we are trying to find the better alternatives all right that's all of it have any questions you can ask otherwise well done look forward to see you next class the white deluxe you my boy he's getting more i'm out he got