 So, good morning everyone and welcome back to this course. So, today's topic query optimization, there is a lot of material in this chapter, but I am not going to have time to cover all the slides, because we have had other interesting questions. So, what I am going to do is cover the first maybe half of the slides here and then just give you a very quick overview of what else is there in this chapter. So, the basic issue in query optimization is, well there are two issues. The first issue is given a query, how do you execute it? What are the options for executing it? The options for executing it are actually at two levels. One is logically equivalent expressions at the relational algebra level. The second is options at the algorithmic level, should you use hash join, merge join, index nested loop join. So, both of these put together give a wide range of options for executing a single given query. The second issue is for all these options, how do you figure out which is the cheapest way? I want to find the cheapest way. Why do I want to find the cheapest way? It turns out if you choose the wrong option, you may take a enormous amount of time, whereas the cheapest option may be extremely fast. The difference can be many, many orders of magnitude. For example, if you did a really dumb thing and executed an SQL query by taking a cross product of the relations in the from clause and then do the selection and the relations have millions of tuples, you are going to take a minimum of a million-million operations which even on the fastest computer will take ages to finish, completely ridiculous way of doing it. In contrast, if you do it the right way, you may finish in a matter of seconds or even on a million records. A few seconds is quite normal with good machines and good database systems. So the question is how do you estimate the cost and then from among the alternatives, how do you efficiently find the best one or something which is good, close to the best one and pick it within a very short period of time. So that is the challenge for query optimization. So we split the area into several parts. One is transformation. The second is catalog information for cost estimation including statistical information, how to do the cost estimation and then how do you use cost-based optimization using dynamic programming and then a little bit about maintenance of materialized views. Most probably I will not get to this last part at all today, but you can read it later. So here is an example of a single query. What is this query doing? It is joining teachers and course and then joining that with instructor and then selecting department equal to music. This query finds those who are teaching courses and joins it with instructor. Since these are all natural joins, this query is actually going to find instructors in the music department who teach courses in the music department. That is what this query does. Now I can take the selection or department name equal to music and push it down into instructor. So instead of first joining and then finding, eliminating those which are not in music, I can do this upfront and select only those instructors who are in music and then meanwhile join all of teachers and course out here, join all of teachers and course and then join that with instructors from the music department. So you can see that the right hand side is going to be better than the left hand side most probably because we have eliminated irrelevant instructors early. But is the right hand side the best possible plan? As a human, you can notice that we are actually taking courses from departments which are not music, joining them with teachers and then propagating it and finally we are filtering those out here in this next join. So as a human, you would say that is probably a bad idea, why don't we ahead of time filter out courses which are not in the music department over here. So we can filter out instructors who are not music, courses which are not in music and then do a join. So this part which says do selections as early as possible is an example of a heuristic which humans would use to optimize a query and optimizers also do use such heuristics. Why postpone the selection if you can do it early? You will definitely reduce the sizes of intermediate results. But there is another part which is a lot harder for humans to do. Should we take a plan which first joins teachers with course and then join it with instructor which is on this side even with the selections or should we take a different plan which joins instructor with teachers and then join it with course or should we join instructor with course and then with teachers. The last one is actually a bad idea because instructor and course don't have any attribute in common. If you join them, it's really a cross product. You can do it but it's going to be very inefficient. But there are two other choices here. Now that doesn't sound like a lot but the number of choices actually grows exponentially with the number of relations and it's very hard for a human to go over all the alternatives and pick which is the best one. But it's relatively easy for a query optimizer to go over the exponential number of choices. Now note that typically queries have 6, 8, 10 is already on the high side relations. So exponential 2 power 10 is 1000 even if the exponent is 3 which is if you expand the search space. 3 power 10 is bigger but it's still certainly doable within a few milliseconds or a second for any modern computer. So optimizer can go over these alternatives much better than a human can. So now the actual plan which is executed, an evaluation plan requires both an algebraic expression which is joins, selects, projects, aggregates but also which operation to use. So what we are going to actually evaluate at the end is what is called an evaluation plan or it's also sometimes referred to as an annotated relational algebra expression. So here is an example. This defines a fairly complete evaluation plan. The thing over here says take course and project it on course ID title and then join it with teachers using hash join. There's one more link in here which says pipeline it. What does pipeline mean? This is actually some material from yesterday which I didn't get to cover. It's in the end of the query processing section and I will let you read it offline for lack of time but the idea of pipelining is when you get a tuple from the projection over here you don't write it to disk. Instead you pass it directly up to the next operator. This cannot always be done. Sometimes you have to write something to disk and then read it back in later because the next operator will not be running at the same time. It will be running afterwards but your query plan can be constructed in many cases such that two operators in this case join and project are actually running concurrently meaning if you have one CPU then only one will run at a time but the idea is that the hash join can tell the projection give me a tuple now and the projection will create a tuple and give it back to the hash join. In fact here there is also a sort and the sort can tell the hash join give me a tuple now and the hash join will in turn get tuples from teachers and course do the partitioning, take the first partition and then generate a tuple and then pass it back up to the sort and then the sort will keep doing this. They give me one more tuple, give me one more tuple and hash join will generate further tuples one by one. So this is basically an iterator, let me write out the term here on the white board. This is the iterator model where each operator has a few functions, there is an open function just like a file open the operator initializes it. So what are the operators join for example hash join, merge join and so on. All the operators have provide all these functions then there is a next it says give me the next tuple of that result and finally there is a close. You will notice that this is very similar to how you read data from a file you open the file read read read read maybe a line at a time and then you close the file. So similarly you open an expression in the tree read data from it one after another by calling next and then close it when you are done with it. So this model is actually very nicely suited to pipelining that is the operators can be executing concurrently though if there are multiple course they actually run in parallel but usually what happens is one operator runs it is a function call on one operator and let us say here the join there is a function call on join and the join in turn does a function call which invokes the project operator. The project operator generates one tuple and returns it when join calls it again for one more tuple it will return the next tuple and so on. So obviously each operator has some notion of state it says what was the last thing I returned and what should I do next if I am called again to return the next tuple what should I do to return the next tuple. So all of this is part of the iterator model of pipeline evaluation it turns out this is a very very basic model and every database system supports it simply because writing out results to disk in between is very expensive. So the alternative to pipeline is what is called materialized evaluation so what is materialized evaluation in materialized evaluation each operator in this case a project will generate its entire output and write it to a file a temporary place and store it. Now let us say this was some other thing this was a selection. So the selection will in turn generate its entire output and write it to a file then the join operator will take these two temporary relations join it and write its output to a file to a temporary relation basically and each operator keeps doing this going from the bottom of the tree upwards. So that is called materialized evaluation a materialized evaluation is more expensive because of reads and writes so wherever possible database system would do pipelining without writing out results temporary this is not always possible. So the query plan is annotated with edges saying pipeline or not if the edge does not say pipeline that would mean generate the results store it and then the next operator will take it. So coming back here on this side there is a select department name equal to music this is annotated with use index 1. So we are telling it use this particular index to fetch instructor records with department name equal to music then the next thing it is pipeline into the sort operation now the sort operation is a bit funny the sort operation cannot output any of its results until it has consume all its input. So the sort operation will have to completely evaluate its input before it returns even one result upwards. So you can think of the sort as a kind of barrier it completely finishes evaluation more or less there is some extensions and then starts outputting results. So now the output of sort is sent to merge join and the output of this sort on this side is also sent to merge join and merge join gets those tuples in a pipeline fashion and outputs them to the join result. And finally that is sent up to the last project and here if this were a select distinct in SQL we have to remove duplicates. So this project is annotated with thing saying sort to remove duplicates. So it has to sort its input to remove duplicates before outputting it. So that is a complete execution plan. So the bullet at the bottom of this slide says find out how to view query execution plans in your favorite database. In fact your lab for today is you are going to look at query execution plans in Postgres SQL. So how do you see query plans in Postgres SQL is actually very easy. Whatever query you wanted to execute you simply prefix it with the word explain. So explain select star from something explain insert into something something whatever is the SQL query put an explain in front and Postgres SQL will show you a plan. So I will let you try this out and see what are the plans you get. Other databases have other syntax we would not get into it here you can read it if you want. So as I was hinting at the beginning this huge difference in plan cost and cost based optimization basically has three steps. First it generates logically equivalent expressions using equivalence rules. Second it annotates the resultant expressions to get alternative query plans. These are fully annotated with the exact operations hash index lookup whatever and then choose the cheapest plan in a based on the estimated cost and the estimation as I said is based on statistics such as number of tuples, number of distinct values for an attribute and so on. Then there is an issue of you know for the given relations you may have computed the statistics of number of tuples and number of distinct values. But if you have a complex expression for the cost of each operation I need to know statistics about its input but its input is actually an expression. So how do I know the statistics of an expression? So the second part is statistics estimation for intermediate results what statistics just above ones usually number of tuples number of distinct values and often also histograms. Now once you have these statistics we can estimate the cost of various algorithms. We already saw yesterday how to estimate the cost of most of the algorithms. Those require typically the number of tuples the input the sizes of the tuples the number of blocks that the relations occupy and such like. So the first step is how do you generate equivalent expressions? So what do you mean by equivalent first of all? The two relation algebra expressions are said to be equivalent if the two relations generate the same set of tuples on every legal database instance. What do we mean by legal database instance? One which satisfies all the integrity constraints specified on the database. You can have two expressions which may return different results if a primary key constraint is violated but we do not care. Then as long as the two return the same result with the primary key constraint we are fine because anyway the primary key constraint is enforced. Also what do we mean by same result? First of all the order in which tuples are output is irrelevant. If you need order at the end of your query anyway there is an order by clause which will enforce the order. So we do not we allow we treat expressions as equivalent without caring about any ordering of the result. Now this notion set same set of tuples but in SQL is actually not just the set it is a multi set. What is the difference? In a set each tuple occurs only once. In a multi set a tuple can occur twice. So let us take one multi set where a particular set tuple occurs twice. Another multi set which is identical except this one tuple occurs three times. These two multi sets are not the same. Two multi sets would be the same if first of all they have the same set of tuples and second if the tuple occurs n times in one it must occur exactly n times in the other also. So that is when two multi sets are equivalent. So what we really need in the context of SQL is multi set equivalence of the multi set relational algebra rules. So now an equivalence rule says that two expression of two forms are equivalent. So let us see an example take the very first one. This one says if I had a selection which had two conditions for example selection which said age equal to 5 and city equal to Mumbai that is the selection condition. Now this says I can split it into a sequence of two selections. So now I can split it into a select operation which select city equal to Mumbai get the results on that I apply the selection age equal to 5 and that is my final result and they are going to be exactly equivalent. So that is an example of an equivalence rule. So coming back here the bottom line says we can replace an expression of the first form by the second or vice versa. This is an important issue in equivalence rules. We are not saying necessarily that doing it like this is better than doing it like this or vice versa either may be the cheapest. In fact the cost may be the same but doing this transformation may be useful to do some other transformation subsequently which may reduce the cost. In fact it could be even more tricky. It may be worth going from this form to that form even though the cost is higher because subsequently we can do more transformations and come up with a cheaper final result. So the optimizer will actually usually blindly apply these equivalences. If you get this form add the other as an alternative. If you get this form add the first one as an alternative. So you collect a number of alternatives that is the idea. The second equivalence rule is again fairly simple. It says that selection operations are commutative. What does this mean? As an example I said age equal to 5 and city equal to Mumbai. I can do it as a sequence which first say city equal to Mumbai gets the results then applies age equal to 5 or I can reverse it. I can first do age equal to 5 get the result and on that I can apply city equal to Mumbai. So I can write a which order I do it the result is the same. So the select operation is commutative that is what this will say you can flip the selections. Then there are other rules such as this one about projection which I am going to skip but let me move on to the more interesting rules. This 4A says that a selection on a joint on a Cartesian product is really the same as a joint. So here a select on E1 cross product E2 is the same as E1 join E2 with the joint condition theta which came from the selection. You can see that this is a very important rule. The original SQL query really had a cross product of all the relations in the from clause. The idea here is if I have a selection on a cross product I can push it in and turn it into a joint condition. So I do not need to evaluate the cross product first and then apply the selection. I can evaluate the joint with this condition instead and that is a lot more efficient. So 4A is a very very important rule for any optimizer. 4B is another one which says that if I have a selection on top of a existing joint I can take this selection condition and push it into the joint and apply that ahead of time. So there are many more such rules which are possible. The next two rules are also very very important rules. The first rule says that joint operations are commutative. What does this mean? It says E1 join theta E2 is the same as E2 join theta of E1. Well there is a slight trick here if you do this in SQL the order in which the columns appear gets flipped but that is a minor issue. We can deal with that very easily by reordering the columns so we will treat them as equivalent. So first of all we can flip the left and right input to any joint operation. The second one says that natural joint operations are associative. What does associative mean? So here is what it means. If I first join E1 and E2 take the result and join with E3 natural joint. This rule says that it is entirely equivalent to first joining E2 E3 and then joining with E1. So the order in which you do the joints for natural joints is irrelevant. In fact if you take these two rules together what they actually tell us let me use the whiteboard to explain that. So these two rules allow us to so commutativity plus associativity together can be used to show the following. If I have any set of joints it holds for natural joints and for other joints but to keep life simple let me stick to natural joints for the moment. So I have any expression E1 join E2 join E3 join E4 and maybe to make it unambiguous I have parenthesis like this I join first I join E2 E3 then I join with E1 and then I do the join with E4. Now with these two rules which are given we can actually infer that we can do it in any order. So in fact this corresponds to a tree where this is E4 and then E1 E2 each node in this tree is a joint. So this is called a joint tree. So what these two rules says that you can take any tree which you want with in this case there are four relations being joined so take any binary tree which has four leaves. So what are examples of binary trees with four leaves here is one example here is another example see that this also has four leaves the leaves are here I am not showing the leaves but the tips of these lines are the leaves. So here are two now there are many more such trees actually. Now I can take any tree and I can put any of these four relations I have E1 through E4 I can make this E1 E2 E3 E4 it may not be very clear because there is not much space here but these are this is one particular order I can take any other order I can do E1 E3 E4 E2 so I can put any relation I want to any leaf of any tree and what we can show is that the final result is going to be identical except for column ordering which we can easily move around as we want later. So the moral of this theory is that given any natural join expression you can pick any one of these join orders how many such join orders are there turns out there are a lot of join orders if you consider every possible tree like this that itself is very large it is exponential in fact it is factorial and then the number of ways in which you can stick relations into the leaf is also factorial. So the overall thing is very very large it is ridiculously large so we cannot actually generate all of these it will take forever so it is not actually practical to generate all the alternatives and then pick the best that is the first moral of this story. So what you need are clever algorithms which will let you find the best join order efficiently and it turns out join order optimizations is absolutely critical a wrong join order can be very very slow a good join order can make the query run much faster. So database optimizers have put a lot of effort into join order optimization so given a set of relations which are joined how do you find the best way to join it now in the join orders I just told you it is a join but actually you have a choice of hash join merge join nested loop join blah blah blah so how do you make that choice. So I have to take all that into account and get the best join order with the join algorithm specified that is a very very important task for an optimizer. And the first optimizer which actually solved this problem was the system R IBM system R optimizer and that the paper where they describe the system R optimizer is considered one of the classic papers in database area in fact in computer science in general because before that people did not know how to do this effectively these people showed how to do it and after that SQL systems could take off because now there was an reasonably efficient way of finding the best join order which is something even a human would have had trouble doing but databases could now do that quite effectively. So that is the history coming back these two rules let basically let us show that the join order can be anything and the result is the same. Now that was the natural joins in fact the same kind of rules hold with joins which have a join condition what is the theta join it is a join with a condition so for example R join S where R dot A equal to S dot A that join condition is R dot A equal to S dot A. So that is what I refer to by theta here down here the theta is some condition used to equate or relate attributes from the left input to the attributes from the right input. So that is the theta refers to any join condition. So what this rule says is if I have a join like this with various conditions I can rewrite it like this by using associativity if you are not here I am joining E 1, E 2 first then with E 3 on this side I am joining E 2, E 3 first and then with E 1 the only issue is which conditions apply where the problem is the conditions here theta 2 and theta 3 some of them use attributes from E 1 some may use attributes from E 2 and some may use attributes from both. So what we are saying here is theta 2 involves only attributes from E 2 and E 3 that part can be used here to as the join condition for E 2, E 3 and then theta 3 can be used later on because it involves attributes from E 1. So that is an example of associativity now if you do not understand the details of all this theta fully that is as long as you understood the natural join case that intuition is good enough do not worry about this one. So we can depict these rules pictorially. So this says a tree like this is equivalent to a tree like this where E 1 and E 2 have been flipped this one says a tree like this where E 2, E 1 are joined first then with E 3 is equivalent to joining E 2, E 3 first then joining with E 1 and this rule is another rule which is coming up which says if I had a selection on top of a join and that selection actually involved only attributes from one of the relations E 1 in this case then I can apply the selection ahead of the join I do not have to wait for the join I can do it ahead. In fact this should be familiar the first example we saw today had a department equal to music and we saw that it was on top of a join we had a join on top of it department equal to music. So we saw that below is instructor which has department attribute. So we can push the selection down and get only instructors of the music department then do the join. So this as I said is a very important optimization also. So that is rule 7a which says if a join is if a selection above a join and it only involves attributes of one of the relations push it down to this form directly apply it on the relations then do the join. So that was 7a we saw it pictorially it is shown textually here there are variants of these rules for lack of time I will not cover them here 7b is a small variant you can read it offline. So there are other rules in the book we have not even shown all the rules here they include pushing projections through joins that is throwing away unnecessary attributes early on instead of carrying it all through an evaluation plan that is one class. Then there is another class of equivalences involving set operators associativity commutativity and others with set operations. And similarly how selection and projection can be moved around with set operation we saw how a selection on a join can go down below the join. Similarly a selection on a union can be done below the union run the selection already then do the union. So those kinds of rules are also there. So there are many rules we are not going to list all of them here. So each rule does one single step overall in the context of optimization we have to apply multiple rules. So here is an example which we saw before which does multiple transformations. So a selection on a join what has happened here multiple things have happened. First of all if you see here teachers were joined with course then with instructor on this side instructor is joined with teachers and then with course. So what has happened here how could we get this we could have got it with associativity remember associativity twisted the tree around. So if we apply associativity here we would get instructor join teachers and then join with course. So that is one rule which is applied but that does not give us the result yet it only gets the correct join order. In fact to get the right join order we may have to apply associativity commutativity multiple times. Then the selection which is here has been pushed down. In fact if you note here if you had this and the selection above the selection would first be pushed on top of the join. Then you will realize that well it can be pushed through this join as well. So it is pushed through one join then further to the next join till it lines up directly at instructor this selection. Similarly year equal to 2009 applies to teachers. So it will go down here and then down here till it lines up here. The point is not that the system magically knew that this was the best plan but the point is that by applying these transformations you can get many many many expressions. This was one example we cannot list all of them here it would take forever. But what we have done is we have shown a sequence of transformations which let us land up with a plan which we believe is good. There are many many more transformations which will land up at many other plans some of which will be good some of which will be bad some will be atrocious. We need to select from amongst all these alternatives based on cost. So that is an important part of optimization. So I have been talking for a long time now. So let us give a short quiz break which is the following. Here is an expression select r dot a equal to 5 on r join s where the schemas are r a b and s b c. So the join condition is r dot a equal to s dot b. It is a natural join remember. So this expression is equivalent to which of these four alternatives. The first one is select r dot a equal to 5 on r do that first then join with s. The next is select r dot a equal to 5 on s then join with r. The third option is neither both are wrong they are not equivalent to the original and the last option is both 1 and 2 are equivalent to the original query. So those are the alternatives. Participants please make sure you press the s t button now. So please choose your option 1 through 4 or a through d and you have about 40 seconds. It time is almost up. Please make sure you have selected answer and time is up. So the answer to this question was something we just saw. We can apply a selection below a join we can push it down provided the selection condition is on that relation and that is exactly what the first option is. We had a select r dot a equal to 5. So we could do that on r before doing the join. So 1 is what we just saw it is correct. How about 2? It is a select r dot a equal to 5 on s that actually is meaningless. How can you select r dot a on s where the attribute is not even present. So it is actually a simple syntax error. So that is wrong. 3 is obviously wrong because 1 is right 4 is wrong because 2 is not right. So the final answer is 1 and let us see how people have done a lot of centers did not even upload. Please check all centers but of those centers which did upload we had a good response 138 responded and the options if you can see it now option 1 the majority. So once again the audience poll the audience wins but quite a few members of the audience have chosen 2 or both. So I guess you missed the small trick here that is actually a syntax error. It does not make sense to push it on s r dot a equal to 5 on s means nothing or perhaps you got confused by the parenthesis or the lack of parenthesis. So when you have a selection on s join with r it obviously means you first do the selection then the join not the other way. Okay moving ahead when you have different join orders possible there are some heuristics which humans used to use. For example if I have r 1 join r 2 join r 3 where r 1 is small but r 2 r 3 are large. So if I join r 2 r 3 maybe it will be a very large relation but perhaps when I join r 1 r 2 the result is very small maybe r 1 had a selection which eliminated a lot of tuples in which case I will try to avoid joining r 2 r 3 which has a very large result and this one r 1 r 2 may be small joining that with r 3 may also be small. So obviously that would be a better solution. So that is a motivation for join order optimization and there are examples here which are intuitive which say if you have this one department name equal to music instructor only a fraction of the instructors are going to be in music. So it is better to join in those few instructors with teachers and then join it with course because then we will not unnecessarily generate teachers course pairs for people who are not even in the music department. So that is probably a good join order. So how do you find a good join order? There are two alternatives which different databases follow. One alternative is to use equivalence rules as we saw them to systematically generate expressions which are equivalent to the given expression. So it turns out that doing this is actually quite difficult. People took a while to figure out how to do this efficiently. There was a project in Wisconsin which tried to use equivalence rules and when they implemented it they found it was very inefficient. But what happened is the person who did his PhD on implementing it and at the end of his PhD he found it was not very efficient. He was a very smart guy and it took him some time to figure out how to do it efficiently. So he had actually graduated and was a professor at that point and he came up with this very, very nice algorithm to efficiently evaluate, to efficiently enumerate expressions and find the best one in a time which is much, much less than actually enumerating all the orders. It is a very, very clever piece of work called the Volcano optimizer and the person's name is Gutz Graffer. So he subsequently joined Microsoft and his work actually is used in the Microsoft query optimizer and it is also used in Cybase and few other places. In contrast, most of the people went along with the old system R style optimization which does not use equivalence rules at all. They are conceptual. In the optimizer there is no explicit system of enumerating equivalent expressions using the equivalence rules. Instead what the system R optimizer does, it focuses totally on joint order optimization. It does not look at anything else almost. Well, it does worry about selections and what it says is, selection should be pushed down. Do it as fast as possible. If I say select department name equal to music, I will enforce it wherever department name occurs. If it is in the instructor relation, when I read the instructor relation, I will filter out things which are not music. So what it does is, it pushes down selections always and then it focuses on selecting the best joint order. So we do not have time to get into all the details, but at the end I will briefly sketch an algorithm which can be used to find the best joint order fairly efficiently. Before I get to that, I am going to spend a little bit of time on cost estimation where we have to find the cost of operators and the statistics of operators. Now, this is a little more complicated. It is not very hard. You can get a fairly good estimate, but it is not precise. What do we mean by this? An estimate is an estimate. If you look at all our civil projects in India, you routinely read in the papers that there was a cost overrun. So somebody estimated at the beginning, it would cost 10 crores, but by the time it was finished, the cost escalated to 20 crores. So the estimation was wrong. Now, in civil projects, usually the estimation errors are because the price of steel and cement went up or because the project took 5 years to complete instead of 3 and the cost of labour went up. So various reasons are there. In the context of databases, what can go wrong? First of all, statistics are approximate. Second, even if you have all the statistics, identical statistics, there may be cases where the data is such that some things are correlated. So to give you an example, supposing I have a selection condition which says let us say department equal to computer science and on instructor relation, department is computer science and age is less than 50. Now, the same sort of query might run on another department, mechanical engineering and age less than 50. They look identical. If you take the statistics for age, I have some idea of the distribution of ages. If I look at the statistics for department, I have some idea of how many people are there in each department. So now I have this combined query. This is CS age less than 50 and another query, department is mechanical and age less than 50. If I had combined statistics for department and age, I could estimate this correctly. But it may be too expensive to store so much statistics, to compute them first and then to store it. So what most database systems would do is they would have an idea of what fraction of people are in computer science department, what fraction are in mechanical. Similarly, it would have an idea of the distribution of ages. So what an optimizer would do is it will say well one-tenth of the people are in CS and among instructors two-thirds are less than 50 years. So I will estimate that one-tenth times two-thirds which is 2 by 30 is the fraction of instructors who satisfy department is CS and age less than 50. It will do the same thing for mechanical also because it does not know anything better. Mechanical also is one-tenth and age distribution it thinks is the same. But in reality the age distribution may be very different. It may be that computer science is a new department and there are lot of young people. Actually this is no longer true but some years ago it was true. Whereas mechanical is an old department. It has been around for a long time. So most of the faculty there are much older. So the query estimates would be identical for the two but the actual result of the query could be very different in size. So the same query may give 5 or whatever 20 people in CS and 5 people in mechanical simply because the mechanical people were in general older. So these kinds of things cannot be helped. You have limited space and time to compute and store statistics. So some estimates are going to be wrong. They are approximate. But that is the best you can do. So then you can ask well if you are going to find the best plan based on wrong estimates why even bother you are wasting your time. The answer is no it is not so bad. They are sometimes wrong. Sometimes they can be quite wrong but practical experiences that even those statistics are approximate what they are good at is eliminating very bad plans. So given a query there are number of plans which are very bad. They are definitely going to be eliminated even if your statistics are not quite accurate. Now among the plans which are good it may be that it will select a plan because its estimate was wrong. It selected a plan which is not quite the best plan. But practical experiences that most of the time the best plan is chosen and even if the best actual best plan is not chosen the plan which is chosen is close enough. People generally do not mind you know if it took one and a half times the time of the best possible plan it does not really matter to most people. So as long as it is close enough one and a half is still significant. If it is 10 percent more most people would not care. So optimizers basically are good at finding plans which are even if they are not exactly optimal they are close to optimum. So now the number of join orders which have to be considered if you do it very naively is factorial. I would not get into the exact formulae but with n equal to 10, 10 relations the number of different join orders is 176 billion. Now enumerating so many is obviously going to take forever. So you cannot really do this. You cannot enumerate it. So the trick is to use dynamic programming. Now if you have done algorithms courses you would have seen the term dynamic programming in there. So the same idea is used here to reduce the cost from this factorial 176 billion to something which is usually exponential 2 power n or 3 power n. 2 power n is what 1000 that is much much less than 176 billion. 3 power n is also not so small but it is still handleable. So how does this work? The idea is the following. To find the best plan for a set of n relations consider all plans of the form s1 join s minus s1. What is s1? For every s1 which is a non-empty subset of s. So I have some number of relations. I will take every subset pull it off and consider any one such subset s1. So if I break up the plan like this with s1 on the left hand the remaining relations on the right then the best plan for this set can be found by recursively computing the best plan for s1 and the best plan for s minus s1. So the first step is to set up a recursive way of enumerating all plans. The second key idea of dynamic programming is that if I do this recursive set up the same optimization goal which is find the best plan for a particular set of relations will be invoked again and again. So the same set of four relations may be a sub part of many many many different other plans. So this thing to find the best plan for a set of four relations will be asked many times again and there are two key insights. The first is it does not matter what the plan is for the rest of the relations. The best plan for this four is the only one we care about. If you have a plan for these four which is not the best plan that cannot form part of an overall best plan. So for this sub goal for this set of four relations I can find the best plan and throw away all other plans. I do not even need to remember them. I will just record the best plan for this four relations and from that recursively I can build up the overall best plan. The second key insight is I need to compute this only once. The first time the optimizer recursively said find the best plan for this four relations. The query has ten relations. So for some four it said find the best plan. I will compute it and store it. The next time as part of some other recursive call the same thing is invoked again. I will say I have already computed the best plan. I just need to retrieve it and return it. So this step is what is called memoization or dynamic programming basically consists of remembering the best choice for something and if the same question is asked again return that answer without again computing it. If you do this the cost comes down drastically. So let us see that pseudo code. So to find the best plan for s if its cost is not infinity what does that mean? We already computed it and stored it. So initially all costs are set to infinity and if it is not infinity we have stored it. We retrieve it and return it immediately. It is very fast. What if we have not computed it already? If s has only one relation then it is just a question of taking whatever selections are there on s applying them and that is the plan for s. There are a few tricks which issues which I am going to ignore to keep life simple. So the base case is one relation we have done. Just find the best way of accessing the data in that relation using whatever selections are there. Now the recursive case is when s has more than one relation. So now we are going to say for each non-empty subset s1 of s which is not equal to s it is not empty and it is not equal to s. We are going to recursively find the best plan for s1 find the best plan for s minus s1 that is the two parts s1 s minus s1 and we are going to do this for every possible s1. But for a given s1 I will find the two best plans recursively. Note that find best plan is the name of this procedure. So this is actually a recursive call. So I have found this somehow recursively. Now what do I do? For this s1 partition joining with the remaining relations I will find the best algorithm for joining those two. So I will look at the alternatives hash join merge join index nested loop join etc. And I will estimate their cost and find the cheapest one. So that is the algorithm I will plan to use. And now the cost of this particular break up using s1 is the cost of the best plan for s1 which is p1. So p1 dot cost plus the cost of p2 which is the best plan for s minus s1 plus the cost of the algorithm A which I just considered. So that is here cost equal to p1 dot cost plus p2 dot cost plus the cost of A. So that is the cost of the best way of doing it if I break up the relations into s1 and everything else. Now there are many ways of breaking it up into different s1. So what I am doing here in this if statement is if cost is less than best plan s dot cost. So remember this is a loop it is doing it for every possible s1. So earlier in the loop I have tried some s1 and first time around it will be the I will choose that one and store its cost. Now what I am saying is if the cost of this alternative is less than the currently known best cost for the overall set s. Then I will replace the best plan s dot cost that is the currently known best plan for s with the current plan which I am looking at. That is the cost and similarly the plan is execute p1 dot plan execute p2 dot plan and join the results. Again there is pipelining and other issues which need to be taken into account we will ignore those details for simplicity. Now that is it we are actually done with the algorithm. Again there are some issues with sort orders and so on which I am going to ignore but at its core this is the algorithm. This is a dynamic programming algorithm but the question is what is its cost? It turns out that this cost is of the order of 3 power n and a lot of optimizes including the original system R said that 3 power n is still too high. Remember machines were a lot slower those days. So what they did is instead of taking every subset s they what they said is that I am going to take every relation or think of it as a subset of size 1 of s and then do the rest. Now what does this correspond to? Let me show it pictorially. Pictorially what system R does is it considers a class of joint trees which look like this. So here each relation over here each of these inputs is a relation and at the bottom this is a relation. So look at this tree for every operation what is the node? It is a joint. So this is a joint everything in here is a joint. Selections we are going to ignore for the moment it is pushed down. These are relations the leaves are relations. So this kind of a joint tree is called a left deep joint tree. So what is special about this? One of the inputs of every joint is a relation. So what this means is system R will not consider trees which look like this. So e1 e2 e3 e4 this is the joint tree which is not lefty. System R will not bother about these. It will only look like trees that look like this and for this class of trees it turns out that finding the best plan is cheaper. Instead of considering every subset s here I will just consider singleton relations and it will simplify the plan the algorithm and it reduces the cost to much less. In fact the cost for joint order optimization left deep joint trees is of the order of 2 power n. It is actually more like n into 2 power n but it is still a fairly small number. So for 10 it is 10 into 2 power 10 which is 10000 which is quite cheap for a fast computer. So this slide talked about left deep joint trees R1 R2 up through R5 and this is a non-left deep joint tree and this slide shows how to change the algorithm to do left deep and it also talks about this space required. I am going to skip these details. So what I just told you is if only left deep trees are considered the time complexity is n into 2 power n. I have a slide here on short orders for lack of time I am going to skip it and then I have several slides on statistics for cost estimation. Again for lack of time I am going to skip all of this. You can read these offline but I will note that the key statistics which are stored are number of tuples that is a key one, number of blocks containing tuples of the data. If you recall we used both of these in estimating the cost of joints and sorting and so on. And then the one important extra thing is the number of distinct values that appear for an attribute A in relation R which is the same as the size of project on A of R with eliminating duplicates. So that is an important statistic the number of distinct values and many optimizers also use histograms which is what is the distribution of values. So for example if you have a histogram on age attribute of some relation person maybe it will look like this. You have so many from 1 to 5, so many from 6 to 10, so many from 11 to 15 then this and so forth. So that is a histogram and histograms are very useful to estimate sizes of selection results and joint results. So again I am going to skip the details. There is a whole bunch of formulae for estimating the size of selections, the size of joints, the number of distinct values of selection results, joint results, etcetera, etcetera. It is there but most courses which teach all of this in one semester probably will not be able to get into all the details. It takes a lot of time to cover all that. So for a typical course what I have covered today for optimization is probably what you will get time to cover in your class also.