 Today we have two theory sessions during which I will finish up on query optimization very briefly and then on transactions and then we have the lab session immediately afterwards after lunch on transactions followed by the last couple of presentations and discussion session logistics and other stuff. Okay, so yesterday in the lab we saw examples of query plans and we had a brief discussion on alternative plans and why the optimizer chose a certain plan based on statistics. So this chapter basically describes some of those techniques which we briefly looked at plus a whole bunch of other stuff and a few people had questions about this yesterday which I would request them to ask again today while I give the talk so that some of the common questions about query optimization which people have been asking can be cleared. So please interrupt me at any time. So the basic thing in query optimization is that given any SQL query we have to evaluate it in some way and there is a bunch of operations which the database supports primitive operations which are basically variants of relational algebra, join, selection, projection and so on, group by. So given an SQL query there are many different relational algebra queries which would give an exactly equivalent answer. Now given a relational algebra query there is another issue of how exactly do you evaluate it? Given a join we saw that there are multiple join algorithms, given a selection you have multiple options, do a relation scan, use one index, use some other index and so forth. So there are basically two levels of choices. The first level of choice which we are going to discuss initially is at the logical level. So given a relational algebra expression there is another logical expression which is equivalent. In fact there are many, many logical expressions which are equivalent and then we go to the physical level which is instead of just saying this is a join we will say that this is a merge join or a hash join instead of saying selection we will say selection using an index and so forth. So then such plans which are annotated with those evaluation algorithms are often referred to as the physical relational algebra or the physical plan. So here is a small example where we have a join of teachers and course then joined with instructor and then there is a selection on department equal to music. Note that this query actually finds all names and of instructors and titles of courses they teach provided both the instructor and the course are from the music department. Why? Because department name comes from instructor and from course and the natural join makes sure that both are from the same department and then this selection restricts it to music. So that is the meaning of this query. Now another equivalent query to this is on the right hand side. So first we select instructors in the music department and then we join it with teachers join with course. Can you suggest another equivalent query for this? Basically the order in which you do joins is not really material they all give the same result. So we could have joined instructor with teachers and then joined that result with course. What about instructor with course? Could we have joined it? We can do a cross product it would still be correct a natural join of instructor sorry it is not a cross product that is a department name in common. So even that is okay but even if it were a cross product it is still a valid relation algebra expression except you would probably not do it because it would be very inefficient. Okay so in this particular query there are many choices of join order. There are many choices of where to place the selection. The projection in this case has to be at the top but in some cases even here in fact you can remove unnecessary attributes early. So for example in this one for instructor are there any unnecessary attributes which we can project out right at the beginning. Instructor has a salary which is not projected out it does not participate in any join so you can throw it out. Instructor has ID but we cannot throw it out. Instructor has name we cannot throw it out because it is used in ID is used in the joins and department name is used so the rest cannot be eliminated. In teachers there is in fact a lot of surface can be eliminated. It has a section ID, it has a semester, year, time slot, no sorry time slot is for the section relation. Here there are three things which can be projected. So basically there is in fact a lot of stuff you can do to remove unnecessary attributes we can do selections early the second version selects on department name equal to music early so that you do not unnecessarily carry around instructors of other departments and do joins only to throw them away at the end. So we can actually come up with a plan which we would think is a lot cheaper but how do we know it is cheaper when that is based on statistics, estimates based on statistics which the query optimizer uses to choose amongst alternative plans as we discussed here. So this is an example of an evaluation plan for the same one. What have we done here? We have actually this one is actually not consistent with the previous one I think one of the plans did not get updated in the slides. So what is this doing here? It is actually projecting course on course ID title and then pipelining the result of the projection to a hash join. The hash join will partition it, it also partitions teachers and then the result of that join is passed up to sorting. Now as we discussed yesterday many algorithms require a sort in particular here we are doing a sort because we are doing a merge join up here. So it is we have to sort this result. Now the result of the hash join can be passed up to sort, this sort runs and then the result of the sort can be passed up to merge. Now it says pipeline, how can you pipeline the result of a sort? Well that is actually because the sort operator can be broken into two parts. One is the run generation and the other is the merge. The merge result can be pipeline instead of being written out. The run generation actually writes runs to disk and then the merge step reads them back from disk and its result can be pipelined up. Similarly on this side we say select department name is music using a particular index which presumably is on department name. And then we sort it because merge joins needs a sort and we are pipelining this into sort. Similarly the output of sort is pipelined into merge join which in fact pipelines it into project. Now can we we are joining here on ID but that ID is being removed by the project. So how do we eliminate duplicates? It sorted an ID which is actually useless for removing duplicates on name comma title. Because many different IDs may have the same name and title. The title is for the course, name is for the instructor. So we actually have to sort on name comma title to remove duplicates or we could hash but this particular project is annotated by saying do it using sort. So this is a complete evaluation plan. Real evaluation plans as we saw yesterday in PostgreSQL actually have even more details than this. So we will leave this over there. How do you view evaluation plans? We already saw how to do it in PostgreSQL using explain. Others have equivalent ways. SQL server has set show plan text on and of Oracle has some version of explain that little bit more work you have to do to get the result. And DB2 has its own. So every optimizer lets you see the plans. Now how do you generate the space of equivalent expressions? We saw an expression, we saw another expression, then we discussed a few more which are all equivalent and some of them may be more efficient than the others. So how do you systematically generate the alternative plans and then pick the cheapest and do this as efficiently as possible without wasting time? That is the key requirement. Now there are actually two answers to this question and different systems have used either one of those answers. The first answer is to use equivalence rules and then have a system which can take as input a set of equivalence rules in some appropriate form and add the run time input to that is the query. It uses the equivalence rules, rewrites the query and by means of finding alternatives using equivalence rules, it picks the cheapest one. This is used in SQL server and Cybase. The other alternative is to not use equivalence rules. This is actually a fairly complex, it was a later addition. Initially people didn't know how to do this efficiently. So although people knew equivalence rules are important, how to use it practically was not clear earlier on. And it was only in the mid 90s that there was a project called Volcano, which actually showed how to do this reasonably efficiently. And then it was improved upon subsequently. So the other alternative which dates back to 1979 when the paper on access path selection is based on optimizing the join order. Because the join order is the most important part of query optimization. There are other issues too. When do you do selections? So that can be handled heuristically. The idea is that why ever postpone a selection? Do it as soon as you can. So based on that heuristic which generally works with some small tweaks. That's basically what the original system are optimized and implemented. And subsequently, pretty much all the databases like Oracle, DB2, PostgreSQL. Everybody implements that algorithm. So we will briefly look at both these alternatives. However, what we won't do is how to show, discuss how to implement query optimization using equivalence rules. We look at the equivalence rules. But how to use the equivalence rules in an optimizer is something which is more complex than we have time for here. We are going to skip that. So let's start by looking at the rules. Logically, it's important to know what is going on. What do we mean by equivalence rules? So two relational algebra expressions are said to be equivalent. If they always generate the same set of results or in fact, in SQL, they should generate the same multi-set of results. What is this multi-set? SQL allows duplicates. So when you run a join operation on duplicates, it in turn generates duplicates. So the count of duplicates is defined in the SQL language. So if the input has so many duplicates, output should have so many duplicates. That is defined in the SQL language basically for each operation. For join, for select, for project and so forth, union. So if we have two alternative expressions for SQL query evaluation, they better generate the same number of duplicates. Otherwise, the results will differ depending on which one you choose, which is not acceptable. So we really want multi-set equivalence in the case of SQL. And an equivalence rule says that two expressions of the same form are equivalent. So let's look at, I'm going to just go straight to the pictorial depiction of a few rules and then I'll come back to this. So the first rule pictorially here says that if I have a join with some join condition theta, joints like this are referred to as theta joints, joints with the condition. Then the first rule says that if I do a join with even as the left input and e2 as the right input, it is certainly equivalent to flipping the order. e2 as the left and e1 as the right. It's fairly clear. The second rule is a rule for natural join, which says, if I join e1, e2 first, then I join with e3. That result is equivalent to joining e1 with the result of joining e2, e3. That's here. What is this rule called? It's called associativity. You're familiar with this from algebra in high school. So that is associativity. The first one is commutativity. Now, I showed this for theta join and this for natural join. But the rule 6b is the same thing for associativity for theta joints. And there is a similar rule for commutativity for natural join. Now here is an example of another kind of rule, which says if I have a selection on top of a join and under the condition that the selection only involves attributes from one of the relations, I can do the selection before the join. In other words, I have pushed the selection through the join and applied directly on e1 before the join. Now, this also should be reasonably clear. If the selection involves conditions from e2 and e1, I cannot do this. Because those attributes are going to be available only after the join. On the other hand, if the selection is a conjunction, if the selection is of form e1.a equal to 5 and e2.b equal to 5 or something as e2.b equal to 6. It is an AND. Then what do I do? I can actually push one part into the left end, one part into the right. So there is also equivalence rules for that. Let us very briefly look at some of the equivalence rules. This time it is not pictorial, it is algebraic representation. So this one says sigma, theta 1 and theta 2 on e is the same as doing one first, actually theta 2 first and then doing theta 1. And the next one says that the result of doing theta 2 first then theta 1 is the same as doing theta 1 first and then theta 2. So now if you do this, you will find that you can actually you can apply it here. If theta is theta 1 and theta 2, you can break it into two. Then push one down and then the other selection will be on top. And then you may be able to push the other one down. So you can use multiple rules to get what you want. Then there are a few more using projections and selections which we pushing selections into cross products that is a select on a cross product is really a theta joint. And in SQL the from clause is actually a cross product and the where clause is a selection and that cross product is turned into a joint and this is the basic rule for that. I am going to not go into every one of these rules for lack of time. But there are several more rules in the book and there are many more rules which are not in the book including a few which are in the exercises and others which are not even there but are actually used. The SQL server optimizer I think uses some very large number. I think couple of hundred rules overall. Now the problem is the more rules you have, the more alternative plans you generate and the more time it takes. So actually what SQL server and other such optimizers do is they will not use all the rules at the beginning. They will start with a few rules and use those to optimize a query. And if they get a good plan, a very efficient plan using those, they won't try the other rules. But if the plan is still very expensive, then they will try other rules. In the hope that maybe one of the other rules can reduce the cost more. So we are not going to discuss all those rules here. This slide shows that you can have multiple rules applied to the same expression to get a more efficient thing. And we have seen this already, so I am going to skip this. So now let's come to the, we have seen the equivalence rules. We are going to leave equivalence rules at that point. And focus on the join ordering problem. So if I have a join which is given like this, R1 join R2 join with R3. With associativity we know we can do it differently. Now, why would these two have a cost difference? That's basically because the sizes of the relations may differ and the selectivity of a join that is when I join R1, R2 what is the size of the output that may differ for R1, R2 and for R2, R3. So for that matter R1, R3. So supposing R1 is, maybe R1 is small. Therefore R1 join R2 is also small, but R2 join R3 is large. So if we use the right hand side here, we are going to spend a lot of time generating R2 join R3 and then join it with R1 which basically eliminates most of the tuples and gets a small result perhaps. In which case, maybe R1 join R2 would have a very small result. If R1 is small and there's a foreign key, the join is on a foreign key perhaps. And then that can be joined with R3 and all the intermediate results are small. And then the join may be much cheaper as a result. There are also other reasons why joins may be expensive or cheap because of indices and so forth. But the join order is really key to this. There's an example here of why a relation may be small. Which is an earlier example which had a selection when department name is music. So music has only a few instructors and there are many departments. Department name equal to music instructor could be very small. Whereas teachers could be very large and course could be very large. So teachers join course would also be large. It's a foreign key joint, but it'll be as large as teachers. And then most of those are eliminated by the subsequent join. So it makes sense to first do the selection instructor, join that with teachers and then join that with course. So I'll leave the equivalence rules back there and see how we can get these join orders. The goal is to find the best join order. Unfortunately, there are many, many join orders. How many join orders are there? Well, if you take arbitrary joint trees, what is the joint tree? It's a binary tree with n leaves, where n is the number of relations which are being joined. Now the question is how many such trees are there? And that answer is basically the Catalan number on n minus 1. Won't get into the derivation, but it's quite large. For n equal to 7, it's already 6 lakhs. For n equal to 10, it is 176 billion. So we really, really don't want to evaluate all possible joint trees. So what do you do? It turns out dynamic programming comes to the rescue and turns ridiculously large number into a large but more manageable number, which is exponential. It's 3 power n or 2 power n depending on what class of trees we consider. And the dynamic programming algorithm can actually be understood as follows. To find the best joint tree for a set of n relations, we can recursively find the best plans for every combination of S1 joined with S minus S1. So we want to find the join of S, which has n relations. We will take every set S1 which is non-empty, strict subset. So S minus S1 is also non-empty. We will take every possible combination, find the best plan for S1, find the best plan for S minus S1, and then find the best way of joining these two results. Now if you, we have many alternatives, every set S1 is an alternative. So if you find the minimum across all sets S1, we would have found the best way of joining the set S. Because you think about it, to get a join of some n relations, there has to be a final join. That final join has to join two sub-results. What we are saying is, let's look at all possible final joins. And for each possible final join, we are recursively finding the best plan for each input. And then, among all possible final joins, we are picking the cheapest. Now this is a recursive thing. So the recursion has to have a base case, otherwise we have a problem. So the base case is when your input S1 or S minus S1 is a single relation. In which case, we are going to heuristically apply all the selections, which are there on that relation and pick the best way of enforcing that selection. That is the base case. Now, if you use this recursive algorithm, just by simple recursive algorithm, the problem is that we will repeatedly call the recursive procedure for a particular subset. For each subset, we will call it many, many times. And the next intuition for dynamic programming is to realize that if we evaluated the best plan for a particular subset once, just store it. Next time you, somebody else asks for the same thing. Why would somebody else ask for the best plan for the same subset? Because a particular subset may be joined with many different relations or many different subsets. So there will be many recursive calls to find the best plan on a given subset and the first time it is called, we compute it using the recursive procedure and save it. The next time it is called, we look it up and return it. This has two names in the literature. One way to look at it is dynamic programming. Most of you who have taught an algorithm score or attended an algorithm score at some point or read a book will be familiar with a version of dynamic programming, which is bottom up. That version computes things before they are needed and then uses them when they are needed. This is actually entirely equivalent, except it is expressed top down. It invokes something, computes it the first time it is needed, saves it. And then on further times when it's needed, it simply looks it up and returns. It's entirely equivalent. So that's dynamic programming. And this is a recursive algorithm for join order optimization. I'm going to skip this details for lack of time. But note that it takes n times 3 power n if we consider all possible joint trees, which is n times 3 power n is large. But it's much, much smaller than the Catalan number we saw before. So for example, for n equal to 10, what do we get? 10 times 3 to the power of 10. What is 3 to the power of 10? I don't remember the exact number. We can calculate it, but it is large, but it's not as large as 176 billion, but still it is large. Therefore, many optimizers don't consider the space of all possible joint trees. But they restrict it to what are called left deep joint trees. What is a left deep joint tree? It's one where for every joint in the joint tree, one of the right input is a relation. So it will look like this. In contrast, this is not a left deep joint tree because for this joint, the right input is a joint result. Now, if you only consider left deep joint trees like this, we can, in fact, modify the previous algorithm a little bit. There are some details in the slides on how to do it. It's actually very straightforward, but the cost comes down quite sharply. From n times 3 power n, it becomes basically 2 power n. And what is 2 power 10, 1000, that's not bad at all. If we do 1000 operations and we can find the best plan for 10-way joint, that's pretty good. Now, again, any algorithm's person will say, but this is exponential. We really don't want exponential algorithms. And it's true that if you have a joint which has 30 results, 2 power 30 is very large. That's going to take a lot of time, so billion. We don't want to spend that much time or space for that matter. So, pretty much every query optimizer has some heuristics built in. If the number of relations gets excessively large, up to 15, 16 relations, they will use this algorithm which we just described, which takes 2 power n time. But if it gets even larger, they use heuristics too, which will take less time. And cut down the time taken so that you don't take 2 power 30 times, for example. So, we won't get into the details, but all optimizers have this, so that you can't mess them up by giving a large joint query and then get them into trouble. Yeah? Are there real that large number of relations really? Yeah, that's a very good question. Do you ever have such large relations? The answer is no, typically you don't, which is why almost always. So, if you look at our schemas, the toy schemas we have don't even have that many relations, but you can keep joining a relation with itself, for example. Like we saw in the station, the track case, we could keep joining. So, the first answer is, realistically, these are very, very rare. Hardly ever does anybody ask a query with more than six sum in joints? If you have views, this may go up. The views get expanded, and it may go up to maybe 10, 15 joints. These do exist in the real world, but 30, they are rare. They can happen. People do write humongous SQL queries. In the industry, there are horribly complex queries in some many cases. So, what the, you don't want is the database dies if you ask such a query. So, the database has to protect itself from such queries. So, they're very rare, but occasionally they do arise. And then you may have an evil person who takes pleasure in showing that the database dies by creating a very large, meaningless joint query, which is very large. Okay, so this slide talks of the space and time complexity of optimization for joint order. And it also talks of how to handle left-deep joint trees. Okay, sorry, I said the time complexity is 2 power n. It's actually n times 2 power n, which is still not too bad. 10 times 2 power 10 is still 10,000, it's not 1,000, but it's pretty small. Okay, now there are some more details to do with interesting sort orders. For lack of time, I'm going to skip it here, but I'll just briefly mention what it is. The previous slide assumed that if I'm given a set of relations, I want the best plan for it. The thing is that different plans may give different sort orders on the relation. And if the relation is unsorted, if I want to do a join, I may have to then do a sort to do a merge join, or I may have to do a hash join. On the other hand, if some join operation below happened to use merge join, it is possible that the result was sorted on a join attribute, which means I can use merge join directly without a sort. So the bottom line is that previous algorithm is not quite right because the way in which a relation is sorted affects the cost of future joins. In other words, I don't want just the best order, best plan for a given set of relations. I may want to get the best plan for each of several useful sort orders. So if I have a join, I may say, give me the best plan sorted on the join attribute. If I'm using a hash join, I may say, just give me the best plan. I don't care what is the sort order. So that procedure can be modified to take an extra parameter, which is the sort order, and still dynamic programming applies, all the good stuff applies. And we can modify it to take this sort orders into account and then get the best plan. And you can either do the top-down version, which we saw, or the bottom-up one, which system are used, actually has a way of pre-computing what are all the sort orders which are useful later, and will generate the best plans for each of those sort orders. For both variants work, I'm going to skip the details. The next set of slides is on statistics for cost estimation. For lack of time, I won't get into the details here. You can read it up later. I'll just mention a couple of things. One of the things is for estimating the size of a selection. If we have several different ways. One way is if we have a selection on equality and we know the equality is on a key, we know there's only going to be one result. If the selection is on something which is not a key, how do we know how many results will be there? Even for a simple selection of the form a equal to five. How do you know how many tuples have a equal to five? We saw an example yesterday where we had a rating and the database knew there are only five distinct values for the rating. It computed this statistic. And then when you say select where rating equal to four, it estimates that one in five tuples will satisfy that. So if it knows there are 100,000 tuples, it could estimate that 100,000 by five, which is 20,000 tuples, will satisfy rating equal to four, or rating equal to one for that matter. Now, this assumes uniformity. What if the ratings are not uniform? What if the rating of four or five is rare? Let's say movie ratings. How many movies have got a rating of five on five? You've seen these movie ratings in newspapers or websites. When was the last time you saw a movie with a rating of five? It's very rare, critics never like to give five. They do give four, but four is rare. Three is more common, two is common, one may be less common. Zero, nobody gives zero typically, let's say zero is not an option. So there is a non-uniformity. Now, if I have a histogram, which gives the frequency of different values, given a particular selection, if I say rating equal to five, I can get a good estimate that it's a very small number. Whereas, rating equal to two or three will be a much larger number. So that's where a histogram comes in useful. It helps us to avoid the uniformity assumption. There are different kinds of histograms, for lack of time I won't get into those. And then there are a bunch of slides which talk of how to do selection size estimation, for simple selections, complex selections, joint size estimations. And I'll just spend a minute here on joint size estimation. The most common kind of join is a foreign key join. And here, size estimation is easy. What is the size? If I join a relation with another, and the join attribute is a foreign key from one referencing the other. I know it'll be exactly the size of this relation. So that is very easy. But what if there are selections on this? What is the selection on the referenced relation? What is the selection on the referencing relation? Then it becomes a little more complex. We can still use the information about foreign keys and get a better estimate. But if it is not a foreign key, if it's a join on some pair of attributes, which is not a foreign key referencing in either direction, then how do you know what is the size of the join? Then so again, if you know the number of distinct values, you can get a good estimate. So let's look at these two estimates. Now, we assume that the join is on an attribute A, that's a common attribute, we are taking a natural join. Now, V of A comma F is an estimate of how many distinct values there are for attribute A in relation S. Now, what is this estimate? How do we get this estimate? For each tuple in R, how many tuples on average is it going to match in S? And the answer is if S had 100,000 tuples and there are only five distinct values, each of those values is going to occur 20,000 times. And if you did a join on rating with some other relation, then each tuple in this relation we will estimate matches 100,000 by 5, which is 20,000 tuples, that's a reasonable estimate. But you may also, you will notice that this is entirely symmetric. Why should I use only the estimate for S? I may, symmetrically I can say, I take a tuple of S and see how many tuples of R it matches. And that estimate will be nR divided by the number of distinct values of A in R, that is this estimate. So, I have two estimates for the same thing and it has been shown that the lower one of these is more likely to be the correct value. So, that is how you do join size estimation in general. Again, you can do better if you have histograms on both sides and so forth. We are going to skip the details and there are more details in the book for estimation for other operations. It's not just the size estimation, it's also the estimation of how many distinct values there are, why do we need this? As we just saw, to estimate the size of a join, we need to know how many distinct values there are on the join attributes. Supposing I have a join which looks like this, R1, R2, maybe R3. To estimate the cost or the size of this join, I need to know how many distinct values there are, let's say this is on attribute A. So, I need to know how many distinct values there are for attribute A in this join result. So, I need to figure out how to estimate that also. I cannot compute any of these exactly. They have to be estimate statistical, because I can't run the query to find it. I'm looking at a huge number of plans. So, I have to quickly estimate these numbers. So, the book has details on how to estimate the number of distinct values of a join result, of a selection result and so forth. So, we will wrap up join optimization and cost base optimization in general with that. If there are any questions, you can ask me. How to decide what is the equivalence to? So, first of all, we can take each operation or combination of operations. Many of the equivalence rules, yes, there are a lot of rules. So, your question is which rule do we use? The answer is that is some sort of minimal rules which optimize like SQL server. It would use those rules always. Now, what are those rules? The rules we have given are essentially they are almost minimal. There are a few redundant rules there. We have not given a very large number. We have given about a dozen rules. That is not considered very large. That is a small set of rules. In fact, we have not given many rules. There are a lot more rules which apply to aggregation operations which we have not given. So, if the query does not have aggregation, that rule is not useful. So, it will be eliminated, it will not even be used. If the query has aggregation, that rule may be included and then the optimizer would find, will use some set of rules. Like I said, there are more complex rules which apply very rarely. Those will be used only if the basic set of rules doesn't produce a good plan. What are these rules common to aggregation operations? No, these database are fixed. First of all, there are only two commercial optimizers which use transformation rules. One is SQL server. The other is some version of Cybase. And there are some research prototypes also. There's one which was built in IIT Bombay, which is actually available in public domain. That also uses that. And Volcano was the original one, which started off the whole thing. That's a very old code base, but I think it is still available. So, each uses its own set of rules. But all the rest do only join order optimization. And then they use some heuristics or some of these transformation rules used as heuristics. What do I mean by that? A transformation rule does not say if this is equivalent to that, you must use that expression. So, you can use either one. Consider both alternatives and find the best plan. The same rule can be used as a heuristic, which says if it matches the left-hand side, always rewrite to the right-hand side. So, if the optimizer implementer decides that almost always the right-hand side is going to be cheaper than the left-hand side, then you can use the same rule as a transformation rule heuristically. And many optimizers use such heuristic rules. They do join order optimization to consider all the join orders. But to handle other operators, they use some of these rules. Any other questions? Approximation algorithm for a join order. That's actually an excellent question for the benefit of the recording. Join order optimization is known to be an NP complete problem. And that's why those algorithms we saw are all exponential. So, are there good approximation algorithms for this? This question was actually an open question for a very long time. Till some years, how long? About five, six years back, Prasad Sumit Ganguly of IIT, Kanpur had a paper which showed that you cannot even approximate it. It's actually a very hard problem. So, that was actually a very nice result and that's from here. From IIT Kanpur. So, what is used in practice industry are heuristics. So, there's no guarantee, they are heuristics. There are, like I said, if the number of relations becomes too large, you use heuristics to cut down the cost. So, some of them are polynomial heuristics. Some of the heuristics are not polynomial, but they will at least reduce the exponential from 2 power 100 to maybe 2 power 20 or something like that. So, it's a more manageable number. I assume everyone knows what is an approximation. So, approximation is you get a plan who's cost is within some factor of the optimal. You cannot even get to it within any constant factor of the optimal always. That is what it means. That to say it cannot, the problem cannot be approximated in polynomial. So, the optimization is, and transformations are one part. It turns out that SQL as a language has some features which cannot actually be handled directly in the simple relational algebra we have seen. In particular, nested subqueries cause a problem because they cannot be represented directly in relational algebra. So, the question is how do you optimize nested queries? And there are two answers to this which are used in practice. The first answer is to extend relational algebra a little bit to handle nested queries, but don't extend the optimizer correspondingly. Instead, we can rewrite the SQL query to remove nested subqueries wherever possible. So, this is a heuristic which most databases have some form of it. Now, how good is this depends on the database. So, for example, in PostgreSQL, I would request you to try this out if you can. I'll add it to the set of exercises. We can give a nested subquery and find the execution plan for it. If you give a simple nested subquery, PostgreSQL might actually transform it to a join or a semi-joint. But if you give a more complex nested subquery, PostgreSQL will not be able to do anything. And it will actually run the outer subquery and for apply the other join conditions. And then, so let me show an example of what I mean by this. We have a nested subquery which is a select name from instructor, where exists select star from teachers, where instructor.id is out of one equal to teachers.id and teachers.year equal to 2000. What this does is a simple nested subquery which has a correlation variable. It's using instructor from outside. It is using it in the inner query where clause. It's actually a simple query. It just finds instructors who have taught some course in 2007. And if you give this query to a human, all of us know that we can alternatively do this using a join. An nested subquery is not really required. We could have instructor join teachers where teachers.year equal to 2000. There is a slight difference. If I do just instructor join teachers, I will get duplicates. So if I want the correct duplicate count, there is a little more work has to be done. Let's ignore that for the moment, it can be done. But the bottom line is how does the optimizer transform a nested subquery like this to a join? And the answer is there are, again, equivalence rules effectively which can be applied to transform this into a join. And the intuition is that joins are likely to be much more efficient than a nested subquery. Why? Because a nested subquery asses will be evaluated once per outer tuple. If the instructor relation has a large number of tuples, the inner query is executed many times. Each execution of the query may do some random IO. So overall, you may get a huge amount of random IO which will make it very slow. In contrast, if you could transform it to a join of instructor and teachers, maybe we could use a merge join and do this very efficiently with very little random IO. So it's a good heuristic to transform nested subqueries to joins wherever possible and to avoid what is called correlated evaluation. The default in SQL is correlated evaluation. We can actually write this in relational algebra as follows. What we have is a selection on instructor. Where the selection condition theta is really this involves a subquery itself. And that subquery is basically this nested. So this one corresponds to, it has a select and the condition is exists, select, teachers and this condition here is the year equal to 2000 and teachers.id equal to instructor.id. So what that means is instructor.id is being passed down here to this selection. So this is not a traditional relational algebra. It's an extended relational algebra. But that is what is used to evaluate SQL. So this is correlated evaluation. If you run it exactly as is. Yeah, there is a dependency. So what it's doing is it takes this result, result of whatever sub expression here in general. For each tuple here, it invokes this side and evaluates this whole sub expression once. Passing in any parameter, in this case, the theta here is instructor.id equals teachers.id and year equals 2007, that is the condition evaluated there. So it's actually invoking a relational algebra expression many times, one per outer expression. No, it cannot be cyclic because the subquery is in the var clause here. So the dependency is always the variables from the outer query can be used in the inner query. Dependency cannot be cyclic. So now how do you do this rewriting from nested subquery to join? I won't get into details. But that is a small example of rewriting which actually preserves the number of duplicates correctly. So what does this do? First of all, it creates a temporary table from teachers where year equal to 2007 and selects distinct id. So each instructor is stored only once, even if they taught many courses. And then that is joined back with instructor. This temporary table T1 is joined back with instructor on T1.id equal to instructor.id. And this will ensure only instructors who have taught a course in 2007 come out. It ensures that there are no duplicates simply because T1 does not have any duplicates. Now instead of doing a create table, you could have used a width clause, which is equivalent. But that's essentially what the actual decorrelation. This is called decorrelation. And there are a number of transformation rules for decorrelation which we won't get into at this point. And to wrap up query optimization, let me mention the idea of materialized views. We have discussed this earlier, I think. Materialized view is a view whose results are actually computed and stored as opposed to a normal view where the results are never stored. They're computed as required and thrown away. In fact, they're not even computed fully. With a normal view, if I have a query which uses a view, it replaces the view by its definition. And then it optimizes the query. So the actual view may never get computed. The optimized query may move some relations here and there. So the view definition now gets moved around and it's never actually computed. In contrast, the materialized view has been computed and stored. So what are the issues here? Why would you do this, first of all? The most common use for materialized views is to pre-compute aggregates. So supposing I have an analyst who wants to know what are the sales of a particular product in a particular region in each month of each year, that's one kind of aggregate query. Now if the sales relation is very large, if you take any large retailer, Big Bazaar, any of the large electronic shops, these guys have huge volumes of sales. Each of them would be selling millions of items a day, especially something like Big Bazaar. Now if you run an aggregate query on hundreds of millions of items, it's going to take a long time to run. So if the analyst gives a query and then has to wait for an hour or two to get the result, it's not a very efficient use of the analyst's time. So what you would like is to pre-compute all this before the analyst even looks at it and then the analyst can just retrieve the pre-computed result. And that's where materialized views are most useful. So you'll create materialized aggregate views, which are maintained. So every day new sales results come in and the view is updated. So that's a typical use. So the question is how to incrementally update the result of a view as new records come in. Again, we don't have time for the details, but the book has some details on how to do this for individual operations. If I have a group by operation, how to incrementally maintain it. When new tuples come, how do I update the result of the group by? When I have a join and new tuples come, how do I update it? But it's not only new tuples, I may also delete tuples. So delete tuples, how do I update it? I may even update a tuple, in which case, how do I change the materialized view? All of this has to be taken into account and the book again has details on how to do this for individual operations, as well as for an entire query, because a materialized view is a query with multiple operations. So if any underlying relation changes, I have to compute the change to the view result and update it. So how to do this is described in the book. I'm going to skip details here. And the final thing was a question which somebody asked yesterday, which is given that we have indices and if you have materialized views, the problem is correspondingly more complex. What indices and what materialized views should I create? And the answer is it depends on the workload and how do you get the workload? Many database, in fact, pretty much all databases let you turn on the parameter which logs all the queries which are executed. So you get a log of all the queries which ran over some period of time. You can use this to determine which indices to create, which materialized views to create. You can do it manually, that's very hard actually, or you can use a tool which is provided by most of the commercial databases called an index tuning wizard or database tuning wizard or they have different names for this. Whatever the tool is called, it's all major databases have it. Postgresquial doesn't, it's not commercial, but the commercial ones have it. You can use that tool to help you make this choice. Any questions? There's some other optimization techniques which are there in the book. I'll skip the details. And that wraps up our session on query optimization. And it also leaves us about two and a half hours for transaction processing. So obviously I will be going fast over that. Are there any questions before I switch to transaction processing? Of course, the multi-query optimization. So if you get a set of queries, how do you optimize them collectively? And it turns out this dynamic programming algorithm which we said that the best plan for something is based on finding the best plan for the subsets. It turns out when you have a set of queries, this doesn't quite work. So dynamic programming is based on what is called the principle of optimality, that the best plan for something is composed of the best plans for its sub-parts. The problem with multi-query optimization is that you may have a plan which is sub-plan which is not optimal, but it is actually shared by many different queries. So it makes sense to use that for this query rather than the optimal plan for this query. That actually has a lot of repercussions. It makes the problem of optimization a lot more difficult. So the dynamic programming actually has to change significantly because of this. And it becomes very expensive, which is not practical. Therefore, we need some heuristics which are reasonably efficient and work well in practice. And implementing those heuristics itself turns out to be non-trivial. So it requires some bunch of clever tricks to make them work efficiently, even the heuristics. So that problem has been addressed including some work from here where we looked at how do you, what heuristics make sense and how do you implement the heuristics efficiently, which itself turned out to be non-trivial. So we did some work about 10 years back on that. And multi-query optimization has turned out to be quite useful in many domains. So I don't know if anyone is using exactly that implementation, but it is being used in some context. Yeah. If you look at the commercial applications nowadays, which are using huge databases, then you will see that creating the most databases of the data rather than going and writing the databases after some time. So if you look at this real-only kind of the operations, do you think that the relational model is useful? Thanks, that's very nice, Hal. What about the optimization and... That's a nice question, which is forcing me to talk about something which I did not really cover in these things. So the question is that a lot of applications today require data analysis as opposed to updating of data. So in fact, in general, the database application world is split into two parts. One is called OLTP for online transaction processing and the other is called decision support system of which one aspect is OLAP, which stands for online analytical processing. There are other aspects to decision support also. So OLAP is something which I briefly alluded to just a little while back where I said an analyst needs answers quickly for aggregate queries. And how do you support that? Basically OLAP systems are designed to pre-compute a number of aggregates so that whatever question the analyst asks, either the answer is already computed or it can be efficiently computed from one of the already computed answers so that you never have to go back to the underlying relation, which the online part is basically because the answers can be given online as opposed to telling the analyst, okay, I've accepted the query, come back tomorrow I'll give you the answer, then it's not online query. So coming back to the specific question, it was should you use traditional designs for relational databases for the problem of decision support? And the answer to this question is there are certain decision support queries where the traditional databases are fine, but there's also other representations. The relational model is not changed, but the underlying implementation is changed. There's something called column stores which have been shown to be quite useful for many decision support applications, not all but many. And what is a column store? If you take a traditional database, as we saw, we have a notion of pages or blocks and then there are records inside it. And all the attributes of a record are together. Now what many people realize is that for some of these decision support queries, you have records with many fields, many of which are not used for many queries. Most of the queries access only one or two fields. Now if you have to read a page with all the other fields, it becomes quite slow. So they said let's push this representation to column representation, which essentially looks like this. So you store a file which has the attribute. So let's say this is R of A, B, C is a relation. So what they do is they store in a file attribute A of all the tuples. They store in another file attribute D of all tuples and in a third file attribute C of all tuples. Now what have we done? Instead of storing records, normally we have cut them vertically and stored for a particular column, we are storing the values for all records together. So we have reclustered the thing instead of row major, we have done column major representation. Now why is this useful? If I don't want to access B and C, I only want to access A, I can access one file which is one third or maybe even much smaller than the whole thing. The second trick which is used is that once I have all these values which are from the same domain, I can use compression much more effectively because they're all from the same domain. So a file which would have been in raw form which would have been 100 megabytes. Now all of those are integers which are age. Now we can optimize this, we can compress it very effectively to be instead of 100 megabytes, it may come down to five megabytes. Some of these are compression techniques might have applied even in the traditional row representation but they're not as effective because there are many different kinds of attributes stored together so the compression is not as effective. So by combination of avoiding reading things and compression for many queries, column store can be much faster, even five to 10 times faster than a relational, the row oriented representation. So people have responded in different ways. Some people have gone and built an entire database using only the column store representation. Others have taken the traditional database and added indices which are really column stores. So they continue to keep the rows but they have an extra column store which is like a materialized view or an index on the original relation. It's an index really and yet others have looked at alternatives which combine within a block, they will store tuples column wise. But all the tuples, all the attributes for a particular set of tuples will be together on one block. So it's a not exactly column major, it's not row major, it's more like a zigzag order. So different variations have been tried out in practice and implemented in different systems. I hope that answered your question unless you had something else. Unstructured data. Unstructured data. We have, nowadays, looked at the Google and as a unstructured database. Okay, so we understand this question is, so we have this relational model which has a very strong structure. We do detailed schema design but there are many applications where the structure is not as firm and people want a flexible structure. So today, let's take users. So any company like Google keeps user profiles. Now what are the attributes of that user profile? So there are some common attributes, login name, name, age or date of birth, whatever, location, a few things which are going to be common for all users. Now there are users of different, what are called properties or parts of the Google empire or the Yahoo empire. So each of those may want some extra attributes to be stored about that user. Like what are the set of friends of this user? It's something which one component wants to store. Another guy may want to store something else. So the net result is the set of attributes may change dynamically and it's different for different users. So these are called flexible schemas. So now there has been work on two fronts. One is, for example, SQL server itself internally, at least in the prototype. I don't know if it has been released yet in 2008 version. I think it may be there in the next version. It has support for relations where you can add attributes as you please. It can have thousands of attributes and yet it will be stored efficiently. You can dynamically add attributes that won't cause any problems. So that's an example of a flexible schema. And of course, non-relational databases like the data stores used in Google, Yahoo and pretty much everybody else, all support flexible schema which let you add attributes on the fly. In fact, some of these use what is called JSON. If you're familiar with JavaScript, JavaScript has a notion of objects. But unlike traditional objects, JavaScript objects can have as many attributes as you want them. JavaScript is naturally a flexible schema language. And it has a storage representation called JavaScript object notation, which is a storage representation for JavaScript objects which have flexible schema. So this is becoming quite popular. Many of these databases now let you directly store objects in JavaScript object notation and retrieve them. So when you store it from JavaScript and retrieve from JavaScript, it's actually very efficient. Of course, if you do it from PHP, there are converters which will take a PHP object and convert it automatically to a JavaScript object and let you stick it into one of these stores and conversely retrieve it and get it out as a PHP object. Any questions, any more? Thanks, I think there will be a good set of questions.