 So, the last two things which I want to cover in this chapter are two other types of optimization techniques. The first one is how to optimize nested subquery. So, what we saw so far was based on join order optimization and transformation. Now, it turns out nested subqueries cannot be represented directly in relation algebra. So, people have come up with other operators to deal with it and transformations involving those new operators. I am not going to get into that at this point, but I want to point out another technique which is much more widely used, which is the following. I will show it by example. Here is a nested query. Select name from instructor where exists. Select star from teacher where instructor dot id equal to teachers dot id and teachers dot year equal to 2007. What is this query doing? It is finding the names of all instructors who taught a course in 2007. That is what this nested subquery is ensuring. This is a correlated subquery because there is a condition here which uses a relation which is at the outer table instructor. So, this is a correlation variable. Instructor from outside is used inside. So, this is called a correlated subquery. And when you have such an nested subquery, the default evaluation specified by SQL is going to take each instructor here and then evaluate this subquery. Taking the value of the instructor id here would be the, I am looking at one instructor at a time. When I am looking at a particular instructor, instructor dot id is fixed. So, that value is used here and what I have is a copy of this query which is being run with a specific value for instructor dot id. And that query returns all the teacher's tuples which match that id and have year 2007. And then the outer query checks if this result is non-empty and then outputs that instructor. So, this is called correlated evaluation. Now, it turns out that correlated evaluation can be horrendously expensive if, for example, there is no index on teachers dot id. So, each time I evaluate this, I may have to scan through the entire teacher's relation to match, to find which tuples match the current instructor id and year 2007. If I do not have any indices, I will scan teachers repeatedly. It is horribly, horribly inefficient. That is just a semantics. SQL does not say you should actually do it that way. That is a semantics. Now, what can we do to optimize the execution? I can build an index on teachers dot id. So, now, I will do an index lookup and find matching teachers records and check the year. So, that is better, but that is not necessarily the best way of doing this. With an index, I could have a lot of IO. Remember, analysis for using an index in an index nested loops join. Each index lookup could land me in a different record somewhere and I might be doing a lot of random IO, a lot of seeks. One seek per instructor here. If I have a lot of instructors, that could be pretty expensive. It is not as bad as scanning teachers all over again, but still a lot of seeks. Can we do something different? The first thing to note, so this slide just says what is the inefficiency with correlated evaluation? Lots of calls, lot of unnecessary random IO. So, the previous query could be written perhaps as following. Select name from instructor teachers where instructor id equal to teachers id and teachers dot here is 2007. This is not quite right because there could be duplicates. If a particular instructor teaches many courses in 2007, that instructor's name will be output many times. So, this is the first cut. What can we do? Can we add a distinct here? That is also quite close, but not quite right. So, it turns out if you want to preserve the number of distinct things, it is still possible. That is coming up in the next slide. So, in general, it is not possible to directly turn that nested query into a join directly like this, but there is a way around using temporary relations and that is shown here. So, what we are doing is create table T1 as select distinct id from teachers where you are equal to 2007. So, the first step is I am finding all those who have taught a course in 2007, just the id. I am not finding the names at this point and I am doing distinct. So, each id appears only once. The next step is to join instructor with this table T1 where T1 dot id equal to instructor dot id. Now, this join can be done using any join algorithm. It could use merge join. It could do hash join. It could even do index nested loop join if it so wishes, whichever is the cheapest, but the important thing to note here is the number of duplicates is correct. At this point, T1 dot id has no duplicates. So, each instructor tuple will match with at most one T1 tuple. So, the names will appear exactly as many times as there are instructor. If there are two instructors with the same name, the name will appear twice. If I put a select distinct instructor, the name will only appear once which is wrong. So, this is a correct translation of the previous. This was a wrong translation. We started with this nested query and this translation is not right, but this translation is correct. In fact, you could instead of create table, you could use a with clause over here which is probably more efficient. So, that is if you translated from SQL to SQL and this process of taking nested query and replacing it by a query which does not have nesting is called decorrelation. So, what people said is we will take the SQL query and we will rewrite it into this SQL, into a decorrelated SQL query which does not have nesting, nested subquery. Hence, our decorrelation is more difficult if the nested subquery uses aggregation or the result of the nested subquery is used to test for equality or it is not exist or not in and so forth. All these cases are harder. So, the initial work was based on rewriting the SQL query. More recently, the SQL server people said that look, why are we doing rewriting from SQL to SQL? Let us implement decorrelation as part of the optimizer itself and use new operators. There is an anti-join operator and then there are a bunch of other tricks which allow them to translate many, many nested queries into joins. So, that kind of decorrelation is used in SQL server. Now, other databases also do decorrelation. Oracle does, PostgreSQL does, every database does it. The only question is how complex a query can each of these decorrelate? PostgreSQL is one of the less sophisticated ones. It can decorrelate all simple nested queries, but if you make it more complex, it could give up and do a correlated evaluation. Oracle is better at it, but SQL server is the master at this particular game. They have done some really cool stuff and there are papers which they have written talking about it. So, they do a really nice job of decorrelation. So, that brings us up to another question which I have phrased as a quiz question, but I am just going to answer it. The question is given an option of writing a query using a join versus using a correlated query. What is the right choice? Now, this is a question which some people have been asking during this workshop. The options are it is always better to write it using a sub query? No. So, write it as a sub query depending on your database. It might be clever enough to rewrite it or not. Some optimizers are likely to get a better plan if the query is written using a join than if written using a sub query. This is true. SQL server will probably be cleverer than you in this and it does not matter which way you write it for SQL server, but for PostgreSQL this is probably true. Third option is some optimizers are likely to get a better plan if it is written as a sub query. No. Pretty much no database would get a better plan this way. None of the above is not an option. The last topic which I want to cover is materialized views. Again, many people have been asking questions about materialized views on chat. I already told you what is a materialized view. So, let me repeat it here. It is a view whose contents are computed and stored. So, query like this view like this create view department total salve, department name total salve as select department name some salve from instructor group by department name. If I compute this supposing instructor is a very large relation unrealistic, but let us suppose I have computed this I can get the total salve very quickly. If I need this frequently it is worth computing this and storing it instead of re executing it each time. Now, of course, the problem is that you have to maintain it. The moment an instructor is inserted deleted or salve is updated or the department of an instructor is changed all of these changes require us to update the materialized view. This is called materialized view maintenance. One way is to recompute it on every update that is horribly expensive. Another way is to recompute it periodically which is ok for certain application. When I do not mind seeing yesterday's statistics this might you know update, recomputing it every night may be an option, but if the database is really large even recomputing it could be pretty expensive. The last option which is better in general is to use incremental view maintenance which means we collect the changes which have happened to the database and use it to compute the changes to the view and then update the view. Now, incremental view maintenance may be done immediately SQL server does it this way as soon as the instructor relation this was on instructor as soon as there is any change to the instructor relation SQL server if I made this a materialized view SQL server will immediately update the materialized view. Others do not do it immediately by default if you specify it you can do it that way. Oracle will also allow you to do incremental view maintenance, but once a day let us say you collect all the changes over a day and at night you can compute the changes to the view and apply them instead of recomputing the whole view. How to do this and so on there are lot more details in the book I am going to skip the details of how to do incremental view maintenance. The last part is which materialized view should you create that depends on what queries your system is seeing. There is a closely related queries which indices should be create on the database. The default is primary key foreign key, but depending on the queries which you are seeing it may be really useful to create other indices. And the way to do both of these is based on a work load that is what are the typical queries and updates that you see during a period of time. So, if you have an application which is being used you can collect it you can see what are people running. If it is not currently live you can make estimates of how many times each you know interface will get used and from that you can estimate what the work load is likely to be. So, the typical goal is now to choose indices and materialized views which will minimize the time to execute the work load. At the same time some queries are very important you want very quick response. So, you have constraints on the time taken for certain critical queries and updates and also this constraints on the space available which you can use. So, all of these you know have to be taken into account to make this choice. How do you do that? It is hard to do this manually, but you can do certain things manually. You can look at your query work load, look at the important queries and see the plan and see the time it takes. Supposing you see that the plan does not use indices it is a simple query, but it is not using an index and it is taking a lot of time. Lot might not be that much it might take 2 seconds, but 2 seconds is a lot if you have a thousand people running this query per hour. So, you may do an explain query see the plan and then say maybe I can create an index it might speed it up add the index run explain plan again and see if it actually work if it improve things. That is very painful unfortunately that is only way in PostgreSQL today, but if you use oracle or SQL server or DB2 and others they have what are called tuning assistants of wizards that can help you through this process. They will help you collect the work load and suggest what indices and materializes to create. So, let me wrap up optimization I am way beyond time I am sorry for that. There are a number of other optimization techniques which are in the book including top K queries, multi query optimization, parametric query optimization which are topics close to my heart because my first PhD students thesis was a multi query optimization. So, it is a really nice piece of work. My second PhD student Arvind Hulgheri worked on parametric query optimization this work was also a really nice piece of work. So, there are 2 nice piece of work which came out of IIT Bombay on these 2 topics. I am going to skip that quiz and wrap up today morning session. So, see you at 5.15. Bye.