 Let me wrap up by mentioning a few more important things about optimization. So far, we focused on joint order optimization. But you have been using a lot of nested subqueries. What do you do with nested subqueries? What should an optimizer do? How does it evaluate nested subqueries? Turns out this is a very important issue. So, let us take an example of a nested subquery. Select name from instructor where exists select star from teachers where instructor.id, instructor is outer relation here, it is being used here. So, this is an example of what is called a correlation variable. Attribute from outer relation is being used in the inner subquery and teachers.year equal to 2000. What is this doing? For each instructor logically, it is looking through the teacher's relation, finding all teacher's tuple matching that instructor, checking if the year is 2000. If so, this subquery gives a non-empty result and that instructor is output. So, what it means is, did this instructor teach a course in 2007? That is what the query means. How does SQL evaluate this? The default evaluation strategy which can be very, very inefficient for any nested subquery is basically what I told you. Take the outer query. If there are multiple relations there, do the join of those relations. In that join result, invoke the subquery with those parameters. What are the parameter here? The instructor.id. That is made available. The inner query is executed. Its result is got. Then you check if the condition in this case exists is true. If so, the outer tuple can be output. Now think about this. If there are many instructors, how many times is the inner subquery executed? If I have a thousand instructors, the inner query is executed a thousand times. If the inner query requires an index lookup or it requires a relation scan, it is going to take some time. So, even if it takes 100 milliseconds, a thousand times 100 milliseconds is 100 seconds. That is relatively slow for a data set where you just have a thousand instructors. And if the number is larger, this will be very, very inefficient. So, the moral of this story is if you execute a nested subquery as stated, it can be very inefficient. So, what do optimizers do about this? So, what they try to do as a first step which works best if it works is to take an nested subquery and turn it into a join. In this particular case, a human could recognize that well, why do not we just join instructor with teachers on instructor id equal to teacher's id and teacher's dot here equal to 2000. Why do not we do the join? How do you do the join? Well, optimizer may figure out that hash join can be used and that hash join probably does not do random IO that much. So, it may finish much faster than the plan which did the subquery per instructor. So, the first line of attack is to take a nested subquery and turn it into a join. We have to be careful though. Why do we have to be careful? So, here is the earlier query which we have turned into a join. Select name from instructor teachers where instructor id equal to teacher's dot id and teacher's dot here equal to 2000. Why do we have to be careful? The issue is duplicates. If you see the earlier one, each instructor's name is output once. Here, if an instructor is teaching two courses, that instructor's name will appear twice. So, we have to deal with this. There are ways to deal with it. I would not get into it, but a good optimizer will take nested subqueries and do what is called decorrelation. So, if you remember here, this default evaluation is called correlated evaluation. Decorrelation basically takes an nested subquery and turns it into a join or a sequence of joins with multiple steps in general. Here, it was easy. In other cases, it may be a little more difficult. So, all optimizers do some amount of this work, decorrelation. But the conditions under which decorrelation can be done can be quite complex. So, very often what you will find is one optimizer, say PostgreSQL can decorrelate some queries, but it cannot decorrelate others. You run the same query on Oracle. It may be able to decorrelate a nested subquery, which PostgreSQL cannot. You run the same query on SQL server. It may be able to decorrelate some things which Oracle cannot. So, what is the impact of this? The same nested subquery may run very slowly in PostgreSQL. It may run faster in Oracle. The same query may run slowly in Oracle, but may run faster in SQL server or vice versa for that matter. So, the performance of nested subqueries is hard to predict if you do not know which database you are running on. If you know you are running on PostgreSQL and you really understand what PostgreSQL is doing, you may be able to say, ah, this subquery is one which PostgreSQL can decorrelate. It will run fine, but you do not know. So, how do you know what is going on? The trick is to use the explain plan feature. And when you see the explain plan feature in today's lab, you will see certain situations where PostgreSQL is able to actually decorrelate a nested subquery. You will see some situations where it is not able to. And in the latter case, the performance can be relatively slow. So, do try this out in the lab today. I have already covered the material in this slide. This is a more complex one which takes care of duplicates. I am going to skip the details, but I will just note again that decorrelation is more complicated when the nested subquery uses aggregation and it is not an exist, but equality, blah, blah. There are lot of issues which a real optimizer has to deal with. So, we can wrap up this session with a short quiz. So, this is just to check if you understood what I just told you. The question is given an option of writing a query using a join versus using a correlated subquery. The options are it is always better to write it using the subquery. The second option is some optimizers are likely to get a better plan if the query is written using a join, then if it is written using a subquery. The third option is some optimizers are likely to get a better plan if the query is written using a subquery. And the last option is none of the above. So, each center please make sure the software is running. Participants, please press the s t button and be ready to take the quiz. I am going to release the quiz in just a moment. The timer has started. You can go ahead and answer the question now. It is almost time up and time is over. We will see the results in just a bit, but to see what is the answer in this case. As we just discussed it, if you write it using a subquery, sometimes optimizers will not be able to get a good plan. So, one is not a right answer. Cannot always be better to write it using a subquery. Option two is some optimizers are likely to get a better plan if the query is written using a join, then if written using a subquery. This is in fact the correct answer. So, in a case where the optimizer is not able to decorrelate the query, if you clever SQL programmer could have written it using a join in the first place, the optimizer will have very efficient algorithms for doing joins. So, it will come up with a good plan. If you write it as a join, if you write it as a subquery and it cannot decorrelate it, it is going to have potentially a much slower plan. So, that is true. Three is the other way. Some optimizers are likely to get a better plan if the query is written using a subquery. This is actually not true because all optimizers are very good at join order optimization, but at best if you write it as a subquery, they may do as well as the join. It is very, very, very unlikely that it will do better if written as a subquery and for none of the above is wrong. So, the correct answer is two. So, let us see what people have said. We have 150 responses. That is good and by far the majority have chosen the correct answer which is two. Good. So, I think that pretty much wraps up. We have a few more slides on materialized views and how to incrementally maintain materialized views. Again, that is probably much more than you can cover in a basic database course and it is more than I can cover in the time allocated. I have kept it here so that I would urge you to go read it because it is very interesting stuff, but probably more than you can cover in the regular course. So, with that we are done with this chapter. I just mentioned these buzzwords. Materialized view maintenance is updating the result of a materialized view when the underlying data changes. Incremental view maintenance is updating it efficiently when there is a small change to the data. If one tuple is inserted, deleted and so on, you do not recompute the whole view, but you find out exactly what changes in the view and apply that change. So, incrementally change the result. I am not going to tell you how it is done, but it is there in the book and a few more buzzwords. Materialized view selection is given a set of queries, what is the best set of materialized views to select. In fact, there is a sub problem and index is much like a materialized view. They are very similar indices and materialized views. So, the index selection problem is given a set of queries, what is the best set of indices that you should create in the database to support these queries. How do you do that? You can have a clever programmer who is good at tuning the database, who will look at the queries that are being run and say, this query could use this index, this query could use that index, but if you have a lot of queries, you may land up with a lot of indices and take up too much space and time to maintain the indices. So, how do you choose a good set? So, it turns out a good tuner can figure this out, but to simplify the job for all the rest of us, many databases like DB2, SQL server, Oracle, not PostgreSQL though, have a tool which can look at the queries which you are executing routinely. They can collect those queries and for some amount of time, then you say, analyze these queries and tell me what all indices to create, which will help me across this workload. Similarly, they can even suggest what materialized views to create, which will help with the given workload. So, if you have one of these databases, you can try their tuning assistants or wizards or whatever they call them, depending on the database to do this. There are a few more optimization techniques, advanced techniques which are in the book. I am going to skip the details and there is a quiz. I would not conduct it formally. Let me just read the quiz and give you the answer right away. The question is, if all data is stored in main memory, query optimization is no longer needed. No, that is wrong. You still can have certain join orders, which generate a lot of intermediate results, others which do not. Query optimization will still be required, but queries will run faster. Yes, that is true. We are reducing the cost of disk access in IO. So, 2 is true. 3 says queries will run slower. That is silly. None of the above is also wrong. So, why do I bring this up as a quiz question? Because the memory sizes have exploded in recent years. Database sizes have exploded for web services and so forth, which deal with millions of users. But for a university, our student population growth of 50 percent is already huge. So, today most of our databases fit in memory and database vendors have also recognized that. So, they are putting a lot of effort into tuning their databases for a situation where memory is very large and also for the situation where flash disks are used instead of hard disks. So, the next generation of releases of database systems are going to take this into account and will perform better in these situations than the current generation. So, if you have a question, please indicate it. I am just going through the different centers to make to check if anyone has a question. Please use chat also if you have questions. I see one question from Samrat Ashok Vidisha. Samrat Ashok, I am going to put you on. Just give me one minute. Samrat Ashok, over to you. Hello, can you hear me sir? Yes, please go ahead. Yeah, good morning sir. The question is created with the cost-based optimization algorithm using dynamic programming. Can this problem can be treated as a NP-hard problem, especially when the case of the cardinality and degree in variations. Please switch off your mic. So, the question is cost-based optimization. Is it an NP-hard problem? Yes, in fact, it is known to be NP-hard in the general case. There are some special cases under certain assumptions. It can be done cheaper, but in general, it is known to be an NP-hard problem. In fact, there is a very interesting result, theoretical result from Prasasumit Ganguly of IIT Kanpur, which showed that not only is this NP-hard, even getting an approximate answer that is getting a plan which is approximately equal in cost to the best plan within some approximation factor, constant factor, even that problem is NP-hard. So, it is known to be in a class of problems, which in general is quite hard. There is no known cheap solution. That said, there are a number of heuristics which people have been using, but all the heuristics fail in certain cases. And luckily for us, the number of joints in typical queries is small. So, we can actually run an exhaustive algorithm. It is NP-hard. So, we use an exponential algorithm. Dynamic programming is exponential and complexity as we just saw. It is 2 power n or 3 power n, but that is still acceptable because n is small. If n became 20, it is going to be slow. If n is 32, it is hopeless. 2 power 32 will take forever or 3 power 32 is even worse. So, the moral of this story is as long as the query size is small, we can actually do exhaustive optimization with dynamic programming. If query sizes become very large, then we cannot do this. We have to use some kinds of heuristics. So, there is a lot of work on heuristics which do not guarantee anything. Like I said, guaranteed approximation is also very, very expensive, but there are a number of heuristics which are routinely applied to larger queries. And in general, they work well. There is no other option if you have a query with 30 joints. The good news is they are very rare. The bad news is they do occur. So, any real optimizer has to be prepared to deal with larger queries with more joints. What do they do in such cases? The solutions vary. For example, PostgreSQL has some form of genetic optimizer built in which is used for very large queries, although for normal queries that is not used. Similarly, other databases have their own proprietary recipe for handling very large queries. They do not say much about what they do. So, that is the fallback option. Now, back to you if you have a follow up question. So, one more question that can be able to improve the optimization technique if we can able to improve the parsing and translation phase. The question is, can you do something in the parsing and translation to help optimization? The answer is not really because parsing and translation is a fairly standard task. So, the optimizer should be able to take whatever form and do whatever it needs to do. So, there is not much choice in that phase. So, optimizer's work is not going to change as a result. Back to you if you have any follow up question. Thank you, sir. Thank you. Let us see if anybody else has a question. PSG Coimbatore has a question. PSG Coimbatore, over to you. We can see you. Please go ahead with your question. It is about run stats. Usually in the batch program, we do run the run stat at the end of the program. My question is how this impact, whether the optimizer uses these run stats, how this will be handled in online systems? Okay, I could not hear the first part of your question fully, but I think what you said is run stats is required to get some statistics. How does the optimizer use it? Is that what you meant? Otherwise, please repeat the first part of your question. Back to you. Yes, sir. Usually we are asked to run run stats at the end of all the batch updates and inserts. And the DBA asked us to, it will impact the optimizer. But my question is how this efficiently used in online systems? Okay. Thank you. That is a good question. To explain the question for those who are not familiar with what the question is about, we have been talking about using statistics to optimize queries. And I have quietly assumed that somehow the database has the statistics. But how does the database compute the statistics? It turns out that updating the statistics every time a relation is, has an update, insert, delete, update, whatever, is very expensive. So, a lot of databases, especially earlier on, and even today, for most of the databases, the statistics are not updated automatically. The database administrator has to run a command, which in PostgreSQL is called Analyze in Oracle or SQL Server. There is, Oracle, I think, is analyzed also. SQL Server was run stats. So, there is a command, which is run, which will gather statistics about one or all relations. In today's lab, you will actually be doing that. You have an exercise where you will load a large relation and then do something. And if you don't run the Analyze command, the statistics will be wrong and then a bad plan will get chosen. You can try that out today. So, the question is, as I interpreted this, there are two parts. A, if you don't run stats, what is going to happen, then the statistics are wrong and a bad plan will be chosen. This has happened to us a long back in an early version of SQL Server 6.5. We had loaded data and forgotten to run this command. And we ran a query, which we just started off and then forgot about it. It was a machine which was doing nothing else. And after a couple of days, somebody noticed that the disk access light on that machine was on all the time. So, what's going on? And it turned out that query we had started was still running after two days. So, then we cancelled the query and said, what is going on? It's not a big relation. Why is it still running? It turned out that it had the statistics wrong. It thought there was one tuple in each of the relations and said, nested loops join is best. Guess what? Each of the relations had a million tuples and nested loops join was taking forever. So, then we realized this. We ran the statistics and then the query finished in a few minutes with a million tuples. So, that is a good example. And similar examples occur all over. So, whenever you make significant changes to the data, if you load a lot of data, you must run this command to update the statistics. If you are, you know, inserting one tuple at a time all day, well, then maybe the statistics are not going to change drastically in one minute or one hour or maybe even in one day. But after a few days, the statistics may be significantly different. So, periodically, you have to run this command. Now, obviously, this becomes a headache for a human to have to remember to do all of this. So, more recently, systems such as SQL server and even Oracle, I believe, in the recent versions, they have automated this task. So, what they do is they keep track of how many updates have happened overall to each relation. And whenever the fraction of tuples in a relation which has been updated reaches, say, 20 percent, then they will automatically recompute statistics. In fact, they have some other tricks up their sleeve. For example, when you run a query which does a count star on a relation, for free, they can check the statistics for number of tuples on that relation and compare it with what is stored. So, some of them will do tricks like if you run a query and that provides some statistics, they will update the statistics automatically at that point. So, these days, the commercial ones, there is less of a need for a programmer to do these things. But even now, sometimes, what happens is immediately after you load a relation, it may still have wrong statistics. But what it may do is after running a query, it realizes something went wrong with that query and later it will run the statistics and generate it. SQL server used to do something very funny in one of the, in the previous release. They would look for statistics on a relation. If it was not there, while optimizing the query, they would compute the statistics. Then people found that sometimes, the query optimization runs very slow. So, some people complained and in the current version, they will actually run with whatever statistics are there. But then they will initiate a thing to compute required statistics in case this query is asked again. So, there is a number of tricks which are used to deal with either out of date or missing statistics. Okay, back to you if you have a follow-up question. Thank you, sir. Over to you. I will take maybe one more question. I will see Patel Shirpur over to you. My question is, sir, while executing the query version plan, can we execute the query in the evaluation plan with the alternative that the more joint pipeline option for the query evolution or the materialized query evolution. This option can be entered while executing the different ways of queries. Okay, so that is a good question. The question is, if you give a query to PostgreSQL or any database, it chooses some plan. Can you control this? That is actually, that is how I understand it. Can you tell PostgreSQL to use this join algorithm or that join algorithm or to use materialization at some point and so on? The answer is, you can control it to some extent, not 100% in PostgreSQL, but you can tell PostgreSQL, for example, don't use nested loops join, don't use hash join. So, there are few flags which PostgreSQL has. If you look up the manuals, it will tell you something about those flags which you can set to prevent it from using certain things and then it will be forced to use some other join mechanism. So, that is some control. A database like Microsoft SQL Server gives you tremendous control. You can actually ask it to give you a plan. It will give you a plan in XML notation. Then you can actually modify it to suggest the completely different plan and give it back to the database and tell it use this plan. Now, actually, the database is fairly clever. You can try tricking it into accessing things you are not authorized to view by using an alternative plan. So, what it does is, it will first check if this plan is something you are authorized to run and it is equivalent to the query and then it will use the plan you suggest. So, you can actually control this in great detail in SQL Server. Certain optima, I think DB2 gives you no control at all or maybe very little control. Oracle gives you a fair amount of control. You can annotate, you can say, for this query, you know, use this join, use hash join, etc. So, Oracle also gives you some control. So, you can play around with it. Generally speaking, it is not required. The only reason to do it is, if you want to, if you have a query which is running very slowly and you want to figure out if there is something you can do to run it faster. But it is very rare. Most of the time, a query runs slowly not because the optimizer did something silly, but A, because you forgot to update statistics or B, it needs an index which you should have created. So, usually index tuning is a far more effective way of dealing with slow queries and occasionally creating materialized views. You cannot do it in Postgres SQL, but in other databases, that is an option for, especially for aggregate queries, creating a materialized view is often useful. So, hope that answered your question. Back to you if you have a follow up. Thank you. With that, we will break for the tea break. So, thank you.