 So now we have a very small amount of time left for query processing. So there are several issues here, for lack of time I am not going to cover all of them. The first part is how do you measure the cost of a query. Then the next part is how do you implement various operations and also estimate their costs. And then the last part is given an expression which has multiple operations, how do I evaluate the whole expression. So I am going to sample bits of each of these and skip other parts of each of these topics. I know the overall process of query execution, you submit a query, there is a parser which translates it into some internal representation which could be relational algebra or it could be some extension of relational algebra or it could be some other form. That is taken up by an optimizer which comes up with an execution plan which it then gives to an evaluation engine which evaluates it. Now how does it decide what plan to choose? It has statistics about the data which it uses to estimate the cost of different plans, cannot actually run it that is too slow. It has to estimate the cost of alternative plans and it will pick the one with the lowest estimated cost and output that as the optimal plan which it gives to the evaluation engine. So that is the basic architecture of all query processing engines. Now an important part of this is the cost of a plan. We are estimating the cost and choosing a plan. Now why is this important first of all? It turns out that for a given query a good plan may execute in a fraction of a second. A bad plan may take hours or days to execute. This actually happened to us in IIT at one point. We had a query which for reasons which I will get into later, the database chose a wrong plan. The reason is it had wrong statistics. Why did it have wrong statistics? That version is an old database system which was sometime ago, 10 years ago. It did not update the statistics after some data was inserted, it had the wrong statistics. On the basis of wrong statistics, it had wrong estimates of the time taken and it chose a wrong plan. This wrong plan actually ran for a few days before we realized somebody started the query and went away and we did not use that machine for sometime after that. Then we come back and we see this disk light on that machine is right continuously. What is this machine doing? I thought it's idle, nothing is happening on it. So then we looked and we realized that the query had started two days before. So what had gone wrong? So we stopped that query. Then we told the database to update its statistics and then ran exactly the same query again. This time it had the right statistics, it chose the correct plan. That plan finished in I think one minute. So this is the kind of difference between a good plan and a bad plan. So the optimizer is absolutely critical in choosing a good plan. It was also absolutely critical in the days before SQL was used. Prior to SQL, who did optimization? A programmer. The programmer actually had to decide how to execute a query. And when SQL systems were first built, they didn't do a great job, the prototypes. And there was a key piece of work from IBM research, which showed how to build an optimizer for SQL. And that was like a seminal paper. It marked a huge change in the way people looked at SQL. Before that, people said SQL is a toy, it's not real. After that, people could see, wow, this optimizer can not only choose a good plan, it often beats expert programmers and chooses a better plan than they would have chosen. And that's when relational database actually took off. Before that, they were a toy. So it is actually that important. Now coming back, how do you estimate what is the cost? There are many factors which contribute to cost. Disk accesses, CPU, network communication in a parallel or a distributed database system are all factors in the cost. As a user, you would say cost is fine. I submit a query, how long does it take to return? But that time actually goes into many different aspects. It goes into disk IO, it goes into CPU cost. And it's often actually hard to estimate the actual time. Because sometimes, you may be able to run multiple IO's in parallel because they happen to be on different disks. And the database doesn't even know how many disks there are on the RAID system. Therefore, what databases do is they use a measure of very cost which is not the elapsed time because it's very hard to measure the elapsed time. The elapsed time also depends on what is in the database buffer and what is not. So that's actually very, very hard to estimate. So instead they use a measure which accounts for how many disk IO's are potentially required in the worst case. How much CPU effort will be required potentially? And they add up these costs to come up with a cost estimate. The disk cost includes a number of seeks and the number of blocks, right? Why do we need to differentiate this? Because the seek is often very slow. You can read blocks much faster than you can seek. So optimizers usually keep these measures separate. If you use a flash disk, you would use the same cost model, but you would simply replace the average seek cost from 10 milliseconds to 1 microsecond or 2 microsecond, whatever it is for that flash device. So the cost model is the same, the constants will change depending on what device you're using. I'm going to skip some more details of query cost and mention a little bit about algorithms. Now, first of all, if I have a selection operation, it's a bare clause. One option is to just scan the whole file and find which records satisfy it. So that has a particular cost, that's linear search. Another is to use an index. How do you decide whether to use an index or not? Now, you might say that index is, if you have a selection, why not use the index? It should be a good idea. Now, if I had a query which said, give me the student with roll number x, y, z. Index is a good idea, that is an index. On the other hand, if I have the takes relation and it's not, and I say, give me everyone who has registered for CS 101. Now, how many people have registered for 101? Assuming there is an index on course ID, there are a huge number of people. Professor Fartek is teaching that course today for 500 people. And he taught it last year for 500 people, actually more, 1,000 people. So if you see how many have taken it, it's pretty much everyone who joins IIT and equivalent in your colleges. So you'll have to retrieve so many records and the index actually becomes a headache. And if you use the index and find record IDs and access records one at a time, you may actually end up with a huge amount of random IO. And the net effect is much slower than if you had simply scanned the whole relation, so this is the job of optimizer. If you see the query, you might say, there's an index on course ID, I will use it, but it turns out it is slower. So the optimizer takes care of this. So index scan is an option, it's a very good option in many queries. For some queries, it's not a good idea, even if it's applicable. I'm going to skip details of the cost computation for indices and move on to the other operations. So we have seen selects primarily using index and scan. Those are the simplest operations. What are the next most important operation in an SQL query? It's probably the join operation. There are other operations, there's a group by operation. How do you evaluate these? In particular, how do you evaluate a join? Let's start with joins. How do you evaluate a join if the relation is much bigger than main memory? You can't do arbitrary things. You have to make sure the amount of IO you do is limited. And it turns out that sorting is actually a key way of implementing a large number of operations, including join. If you think of it, if I have two relations, I want to join them on some common attribute. I want to match records which have the same value for that attribute. If I sort this relation on this attribute, I sort this other relation on this attribute, maybe the relations are huge. But I can read them sequentially in sync. You're familiar with the merging process for sorting. It's very similar to that. I'll read those two relations in sync. And any pair of records which are matching, I realize that they join. And I will output a join result. And then I will continue moving the pointers. There are some details with duplicate values, which can easily be dealt with. So the key intuition is, if the relations are sorted, it's very easy to join. If they're sorted on what? On the join attribute. If they're sorted on something else, it's useless. If they're sorted on the join attribute, I can join it. So the key step here is how to sort. I can join huge relations if they are sorted. The question is, how do I sort huge relations? And that is external sort. So external sort merge is one of the sort algorithms which is applicable to relations which are much larger than memory. I won't get into all the details, but let me show it to an example. So here is supposedly a very large relation out here on the left. And let's suppose my memory is only three records. Of course, realistically, this relation is going to be millions of times bigger. And each of these will also be millions of times bigger. But for the example, let's assume memory can only hold three records. So I will read the first three records into memory, sort it and write it out. Now, this is a file on disk. Next I'll read the next three blocks of memory, sort them and write them out. That's another block on disk. And I continue this till I have a number of sorted files, each of which has one memory full of information. That has taken the first step towards sorting. The next step is to merge these sorted files to get larger sorted files. Now ideally, I can merge all of these at one go into one file. But there are some constraints on how many I can merge together. In particular, if I have a million such files, merging all of them at one go is a million way merge. And the bottleneck there is that when I want to read these for merging, I need to allocate at least one block for each of these files to read them. If a memory does not have a million blocks, I cannot read a million files in sync. So I cannot do a million way merge. So that is a limitation. So I may be forced to merge a few of these. In this case, I'm assuming I can merge only two files at a time. So I merge the first two files and output a relation, which file, which is now, since I merge two files, it's double the size of this. Similarly, I merge these two, it's double the size. Now I can merge these two, allocating one block for reading each of these. And then one block for the output and I get the merged output. Of course, in reality, I'm not going to do two way merge. Memory has a lot of blocks. So the merges themselves are going to be quite wide. I'm going to merge 1,000 or 10,000 things at one go. So that is, in a nutshell, is external sorting. Of course, there are a lot of optimizations since it's such a core operation. People have put a lot of effort into fast sorting and there are benchmarks for sorting. At one point, the benchmark was a gigabyte sort. So how long does it take to sort a relation, which is 1 gigabyte in size? So this was a nice benchmark back when memories were 10 megabytes. After some time, the whole thing fitted in memory and it was no longer an external sort problem. It was an internal sort problem. So then they flipped it around and said, a minute sort. How large a relation can you sort in one minute with a certain amount of memory? So the current sort benchmarks look at this kind of thing. How big a relation can you sort in a fixed amount of time? And there are some truly impressive numbers on how big relations you can sort. So those are highly optimized. So now I'll skip the cost analysis and come to the join operation. Now join itself, there are many ways of doing a join. One of the options is a merge join, which I briefly talked about. But there are also much simpler, but potentially very inefficient things. There's something called the nested loops join, which actually matches every pair of records. And then there's something called the hash join. I don't have time before lunch to cover all of these. So I will just mention nested loops join and index nested loops join. Merge join I've already covered. I'll skip hash join. The nested loops join is very simple. It's like a nested for loop for each tuple in this, for each tuple in that. Check if the pair of tuples satisfies the join condition. Now it turns out that you sometimes need this because the join condition is a complex condition. If the join condition is attribute A equal to attribute B, I can sort on A and sort the other one on B and merge it. That's very easy. But if the join condition is some arbitrary thing, 2A plus 5B is equal to 29 or some function of A plus A and B is equal to some value, and I don't understand what that function is. How do I know which things here potentially match which things there? I don't know. So the fallback option is I will compare every tuple here with every tuple there, run the function, see if it satisfies it. If so, I output it. So that is nested loops join. Now typically you don't want to do this because the cost is quadratic. Occasionally people write queries which they don't realize, but the optimizer is forced to use nested loops join, we will actually see an example in today's lab where people have unintentionally written a query which required a nested loops join and becomes very slow. And there are ways around it, we'll see that in today's lab. Indexed nested loops join is essentially the same thing except over here for each tuple in R, instead of going over every tuple in S, I use an index to find tuples in S that could match this tuple in R. So what should the index be on? It would be on the join column of S. So then I can probe it. So when the number of tuples in R is small, an index nested loop join is easily the best way of doing the join because I don't even look at all of S. If I have an index on the join attribute of S and a small R is easily the best way. But if R is very big, this may not be the best way. Merge join may still be cheaper even after adding the cost of sorting. So that is the job of the optimizer to pick amongst these. So I'm going to skip hash join for lack of time and there are other operations, duplicate elimination, aggregation, all of which can be done using sorting. It turns out, let me spend a minute on aggregation. Supposing I had a query which said group by department and find the number of instructors, how do I run this? If the number of instructors is, number of instructor records is very large, I want to group by department, group instructors by department. A simple trick is to sort the instructor relation on department. What is going to happen? All the instructors for a department will be consecutive, one after another. Now I'll read the first record, it's for some department. I keep reading all the remaining records for that department. They will all be together. I can keep counting how many records. The moment I find an instructor of another department, I know I'm done with the first department. I output that department with its count. Now I start counting the next department and I go through. However large this relation is, if I sort on the grouping attribute using external sort, the actual operation of computing the aggregate with grouping will work with a very small amount of memory, once I have sorted it. So sort to the rescue again, however large the relation is, grouping and aggregation can be done very efficiently after sorting. It turns out hashing can also be used to do aggregation. For lack of time, I won't get into it. Similarly for intersection, union, all of these can be done using sorting or hashing, outer join is another operation. I'm going to skip the details. But again you can do it using variations of the join, merge join or hash join, we're going to skip the details. And come to the last topic in this chapter which is how do you evaluate an expression? Let's see this example here. This is a very simple expression. It runs a selection on department, joins the result with instructor and then projects on name. So there are three operations here. How do I evaluate this query? One simple way of doing it is to start from the bottom of the tree. Here is a select operation. I'll run the select operation on department, get the result and store it as a relation. Next I move one level of the tree, I have a join. This result has already been computed. It's been materialized, so that's why it's called materialized evaluation. So this has been materialized. I can do a join with instructor, output that relation to disk. So this join result has been materialized. Now I read that and project on name and get the output. Now it should be very obvious to you that I'm unnecessarily writing things to disk and reading them back in. Maybe I could have taken this and immediately done a join with it, depending on the join algorithm I used. The output of the join, why on earth did I need to write it to disk? I could have straight away passed it to the projection operation and used it. So that idea leads to what is called pipeline evaluation. Where as far as possible, instead of writing the relation records to disk and reading them back in, I will pass them on to the next operator in the tree. Now pipelining can itself be done in two ways, demand driven and producer driven. And both are used. So what is the difference between them? In demand driven, it's also called lazy evaluation. You start at the top of the tree, let's look at this tree. At the top of the tree, this operator will tell its child operator, give me a tuple. This operator will somehow have to compute the tuple and pass it back up. How does it compute a tuple? It may have to get a tuple from here, a tuple from here. Supposing it's doing merge join, it may have to sort these two and then read one tuple at a time, merge and then if it gets a result, it passes it back in. Now how does this get a tuple? It may have to scan through the relation one at a time, one record at a time, sees if it satisfies it and passes it back in. So each operator here is lazy. When it is asked to give a tuple, it will do the minimum amount of work to generate one tuple and pass it back up. Now an operation here, maybe there's a sort operation in between, that may keep telling this, give me a tuple, give me a tuple. And keep sucking tuples one at a time and then do the sort. So this kind of thing where an operator generates a tuple only when required is called lazy event or it's also called demand driven. And this is very widely used, post-criscule uses this model. In contrast, there's also a model called producer driven where this operator keeps reading tuples from the relation using an index or scan, whatever. And whatever tuples it gets, it pushes it into a buffer. Now between each operator, there's a buffer. So it eagerly pushes tuples into a buffer. Now the join operation finds tuples in its buffer and it can join it with the instructor. And it in turn pushes that whatever it can compute. As soon as it computes, it pushes it into this buffer. This fellow looks at its buffer. As soon as it gets some tuples, it uses them, does whatever it can with them and outputs those. So this is pushing down from below. It's also called eager because these guys don't care about who needs their results. They eagerly pull in, get their input and eagerly generate output. So that is producer driven. Most databases use demand driven as the default. However, when they are doing parallel query processing, sometimes they use a combination. So, each site may work eagerly and finish its work. And then pass the result to the next site. So sometimes they use a combination. But for a single site thing, it's typically demand driven. Let me repeat that question. There are two parts to the question as I see it. Pipelining lets you not materialize an intermediate result. Now, can you always do pipelining? The answer is no. There are some operators which cannot output anything until they have consumed all the input. Sort is one such operator. There are actually some tricks where the sort operator is actually broken into two operators. One of which generates those intermediate runs. And the other one does the merge. And then you can treat it as two operators. One of which has to complete before the next part runs. So now, you cannot pipeline it all the way through. So there are some breaking points. So the overall tree is broken into parts where one part may be able to pipeline tuples till the root. But the root of that has to materialize its result and store it before the next one can even start. So materialization may be unavoidable for many query plans. But where it can be used, it saves a lot of writing and reading it back in. That's a huge amount of effort which is saved. So where possible it will be used. Now, the second part of the question is there may be some situations where there are many queries which use a common result. Then wouldn't it make sense to write it once to disk and then use it across multiple queries? And that's actually a very good question. And most databases don't actually do this. But there are some situations where they do. And there's been a lot of research on this topic which is called multi-query optimization. So the first point is you know there are multiple queries and you want to optimize them together. So why would you get multiple queries? If you're just getting one query at a time, you can't do much. But there are some situations where there's a whole batch of queries which comes together and all of them have to be evaluated. And then you can actually exploit common parts. Now, the optimizer's job is to decide whether a particular result is worth materializing and sharing. Or you just want to recompute and pipeline it to each of the uses. So that is the job of the multi-query optimizer. And that's actually an interesting research area. One which IITB has actually done a fair amount of work on. One of my PhD students did some very nice work on this. But it's kind of an advanced topic, so you're not going to cover it here. Any other questions? That's basically, there's a few more details on the API to implement pipelining, I'm going to skip that so that we can have lunch and come back to our next session.