 It's a good system. That's what we have. They're big on metrics and the edge of a lot of things. What you get done, I was thinking in terms of what Robert Wood, one of the Jeff's said things for the idea of the idea of using more hardware so you can listen to people. I didn't like that much, but yeah. I haven't seen that too much. So yeah, I think that's what you might like it. Interesting. You can imagine the style of the concept of doing different things. The same thing works as you have some of the smart stuff, the stuff that's smart, and just a little bit of other product. It's not going to Jeff's space there to receive the data. OK, let's start. It's not a state of all markets. So welcome to people who actually briefed the weather and came here. And welcome to everyone who watches this lecture streaming. So today, the plan is to finish query execution that we started last time, and then to also cover query optimization, which is a pretty rich part, which is a lot of material to cover. Next week will be essentially the last lecture before the final. And we will discuss parallel databases and bloom filters. So I know we are a little bit behind grading homework 3 and 4. I think Parham is working on them. He's probably taking off during the holidays. But I would make sure that he posts them the grades soon after, early next week sometimes. And Jessica, I know she's already grading homework 5. So that should be posted pretty soon. Somebody asked me to consider cancelling homework 7. I don't like to do this. Homework 7 is actually easy, and it exposes you to some very interesting technology, which are parallel databases and the MapReduce framework. And also, it exposes you to using web services. You will use Amazon web services. So I think that the amount of things that you learn from this homework is worth your time. And I will find some ways in which you can do this homework Monday, Tuesday, Wednesday evening after the final. OK, so let's start. Where are we? Last time we discussed physical operators. When a query is to be evaluated, the first task is to transform this query from a declarative syntax from SQL into a sequence of operators, of logical operators. What are examples of operators? Hey, I have an audience of five people, so it's a huge audience. So what are examples of operators? Give me an example of an operator. Join is the operator of four operators. That's the most important operator. Union, different selection, these are logical operators. A physical operator means a particular implementation of a logical operator. And what we started to do last time was to discuss alternative physical implementations of operators. Now, one thing we said there is that in order for these operators in the tree to talk to each other, they need to expose a particular interface. And here is the interface. Every operator has an open method, a getNext that receives the next table, and the close method. So every time the implementation of the getNext method is such that the operator starts doing its task. If it's a join, then it starts working on the join. As it works, it may need to call getNext for some of its children operators, for some of its sub-expressions. And when it finishes computing at least one tuple, then it will return to that single tuple. And we had some examples of a main memory hash algorithm and an implementation of the getNext operator. The other things that we covered last time were three specific algorithms for a main memory join. And then we briefly mentioned some algorithms for selection, for projection, for group I. But they were very simple because these were main memory algorithms. So join, let's briefly review them. We discussed hash join. What is the complexity, the asymptotic complexity of the main memory hash join? If the tables have size m and n. For m plus n? m plus n. So this is O of m plus n. Merge join. Here we sort the two tables, and then we compute a join by merging them. What is the asymptotic complexity here? m log n plus n log n. m log n plus n log n. And the simplest one was a nested loop join. What is the complexity here? Mn. Mn. For selection, projection, and group I, you can figure out the complexities. For selection, it's linear. And for group I, it depends on which particular algorithm you choose. OK, so these were main memory algorithms. Now, of course, the interesting part in databases is when the data is too big to fit in main memory. And in that case, the algorithm needs to be redesigned in order to take into account this huge gap, latency gap between main memory and disks. So as we design extended memory algorithms, we will estimate their cost in a different way. This is slightly different from what you know from the data structure and algorithm class. Instead of computing the asymptotic running time, we will compute the exact number of disk IOs. This is how we evaluate an algorithm. Now, in practice, we also need to worry about the CPU time. They mean the time before the processor time. But for the lecture and what the book does and all books do the same, they assume that these algorithms are efficient enough. And that essentially the bottleneck is only in the disk IO. So here are the parameters. And I go slowly over them that dictate the IO complexity of every physical operator that we will discuss. B denotes the number of blocks, the number of disk blocks used to store the relation R. T denotes the number of tuples for the relation R. And we assume, I just need to move my mic, we always assume that there are more tuples and blocks. And this is because we assume that the records are smaller than a block, at least one record fits per block. So in general, the good assumption is that the number of blocks is way smaller than the number of tuples. Now, the other is a little bit more mysterious. V of R and A represents a number of distinct values of the attribute A in R. And we will see where we use this. And an important parameter is how much main memory do we have. This is M. And it's always measured in blocks, such that you can compare B to M. And if B is less than M, it means that you can fit in main memory. OK, so let's see how we analyze such an algorithm. But let me mention a convention that textbooks always do. It's an ad hoc convention. As we compute the IOCost of an algorithm, we will ignore the last write. So we compute the cost to read, any cost to intermediate write and read back. But we will not compute the cost of the final write. And the reason is because a final write might not be needed. Maybe the result is pipelined. Maybe it can fit in main memory. For whatever reason, we will not include that in the cost. If it turns out that we need to write the result back to this, and of course, we have to add that cost separately. But in the analysis that we do for each physical operator, we will not take into account the final write. So for example, all the main memory join algorithms, they have this cost. It's B of R plus B of S. So why is that? Well, because that's all we do to them. We just read them. And we do stuff in main memory. And we do not charge for what we do in the main memory. Similarly, for a group I, the cost is just the cost of reading the relation. OK, let's see another simple algorithm, a sequential scan of the table. As you see, already it gets complicated. If the table is clustered, which is what you would normally expect, then the cost is the number of blocks. Sometimes, database systems, especially Oracle, if I know for sure Oracle does this, I'm not sure about the others, they use unclustered relations. Unclustered means that its records are interspersed, are mixed with the records of some other relations. So when you read a block trying to get to the records of that relation, you might get a single tupper from that relation. In that case, the cost is closer to the number of tuppers than to the number of tuppers divided by the size of a block, or divided by how many records you can fit in one block. So sometimes, we will mention unclustered relations, but the typical case is when the relation is clustered. OK, so let's see the first slightly more interesting algorithm. It starts very naive, but then you see how it gets more interesting. It's a nested loop join. Nobody would compute a join like this, but it's worthwhile to contemplate it. This is a nested loop, but now we need to compute the IO cost. So how many disk IOs do we need to do in order to compute the join naively? Well, the outer loop iterates T of r times. And for each record, according to this particular implementation, we need to read in a relation entirely, so it's times B of s. It's a huge cost. OK, but now let's see how we can still do a nested loop join in a more clever way. For example, suppose we have two tables of size 1,000 blocks and 10,000 records, and this has two blocks and 20 records. If we computed naively with a naively nested loop join, then the cost will be 10,000, which is the number of tuples, times 2, which is 20,000 disk IOs. It's huge, OK, IOs. Now look at the statistics here and tell me, how could you compute this join still using nested loops, but with more clever nested loops? Any ideas? I give you a hint. I also include this information, the size of the main memory. How many blocks can we fit in the main memory? Four blocks. So this gives you a hint, which is the smaller table S can be fit in main memory, OK? So reach out in main memory, and then we do what? Flip over the smaller table while reading the line on the table. Yeah, keep the smaller table in main memory, iterate over its tuples while reading the larger table. And this gives us? 22 IOs. It's 2 for the small table plus 1,000. Which is 1,002, much better. But here we were lucky, right? Because the small table fit in main memory. Look at the second example. Same amount of main memory, but now the smaller table is slightly bigger. What can you do? I can still get S into main memory. Yeah, you can still get S into main memory. But here is the reason why you can't get it into main memory, because you also need one buffer to read T. And then if you have to write, you need another buffer for writing. So actually, you can only fit two blocks in main memory. So what can we do now? We can do the smaller table one half at a time. Exactly. We split the smaller table in two and do it half at a time. What's the cost? Well, the small table needs to be read one way or another. What do we do about the other table? We read twice. Has to be read twice exactly. We choose two times 1,000. So now the cost is 2,004. And I should have mentioned that this board does not fit in main memory because of the extra two blocks for buffering. So that's the idea between the block-based nested loop or block-nested loop join. We iterate over one table. And we read in main memory as much as we can. And we can read the main memory up to minus two blocks. And once we have this huge chunk of one table in main memory, we iterate over the other table. And we do the join. So that's the iteration over the other table. And then inside here, this might be replaced with a hash join. It's actually a better idea to replace this with a main memory hash join. It will not affect the only cost that we analyze, the number of disk Ios. But it will affect the CPU time, which is still non-negligible. So that is the block-based nested loop. Let's analyze this. I have actually a nice picture. It's worth to look at this picture because later when we discuss partition hash joins, this picture will come in handy. So what this suggests is that we read S in main memory, I mean a large chunk of S in main memory, and minus two pages. And then we read one page at a time from R, one block at a time. We do our join. And as we want to write, we need one other page for writing. And this suggests that we can only use a minus two pages for S. So now let's analyze this. We have seen the example already. The analysis is pretty straightforward. We need to read S completely. And then we need to read B of R how many times? Well, this many times. B of S by M minus 2, because this is a number of fragments of S that we need to read. So we multiply this with B of R. Good. Question. If you have a choice, you look at the two relations, R and S. And you can swap them. Which one do you want S to be? Exactly. S has to be the smaller ones. And you saw this from our examples. The small one, maybe you're lucky, and you can fit it in main memory. This formula tells you exactly why. Because, well, you and B of S at the beginning. OK, S is called the terminology is confusing. Don't always use the same terminology. And I'm also at fault. I'm also not very consistent. Usually, the correct way is to call the outer table. It's a table that can be streamed, that can be pipeline. Which of these two tables can be pipeline? Which tuples can be streamed directly to the output? S, they are read from this? Intertainment. Well, R or S. Because that's what I want to call the outer table. So which one? R. So here R should be the outer table. And S would be the inner table. I know at least one book that calls them this way. And I said, no, I like to call S the outer table because it's in the outer loop. But I realize I think the correct terminology is based on the streaming. So always check. In the case of hashtags and algorithms, the terminology is clear. I'll tell you which one is inner and outer. But in case of nested loop joints, there is sometimes a confusion. OK, and actually, this reminds me that I had a question for you for today. How to do a double pipeline hash algorithm? Why don't we have this discussion a little bit? So remember the main memory hash join algorithm. We want to join R and S. Here is our main memory. We take S and we create a hash table. And then we read the tuples from R. We probe. So there was a build phase. And then there will be a probe phase where we probe tuples from R. Which of the two relations is streaming? Which can be pipelined directly to the next operator? R. So this is pipelined. And this is always called the outer relation. In the case of hash join algorithm, there is no confusion. This is always the outer and S is the inner. OK, but often, I mean not often, but sometimes, you would like to have a double pipeline join. You would like to be able to stream tuples coming both from R and S. Why would you do this? Why would you ever care about streaming tuples coming both from R and from S? The growth rather large. The buffers are large? No, both tables are large. You're going to end up with a lot of hits from R and S. So for instance, you get 1,000 from each. And they're encloser than you have 10 blocks available. Well, if you have too little memory, then I can't stream them. We need to have enough memory in order to be able to join them and stream them. So I will assume that we have enough memory. But it might be useful to stream both R and S. For example, they might arrive at different rates, especially in data integration. Maybe they arrive over the network, and they arrive at different rates. And you would like to return tuples as soon as possible, to return answers as soon as possible. How do you design a hash algorithm such as double pipeline? Double pipeline hash algorithm. It's not that hard. We assume we have sufficient memory to do our little trick. Well, let's suppose the first tuple that arrive R from S. What would you do with them? We create this hash table. But now we start getting tuples from R. What do you do with them? Create a hash of models. Create a, you probe them first. And if they probe in S, you can immediately stream them to the next operator. But you also created a hash table of R. Keep them there. Now new tuples from S arrive. What do you do with these tuples? Rob them against R. And the result, you can immediately stream. So it's indeed double pipeline. But you also insert them in S for future use, because new tuples from R might join with those tuples. This is called a double pipeline hash algorithm. And people writing query optimizers, they love double pipeline because they can pipeline from both directions. And it's a much, much better choice. Good. So let's continue our discussion of join algorithm. This we saw only the most naive external memory join algorithm, which is nested loop. Let's see the next one, which is kind of the most important one, the index-based join algorithm. And for that, I will start by discussing the index-based selection, which is like the building block for an index-based join. So index-based selection, how does it work? Does the obvious thing? If we have an index on the attribute being selected, then we just retrieve that value. For example, let's suppose we use an index to answer this query. What is the cost for this query if we use an index? And I will assume, and this is an assumption that we will do throughout the discussion of physical operators, that there is no cost to read the index. You can assume that the index happens to be in main memory. Otherwise, you need to charge for those two or three disk accesses that it takes to read the B3. So if the index is already in main memory, what's the cost for finding the movie with ID 1234? It's just one, because that's a key. Here it gets more interesting. What is the cost to access a movie on a selection which is not on the primary key? So we need to find out how many records we return. And if the index is unclustered, then each such record will probably come from a different block. And then that's a cost. But how many records do we get? Possibly all of them. But remember, we had that statistic called B, the number of distinct values. This is where it comes in handy. V of movie and year, let me write it here. V of movie and year, I'm going to abbreviate. Gives us a number of distinct values of the year attribute, probably around 150 or so for the real movie database, because this is how many years it might have. Now, how do we use this in a formula to compute the number of topics returned by the selection query? Well, let's take the total number of topics, T of movie, and then we divide it by the number of distinct values of the year attribute. And that is a cost. This assumes that the index is unclustered and that every such read is from a different block. What if the index were clustered? What if we had a clustered index on the year attribute? Then it would be B of m by V of m and y. That is the difference. So let's see this on the next slide. So if the index is clustered, then the cost is equal to, I forgot the equals sign here, then the cost is equal to B of r by B of r and a. And if it is unclustered, then it is T of r by B of r and a. Good, so that's a selection. Now let's finally get to, I'm going to skip this. Let's get to the index-based join. And now it should be easy. An index-based join is like a nested loop join. Or an inner table, inner in the sense of inner loop table, is accessed through the index. There is an index-based access for the inner table. So here it is. For each tuple in the outer relation, access the corresponding tuple or tuples in S using the index. And the cost is straightforward. It is, well, we need to read r. So don't forget about that component. And then it's exactly the previous cost, either B of s or T of s by B of s and a. At times, the number of tuples that come from r. So I think I have an example here. No, I don't have an example. So let me actually go. No, I think we can have this discussion on the slide. This is the most commonly used join algorithm for over-TP queries. In over-TP queries, you have selections either on a key or on some very selective predicate. Either find information about the account with this key or find the information about the account belonging to this customer. So one selection is on the customer name or on the account number. And then you might need to join this one record or a small number of records with extra information with some other relations. And this is where the index-based join is the most efficient join algorithm. OK, so what remains to discuss are joins that are really huge. And they don't occur in over-TP queries, but they occur quite often in decision support queries. When you need to compute large group bytes or large statistics on tables, there are no selections to reduce their size. So how do we compute a join of two tables? There were no selections. And therefore, these tables are huge or the selections are not very selective. And therefore, both tables are huge. They don't fit in my memory. But there are two methods, either based on hashing or based on merge join. And both can be seen of ways to partition the data, such that it does fit in my memory, such that we can handle partition that are small enough to fit in my memory. And I will discuss both of them. And I think I'm going to start with grouping. Let me see if that's how the slides go. Actually, we start with a general principle. So let's discuss hash-based algorithms for external memory. And then we will apply this principle to both group bytes and joins. So here is the idea. The relation is too big to fit in my memory. So we want to partition it into smaller relations, k of them. And assuming that our partition is good, we use a good hash function, we expect that the size of each partition to be 1 by k of the initial size. And now, if this fits in my memory, if v of r by k fits in my memory, then we're in good shape. In the second step, we can read each partition in my memory and process it separately from the other partitions. Very simple. Now, here is a big question. Do you want k to be smaller or big? That's the easy part of the question. Do we want a small k or a big k? Of course, we want it big. We want to drop this relation into lots of small pieces, into many small pieces. So what prevents us from choosing a very big k? How big can we choose k? That's the question. Think that you need to write how many different files when you do the partition in this first step. How many different files do we need to write? k. For each of them, we need one block. So k is bounded by the amount of main memory. So k cannot be bigger than m. So now, let's put these two together. And what do we discover? The partition-based algorithm will work only if b of r is less than or equal to m squared. Very interesting, to the square of the main memory. Now, if b of r is bigger, then you can do a two-phase partition. And then it can be as big as m cubed. And then we have a three-step algorithm. But we won't discuss that. It's the same principle. It's only more complicated. So we will discuss the algorithm that work when the size of a relation is no bigger than the square of the main memory. And then we can do this in one partition step. Good. So let's see the partition. OK, so here is a picture for what I discussed previously. In the first phase, r is read block by block. We need one main memory block to read r. And then we can apply the hash function and send the tuple to one of these output blocks. And when that block is full, then we immediately send it to disk into the partition r, rk. And then we continue to use that block for the partition rk. So what this picture emphasizes is actually two things. First of all, we cannot have more than so many partitions. We can have at most m minus 1, what I will actually approximate is with m, m different partitions. And the second is that at least for the partition phase, each of them can be arbitrarily long, because we will always output one block at a time. And this is the important thing to keep in mind. The size of the relation is no bigger than m squared. OK, grouping. Let's see the group pi. Oh, I have this funny symbol here. So gamma is a symbol for group pi. So what do we do? In the first phase, we partition the relation by applying a hash function to the group pi attribute. So now you start with this huge relation with the gazerians of tuples. We wanted to compute a group pi by group pi attribute. We first partition it according to the group pi attribute. So now in each partition might have multiple values of the same group pi attribute. But we know that it will contain all the values of that group pi attribute. So in the second step, we take one bucket at a time, read it in my memory, and then do the group pi in my memory. And this works if v of m is less than m squared. Notice the cost. We need to read the entire relation in the first step and then write it back into the partitions. And then in the second step, we need to read every partition once. And the total cost is three times the number of blocks of r. OK, let's see this applied to the hash join. It's exactly the same idea. We partition s into buckets. We partition r into buckets. Both are sent to disk. And now we join. A question here. What is the condition for the partition hash join to work in just these two steps? Actually, there are two steps. This would be step one partition. And this is step two, the join. So what is the side condition that I need to write for the partition hash join to work in two steps? Previously, we wrote this. We said that v of r is less than m squared. But now it gets tricky. What exactly do we need to write here? There are two relations. Now, this really forces you to think about how this algorithm works. If you get confused, let me know. Imagine these buckets. The buckets, we can't have more than m buckets. That's fixed. Now, if the relation is too big, then the size of each bucket will exceed the size of the main memory. So in the second step, we won't be able to read that bucket in main memory. But do we need to read both buckets in main memory from s and from r or only one? We only need one. Think about the main memory hash join algorithm. We construct the hash, they were only one of them. And the other one can be pipeline, can be streaming. So only the smaller one must have the property that its size is less than m squared. So for this to work, we need to have the minimum between b of r and b of s be less than m squared. Now, if this holds, what is the cost of the partition hash algorithm? How many disk Ios do we need to do? That's actually quite straightforward. Three times b of r plus three times v of s. We need to read and write both tables in the first step. And we need to read both tables in the second step. And we ignore what happens to the results. We stream them, whatever. We don't account for writing the results back to this. Where does the third read come? So we read and write in the first step here. And then we need to read in the second step. Maybe this picture will make it clear. So the picture at the top represents a partition. It is the same principle for both r and s. So we take r. We partition using the hash function. And we write it to this. And now we have this r partition. Then we repeat the same thing for s. And we use a hash function and write it to this. Question. Do we need to use the same hash function when we partition r and s? Or should we use different hash functions when we partition r and s? It's a simple question, but I want to see. We have to use the same one. We must use the same one. Because we want, we must ensure, the topics that join from r and s they end up in the same partition. So we must use the same one. OK, now the picture at the bottom represents the second step. Now we have these buckets. Buckets from r, buckets here in one color, and buckets from s in the other color. So what we do is that we read one bucket from one of these two relations in main memory. In this case, it's a blue relation, whatever that is. And the other relation can be streaming. We only need to read one block at a time and do the probing. Now the relation that fits in main memory will be organized as a hash table. We do a main memory hash to an algorithm. Same question here. Should we use the same hash function as in the first step? Or should we use a different hash function when we construct the little, the small hash table for the bucket that fits in main memory? So this is h2 here. I denoted it as h2, but I leave it open. The question, is it the same, or is it a different hash function from the previous, from the first step? It's different because why? What happens if we use the same hash function? You get everything back into one key. Exactly, because we already know that all these records, they come from the same bucket. So if we use the same hash function as in the first step, this will be a very biased, very unbalanced hash table in which a single bucket contains all the tuples. We don't want this. So we clearly want a second hash function that is kind of independent of the first one that ensures that the tuples are uniformly distributed, although the first one put all these tuples in the same bucket. And that's OK. There are ways to construct such a hash function. For example, you just add salt. You know the technique of adding salt. It's used especially in cryptography. So you add a random number which you fix for that hash function. And then the hash function is completely changed. Good. So that was the partition-based hash algorithm. Here is the analysis. We already went through it. And so this was one technique to do external memory joints of very large tables that do not fit in my memory. The second technique is based on sorting. And for external memory sorting, we always use a merge sort, not quick sort. We always use a merge sort. And why? Because merge sort has a property that we have to read the relations in a sequential manner. And this is what we can do very effectively on a disk. It turns out that under the same assumption, only two passes are sufficient for merge sort. And the merge sort I'm going to describe to you, just to refresh your memory about merge sort, is based on this assumption. So let's review merge sort. The first thing that you do in a merge algorithm is to compute the initial run. A run means a subsequence that is sorted. Sorted subsequence. So in the first step, what we do is that we read the relation from disk. We read as much as we can fit in my memory. Sort in my memory, probably using quick sort, because that's much more effective in my memory. And then write that chunk to disk. Now this chunk is sorted, so this will be a run. Then we read the next chunk of m blocks sorted, write it to disk. After the first step, the runs will have lengths m, right? How many, actually not m bytes, but m blocks. OK, now look at the second step. Unlike the merge algorithm that you probably learned in data structures, when we deal with data on disk, we use a multi-way merge. We don't merge two sort of sequences at a time, but we merge many of them at the same time. How many? Well, same limitation. We need one block for every run that we are merging, so we cannot merge more than m minus 1 or m. OK, that's the upper bound. But once we merge them, the new run will become m times bigger. So now the runs are, if these were runs of size m, these will be runs of size m squared. And therefore, if the original table had only m squared blocks, then after the second step, we are done. And I will assume that that's the case. Otherwise, you need additional steps. OK, so that's the essence of, yes? So if main memory had only m blocks, then how was it able to read all m inputs at once to do the merge? Yeah. It's actually a good question. So this run is big. This run by itself is m blocks. That takes the entire main memory. The next run also uses m blocks. How can we merge m of them? What does merging mean, actually? What does, what is the merge of two sort of sequences? You interleaved them by looking at the head of the sequence and then picking the lowest element, yes. Exactly. So you look at the sort of sequences, and you always pick the smallest element. And that's the one you send out. And then you advance in that sequence. And if the other sequence holds the sort of the smallest element, then that's the one you block out and send away. So that means that we don't have to read the entire run of main memory. We can read only the first block. And that's what we put here in the first buffer. Similarly, from the second run, we just read the first block. Now that block, in turn, has multiple records. So now we do the merge. We interleave these records. We decide which one to output. And when we exhaust all the records from the current block, then we read the next block. Does it make sense? I think you missed this for the small detail. But it was a great excuse to review the merge algorithm, the merge function. Now another interesting aspect here is that in data structures, we probably remember the merge algorithm is that you look at the two sequences and you compare the top elements. This is x bigger than y or y bigger than x. Depending on the answer, you do one thing or the other. But now you have to compare m sequences. m is like in the range of how big can it be. If you have 1 gigabyte of main memory, and your block is 10,000 bytes, then it's like 100,000. So m is 100,000. It's big. There is an interesting data structure which you actually know that allow you to decide very efficiently which is the smallest element in these 100,000 blocks. And it's not so much to decide which one is the smallest one because you can do this in one linear path. But to update, to always have fresh waiting for you the smallest element of these m items, m runs. What is that data structure? What is the data structure that allows us to always retrieve the smallest element and remove it? And then which one? Priority queue. Priority queue, exactly. So what we see here is a priority queue. It is a perfect fit for multi-way merge, multi-way merge. And then the main memory part of the algorithm is very, very effective. OK, please read. It's not going to be required on the final, but it's very interesting information. Chapter 13 of the book describes external sorting algorithm. It gets very, very subtle. There are so many different tricks and techniques that people have proposed for speeding up the extended memory merge sort. And they are pretty well described in Chapter 13. OK, so for our purpose, I will always assume that in two steps we are done. And therefore, the cost of a merge sort is three times the size of the relation. So now let's put this to work. How can we use this idea of merging, for example, to compute a large group i? Imagine a huge table, terabyte. And you want to group by one attribute, and you can't fit the table in memory. But then you want to compute a sum or a count. In this example, it's a sum for each group. So with merge join, we do the same thing. We create runs in the first step. And then in the second step, we merge the runs. How do we need to change this in order to compute group i? When we create the runs, what changes? Think about ingesting in memory all these chunks of size m of data. And then we sort there using quick sort. But now our ultimate goal is to compute group i. So what do we do? What about computing the group i? We compute group i, so we consolidate multiple records into one, compute the sum, and that's what we write to disk. Now that sum is not final, because the next chunk might have the same attributes again. And then we need to combine them. But that goes in the second phase. So now when we merge, we need to detect duplicates. And as we detect duplicates, we are going to replace them with a single copy of that group i key and with a sum of the corresponding partial sums. Is it clear how to extend the merge join to do group i? It's pretty straightforward. Same thing with joins. We create the initial runs for R, the initial runs for S. And then we modify the merge step to merge not just runs from R, but runs from R and from S together. And as we merge, we actually apply the logic of the join and we do the join. So now let's see what needs to happen for this to work. What is the condition that needs to be satisfied for the merge join to work in two steps, as described here? So we know that after the first step, the runs, they have length m. So how many runs will we have for R after the first step? Well, B of R by m. Think about these as being separate runs that are waiting to be merged. Similarly for S, the number of runs is equal to B of S by m. Now in the second step, you need to have one block for all these runs in main memory because you need to read all of them and merge them and do the join if they come from R and the other from S. So what needs to happen is that B of R plus B of S by m needs to be less than m. So this keeps us a condition. Now it's no longer the maximum. It was for the hash algorithm. But now it's a sum that has to be less than m squared. So let's see this. We have it on the next slide. So that is exactly the analysis. m1 is a number of runs obtained after partitioning R. m2 is a number of runs obtained for S after partitioning S. And now in the merge join, we need one block for each of them to fit in main memory. So we need m1 plus m2 to be less than m. And that results. m1 plus m2 to be less than m squared. Oh, no, just m. Because for each run, we need to be able to store one block in main memory. You'll get to m squared because m1, if you look at m1, it is divided by m. So we do get to m squared. m squared will appear once we ask a question of how big can b of R be. So here it is. That's what we get. The sum of b of R and b of S has to be less than m squared. And the cost is the same. We need to read and write during the run formation. And then we just need to read during the merge and during the join computation. Good. So these are the external memory algorithms. I don't do justice to them. I only touched the surface. You can imagine that in practice, you need to worry about many more things. You need to worry about the need to extend these algorithms to more than two steps if you don't have enough main memory. You need to worry about skew. There is an extension of this partition-based hash algorithm that's called the hybrid hash join. It's described in the book. If you need more information, I strongly advise you that you read this chapter. I think this is chapter 15, if I'm correct. No, 14, sorry. I think that's chapter 14 in the book. But for our purpose, I'm going to stop here. And some are IC external join algorithms that we discussed. Block nested loop, don't ignore it. It actually works pretty well if this product is not much bigger than M. The index join is kind of unique because this works really, really well if T of R is more. And this happens in OATP queries because R is not the actual base relation, but it is a relation with selections. And then for massive joins, we need to partition somehow. And we have partition hash algorithm. And here, the condition is that the smallest of the two tables be less than m squared. And the merge join algorithm, if the condition is that the sum of the two tables be less than m squared. OK, any questions about the physical operators before we move on to the optimizer? To the optimizations? So by the way, I didn't put this paper on the web. But if you need more information, the standard reference for discussing physical operators is a paper by Götz Grefe, spread like this, Götz Grefe. He wrote a paper around 93, 94. It's almost 100 pages long. And today's the reference paper on physical operators. He discusses all sorts of variations of joins, hybrid joins, merge joins. He has an extensive discussion of which one is better when how these merge joins and hash joins can be essentially viewed as being too fast of the same principle. And he also discusses more complicated operators like something similar to set difference. So this is the right information. If you want to know more about the actual person, he was a student at Wisconsin in the 80s. He invented essentially the modern style of optimizers based on rules. And then he moved to Microsoft. And he implemented this style of optimization that has been actually implemented earlier in DB2. DB2 implemented the rule-based optimization in the early 90s. Götz Grefe implemented his vision. And I think he was actually first to propose this architecture in SQL Server in the late 90s. And now he moved. He's no longer with Microsoft. He's, I think, with HP. So he's really a true expert in optimizers and physical operators. And talking about optimization, let me switch the slides and go to today's lecture. I can figure out how to do this. Here it is. So I really hope to be able to cover these 90 slides or so today and discuss what is always the most confusing topic in database systems, which is the query optimizer. No book does justice to the query optimizer. It's just it's a complex system. It's a complex module of the database management system. And somehow you need to know everything before they can start fitting together. So to help you understand this topic, I'm going to do two things. So of course, I'm going to refer you to the chapter in the book. But it's not a great chapter. And no book has a good description of query optimizations. But I made available this paper, Query Optimizers, Time to Receive the Contract by Surajiciao Dury. Surajiciao Dury, he is the head of the database group in Microsoft Research. And he wrote a number of very influential papers and some of the structuring of the material on query optimization that I have in these lectures comes from a paper written by him in 1998, where he gave a tutorial on query optimizers. And now, last year, he wrote this very provocative paper, Time to Receive the Contract of Query Optimizations. And I would ask you to read it for next time. It's a very easy read. But it also makes you think about what went into the query optimizers. And please come to the lecture next time, because we will discuss parallel databases and bloom filters. Bloom filters actually do belong here. They are a form of semi-joint reductions. Good. So let's see, what's the point behind query optimization? So remember, we start from a SQL query, which is a declarative language, tells us what we want, and what optimizers do, what query engines do, they translate the SQL query into a query plan. So into something that looks like this. It's a tree. There are joints and selections and more joints and projections and groupies. And this is a plan, because if you look at this tree, you really know in which order to execute the operations. Now, this is just one possible way. What the query optimizer has to do is to choose between alternative plans. There are probably many ways in which you can express the same query. You might reorder the joints. You might move some of the selections up and down the tree. Maybe you can do something with groupies. You need to choose between alternative plans. An important decision is to decide which physical operators to choose for each logical operator. Should that be a hash join, or should it be a merge join, or an nested loop join? As it does so, it needs to settle on a good plan. Well, how does it know if the plan is good? It needs to somehow estimate the cost. It needs to estimate how long would that plan take without actually executing it, because execution is what's expensive. So it needs to figure out somehow magically how long the plan will take before it actually executes it. And this is very, very tricky. And finally, once it has examined several such plans, it will choose one that looks best, and that's the one that's going to execute. That's the purpose of the query optimizer. So I have an example to suggest the possible choices of the query optimizer. I wonder whether we should go through too many details here. Please do read this at home carefully, this example. Don't try to get to the next stage without understanding this example in detail. But I wonder whether we should spend too much time in this lecture. Let me spend just a little bit of time on each of these query plans. So what I have here, I have three physical plans for the same query. The query is just right here. It is joined between supplier and supply. And it also has a selection on, oh, there are two selections on supplier, and there is one selection on supply. And this is a joint here. So that's a query. So I have a single logical plan, but I'm going to show you three alternative physical plans for this logical plan. So the first is do a block nested loop, and then do the selections later on, and then do the duplicate elimination. So what is the cost? We have the statistics at the top. That's how we can compute the cost. And here is a cost. So for the block nested loop, it's this cost here, which turns out to be 100 plus 10 times 100. And the other two, they incur no cost. They are done on the fly. OK, so this was one plan. Here is a second plan. Well, I said this is a same logical plan, but it's actually not true. Here the selections are pushed down. And once they are pushed down, we will reduce the size of these tables. So this also materializes. So it writes the result of these two selections to disk. It didn't have to do this, but that's how we decided to illustrate this example. So let's see the cost here. The total cost to write the first thing to disk is this. This is the cost of T1. The cost of T2 is here. That's the cost of T2. And then the sort merge join seems to be very cheap. It's just 2. And then the final projection is 0 because that's going to be streaming. And finally, the last one, this is more interesting. This uses an index to look up supply. So it puts on smartness here. And then it uses an index join to get the supplier. So let's see the cost. First, we need to figure out how many tuples are coming from the left. And this is four tuples because if you look at supply, here is the number of tuples of supply, 10,000. Now PNO is here. There are 2,500 distinct values of PNO. And therefore, every given PNO stands for part number. For every part number, you expect there are about four supplies without part number. 10,000 divided by 2,500. So therefore, you have four tuples coming from the left. For each of them, we need to do this index lookup. So then the total cost is four. Each of them costs us one disk access because it's an index lookup on the key. And we assume that the key is a clustered index. Good. So please contemplate and study these physical plans at home. Make sure you understand what's going on. And at a high level, what you need to take out of this is that these are the kind of calculations that the optimizers must do in order to decide which of these three physical plans is the best one. OK, so the simplification that we assume throughout the lectures or the lectures is that the index lookups, they incur no cost. The only cost is to actually reach a record that the index points us to. So it's like assuming that the index is in memory. Good. So here are the important lessons that I want you to take out of that example. First is that we need to consider multiple logical and physical plans. That is the essence of the query optimizer. That's what query optimizers have to do. But there is a second lesson, which is quite important. The first thing you try to think about is, yeah, you can try to optimize. But apparently, you might get deceived into thinking that there is the ideal query plan. Ideal in the sense that you try to be smart and say, well, I can always push those elections down. It's going to be turned fewer topics. But no, that's not the case. It depends a lot on the data. And remember, we discussed this example with the selection on year on the movie. And I still recommend that you run this experiment. Go to SQL Server and try to select all movies with a year 1905. Then it replaces with 1995. In the first case, the optimizer will choose an index. In the second one, it will not choose an index. It depends on the data. It depends on the statistics of the data. So there is no magical best plan. Instead, the optimizer must use what's called a cost-based optimization. It must use all the statistics that it has available on the data and choose the plan whose estimated cost is lowest. OK, so here is a contract of the optimizer. As defined by Surajit in his 2009 paper, the paper that you're going to read, it's a very difficult contract. It needs to produce a high-quality execution plan for all queries so it can't refuse to optimize nested queries because it can't handle them. No, it must be able to understand all possible queries. But the second and the third requirements are really difficult. It must be fast. It must decide this with a small optimization under a small optimization time. For example, here is an alternative. If we were allowed to actually run the queries and see how long they take, then it's easy. You just run them, see which one was faster, and you say here is my optimized query. But it's too late. You can't do this. You need to select an optimal query plan much faster. And the second requirement is also important, using only very limited additional input. And this has impact on the statistics that we can use. You probably know about histograms. Histograms are small precisely because we can't afford that optimization time to process large data structures that give us a lot of information. So that is the basic contract of the optimizer. So now we're going to start discussing the optimizers. And here I'm using, again, this Syragicio Duris with the structuring of the material of an optimizer. It consists of three different components that can be discussed independently. There is a search space, which is a space of plans that the optimizer is willing to consider. There is an algorithm that decides how to search in this space and how to find a good plan. And there is a separate module that is responsible with estimating the sizes of all the intermediate results. And the consequence is capable of estimating the cost of every alternative plan. I'm going to spend most time on the search space and very little on the algorithm. The algorithm is the most confusing one because the modern algorithms are too complicated to put in a textbook. And I'm going to spend a little bit more time on the cardinality and cost estimation, which is a major open problem even today. People don't understand how to do this effectively. So the search space, oh, and actually I should mention this, it's impossible to discuss query optimization without referring to a landmark system, which was the first we defined query optimization. And this was system R, developed at IBM. The paper was written in 79. The system was developed a few years earlier. And this optimizer was, it had all the components already in place. It had the search space in place. It had an algorithm in place, which is actually the only algorithm defined today in textbooks. And it had the idea of data statistics and a size estimator and a cost estimator in place. In all books and in my slides, you will find the references to the system R query optimizer. Pre-sweet from the book is described in this section. It's very brief. It's half a page. And it tells you the high-level ideas of the optimizer and system R. Good. So let's start our discussion of the first component of the optimizer, which is the search space. So again, think about the search space as a definition of all the alternative plans that the optimizer will consider. How do you define these alternative plans? Well, you need to pinpoint the algebraic laws that allow you to transform one plan into the other plan. And that's what we are going to discuss for the next 20 minutes or so. But in addition, you might also want to restrict the structural shape of the plan, which has nothing to do with algebra. It has to do with some practical limitations of the optimizer. So let me show you this restriction first, because it's easy easiest to understand. For example, you might want to restrict what kind of joints the optimizer considers. Many optimizers only consider left-deep joint trees. And the original system R only had left-deep joint plans. But of course, some plans are more efficient. If you allow them, some queries can be evaluated more efficiently if you allow bushy trees. But then you have more plans to explore. That's a trade-off. OK, so what I'm going to do next is to discuss the algebraic laws that are used in modern database systems and that essentially defines the search space of all the possible plans. And by the way, if you need a bio break, you either let me know or you just take it by yourself. I don't think it makes sense to stop. Do you need a break? Let's just continue, because I have lots of material to cover. And if we finish a read, then everybody would be happy. And actually, I have a lot of fun teaching this material. There is a lot of intelligence put in the optimizers. And I'm actually hope to do justice to them more than the book does. So algebraic laws, they start very easy. And these are the laws that were already present in the original system R. Let's look at the joints. Joints are commutative. So you might consider both joints. It matters, of course. If you think about a hash join, it matters if you compute R join S for S join R, because the inner table is someone on which you will construct the hash table. Associativity, again, applying this law determines a different joint order. And that is what the original system R essentially focused on, on how to choose the joint order. And then there are other simple rules that you can imagine that are written here. This example just shows that associativity matters. I'm going to skip over it, and please read it at home. More simple laws. Hidden on this slide is the only really important simple law that all the database systems, even the most naive ones, implement. And that's the most important one that I'd like you to remember. Is this one here? So how should I make it important? Question, exclamation mark. What does it do? It says that if you join two relations, and then you do apply a selection, and that selection only looks at the left relation, then you can push the selection down. And it's a heuristics that essentially says it's almost always more efficient to push the selection down. Because then you reduce the number of tuppers that come from the left, and it can't hurt you. But it can hurt you. And I would like you to ask you, when can it hurt you? When is it not a good idea to push selections down? Let's push them to the right, and then maybe it will be easier to see why this might hurt you. So my point is that if the selection refers only to S, then yeah, we can push the selection down. But there are cases when this hurts, when. It only emphasizes the thing that we said over and over again, that there is no perfect plan. You always have to estimate the cost. Give me one example when it's a bad idea to push a selection down. It's not on the index? Something with index, but what exactly would that be? Think about the join. The join might use an index. If you look on the left, this might be an index join, and then you use an index on S. But once you push a selection down, there is no more index on the result of the selection. So this is where it hurts you. Once you have pushed the selection down, you can no longer use an index-based join for this particular logical join operator. And the optimizer, they just have to consider all such or plans into the cheapest one. That's always the case. Very early and very simple query optimizers, they did only this. They pushed the selections down automatically, and then they just used the default join ordering. And that was all they did. This is why I claim this to be the most popular algebraic law. So here are some examples. Let's go quickly over the first one. Actually, let's go quickly over both. How can I push the selection on F equals 3 down? Where does it go? To R or to S? To S, because F is in S. So this would be R join on D equals E of select F equals 3 from S. Now, suppose you can see this. If we had an index on E, on S.E, then for the expression on the left, we could use that index and an index join. But once we do this extra selection, then we can't use an index join anymore. And what about the second operation? A is in R, and G is in S. So therefore, this is select A equals 5 from R, join, select G equals 9 from S. OK, so that was a simple example. More laws, this is about projections. I'm going to skip some, because projections are really postponed as much as possible to the end. In real implementations, projections are always without duplicate elimination. They just remove columns. So the true question is, what do we do with the duplicate elimination operation, which is the expensive one? Projections, they can be postponed until the end. It's really the case that optimizes go to a great length to push them down. But if you look at these slides at home, now, so this essentially exhausts the simple laws that were present in the initial and the original, the early database systems. Now I'm getting to some much more intricate laws that combine group I and joins. So this actually is not a very interesting slide. Please read it at home. Let's look at this slide here. And I want to change the title. I'm going to replace the title and call it laws for group I and join. It's not an example. It's a new law for group I and join. This is an amazing law. Look at this. It says the following. Suppose you need to join R and S. And then on the result, you want to group I, A, and compute the sum of the Ds. But it says that you can do this as follows. You can aggressively group by C first. How should I? You could group by, yeah, you can group by C first and sum the Ds. And this way you reduce a lot the size of S. And then do the join, and then group again by A, and sum the Ds again, sum them a second time. So these laws, they were discovered only in the 90s, and they were introduced in database systems. I had actually a personal experience with this law because I was experimenting trying to design a homework for the undergraduate version of this class. And I had a simple many, many relationships. Like imagine two tables like R and T, and they are joined by a relationship S. And this is many to many. And what I wanted to do, I wanted to design a homework in which students would have to implement in C a join followed by a group I. So there was a group I between a group I R by R dot A, and the sum of T dot D. Do you envision the query? So do us free by join between R, S, and T, and then group I, an attribute from R, and sum an attribute from T. And I wanted to emphasize how database system can be very smart. And OK, so I implemented this in C in order to show that database system can be smarter. But when I tried it on SQL server, I could not believe my eyes. It was blazingly fast. And the reason why it was blazingly fast, and I actually answered the homework, it didn't make sense. SQL server was just amazing. The reason why it was blazingly fast is because it was applying this rule, which I didn't expect. If this is what you need to do to group I A and some D, what you can do first, you first group by the key of T. You group by its key and sum over D. And then when you join with many, many relationships, you can group again by its foreign key into R. And then you do a final group when you join with R. And this is what SQL server did. And the tables were huge. It took like a second or two to compute the query. So I gave up, because that was not a good introductory exercise for the underground database class. But it points, I think, to the fact that this is a non-obvious law. It's a law that you would not normally think at the beginning. It was actually invented and promoted in the 90s. OK, more laws that were actually known before the 90s are laws that involve constraints. You know these laws already. We discussed them when we discussed views and the view inlining. Here is an example. When we have a foreign key from product into company, then if you join but you only need attributes from product, then you don't need a join. And you can drop company. In order to drop company, you need to have two conditions that are satisfied. One is that CID must be a foreign key. This is the first condition. What is the second condition? If you just declare CID to be a foreign key and compute the join, the optimizer will not remove the join. There is a second constraint that needs to hold. The product CID must be not null. Because if the product CID is allowed to be null, then some products will not be included. Namely, those that do not have a corresponding company. So we discussed this in the lecture about views. I don't think we should repeat this too much. So here is an example. Again, I'm going to skip over the slide. It's essentially the same idea as the one that we discussed in the lecture on views. Good. Now this brings us to one of the most powerful of these optimization laws that are based on semi-joints. We discussed already semi-joints last time. And now I'm going to share with you our full power. And actually, I'm going to continue this discussion next time when we discuss Bloom filters. This kind of semi-joint law is called a semi-joint reduction. But it is also referred to as a magic set. Somehow incorrectly, though. But you might find this term magic set. I think in Surajit's paper that you have to read for next time, you might find the term magic set refers to semi-joint reductions. There is a long history why they're called magic set. I shouldn't get into that detail. The correct term is always semi-joint reduction. It's a very powerful technique that is becoming even more important in the context of parallel query processing. That can significantly reduce the running time of a massive parallel query computation. So let me remind you, I'm going to spend some time on discussing the semi-joint reduction. This is why I'm going slowly. Let me remind you the semi-joint. What is the semi-joint of R and S? The English description is much better. It's a set of tuples in R that will join with S. But we don't do the joint. We only prepare them. The formal expression is this. You join R and S in the new project on the attribute of R. And the basic law of the semi-joint is this. That if you want to compute R and S, R-joint S, then what you can do, you can first compute the semi-joint and then join with S. It's so trivial that you wonder why do we have this law. So I'm going to show you how this applies. Again, it's often combined with boom filters, which we are going to discuss next time. So this is a place in the textbook where these things are described. So here is the idea of a semi-joint reducer. Now you don't have a single joint. You have a whole set of joints. You have an entire query. And a semi-joint reducer is a sequence of semi-joint operations that restrict further and further the relations that you eventually need to join. There might be multiple ways to do this. So the semi-joint reducer is not unique. But the better job you do, the fewer tuppers you end up, the fewer tuppers you have to join in the end. What remains invariant is that the query is still equal to the joint of the reduced tables. So the query remains unchanged. But during the semi-joint reduction, you will eliminate some tuppers from the relations that don't need to be later joined. Now a full reducer is a reducer where you've done the job to completion. So every tupper that remains in any relation will indeed participate in the joint. There are no more dangling tuppers. Let me show you some examples. Here is one. So we want to join R and S. Here is one reducer. We replace R with a semi-joint of R and S. And therefore, the new query can be obtained by joining the reduced table R with S. So why do we do this? Why would we compute the semi-joint reduction, which used to be not much cheaper than the join? Well, there are cases when it can be much cheaper. And we discussed one such case last time, when we discussed distributed joins. A typical case is when you have two tables that have lots of attributes or some very large attributes, but they join only on, here is a join, between B and B. So you might want to do a semi-joint before you actually do the join. A more interesting application is when you combine this with groupis. And I have later an example that explains this. But we can also go over this one. So what happens here? This is a natural join between R dot B and S dot B. Later, we will group by AB, which are here. But we only want the tuples that satisfy a certain selection predicate on C. So what can we do here? Well, we could first look at the tuples in R that join with something in S that satisfy the selection conditions. And once we have reduced R, then we can apply this group by, which is an expensive operator, to a much smaller table R1. Again, as an example that you should read in detail at home, it's actually pretty simple. And I have a more complicated one in a couple of slides. So let's look deeper into semi-joint reducers. So we saw this one, where R1 is obtained by doing the semi-joint between R and S. My question is, is this a full reducer, or are there dangling tuples? Are there any dangling tuples in S or in R1? When we compute the join, are we guaranteed that all the tuples in R1 and in S will indeed join, or do we still have some dangling ones? Let me erase so we can read the example better. There are no dangling tuples. There are in S. Think about S. It might have a tuple whose B value doesn't match anything in R. We have not removed it. What about R1? Can R1 have dangling tuples? R1, no. Because R1 is exactly the semi-joint between R and S. So all the tuples that are in R1, they are guaranteed to join with something in S. So with R1, we are done. But we are not done with S yet. So we can finish the semi-joint reducer by doing this second reduction. So here it is. We first restrict R1 by doing the semi-joint with S. Then we restrict S by doing the semi-joint with R1. And now if you think about this a little bit, there are no more dangling tuples. And now this query is kind of the best you can do. It gets more interesting and more complicated. Now we have a three-way join, S, R, and S join on B, S, and T join on C. So here is a semi-joint reducer. It's actually a full reducer. We start from the middle table. And we do the semi-joint with R. So now we know that everything that survives will join as a matching B and R. Then we move to the right table, to D prime. And here actually we can use S prime, because it's smaller. Actually, we have to use S prime as a typo. That should have been S prime. And now we further reduce S by doing the semi-joint with T prime. And finally, we use that to reduce R. Please think at home when you study this, why is it the case that there are no more dangling tuples? Try to convince yourself that after you do this sequence of operations, there are no more dangling tuples. It's not unique. There are other orders that are possible. But it turns out that there is a way in which you can compute a full reducer. A little bit of theory. It turns out that some queries do not admit full reducers that are expressible in a finite number of steps. The case here, this query is cyclic, because R and S join on B. S and T join on C. But R and T, they join on A. There is a cycle. It turns out no matter how often you reduce, there is always more to be reduced. And there is no place to stop. There is actually a nice theory that describes which queries admit full reducers. They are called a cyclic. In practice, it doesn't matter so much. You can always have some reducers, even if they are not full, even for cyclic queries. But keep in mind that in some cases, there is no hope to look for a full reducer. You have to settle for some other reducers. OK. And here is the more complicated example that I mentioned. Let's go over this a little bit. We still have time to discuss this example. So this is an example, again, from Sirajit's paper from 1998. It has two tables, employees and department. And every employee has a foreign key into the department, into a department. Now there is a view which computes for every department the average salary in that department. And here is a view. It's actually just a group I. No joins. Now the query, what does the query do? It joins employees and departments with this average salary and finds all employees under 30 who earn more than the average salary of that department. So it really uses the computation in the view in an essential way. This query cannot be unnested. Even after you expand the view definition, the query cannot be unnested. But that's not the point. The point is to do a semi-join reduction. And there is another condition here, a condition on the department that its budget be big. Good. So now let's look at the reducers. The point here is let me go back. The point is that if you substitute, if you inline the definition of the department average salary, then we will end up computing the average salary for too many departments. The question is, for which departments do we actually need to compute the average salary? We don't need some for all departments. Give me a hint. Which departments would be the ones where you should invest resources to compute the average salary? This is the only query you have to answer. I mean, for this particular query. Because the departments whose budget is over $100,000, for the others, no need to compute the average salary. Second hint, what else can we extract from here? The employees whose age is different. Departments who have at least one employee whose age is less than 70, this condition. So this is a semi-joined reducer. We will join the employees with these conditions and make sure that we only compute the average salary on these departments. That's the idea. So one way to describe this process, of course, what happens under the hood is actually much more messy. But one way to visualize this process is to think about modifying the view definition. So now we have the limited average salary that computes the average salary only for departments with a high budget. And then the new query can simply refer to this limited average salary because this is like a semi-joined reduction. But we can push the other condition as well. And here it is. So this has both a join with the departments, but it also checks if at least one employee is under 30. And this is now a full reducer because we don't compute any dangling. There are no dangling tuples. OK, very nice example. Again, I recommend that you study it carefully at home. And this concludes my discussion of the search space. So let me review it a little bit. The search space is the conceptual set of all of the possible plans that the optimizer is willing to consider. And it is essentially defined by the algebraic laws that are hardwired in the optimizer and that allow the optimizer to enumerate these possible plans. The laws start simple, like associativity and commutativity for join reordering. But they quickly become very clever and quite complicated, like the interaction between joins and groupies and these semi-joined reductions. Keep in your mind the semi-joined reductions in a very separate place because they, again, they have very powerful applications to parallel databases. And I will bring them up a little bit next time. OK, there is a last decision that affects this definition of the search plan. Namely, how do we represent the plans such that we can do early pruning? So what does this mean? As the optimizer will start considering alternative plans, it will discover that some choices are bad. Some choices, some join order, say result in bad plans. Depending on how you represent this partial progress, you can do the pruning much more effectively in one way or another. So let me show you possible choices. One choice, which is not ideal, is to represent the complete plan for the query. But if you do this, then it's very difficult to say that a particular order between two joints is ineffective because there are many plans that use this particular order. And you have to say for all of them that this order is ineffective. Database optimizer, they use partial plans. And there are two kinds of partial plans, bottom up partial plans, and top down. Let's examine bottom up. Bottom up means, for instance, if you have a query that joins three relations, then joining just two relations is a bottom up plan. It means you need to continue to join with more stuff in order to get a complete plan. And similarly for the selections, now you start introducing selections. And all these are partial plans. So now it's much better because if you discover, for example, that this plan, p1, which is partial, is much worse than the second plan, which pushes the selection down, then there are many other plans that use p1 as a sub-plan that you can prune out of your search spaces. So that's a bottom up plan. Top down plans, they're quite interesting. They represent the plan from the root. But they stop at a certain branch. And where they stop, there is an entire sub-query. So for example, you can think of this big query like this. You say, I'm going to do the join with p last. And before that, I'm going to join with s. But for that, I still need to compute the selection. And I don't know yet what the plan will be for that selection. This is a top down plan. Same principle. If this join between s and t turns out to be ineffective, then we can prune many, many plans. Good. So this really finishes our discussion of the search space. Once this is in place, then the optimizer needs to enumerate the plans in the search space. Ideally, it should enumerate all of them and choose the smallest one. But in practice, they often skip some large fraction of the plans in order to speed up. So what I'm going to briefly mention now are algorithms for enumerating plans. And this is kind of the most flimsy part of our discussion. Because I'm not going to discuss the real ones, the rule-based optimizers. I'm going to discuss only the system R dynamic programming. So there are two plan enumeration algorithms. The one discussing system does two things. Joint reordering, a decide in which order to perform the joins, and something called access path selection, which is just saying, how do you access a particular relation? Do you do a sequential scan, or do you use an index? So the system R optimizer uses a bottom up representation. It's simple, but it's only limited to single block select from where queries, no nested queries. What modern database systems do, they use rule-based optimizers. And these were optimizers that were, I mean, this is an architecture that was designed in the mid-80s, early 90s. When people observed that the system R optimizer cannot handle sub-queries, and cannot handle these more complicated constructs, like groupize, universal quantifiers, because it was just limited to join ordering. And I told you that, but maybe I should give them justice and say again. There were two different groups. That's proposed to rule-based optimizations. One was at IBM Almaden. They developed a research prototype called Starburst that described the optimizer as a collection of rules. And the other was Volcano, which was developed at Wisconsin by David DeWitt, who was a professor. And Göske Refe, who was a student then. And Starburst became the DB2 optimizer. It's been implemented in DB2. And Göske Refe, once he moved to Microsoft, he essentially architectured and had a significant impact on the rule-based optimizer and SQL server. So these kinds of optimizers, they work by having a database of rules. So all these algebraic laws that we described, they can be registered as rules in the optimizer. Essentially, the syntax is quite similar to what we used. And then the optimizer uses some dynamic programming in order to apply these rules and to choose the best plan. I'm not going to discuss any of such optimizers. There is no good description as far as I know. The original papers are impossible to read. I tried several times. It's only in the minds of the people who actually implemented these optimizers that you can find some clarity. No book describes them justice. So what I would do, I would do like the standard books and describe the joint ordering and the access path selection in system R. So what is joint ordering or joint reordering? It's a decision on what order we will do the joints in a single select from where SQL block. We don't worry about selections. All selections are assumed to be pushed down. We don't worry about projections. All the projections are assumed to be postponed under the end. All we worry about is in what order should we do the joints. So for this, the system R algorithm used a simple dynamic programming algorithm. And it's really so simple that it's a pity if you miss it. So bear with me. It's very, very simple. The dynamic programming algorithm, what it does, it takes any subset of these relations and says, if I were to join just this subset, it's a bottom up plan. If I were to compute just this subquery, what is the optimal order for this subset? And it creates a table. That's the essence of any dynamic programming algorithm. It creates a table. Where for any such subset, it records the cheapest way, the best plan, for computing the query corresponding to this subset. Now, you might object. You might say, well, there are exponentially many entries in this table. And this is true. This optimization algorithm will run an exponential time. But there are all sorts of heuristics that optimizes due to restricts this space. So let's examine this in some detail. That's my only slide I have for it. Initially, what we fill in each table and what we fill in the table are the entries corresponding to single relations. Well, what is a plan for a single relation? You either need to scan it or you need to access it via an index, because the selection has been pushed down. So you need to make this decision, and that's what you store in the table. You store how you are going to evolve that relation. Then, the algorithm fills the entries in the table corresponding to larger and larger subsets. For each such subset, once it reaches that entry, and they said, now I want to compute the optimal plan for this subset, here's what it does. It computes all possible ways to partition the subset into two disjoint sets. And then it says, look, I'm going to first compute Q prime, then I'm going to compute Q double, Q double prime, then I'm going to join them. This join will be at the end. It says what I'm going to do, how much will it cost? And here comes the beauty. In the table, we have the entries for Q prime and Q double prime. We just look them up. We know how much they cost. And now we do our magic. We estimate the cost of this join using some cost estimation function. And we compute the total cost for this plan. Then, repeat, do this for another such way of partitioning the query. Try this in all possible ways. Keep them the smallest ones. And that is going to be what we store in the entry for Q, the cheapest plan for Q prime join, Q double prime. It's a very simple algorithm that is jointly ordering. In order to reduce the space, what database systems do, and this is quite again what system are that are in the 70s, they have two kind of restrictions. Often, they restrict only to left linear trees. It's amazing. Optimize that they don't like bushy trees, because there are too many of them. And then optimization will take too long. But then there is a much more interesting restriction that deserves some discussion, namely a restriction to trees that do not have Cartesian products. So look at this query here. We join R with S and S with T. Does it make sense to compute this plan? What's interesting about this plan is that when we join R and T first, there are no attributes to join on. So then we get a Cartesian product. But then we can later join with S. Would this plan make sense? No, it doesn't make sense. In most cases, actually. And the reason is because a Cartesian product is big. It results in many, many topics. But it's interesting that there is a particular setting of, OK, so let me just finish the main thread. So data optimizers typically do not consider plans that have Cartesian products. In this way, they can reuse quite dramatically the search space. But what is interesting is that people have discovered that in data warehouses, such plans actually do make sense. So what is a data warehouse? A data warehouse is like a database where you essentially a copy of your production database, which you put aside in order to do data analytics. This is where you would run your expensive decision support queries. It's sometimes structured. The schema is often modified from the production database. Sometimes you integrate data from multiple databases. But what is typical for the vast majority of data warehouses is that they have a star schema. What is a star schema? There is one fat entity. Think about this like the detailed transaction records. This goes back to the days when Walmart was running the largest data warehouse in industry in the 90s. I think they might still have the largest data warehouse. And every single transaction, every single sales is recorded in that warehouse for a period of, I don't know, months or years. So this is a big table in the middle. I'm going to call it S. And then there are some side tables, like the table of all stores that would be R, or the table of all products that would be T, or the table of all employees that sell things. And then there are these many one relationships. So every transaction, every sales, refers to one store, to one product, to one employee. Here is the star schema. Now the typical queries, the typical decision support queries in data warehouses, they go like this. They say, how many sales of this particular product did I have at that particular store? So when it refers to a particular product, there is a very selective predicate on the product. And there is a very selective predicate on the store. So instead of having like thousands or tens of thousands of products in stores, now you have only five. But S continues to be huge. You still have terabytes of detailed records. So now think about this joint. Think about this joint here. Maybe you have five stores that match your store criteria. Maybe you have five products that match your criteria. The Cartesian product is just 25. But now you only need to touch the big table once, as opposed to touching it twice. And that's the advantage of using a plan that has a Cartesian product. Good, so bottom line. Yes, database optimizers today typically do not consider plans with Cartesian products. But no, this is not always a good idea. There are cases when plans with Cartesian product actually do make sense, especially in data warehouses. Good. This was the first part of the system R optimizer, jointly ordering. The second part is about access pass selection. It's actually much easier. What it says is this is the first step of the dynamic programming algorithm, where we have to decide for a base table how to access it. So I have here some examples. I think I have a better example here. So look at this example on the slide. We have a table supplier, SID is a key, and SID is an attribute. And now the selection condition to make things complicated is this, SID greater than 300, and the city is in Seattle. So we have a B3 on SID, and we have a B3 on SID. So we have three possibilities for the access pass. We could do a sequential scan. We could use a B3 on SID, or the B3 on city. Which one should we use? Which one do you think is more cheaper? It's actually difficult to say, because it's not a selection. SID is probably the primary key, but it's an equality predicate. So it's not clear that this is cheaper than going to city. The query optimizer just needs to consider all of them and then settle on the most selective one, which is just a fancy word for saying, compute the cost and select the best one. And on this slide, I have details on how to compute the cost. But I really advise that you look at this at home. There is nothing deep here. These are exactly the cost expression that we discussed earlier today for index-based selection. We just compute all of them and then select the one that's cheapest. Good. So keep in mind that the algorithm needs to, I mean, the system style algorithm has two parts, join ordering and access pass selection. Modern optimizers, they don't do just join ordering, but they have a rule-based approach to optimization, which extends out in quite significant ways. Now, besides choosing the plan, and the optimizer needs to make some other major decisions. And they are very messy. It needs to decide how much memory to allocate to each operator. And that's something I'm not going to discuss. There is no magic here. But it needs to make another decision which has impact, which is whether to materialize or whether to pipeline the results of operators. That's something that I would like to discuss. Because the principle of pipelining is really critical and it's important that we get this correctly from the lectures. Here is a plan that joins R, S, E, and U, four tables. And it does so by materializing all the intermediate results. So I represented this with a bar. This is how people represent them. By putting a bar here to emphasize that everything is materialized. So what does it mean when I put a bar here right over the S? It means that S will be right in my memory. And we have to wait until the entire S is present in order to construct a hash table in my memory. Once we have it, now R can be streamed. And as we stream R, we decided we are going to create a view, which is this V1, that contains the result of the join. So this is expressed here. We create a hash table for S. Then we join. And we write V1 back to disk. Then we repeat at T. So we create a hash table on T. Now we read V1. And we write back to disk. And finally, we do the same thing for the last join. Let's see the cost. So actually, let me go back here. Let me erase everything so I can write it. First of all, how much memory do we need? How much memory do we need for the first join? B of S. How much memory do we need for the second join? B of T. Let me write this more disciplined. This has to be bigger than the maximum between B of S, B of T, and B of U. Good. Now what is the cost for the first join? What is the cost? We need to read R. No escape. We need to read S. But then what do we do? We write V1. I'm going to refer to B of V1. So then we are going to write V1. And later, we need to read it back in the second step. And then we need to read T. And now we need to read. And then later write V2. And then we need to read U. And then we don't account for the last write. So that's the cost. The cost is probably high because of the need to write all the V intermediate Vs. Good. So the alternative is to pipeline. Here is how pipeline works. And this is really simple, but it's really important that we understand the pipelining principles well. And I represented it here in pseudo code. What we do here is that we construct three hash tables in main memory. S, T, and U are blocking. For all three, we can't proceed until we have the entire tables S, T, and U. If they are completed by some subplans, we need to wait until they finish. But once we have them, we create three hash tables in main memory. And now we can stream. We can pipeline the tuppers from R. We get one tupper from R. We probe it in the hash table for S. The result we probe in the hash table for T. The results we probe in the hash table for U. Good. Let's redo the same analysis. How much main memory do we need here? V of S plus V of T plus V of U. Exactly. It's no longer the maximum, but now it's your sum. So we are more constrained. We need to have enough main memory to store all three of them. But what is the cost? Exactly. We just need to read each of them once. So the cost is much better. No need to arise intermediate results. So pipelining always makes sense. The only reason why we don't pipeline is because we don't have enough main memory or because the nature of the operator is such that we can't pipeline. For example, for the hash join, we cannot pipeline the inner table. Unless we do a double pipeline join, in which case we can pipeline both tables. OK. If the operator tree is bushy, then you have to think a little bit about how pipelining works. It always works on the left branch. So for example, here R can be pipelined all the way through. But before we can do this, we need to have S present. We need to have this entire sub-plan present. And we need to have this entire sub-plan present. And in order to have the middle plan present, well, we need to read I and V. And then we can pipeline D. Good. So that was pipelining. I have an extensive example in the slides, which I got from the other textbook. And I invited you to study it at home. It illustrates the trade-off between pipelining and not pipelining, depending on various parameters. I will skip it in class. Good. So that was my short and hopefully not as confusing as in the textbook discussion of the algorithm of query optimization. Do remember, the system are algorithms. Everybody who knows optimizers knows the system are optimization algorithms. And it's OK if you don't know the details of rule-based algorithms because they are implementation specific. This was just a second module of the optimizer. The last module is kind of the most difficult one because little progress has been done here. It's a module that's responsible with estimating the sizes and, later, after that, the cost of various alternative plans. The magic here is to estimate the size or the cardinality. We consider an alternative plan. We look at the fraction of this plan, a sub-plan. And we want to know how many tuppers are there so we can make some choice, some determination about the cost of the next operator that uses this intermediate result. Once we have these, then estimating the cost is relatively easy. We know the formulas depending on the physical plan. We know the formulas. The difficult part is estimating the sizes. OK, so let's discuss the size estimation. Size estimation starts by collecting statistics on the data. What we discussed in classes are these simple statistics like B of R. It's a statistics. We need to know how many blocks the table uses. T of R, we need to know how many records there are in T. And the values V of R and A for attributes A, like the number of distinct values of the attribute A. In addition, database systems, they maintain histograms that are more refined statistics. And we will discuss them in a second. These statistics are never maintained dynamically for reasons that we discuss again at the end after histograms. OK, now the question is, how do we use them? And here is where the book actually does a pretty good job. This book describes the size estimation much better than other books. So I strongly advise that you read chapter, I think it's 15, on query optimization. So here is a general problem. We have a block, which means a join between multiple tables. And there are some selection conditions. We have those statistics on the base tables. But how do we estimate how many tables will be returned by such a complicated join? And the idea in size estimation that all the database optimizers are doing is to start from this reference tables, which is a Cartesian product of all the tables. Also, it goes back to the nested loop semantics. Remember the nested loop semantics of SQL queries? This is a total number of steps that you do before you actually check the condition at the bottom, OK? But now, of course, this is just an upper bound. In order to figure out an estimate for the actual size, what we do is that we assume that every condition will select a certain fraction of this upper bound of the Cartesian product. And then we make an independent assumption assuming that different conditions are independent and that we can multiply their selectivities to get the fraction of the entire query. Let me show you an example. So here is an example. So here we join R and S we join on B. And S and T we join on C. And then we have a selection on R. So let's suppose I know that the selectivity of this join is one-third. I'll show you later how to compute this. But what it says is that if you ask a question, how big is R join S, well, this size is one-third of the Cartesian product of R and S. That's what we know. We also know that the selectivity of the join between S and T is one-tenth. Somehow we figure this out. And we also know that the selectivity of the selection predicate is one-half. Half of the tuples in R are present. Then we can easily compute. I mean, we can easily estimate the size of this query, of the answer to this query, by assuming that these three conditions are independent. Let me see, do I have this on the next slide? No, I don't have it. I need to compute this. So what is the result? Well, it's going to be, and I'm reading from here, it's going to be 30k times 200k times 10k. So I'm going to multiply with 10 to the 9, which are the three k's I missed, times one-third, one-tenth, and one-half. So this cancels to this. And we get exactly how many? 10 to the power 9 plus 3 is 12 tuples. This is how we do selectivity estimation. Now the question is, so there are two questions here. One is how realistic is this independent assumption? And the second is, where do we get these base selectivities from? The answer to the first question is, it's not. It's not realistic. And this is where a database optimizes. They don't know how to do any better. We wish we knew how to do better, but we don't. These selectivity predicates are dependent. We know this, but there are no better techniques to account for this interdependence. For this dependence. But what I want to discuss next is how do we get these base selectivities from? Well, again, we go back to system R. And they had a great idea. They said that if you don't know anything, then the selectivity is 1 by 10. And this is what database systems do even today. If you don't have any statistics, using 1 by 10 is better than crushing. But if you have better ideas, here are some good examples. Let's suppose you have a selection condition, which says A is equal to some constant value C. What fraction of the table will satisfy the selection condition A equals C? Well, remember, V of R and A is a number of distinct values of the attribute A. And therefore, that's a fraction. It's 1 by V of R and A. That's what we use for the selectivity of the condition U equals C. What about an inequality predicate, A less than C? Well, if you're lucky and you know these information, which is what a typical database system will maintain, the low value and the high value of the attribute, then you just do a linear interpolation. And you can estimate what fraction of the table satisfies this predicate. Let's look at joints. So here, there is a magic formula. And this requires some explanation. Let me explain this formula next. The formula makes two assumptions. And actually, we only need the first one to derive the formula. This assumption says that if you do a join on A equals B and you notice that A has fewer number of values than B, then you think that A is probably a foreign key while B was a key. Because in that case, the number of distinct values of A is less than the number of distinct values of B. And therefore, we assume that every value of A actually occurs as a value of B. So that's the assumption. The assumption is that if this inequality holds, then every value in R occurs also in S. And it's justified by this key foreign key scenario. So if this holds, then here is how we can compute the selectivity of the joint. Take any tupper in T and R. Think about it as T has a foreign key and to S. It doesn't have to be a foreign key. It doesn't have to be a foreign key. We know that this many tuppers will join with T. Remember the selection? There are, this T just carries one particular value for A. So now the question is, how many tuppers T prime in S satisfies this? Well, we know this. This is exactly this fraction of S. That's exactly like a selection. Then think about replacing this with a constant. So so many tuppers from S will join with one tupper in T. How many tuppers in T do we have? Well, all of them, because all of them will join. That's our assumption. And therefore, this is our estimated size. But that only works if V of R and A is less than V of S and B. If it's in the other direction, then we switch the roles. And this is why the size of the joint is given by this formula, in which this is the maximum. Because if the other one is bigger, then we need to switch their roles. OK, so that was like a magic formula. I'm going to skip this example. Please, please, please do at home. And I want to spend the remaining, like, six, seven minutes discussing histograms. So you know this. Database systems, they maintain histograms on tables. And histograms, they look like this. Why don't I show you directly the histograms? What is my histogram? It is my histogram. If, for example, a histogram on H will have several buckets. These are the buckets of the histogram. And for each H interval, it will tell you how many tuppers are falling within that H bracket. OK, so now let's put this to work. How can you use a histogram to estimate more effectively the result of selecting all employees with H48? What is your estimate for this query? First of all, where do we look up H equals 48? In this packet here, right? But not all of these tuppers will satisfy the selection. What fraction of these tuppers will satisfy the selection? 1 tenths, because there are 10 distinct ages in here. So the answer is, let me see if my animation works, yep, 1 tenths, which is this fraction. What do you do with foreign interval? What is quite straightforward do you see between which buckets it stretches from here to here? And then you estimate that. Yeah, you just computed. I'm not going to describe this. It's pretty straightforward. So this is why histograms are great, because they don't assume this uniformity on the data values. They actually account for skew on the data values. The major question the database systems are facing is how do we construct the histograms? How do you determine the bucket boundaries? The number of buckets is kind of given. I want to have this discussion on the number of buckets. But the discussion I have right now is how do we determine the bucket boundaries? There are a number of choices, called equiWiz, equidebs compressed and voptimal. Let me go over the first two relatively quickly. The first three, actually. EquiWiz means that when all these buckets have the same width. That's kind of the first idea you might have. You just make them the same width. But it's not the best histogram, because look, some buckets are huge. There are many tuples in some of these buckets. And this is where you lose precision, because a query set that will select tuples in this bucket, for them you don't have too much information. In equidebs, you repartition the buckets such that you have approximately the same number of tuples. Or ideally, it's exactly the same number of tuples in each bucket. And now the hope is that this gives you better precision. Yes? So equiWiz, shouldn't that be 10 to 20? If you know every histogram record is having 10. Do you refer to this? No. Oh, yes, you're right. This should be 10 to 20. Thank you. And I also realize the number at the bottom, they don't correspond to those at the top. Not at all, yeah. But you get the idea. There is something called compress histograms that extracts the most frequent values and store them separately. And they are called compressed. But actually, this is not how systems store histograms. They apply an algorithm that computes a so-called B-optimal histograms that tries to minimize the total number of square errors over all points in the domain. And I'm not going to discuss them. But if you remember this name, they are not even discussed in the book. But I did some research on histograms, and I had to look up this material. There is actually a pretty obvious way to optimize the bucket boundaries. And this is what most database systems use today. This is an optimal way to define the bucket boundaries. OK, but here is what I wanted to get. Of course, if you have like hundreds of thousands of buckets, then the histogram can be pretty accurate, right? But this is not what database systems do. They do not use hundreds of thousands of buckets. They only use, I've been told, hundreds of buckets. That's all. Why? Can updates would change very strongly? This actually goes back to the second question. Maybe we should discuss this one first. When you update a database, you're doing a delete or an update or an insert. The histogram is not updated. Why not? And we did discuss this in class. There is an obvious reason why you don't want to update the histograms. You're trying to improve performance, not put it in the toilet. Sorry? Trying to improve performance, not destroy it. You're trying to improve performance, of course. Come on, updating a histogram means just go to one bucket and do your local increment here. Every time. It will cause you a lock. Sorry? Because of the locking, like when you. Exactly. Now this becomes a hotspot. All the transactions, even if they look at completely different records, they all want to update the same histogram. And it's a huge lock contention. Yes? But is this even going to help you, even if you didn't have to worry about locking? Oh, you can do this. You could say, I'm not going to lock the histograms. And if some updates are lost, OK, what's a big deal? But once you go this fast, then you wonder why we should actually update the histogram in the first place. If I don't worry about keeping exact track of the number of records. You want to be close, but it doesn't have to be exactly the same. Yeah, but I'm not exactly sure if database systems use this approach. I know that they favor. So histograms are updated periodically. They are not maintained. But maybe they also do what you say. But I know for sure that they are not guaranteed to be in sync with the database. That's an important variant, mainly for these reasons. Because updates are expensive, especially because of concurrency control. So getting back to updates, they are cheap, because they are not done. So then why restricts the number of packets to just a few hundreds? So why is this? Remember the contract that we discussed? The optimizer needs to make a quick decision about which plan is optimal. If the query is a range query, and if the histogram has hundreds of thousands of packets, now we need to sum up tens of thousands of numbers. It's not a big deal. But this is in the context in which you're examining probably hundreds of thousands of query plans. And now it becomes a big deal. For each of them, you need to inspect this histogram. So no, histograms, they have to be very small. Plus, they need to have good L2 cache locality. So you don't want them to start using more space that you have in your memory. You actually would like them even to fit in your L2 cache. So that's why they have to be small. Yes? Is there something with a theoretical basis, or was it something that came about just by testing against some industry benchmark? It's a good question. So is there a theoretical basis for deciding how big the histogram is? And to my knowledge, no. And I actually have worked with people who know this quite in detail. And they just said hundreds of packets is the magic number, which brings me to the last question. Multidimensional histograms, there is actually a lot of research on multidimensional histograms. One interesting, I should actually not give this away, they hold a lot of promise. Because the problem with histograms is that they do not keep track of correlations between attributes. They give you a better estimate of a single selection on one attribute. But when you compare selections between two attributes, you still assume that they're independent. But with a two-dimensional histogram, instead of partitioning a single attribute into buckets, now you partition two attributes and you get exactly this, a two-dimensional histogram. And in each of these buckets, you store the number of tuples in that market. It's a great idea. Why don't database systems, why did they not adopt multidimensional histograms? And actually, I should be a little bit careful here, because VB2 does have multidimensional histograms. It's SQL servers that didn't adopt them. And Sirajit Chaudhuri, who's very outspoken about this, well, he's, of course, a defense SQL server. But there is actually a good reason why multidimensional histograms are problematic. Any ideas what the problems are? Actually, there are several. How are they a size explosion? A size explosion is definitely a problem. You can only afford a few hundred tuples. But if you want a few hundred buckets on each dimension, now you're talking about tens of thousands of buckets. You can't afford that. Even VB2, you only have a few hundred buckets for the entire multidimensional histogram, which means a very low precision for each attribute. This is one problem. There is a second related problem. Think about a table that has, like, 30 attributes. How many one-dimensional histograms do you need to construct? 30. Don't hesitate. Just make some more, OK? It's not a big deal. How many two-dimensional histograms do you need to construct? Now we're talking a large number. So it's no longer clear that you can construct all of them without hesitation. Which one do you choose? It's not obvious. There is a third reason. I discussed the optimal histograms. It turns out you can compute the optimal histogram efficiently for one-dimensional histograms. It's probably NP-hard to construct for multidimensional histograms. So it's not clear where to draw the bucket boundaries for some theoretical reason. And there is a fascinating fourth reason that was, of course, discovered by people from IBM Almaden, Falker, Markle, and Peter Haas, which is if you have multidimensional histograms and you have a complicated query, you can combine histograms in multiple ways to estimate your query size. And of course, these combinations, they will result in different outputs. Which one do you choose? How do you choose? Well, they developed a beautiful theory that's based on entropy maximization. But it's actually quite inefficient to compute. It's difficult to combine overlapping information from histograms into computing query sizes. So all this makes multidimensional histograms problematic. DB2 does support them. SQL Server does not support them. And I don't know about Oracle. And it's also a fascinating research topic also, precisely because they are not yet adopted. Good. So this was a long lecture. And I hope everyone who's watching this streaming is still awake. If not, you can see this on the archive. So we discussed query optimization. Keep in mind that there are three important components to a query optimizer. There is a search space. Think about this as being algebraic laws. There is a lot of intelligence here. And study, especially semi-joint reductions, the most esoteric of these laws. There is the algorithm. And everybody talks about the system R algorithm. And nobody talks about the rule-based algorithm. And there is this big open problem, the size estimation, the cost estimation. And here histograms are the state of the art. And that's all we can do. The goal of the optimizer is to find some reasonable plan. It's not necessarily to find the best plan, but to avoid the worst. And please remember, read the paper, write this paper for next time. And do come next time to the lecture. We have another interesting discussion on parallel databases, bloom filters, and also some of the materials that you need for the homework. Big, Latin, and MapReduce. Good. So any questions? Then good luck with your homework. And I'll see you on Wednesday. Thanks, Fred.