 There's a lot to talk about today and join, which is a very important topic. Extremely low turnout today. I can only assume because it's what, 85 degrees in October, which is crazy. I don't remember there being this hot of you. No. Yeah. And my parents don't believe in global warming, which is crazy. They think it's because the planets are aligning a certain way. And it's all big hoax, which is all right. All right. So when we talk about joins, again, joins are super important. It's probably the, this is where we're going to spend most of our time in an analytical database processing queries. So it's really important for us to get it right. So the first question is, why do we even need to join? So it's sort of a byproduct of having been our relational database system and normalizing our tables because we're splitting them up to reduce the amount of repetition or redundant information. So we want to break them up into different tables. Like, again, foreign keys are a natural way of doing this. All right. You know, you have Andy's orders. Sorry, you have the table of all the orders and every single order can have multiple order items. So you have a separate table for the order items. So now if you want to run a query and say, give me all the order items for Andy's order, then you want to join those two tables together and get all the related data together. So the way we're going to do this is, again, through a join operator. The join operator is going to allow us to reconstruct the original tuples without any information loss. We want to do this as efficiently as possible because, again, for analytical queries, the tables are going to be quite large. So, you know, this could take minutes, hours, even days, depending on what algorithm we choose. So for this lecture, we're going to focus on joining just two tables at a time and we're going to only focus on doing inter-equal joins. So two table joins is probably the most common join algorithm that's implemented in database systems today. This is what pretty much every single major open source system and commercial system does because it's sort of a natural way to think about how to break up a query into join operations. There are algorithms that do multi-way joins or M-way joins, like take three or more tables and join them exactly at the same time. That mostly exists in the theoretical world, although there are some high-end systems that do support this. We can cover that in the advanced class, but for our purpose here today, we want to understand the basics of joining two tables together. And we're also going to focus on inter-equal joins. So inter-equal join means that we're taking a tuple from one table and we want to check to see where there's an equality match for another tuple in the other table. We're not worried about less than, greater than. We're not worried about anti-joins, like not equals to. So the algorithms we'll talk about today can be tweaked to support those other types of joins. It's not a major change, but for our purpose, we're going to focus on equa-joins because that's the most common one. We're also focusing on inter-joins because in addition to that being the most common one, to do outer-join support for the algorithms we're talking about today is major operation, major change. So in general going forward for all the algorithms we're going to talk about today, the thing in the back of our mind that we should understand of how we're going to organize the join operation is that we're going to almost always want to put the smaller table as the left table in the join operation. So think of the query plan tree. You have the left side and the right side, the child's inputting into the operator. So we'll say the left child is the one that we want to be the smaller one. The nested loop joins also refer to this as the outer table. So even though some algorithms don't have nested loops and therefore they don't have an outer table, we'll always say outer table to mean the left side. So before we get to now discussing what the algorithms are, there's some other design decisions we have to make in our database system to talk about how all these join operators are going to work. So again, the way to understand what a query plan is, think of this, we're taking the relation outer of the SQL query and converting it into a directed graph or tree structure. And so at the leaf nodes we're accessing the tables and they're feeding tuples up as input into our parent operators. So this is what I was saying here. So this would be the join operator, so this would be the left side and then this would be the right side. So this would be the outer table and this would be the inner table. So the two design decisions we have to figure out are what is the output of our join operator, like what is the actual low level bits we're sending up into our parent, and then how we can decide whether one algorithm, one join algorithm implementation is better than another. So for the first one it's going to depend on what our database is and how it's actually implemented and other factors of its environment. So in general what we're trying to do is we're trying to say for every tuple lowercase r in the relation big uppercase r and any tuple that matches in the other table s, we want to then produce some output because they satisfy the join predicate and we're sending that up to the next operator in the tree. And so at a high level, when we understood this through a relational algorithm, when we first just talked about joins, we just said it was a concatenation of the two tables, of the tuples of the two tables. So you take all the attributes in r, you take all the attributes in s, and you mash them together and then that's the output going up. But that may not always be what you want to do in a real system. Theoretically that's okay, but in a real system we have to worry about disk read, we have to worry about how much memory we're using, so we can be a bit more careful about what we're sending along. So what we're actually going to use as the output of this operator and sending it up to this next operator can depend on our implementation of our query processing model. Again, that we'll cover on Monday next week, but just know that it's not always the case where I'm sending one tuple at a time. I could be sending multiple tuples at a time. It could also depend on my storage model, whether I'm a row-based system or a column-based system. And then it also depends on what the query is. Depending on what's above me in my query plan, in my tree, I may not want to send all the attributes for the both tables. I want to maybe send a subset of that. So the first approach we could do is actually just send data. So we're going to copy the values of the attributes for the tuples that match our join predicate, and then we're going to produce a new output tuple that we shove up to the next guy. So say our table looks like this, r and s, and when we do a join of them, again, we're just concatenating the attributes of r and appending the attributes of s to them, and then that's the result of this join operator. So in our query plan tree, the output of this operator would be this entire thing. So the benefit of this approach is that up above in the tree, we never have to go back and ask for more data from our underlying tables, because everything that came out of r and s is produced in our join output. So that's fantastic because, again, we're not going back and reading more stuff after the fact, but it's bad because now we're essentially materializing this in giant tuple. In my example here, I only have five attributes, so maybe not that big of a deal, but if I have 1,000 attributes in r and 1,000 attributes in s and now I just have this 2,000 attribute tuple, that gets to be pretty wide, and now I'm copying that as my output going up above, and that can get expensive. So, again, the benefit of this is that you never have to go back and get more data. You can be a bit smarter and recognize that in this case here for this particular query, for the case of table r, I only need the RID for the rest of the query plan, so maybe instead of actually sending up the RID plus the name field, maybe I just send up RID. So I could do a projection down here to start stripping out things that I know I'm not going to need up above. Then likewise, as part of the output of this join operator, I can embed or inline a projection operator and recognize that after I do the join on SID, I don't need value and I don't need SID. The only thing I need is the creation date, the C date field, so as my output, I can do a projection and strip out that so that only, you know, then essentially I don't have to do the projection above because it's already done for me as I produce the output of the tuple in the join. The other approach is what we talked about before when we talked about column stores is that we only now pass along the bare minimum information we need for the join keys, and then we also include the record ID of where we had to go find the rest of the data in our table. So say we do our join like this, we're only joining on RID and SID, so the output result of the join operator will be just the RID and SID that match, and then the record ID, or the tuple ID, which is the page number and the offset of where we can go find the rest of the data in our database, and then we just pass that up into our tree, and then up here when we say, oh, we also need this creation date, because we have this field here, we know how to go back to S and get the rest of the data that you need. So again, this is ideal for column stores because it's very expensive for me to go stitch together the tuple from all the different columns and put it back into a row-based form as I shove it up into the query plan. So if I can delay as much as possible having to do that materialization, putting all the tuple back together then I'm not passing along much of data up above. And furthermore, say that I'm feeding up maybe a billion tuples from R&S, but only one or two tuples match after the join, then up above when I go fetch that, the creation date, the C date field, I'm only going and grouting maybe two pages. So again, this technique is called late materialization. This was in vogue about 12, 15 years ago when the first column store database systems came out, like Vertica, and sort of seems like this would be a huge win, Vertica told me that two or three years ago they actually got rid of this optimization because it turns out it doesn't actually help you because the cost of going and getting this data out is so expensive to do in the beginning, you might as well get everything you need from the get-go and not worry about going back and getting it later because this could be only like, you know, the data you're getting could become on another machine and now you're going over to network to go get that data, not just like meeting local disk. So the first column store systems that came out in the 2000s all promoted this technique, I don't know how many actually still use it today. I know Vertica does not, which was very surprising when they told me this. All right, so again, this is how we're going to decide what we're going to shove up into the operator tree, and depends on what our environment looks like, depends on what the query wants to do. The other thing we have to now consider is how we're going to determine whether one joint algorithm is better than another. And so the way we're going to do this is by basing on the cost metric of how many IOs we're going to have to do to compute the joint. So for the rest of this lecture, we're going to use this nomenclature here. So we'll say that we have on table R, it's going to have m pages with a total of little m tuples throughout the entire table, and table S will have n pages with a total of lowercase n for tuples in S. So we're going to use these variables to determine what the IO cost is for the various algorithms that we'll look at today. And so the point I need to point out is that we're only considering the cost actually to compute the joint, and not the cost to actually produce the final output result, because that's going to be constant throughout all the different algorithms. If I do a sort merge joint versus a nested loop joint on tables R and S, they're always going to produce the exact same result, and therefore that cost is the same across both of them. And furthermore, for the stuff we're talking about today, we don't know the number of tuples they're going to output, because we know nothing about what the data actually looks like. When we talk about query optimization and query planning, we do have to start making those estimations, because now we need to start considering where to move the joints in the query plan. But for now, we're just focusing on taking one joint operator, what's the best algorithm for that. And we're going to base that entirely on the number IOs to compute the cost. Okay? All right, so real quickly the last thing to talk about is the joints versus cross product. We're focusing on inter-equa joints today, because that's the most common thing. We're not even going to bother talking about cross products, or cross joints, or Cartesian products, because these are almost super rare, and you never actually really need to worry about them. And there's nothing really you can do to make the cross products run faster, because it's just two, four loops, just iterating one after the other, and everything matches. So it's like a nested loop joint without a predicate to check on matches. The other thing to also say, too, is that there will be a bunch of techniques we can apply to our joint algorithms as we go along today that can make things run faster. But in general, there's not going to be this one optimization we can always do that will work for every single possible scenario, every single possible dataset, every single possible query. So certainly, again, I'm going to teach you guys the basics of these algorithms. I'll see for hash drawing one optimization we can do, because I think it's a really useful one. But for our purpose here today, what it looks like and not try to do any sort of one-off ad hoc optimizations. So in general, there's three categories, there's three classes of joint algorithms. There's the nested loop joint, which is the most basic one. And every single database system that says they're supporting joints, at the very least they're going to support something that looks like a nested loop joint. And then we'll talk about doing sort merge join, which we'll build on the sorting stuff we talked about last class. And then we'll talk about hash join, because this is almost always going to be the fastest one that we're going to want to use. So let's start in the beginning, nested loop join. It's exactly as it sounds. It's a for loop nest inside of another for loop. So all you're doing is for every single tuple in the outer table are, and you're going to iterate for every single tuple in the inner table. And then you just check the predicates in your where clause, your join clause, in your SQL query, see whether they match. And if so, then you omit it as the output. It'll be buffered as the output for the next tuple above. So again, the parlance for outer versus inner is just as it sounds. The outer table is on the outer for loop. The inner table is on the inner for loop. And this would be slightly confusing when we talk about hash joins, because hash joins don't have nested for loops. And we still refer to this way. Again, and in terms of how it looks like in the actual query plan, it's usually designated as the left operator going into the join operator. The left input is the outer table. The right input is the inner table. All right. So this is like the dumbest thing to do. To do a join. Why? For every single tuple in R, you're going to have to load all of S now that we've gone up. Absolutely correct. So he says for every single tuple in R, we've got to go scan S and bring all that back into memory every single time. So we're not doing anything about locality. We don't know anything about pages or blocks at this point. We're literally just saying for every single tuple in my outer table, let me go fetch the page that has the tuple for my inner table. So it's super dumb and super expensive. So the cost for this, using our variables we had defined before, is big M plus little M times N. So big M is the number of pages in the outer table R. So we have to read every page. And then this inner part here is for every single tuple we have in the outer table, we have to go then fetch every single page on the inner table. This little M times big N. So this sort of seems abstract to you guys because there's just a bunch of variables. But let's actually put some numbers into it and see how retarded this is. So let's say our table R has M pages, as M 1,000 pages with a total of 100,000 tuples. Table S has 500 pages with a total of 40,000 tuples. So we just plug and chug the values for these variables. We see that for big M plus little M times big M we do 50 million IOs. So say we have a speedy SSD that can do an IO in one-tenth of a millisecond. So roughly 100,000 to 200,000 nanoseconds per IO which is about what a standard SSD can do. You can pay a little bit more money and get faster, but in general that's what a consumer grade one can do. So now if you take this 0.1 milliseconds times this, it's 1.3 hours to do that join. All right. So all right, one optimization we can do to try to speed this thing up. I said before in the beginning. Smaller table should be as the outer table. So if we do that and rerun the formula, now we're doing 1.1 hours. Not much better, but it is slightly faster. So this is like super stupid. This is the worst case scenario because assume where we're doing four kilobyte pages, then if you just do the math, the size of these two tables is six megabytes. That can sit in L3 cache. So for something that can fit in my CPU cache, if I have to go get a disk, go get it. I'm doing one. It's taking one hour, right? Like you can do this in memory. You could do this join as two nest of four loops in like microseconds or nanoseconds. It should be super fast. But if we have to go to disk and we're not smart of how we're going to disk, then we're going to pay a huge penalty. So one way to improve the nested loop join, the stupid one is to be mindful that we have blocks. We have pages. We can pack multiple tuples in our pages. So now if we just say we'll have one block for the input, sorry, one block for the outer table, one block for the inner table, then for every block we fetch in the outer table, we'll fetch, we'll scan through all the tuples and for each block in the, for each tuple we'll go fetch, sorry, for each block on the outer table, we'll fetch one block at a time for the inner table and we do our join for every tuple in the outer table block for all the tuples in the inner table block. We don't go to the next block in the inner table until we have completed our evaluation for all the tuples in the outer block. So this is a little bit better now. So assuming that, again, we have one block for the outer, one block for the inner, now our cost is big M plus big M times N. Before this was little M because we were doing, for every single tuple in the outer table, we were fetching every single page in the inner table but now for every single page in the outer table, we're fetching every single page in the inner table. So this is a little bit better. And again, what should be the outer table? The smaller one, but in terms of pages, not tuples, right? It could be that we have one that has fewer tuples and more pages than the other one. We still always want to base it on, because this one has at least a number of pages. So going back to our example we showed in the beginning, again, plug and chug these numbers, now we can do our join in 50 seconds. Still bad. Don't get the wrong idea. This is still terrible. Again, to join six megabytes should not take 50 seconds. But, you know, we're not one hour we were before. So again, just by being smart about that we're not, we're doing sequential access, and single IO, we're getting multiple tuples. Just making that simple change of our nest-to-loop algorithm, we can cut it down to being under a second. So what if we can generalize this? Instead of having one block for the outer one block for the inner, what if we had multiple blocks? And so the way this is going to work is, for the outer relation, we're going to buffer as much as possible in memory in B minus two blocks, and then we'll retain one block for the interrelation and one block for the output result, and we can do it a little bit better. And again, the basic algorithm looks like this, for B minus two blocks in the interrelation, then I'll go fetch one block for the, for B minus two blocks in the outer relation, I'll go fetch one block in the interrelation, do the scan across the other them, and then when I'm done the interblock, I can get the next one. And then when I'm completed all the interblocks for my table, go and get the next B minus two blocks on the outer table. So now again, plug-in-chugging the math, instead of having to do M page reads on the outer table, it's M divided by B minus two, and you take the ceiling of that, because that's telling me how many chunks of B minus two blocks I can divide the outer table into. And you take the ceiling because the last segment of B minus two blocks may not be exactly B minus two. So you always round up. All right, so now what happens, though, if the outer relation fits entirely in main memory? Meaning the size of the amount of buffers we're allowed to have is greater than M plus two. Again, the plus two is one for the interrelation of the intertable and one for the output result. So if we can fit B minus two, you know, B minus two is exactly the size of the interrelation, sorry, the outer relation, then we're golden, because all we need to do now is just go fetch the outer relation once, bring that into memory, and then just scan through the interrelation once. So then it's M plus N. So now I'm at 1500 IOs, and now I'm at 150 milliseconds. That's starting to get more realistic, right? So again, this is like the best-case scenario, right? If you have enough memory to fit the outer relation in memory, you know, the nest of for loops, the nest of loop join is probably going to be good, okay for you. But of course now if your database is like terabytes or petabytes, you can't do that, right? So in general, why does this suck? Why does the nest of loop join suck? Because it's basically a brute force search. All we're doing is sequential scans on the outer relation and the interrelation. We know nothing about the locality of the data, we know nothing about the data that we're looking at, right? We're assuming it's just, we don't care, we're blindly just looking for matches. So as we said before, sequential scans are always the fallback option for when we don't have an index, we can't do anything smart. But if we can be smarter, like if we know we have an index, or as we see in a second in the sort merge join, if we know that things are sorted, then we can make these sequential scans be a bit smarter. So, the one of the things the database system can do, it can recognize that if you have an index, based on the keys you want to join on, at least for the inner table, then you can use that as part of the inner loop instead of actually having to do the sequential scan every single time. So there's two ways you could do this. One is available because the application created it for you, then you're golden because you just use that. Again, that's going to be very common in O2B workloads because as we said, if you have foreign keys, you have to have an index to enforce the foreign key constraint, so therefore you would use that to find the thing you're looking for. Some systems can build an index on the fly. It's essentially what a hash join is going to be, as we'll see later on, but in other systems like in SQL server, they can actually build a B plus tree on the fly called a spooling index that during the query, then they run the query, do the join using your index, and then when the query is done, they just throw it away. Again, the idea here is that the cost of doing sequential scan is so expensive that it's better off to build an index, an ephemeral index right now to do my query, and then it'll throw it away. So let's see how we do an index nested loop join. So again, all we're doing is that we still have to do the sequential scan on the outer relation, and again we can use additional buffers, blocks to not have to go fetch IO for every single tuple, but in the inner part, the inner for loop, we're now going to probe on the index, and then if we find a match, we would then go check to see whether we have a, you know, produced as an output. Now, the index does not need to be exactly what our join key is, so if we're trying to save a join on column A and column B, if we have an index on A, we can still do that index probe to go restrict now the number of tuples you need to evaluate just on that attribute A, but then now once we get the output, we do that additional match to see whether we have, whether B matches as well. So the index doesn't need to be an exact match. So now what's the cost of doing this? Well, it depends, right? It depends on what the index looks like. So we're just going to represent the cost of probing the index with some constant C, because again, if it's a hash table, then it's, you know, best case scenario 01. If it's a B plus tree, then it's log N, right? But we replaced that having to go access every single page in the interrelation that uppercase N with this constant. And that's, in general, that's going to be much better, much faster. So this is all you really need to know about Nestledoop join. Like I said, it's the brute force approach. It's the most simplest thing. If any database system says, you know, a newer system says they support joins, it's more likely that they're doing this in the middle of that. So the main things we need to be mindful of is that, again, always pick the smaller table as the outer relation. We're going to try to put at the outer table as much as possible in memory to reduce amount of redundant IO we're doing on that. And then if possible, if we have an index on our inner table, then we want to use that. Otherwise, we fall back to just doing a sequential scan. Okay? All right. So again, this is like the dumb thing. We don't know anything about the data. It's not like other than having an index, we're just always doing a brute force search. So let's try to be a bit smarter. This is what the sort merge join tries to do. So as I said last class, this is super confusing because I'm going to teach you the sort merge join algorithm, but in the sort phase of the sort merge join algorithm, it can use the external merge sort that we talked about last time. But in the external merge sort, it has its own merge phase, which is different than this merge phase. So this is confusing, but the only thing to be mindful of is the sort phase, we just used the external merge sort we did last class or quick sort if it fits in memory. And then the merge process will be different than what they did before. Again, so two phases. Sort at first, spill to memory, spill to disk if necessary. And then in the merge phase, as we'll see in example, we're going to walk through the two sort of tables one by one and do comparisons across them. And see if we have a match. And in some cases, we only have to look at each tuple once in the interrelation. But in the interrelation, we may also not have to go backtrack ever and look at the same tuple multiple times. But we don't have to go jump to the very beginning every single time the way you have to do in a sequential scan. And that's the advantage you get by sorting ahead of time. So this is a approximation of the algorithm. The basic way it's going to work is that after sorting, we're going to have two cursors. One on the inner table, one on the outer table. And they're going to walk step by step down looking at tuples. So at each iteration, if the outer relation cursor is pointing at a tuple that has a value that's greater than the inner one, then we're going to increment the inner cursor. If the outer is less than the inner, then we increment the outer. If we have a match, we produce it as an output and then we increment the inner. So reading code like this is difficult. So let's do a visual example. So again, we have two types. And we're going to join R and S on the ID column for both of these tables. So in the very first step, we're going to do sorting. And again, this is just the extranomers sort of the quick sort, depending whether it fits in memory or not. So now our result is sorted. And what we're going to want to do now is, again, have our cursors walk through these two tables. So we're going to start at the very beginning here. We have a cursor on R and a cursor on S. So the first thing that they're going to do when they start, they're going to go look at the, the value that they're joining on, the ID field. In this case here, the tuple that the outer relation cursor is pointing to, the value is 100. And for the inner relation cursor, the value is 100. So that's a match. So we would produce that combined tuple as our output for our join. And then now at this point, we then increment the inner relation cursor and move it down by one. So now we look at 100 again. And again, that matches on this. And what we're pointing to here. So we produce another output. And then we increment it again. And now it's 200. So now at this point, 200 is greater than 100. So the, we increment the outer relation cursor. So now, so in this case here, we know that when we want to do this evaluation between does 200 equals 200, which it does, and we produce our output, we don't need to go back and look at anything else because we know at this point, the cursor is looking at 200. And it's the first time I've seen this value 200 on this side of the table. So I know I don't need to look at anything up above in the table. So if you were just doing a nested loop join, you don't know that. You'd have to do the scan all the way from the beginning. But because we pre-sorted everything, we can say, well, everything above me in 200 doesn't need to be examined to do this join. So that's the advantage that we're doing here. That's the benefit we're getting over a sort merge over the nested loop join. So let's look at a case where we have to do backtracking. So the outer relation is 200. The inner relation is 200. We already produced a match. Now we increment the inner relation, and now it's 400. And so now 400 is greater than 200. So we increment this side. But now we have 200. So if we just kept going down, going down one by one, we would have missed the match between this other 200 that we had up here. So we had to maintain some metadata on this side to say, oh, the last value I just looked at was 200. So if I increment this thing, and it matches the last one I just saw, then I know I need to go back to the very beginning when I first saw that value on the inner relation. And then I can do my join and get the match. But then everything else just proceeds as before. Now I increment back to 400, and then 400 once again is greater than 200. So we increment this guy. So again, the main thing I'm stressing here is that we may have to backtrack on the inner relation, but we never backtrack on the outer relation. We're only examining the outer relation once. So 300 is less than 400. They're not equal, so we increment the outer relation. Now 400 equals 400. That's a match. We increment the inner relation. 500 is greater than 400. That doesn't match, and that's since that's greater than this, increment this. Then now we have a match on 500. So now we increment this, and now we reach the end. So we can't stop here because, again, we don't know what's coming down below us on the outer relation, and we may need to backtrack because this next table might actually be 500, and we'd have to go back up and go back at the starting point where 500 is. But in this case here, for this example, it's a match, so we keep going until eventually both cartridges reach to the end, and then the join is complete. Yes? So how exactly would you track a lot? Would you just track like the bridge to currents as previously, so you would, like, if you had multiple groups on the inner table? His question is, going back here for this backtrack part, how would I keep track that I saw 200 before, now I'm seeing 400, and then when this guy sees 200, I know I need to backtrack. You just say, here's the, for the last value that's different than the current value I'm pointing at, here's the starting location. So if you had a bunch of 200s here, you know you have to jump to the very beginning. Right? Yes? If I have, like in my metadata, I'm storing the last value, right? Yes. The previous value, and there I have the end of the, this second table. Yes. And in my first table, I get a value that is greater than the last value stored, then I can terminate, right? Then I don't need to go to the end of the question. Yeah, so his point, which is correct, is like, if I'm down here, I finish my last one, and now I'm off, and say, well the last thing I saw was 500. So over here, if I get to 600, I know that 600 is greater than 500. There can never be anything below me that's going to be matched with this, so I could just terminate here. Yes, you could do that. So what's the cost of doing this? Well, the sort costs on the inner and the outer table are just the external merge sort costs that we talked about before, right? Assuming we have to spill to the desk. So now the merge cost, roughly, is m plus n, right? At the best case scenario, I'm going to read every page on the outer table once and every page on the inner table once after they're sorted. Now, I just showed in the backtracking case that's not exactly true, because if I go back here, say 400, 500, or one page, then I've got to backtrack to 200, and that's on the previous page, I've got to go fetch that again. But again, we can't compute that in this example because we don't know what the layout is of the data. So we're just going to simplify it and just say that it's m plus n. Again, so the cost of the total sort merge, do I want to go with the cost of the sort phase, whichever sort algorithm you want to use, and the cost of this merge phase, which we approximate to be m plus n. So now, if we say we have 100 buffer pages for our simple example, then we can sort R and S in two passes. Again, that's just using the formula from last class. So therefore, and then the merge cost is just reading the 1,000 pages from the inner, sorry, 1,000 pages at the outer, the 500 pages at the inner, which is 1,500. So you take the cost of sorting R, 3,000, sorting S, 1,350, and then the merge pass, 1,500, you get to 5850 IOs, which is roughly 0.59 seconds, 590 milliseconds. So again, the block-based nested loop join, we can get it down to 50 seconds, and now in this case here, now we're under a second. Now this is certainly look reasonable. So the worst case scenario for the sort merge, which is rare, but it could happen, is that you have every single value on the outer table is exactly the same as every single value on the inner table. So every value in the tuples is just one. So sorting is just wasting time because you're not getting any benefit from that because it's just going to be the same columns of ones all over again. And then now you're just paying the cost of reverting back to a nested loop join. But this is rare. This is not like people do stupid things in databases, but this one's pretty stupid. And the database system can recognize, oh, I have only one value for this column. Don't even bother doing the sort merge. It essentially falls back to the Cartesian product, which is just two nested for loops. So in that case, when is the sort merge join actually useful? Well, if the two tables are already sorted on the join key, then we're golden because now we don't even have a sort cost. This is what that clustered index stuff that I talked about before. If I'm doing a join on an ID attribute and then I have a clustered index on my table where it's sorted on the ID attribute, then I don't have a sort phase. I'm exactly where the data is where I want to be and now I just have my cursor just go through and lock set with each other. It's also super helpful is when the query contains an order by clause, and the order by clause is the same, what's it sort the table or sort the result on the same keys that you want to do a join on, then I'm getting a two for one because now I do my sort merge join and then the output is sorted in the same way that the order by clause wants it to be sorted. So I don't even have to do that order by clause. So again, the database can recognize that, oh, my query looks like this because again, it's declarative. You tell it how you want it to be sorted and you can look at that and say, oh, well, you want to be sorted on this key and you want to also join it on this key. So I'm going to just join, rather than doing an extra loop join or hash join and then followed by an order by because I just cut off that extra operator entirely and that's going to run way faster. So again, it's all the same things we talked about before in the last class. If we have an index that's already sorted in the way we want it to be and it's clustered, we can just use that. Otherwise, we fall back to the external merge sort. Yes. So when you do a sort like this, is the sorting result really cashed and we will talk about this on Monday next week. This question is, where is the output of the sorting? Is it cashed? So it's an intermediate result for the query so then it's backed by a buffer pool. So if the buffer pool has to spill out a disk because our data set is too large, we already can handle that. But that's why we picked the external merge sort because that tried to maximize the amount of sequential IO that we're doing on the disk. So it's cashed and it's specific to the one query that's running it. And then we can do the... I think we talked about scan sharing a little bit but if we recognize that two queries want to sort the same data at the exact same time the same way, we could piggyback and just do it once and share it across the two of them. The high-end system is going to do that. My SQL Postgres cannot. So sort of merge is super useful. The Postgres supports this. The database systems support this. The sort of smaller, newer, embedded database systems don't usually support this. They usually support nest-to-loop join and then if they get their shit together they can support hash join. But not everyone can do that. All right, so let's talk about hash join. Again, this is going to be the most important algorithm we're going to use to do joins because this in general is going to get the best performance. For a large, large data set this is pretty much always what you're going to want to do. So the basic insight about how hash join is going to work is similar to how we were doing that hash-based aggregation at the end of the last class. Our hash function is deterministic meaning for the same input the hash value will always be the same thing. So that means that if we have values in the outer table that hash to a certain thing or certain value and then values in the inner table that hash to the same thing because they're equal we have to partition and split things up so that we only have to examine things within the same hash bucket. Again, it's like a divide and conquer approach. So that's the basic idea what we're going to do that we're going to split the outer relation up into partitions based on the hash key and for this one we'll get to it in a second but this one if it's in everything fits in memory we can use a linear hash table linear probing hash table we talked about before so we're going to have to spill to disk and then we can do that recursive partitioning on a bucket chain hash table that we talked about also before. So again, the idea is that if we have tuples in the same partition because they hash to the same location then we only need to worry about guys that are in my same partition. I don't have to look across the entire table. Again, the idea is we're paying an upfront cost to split the data up to make the search or probing process run much faster. In the first phase the build phase you take the outer relation you do a sequential scan on it and then you're going to populate a hash table and then in the second phase the probe phase now you do a sequential scan on the inner relation using the same hash function you then probe into the hash table you built in the first phase and look to see whether you have a match and if you do you produce it as the output. So at a high level it looks like this so about for about inner versus outer tables. So in this one we don't really have an as to for loop we have a for loop to build the hash table and a for loop to do the probe but we still refer to the relation we're going to build the hash table on as the outer relation just to keep everything consistent. So in the first phase we're going to populate this hash table we're just doing a sequential scan on this guy and insert into the hash table. Then in the second phase we do a sequential scan on the inner relation probe inside this it doesn't matter what hash table implementation we're using but we know how to always find an exact match and if we find one then we produce that as an output. Pretty straightforward. So the key again is just whatever you're doing or join on the value can depend on how you actually want to implement your hash table in your system and as we said before it can depend on what the output is going to be or what the output what information is needed up above in the query plan that'll determine what you actually want to store. So this is this classic trade-off between storage and compute in computer science we could store the full tuple because that's everything we need to produce the output up above plus that's everything we need to do our join in our hash table but of course now that makes our hash table much larger which means we could have to split a disk more but at least computation it would be faster to find exactly what we want because we jump into the hash table and we have everything we need right there the other approach is sort of do something like the late materialization approach where we just store the tuple identifier and when we hash into the hash table we scan until we find the key that we want but then we would see that we have this tuple identifier that we have to go follow along and get more information that we need and so again for column stores this approach is usually better in general because the hash table is more smaller this won't be better for row stores because it's you're storing all the data you need and you don't have to go back and go fetch entire pages that have the entire tuple all over again okay so the total optimization we can do and this is the only sort of other than spilling the disk this is the only optimization we'll talk about for joins today is for the probe site so in the build phase as we build the hash table we can also build an auxiliary data structure or a filter that can determine whether the tuple we're looking for is even going to be in the hash table without actually having to look inside of it so to do this we can build a bloom filter does everyone know what a bloom filter is? who does not know what a bloom filter is? okay I have backup slides for this for this very reason I don't know what people's background is for this kind of stuff and algorithms so let me teach you very quickly alright a bloom filter is a super super useful data structure you're going to come across throughout your entire life it's awesome so it was built in the 1970s the guy that invented it is named bloom and that's why it's called that so it's a probabilistic data structure that's a bitmap that can that can answer set membership queries or set membership questions so the set membership question would be like does this key exist in my set? and it'll come back and say yes or no it can't tell you where to go find it it's not an index it's a filter it just tells you yes or no but the interesting thing about it is that it's a probabilistic data structure or approximate data structure so it could give you actually false positives so if you ask it does this key exist it always it says no then you know that's actually true but if you ask the key exist it may come back and say yes that key does exist but it actually may be lying to you and then you gotta go actually check something else and see whether that's true or not so it only has two operations the basic bloom filter can only do two things you can assert a key and you do a lookup on a key you can't delete a key so here's how it works so it's just a bitmap so say this is a really simple 8-bit bloom filter and so when we want to insert a key like RZA from the Wu-Tang Clan we're going to hash it multiple times and then whatever the hash value we get out we're going to mod it by the number of bits we have and that's going to give us a location in our bitmap so in this case here the first hash mod 8 goes to 6 that goes to that location and then it's mod 8 goes to 4 for this guy and that goes to that location and that's a bit to 1 if it's 0 we set it to 1 now we insert JZA same thing we hash it and we get 3 for the first hash function and we get 1 for the second one same thing we jump into the hash table we flip it to 1 this is super fast, we can do this extremely fast because this will all hang out in CPU caches so now we're going to do a lookup and look for Rayquan the chef if we hash this we get 5 and 3 then it's 0 but then it points to this one it's 1 so in this case here because all the keys all the locations in our bitmap are not 1 we know that this cannot exist so we will get false and that's correct and this is why you never get a false negative but we may look up for ODB and we hash to 3 and 6 but now this hashes to these two locations that we populated before in JZA but we never actually inserted ODB so here's where we're getting a false positive right so the Blinfilter is coming back and telling us this key exists when it actually doesn't okay Blinfilter is awesome again they're super useful for a lot of things and they're super you can take a billion key data set and put it down to a couple kilobytes for a Blinfilter yes can you delete a key? no because what would happen right so like say going back to sorry right in this case here uh right RZA went to 6 and 4 so we flip those bits and then JZA went to 3 and 1 this is a bad example but like we can have another key that would hash to maybe 1 and 2 and now we want to delete it we don't know whether that 1 is from or whether we're the only 1 you could turn this instead of it to a bitmap to a counter and you can do that but now that's getting larger we want something really fast for us so that's what a Blinfilter is, yes so the question is how big do you initialize the Blinfilter to be depending on the size of the the data set kilobytes if that like they don't need to be very big and then you can actually also vary the number of hash functions you use and that'll determine the number of your false positive rate uh the larger the Blinfilter the more hash functions you use the better the false positive rate you can get it down to being I think like 10 b**ch or something like that, it's like super small these are going to be used for all other parts of the data systems we can talk about later on but like for our purpose here we're using them for joins again the difference between this and a index is that this is just telling you whether something exists, it doesn't tell you where it exists where an index would tell you it exists and here's where to go find it I'm glad I included those slides because I wasn't sure who who has seen Blinfilters before so the optimization we're going to do with our Blinfilters is as we're building our hash table which is going to be large and could spill to disk we'll also build a Blinfilter for all our keys which is going to be super small it can fit in memory and so again as we probably the hash table we build the Blinfilter and then now when we do our probe we pass the Blinfilter over to this guy and before we probe the hash table we go probe the Blinfilter that's in memory, that's super fast if our key doesn't match anything in the hash table then the Blinfilter will say you don't have a match and we stop right there and we avoid having to do that hash table lookup which could be disk Ios to go jump to find the things we want otherwise if it comes back and says true then we have to go check the hash table because this might have been telling us something incorrect, like a false positive yes so her question is for the Blinfilter how many hash functions you use it depends on how you configure it my example you showed two, other ones you could have more depends on how large you make it to as well but in general I actually I don't I don't know what the default is it's probably like four his question is could it be the case that the Blinfilter has every bit set and therefore everything values to true yes that's where you get the size it would be a certain amount again we'll talk about query planning in two weeks but one of the things the optimizer can do is try to estimate well here's the distribution of this guy and you need to know how to size your hash table anyway and you would say I think my key distribution looks like this and therefore a Blinfilter of this size would be how I want to size it avoid that issue where everything is set to one but even with a couple kilobytes it's still going to produce pretty good results so this is sometimes called sideways information passing the high end systems can do this kind of stuff and actually we'll talk about distributed databases later on in the semester but you can imagine now maybe A and B are on different machines or different data centers so rather than me having to go send messages over to the network to go do a probe in the hash join if I can just send over a couple kilobytes of a Blinfilter to the other machine then I can do even more filtering on this side before I start going over the network this is a huge win for that but this is the reason why it's called sideways information passing because this is sort of breaking our model for how our query operators execute whether they have these discrete channels of just sending data up from the child to the parent and not between siblings and this sort of violates that but it's a big win so this is a good idea so let's finish off talking about what we have hash joins that don't fit in memory so if everything fits in memory then we probably want to use actually we do just want to use a linear probing hashing table right we can approximate the size of the hash table we need for depending what the input data looks like and then that fits in memory and that's going to be really fast the issue though now is if we have to start spelling the disk now the hash table is going to be terrible for us because now it's going to be random IO because we're going to take every single key and we're going to hash it to some slot location in our hash table and for every single key that could be another you know another page lookup so what we're going to want to do is we want to convert that random access pattern in our hash table which is the worst thing for us in our database system into something that's more sequential the same idea that we applied for the external merge sort same idea we did for the hash base aggregation when we spilled the disk so the technique we're going to use is called the grace hash join sometimes it's called the partition hash join I think the textbook refers it to as the grace hash join but this is a technique that's developed to do hash joins when things don't fit in memory so the term grace comes from this project it was an academic project at the University of Tokyo in the 1980s they built something called a database the grace database machine that project obviously doesn't exist anymore but they had a paper that came out at the time as part of this project that talked about how to do a hash join when things don't fit in memory and then for whatever reason that term stuck it refers to what I'm going to describe here as the grace hash join who here has ever heard of the term database machine database appliance perfect okay so a database machine or database appliance is like specialized hardware that you buy for a database system so think about right now when you want to run Postgres what do you do you go spin up an instance on EC2 you go download Postgres and then you set it up and configure it for your instance size you can get RDS that's already pre-configured but in general most who are running deploying these databases themselves on their own hardware so the idea of a database appliance is that you can buy hardware from a vendor that's already been set up and tuned for a database system so you don't have to worry about how to set anything up yourself so all the high-end companies will sell you very expensive very nice enterprise servers that are tuned specifically for our particular database system so IVM has this thing called Neteza which they sort of killed off but they can sell you a rack machine that had Neteza already set up for you Clustrix was a startup that came out of AOL and then they got bought by Percona last year this year and they used to sell a version of MySQL that would run on our specialized hardware the most famous one and probably the most expensive one is Oracle's Exadata and this is everywhere this is like you buy these huge rack machines that have the Oracle Data Warehouse running it inside it for you like we're talking like millions and millions of dollars and there's some places that spend a hundred million dollars a year on running Exadata it's very expensive and so a database machine is a appliance that's tuned specifically for a database system but then they add in special hardware like custom A6 that are just for running your database system so the GRACE database machine that they built in the 1980s it had special hardware to do hash joins very efficiently so this is sort of how people built databases in the 1980s and then that all sort of went out of vogue everybody wants to run on commodity hardware now because by the time it took for you to come up with your custom hardware for your database system and then actually you know fab it and actually produce it and manufacture it Intel or whoever came out with new chips that already ran faster than what you started with and you lost all benefits so most database systems run on custom hardware other than like this super high end stuff from Oracle there are some newer startups that have come out in the last year so this is a slide from Yellowbrick this is a newer database appliance vendor that sell specialized flash controllers running on their particular database system but most people don't run this kind of stuff unless you have a s***load of money okay so the GRACE hash join has two parts in the build phase we're going to split up both tables now based on the hash key and write them out to partitions so the regular hash join we only sort of hashed one side and build a hash table for that and then we probe it on the other side now what we're going to do is just split up into two separate hash tables on both sides and then do a nested loop join for the partitions that match I'll show what that looks like in the next slide so again on the outer table we're going to have a hash table for it and we're going to hash all our values and populate this guy and so this won't be a linear probe hash table this will be a bucket chain hash table right because we could have this we want to have things that hash in the same location all get mapped to the same partition the same set of pages we don't want something that hashed here landing down here same thing now on the other side right hash all the values produce a hash table and now we're just going to number these these levels as our partitions right so now in the in the probe phase when we do our join right after getting the hash table we're just going to take all the buckets within one partition and now just do a nested for loop right again the idea here is that because we've already partitioned them with the hash function in the very beginning we know all the data we could ever need to examine for a tuple that exists on this side of the join and this bucket can only exist in this side can't exist anywhere down here so we don't need it when we scan everything here we don't need to look at anything else right it's sort of the same idea we did in the sort merge join because we sort of think things ahead of time we know what the boundaries are where there could be possible matches for tuples in the outer table so if everything fits in memory then this is fantastic right because remember I said in the very beginning when we talk about the nested loop join it in memory then this is the fastest way to do this right there's no magic building hash functions is wasted instructions all you're doing is doing you know single instruction or a small number of instructions to do the comparison and this doing fast for loops through these things it means the compiler can start unrolling this loop as well yes yeah so he says collisions can collisions can occur right because two different values that could hash to the same thing well I can just if I'm doing brute force search for the bucket in the bucket in memory then who cares if there's collisions now if everything collides with the same thing then this starts spilling a disk then we have a problem right and we can handle that through what's called recursive partitioning and this is sort of the latest in the technique I was talking about last time with the hash aggregation but we didn't go to the details of it but we can basically recognize that if we start spilling the buckets within a given partition we start adding more more buckets and the chain keeps getting longer and longer because we have collisions then we can just do another round of partitioning and split up to even more buckets to even more sub partitions and then that way the idea is that when we do that that that nested loop join everything fits in memory but then one to one mapping would still happen so the question is do one to one mapping still happen yes next slide okay let's see how we do this so again this is on the outer table we run the hash function and we create a bunch of buckets in the partition so say this one the chain gets super long we keep spilling out to new more more buckets so if we recognize there's some threshold to say where we've gone past some watermark to say we spilled to too many pages to many buckets we can then just run another hash function on this guy and then split up the even more sub pages sub buckets for the first hash function we have a bunch of guys mapped to partition one and then that overflows so then for partition one we ran another hash function again it's the same hash function just a different seed value and then we split out to more and more buckets then now on the when we do the probe on the interrelation if we hash to anything that has not been split before we'd have some metadata to say well if you're going to partition zero or partition n the first hash function is fine so now I can find exactly what I'm looking for across these pages here if though if I hash to partition one then I would recognize oh well I had to split that on the build side for the outer relation so let me go ahead and run the second hash function and then I'll find out where I really need to go and you can just keep doing this over and over again until you get things to fit in memory because worst case scenario you know the the column you're joining on the attributes you're joining on only has one value it's always going to hash into the same thing so therefore if a cursive partition is just wasted time in that case scenario you just fall back to the nest loop because there's no join algorithm that can make that run faster simple again so we apply the same technique for the hash base aggregation we talked about last time right if our buckets get too full then we just do another round of partitioning so what's the cost of doing this partition hash join well assuming that we have enough buffers to fit everything in memory to do the join part across partitions it's going to be three times n plus n so the three comes from in the first phase we do the partition it's one pass through m and n the pages of the outer of the pages in the interrelation one pass to read and then another round of writes for every single page we read on the interrelation we're writing another page on the outer relation then the second pass is now to do the join part where we're just again just doing a nest of full loop join on the buckets within the same partition and that's just one pass through all the pages as well so the partition phase is two times m plus n and then we just put the numbers to this now we can do our join in 0.45 seconds so the sort merge join best case scenario was 0.59 590 milliseconds now we're down to 450 milliseconds that's pretty good and this is why the hash join is always going to be preferable any questions so just finish up as I said multiple times about today if the dataset knows something about what the tables hash tables are going to look like or what the tables I'm reading into are going to look like then it can try to size the hash tables or the buffers accordingly so everything fits in memory a linear probe hash table is what we want to use if we have to spill the disk then we use the partition approach with the bucket hash table um the if you don't know the size then we could fall back and use a dynamic hash table like the linear extendable but those approaches are much more heavy weight to do joins than the simple bucket hash table or the linear probe hash table so in this case here because we're going to have to do a lot of probes and a lot of uh insertions into our hash table you know the simple is usually going to be better for us alright then it's to summarize the different costs of the things we talked about today right the stupid nest loop join could take 1.3 hours if we had everything in we use a block nest loop join then we take 50 seconds the index one depends on what the index data structure we're using so we can't actually give an exact cost for that but then the server is joined with 0.5 9 seconds and then the hash join was was 0.45 yes so the question is what scenario would I not know the size of the outer table uh so in this example the other thing I showed today was one query that joins two tables and we said we were doing a two way join operator so the operator took two tables produce the output right join them produce the output what if I have three tables to join so again if I'm doing a two way join operator I join I have tables A, B and C I want to join them I join A and B and then the output of A and B is now joined with C so unless I can have super accurate estimations on what the output is going to be on you know for the joins of an A and B I may not know how to size things up above now for us like it depends on how you do query execution the query processing we'll talk about this on Monday but I could do a pipeline approach where for every single tuple output of an operator I then immediately feed it up to the next operator and do whatever it is I want to do in that one so now that's the streaming case sort of I'm incrementally building my hash table in the second join I want to do or I could just say take all the output of my join put into a bunch of buffers now I know the exact size and then I can size everything so in some cases you do and in some cases you don't furthermore the more joins you have the worse your estimations get because the cost models are always terrible in query optimizers yes this question is in the for this one should this be a lowcase and yes I'll fix that thank you okay so the main takeaway for you guys going forth in the real world is that hash join is always going to be preferable to everything else except if we went things to be sorted as the output ahead of time or things already sorted for us in which case the sort merge join is going to be preferable but nine times out of ten if you take a postgres or any commercial database system from what I've seen they always pick a hash join and this is what sort of separates the high end expensive or well written source database systems from the you know the off brand things because they're going to be able to do both and reason about in the system what's the right algorithm I want to use and again this is the beauty of the relational model and sequel the same sequel query could then choose either these algorithms we talked about today and I don't have to go back and change anything in my application to make that work the data systems can do that for me yes this question is are all of these things true that I'm talking about here is that true if you're doing outer joins or inequality joins or other anti joins things like that for inner versus outer join actually outer join I don't think you can do actually in outer join I think you can do everything for the for inequality joins or range joins you have to use sort merge because there's no locality to values in the hash table right I want to find me all the keys that are less than this other key I have to use a B plus tree I have to use a the sort merge join for anti joins something doesn't equal something hash joins are usually better actually almost always better okay another extension your question is is this still true on single node databases or distributed databases yes we'll cover that in the end of the semester but in general yes because again it's not a reading from disk I'm reading from the network and that's even worse so replace disk IOs with network IOs instead of the same yes if it's not skewed right actually it should be uniform data if it's uniformly distributed sorting will be great for that if it's heavily skewed then sorting is bad but you still have the issue where everything is hashing to the same thing okay so now Monday next week we will then now just talk about how to compose all these different operators that we've talked about and actually run them from end to end actually be able to execute queries so I've alluded to this multiple times about the processing models of pushing data up to from one operator to the next now we can actually see how that's actually going to be implemented and then we'll also talk about how the system can be architected to run queries in parallel I have multiple cores, I have multiple threads how am I going to design a system to run multiple queries at the same time and also take the same query and then multiple threads at the same time right and that will segue into or that will lead into a discussion at the end of the semester when we talk about distributed databases because that's essentially what you want to do as well you want to take a single query and break it up across multiple machines and run that in parallel okay alright so we're done today enjoy the 88 degree weather outside or I don't know what that is in Celsius