 Okay, let's get started, so with that, today we're talking about joins and for administrative things, as I said, project two, check point one is due today, I did a quick show of hands as he was actually finished, but some more people have come in the room, who's actually done this? I've got a hundred percent, there we go, all right, more people, okay, who has not started? Good, nobody. All right, the other thing too is that there's no class this Wednesday, I had to deal with my parole officer, so I can't come to class, so that's going to be entirely canceled, use the time obviously to work on your assignment, but of course no one's going to do that, but that's okay. And then the midterm exam will be in class in this room Wednesday next week on October 17th, and it will cover everything up to and including what we're talking about today. So on Monday's lecture next week, I think it's query optimization, that won't be covered in the exam at all, and there's a homework going out today on joint algorithms that will begin germane to what's on the exam. So I'll post a study guide with, again, just a list of the chapters and homework problems that you should be looking at. I can also provide for the textbook, I think the odd problems, the answers or solutions are on the internet, so if you want to sort of do further problems and prepare for the exam, you could do that. And then you can show up with a one sheet, double-sided handwritten notes with whatever you want on it. Again, handwritten doesn't, you don't take the slide and try to condense them down as small as possible, it has to be your own running, because I think you'll get more out of it that way, okay? And then also bring your ID. And if you have any, I think there's, if you need any extra time or have a medical issue you need to deal with for the exam, you know, please email me now, we'll get that taken care of as soon as possible, okay? All right, so the entire lecture today is going to be on joins. This is probably the, I won't say difficult, but the most challenging aspect of building a database system, a relational database system, is supporting joins and doing this efficiently. You know, the no-segal guys when they came out, they said, oh, joins are slow, we're not going to do joins. Some of them are starting to add joins, not quite well, but that's okay. But this is really going to be the most expensive thing, where the data system is going to be spending most of their time on and for analytical workloads. In old-to-be workloads, they're usually, you know, doing single-key lookups or they're doing joins. We can use index and S-to-loop join, which we'll see in a second. So joins aren't that expensive in old-to-be workloads because you're just not joining large segments of data. In analytical workloads, this is totally the case where, you know, we've done our own profiling. I've seen reports and other things where an overwhelming majority, maybe over even 50% of the time that data system is doing work, excuse me, queries, will be spent in join algorithms. So it's really important for us to get this right. So to understand sort of at a high level why we actually have to do the joins, we talked about this in the beginning of the class. We talked about how we would normalize tables to reduce the amount of redundant and unnecessary repetition of data in our database. And so the way to sort of think about this is like we have our original tuples and there may be columns or attributes where we're repeating data. And then we're going to normalize them, essentially split them up into separate tables so that we don't repeat this information. So the join algorithm, so the join operators are essentially putting them back together to reconstruct the original tuples, right? And we didn't talk about normal forms because I didn't want to sort of torture you guys with that information. But the high level concept is pretty basic. And this is actually the example database we looked at the very beginning of the lecture, right? We want to keep track of our music store. We have artists that put out albums. And we said that we're going to have this cross-reference table artist album that has a foreign key reference from the artist table to end the album table. We did this because there may be some albums where we would have multiple artists appearing on it. And instead of having an entry where we're repeating the album information over and over again, we can again normalize it by breaking it up based on these foreign keys. And then now we have one entry for the album and one entry for the artist, even though there may be many to many relationship between the two of them. So normalization is the process of designing your application, your application's database at a logical level and splitting it up into normalize tables. And then joins are essentially just putting back together, right? So that's what we're trying to do today. So today's class, we're going to focus on the algorithms for joining two tables. So today we're going to focus on joining two tables with an equijoin, because that's going to be the most common approach. We're not going to talk about there are ways to do K-way joins or multiple more than two table joins, excuse me. We're not going to cover that in this class. We'll cover that in the advanced class. This is the most common type of join and I can't say exactly how many numbers or what percentage of the workloads will look like this. If you're doing join, you're most always going to be doing a inner equijoin. So the reoccurring theme we'll see as we talk about these different algorithms is that your own is always going to want to put the smallest table that you're trying to join and smallest in terms of number tuples that are going to be fed into the join operator. You're always going to put that as the outer table. Now explain what an outer table is in a second. Because this is always going to produce the best execution time, the lowest cost of doing the join as possible. So before we get into the actual algorithms, I'm going to spend time talking about two additional things. We've got to talk about what these join operators are actually going to produce as output and then we can talk about how we're going to determine whether one algorithm is better or another. What are we going to use for our criteria to determine what's the cost of executing a query or so executing a join? So say that we have this sample query like this. We're doing an inter-equa join on R&S on the RID, it goes SID. And we're going to have a join operator in our query plan that's going to produce some output and shove it up to the projection operator above it. And we talked a little bit about this when we talked about query processing about what the output of these operators look like. But in the case of joins, this is going to actually have the most variance. Because there's a bunch of different ways you can do these things. And it really depends on things like what query processing model you're using, whether it's vectorized, materialization, or the iterator model. It depends on also the storage model if you're a column store versus a row store. It also depends on what the query actually looks like. Meaning what data are we going to need above the join operator in our query plan and that will determine what we need to produce as our output. So we talked about this before, right? We talked about the concept of latent materialization in query processing models and we said that the most easiest way to produce output is just copy all the data that you're fed into your operator and shove it as your output. So say that in this example here, we're joining R and S together. Then we have one tuple in R and two tuples in S. So our join operator would essentially just concatenate all the attributes of R and all the attributes S for every tuple that's produced in our output. And then we shove that up through in the query plan, right? And in this case here, above that, we have the projection. And the projection would know that we only need to filter out the data that we actually don't need. So in this case here, we only need R ID and C date. So the projection operator would just pull out those two attributes. So the advantage of this approach is that you never have to go back to the base table and get more data because everything you need at this point is that produce the rest of the query plan is baked in this output here. But there may be some cases where if the tables are really wide, meaning there's a lot of columns for both the two tables you're trying to join together, it may make sense to actually only produce a subset of the output. In this case here, say if S had a bunch more attributes, we only know we need value R ID going up. Then maybe we don't want to copy all the other crap into the join operator and have the join operator copy it back out. We talked about this approach as well in the context of a column store system where we can only pass around maybe the record IDs or offsets of the tuples from one operator to the next. And we do this because we don't want to have to copy large tuples going forward. So in this case here, our join operator would only produce the keys that we joined on and the record IDs of the matching tuples. And then we shove that up into the query plan. And then when we get into our projection operator, it knows that, oh, I need C date from the S table. So let me go to the S table and pick out that information. If this is a column store, then this is super easy to do because the C date column, the creation date column will be in it's only separate pages and we just jump to the right offset for what we need. For other, if it's a row store, this would probably not want to do this because then you have to go back and copy the entire tuple, which would be wasteful. So I want to remember what this technique was called before. We talked about it when we talked about query processing. Late materialization. We're materializing the tuple that we need to produce for the output as late as possible in our query plan. A lot of the column store systems did this. Vertigo is really big on this as well, but as talking with them over the summer, it turns out they stopped doing this because it actually turned out to be too slow. I don't remember the full details, but this is one technique that people can use. It's one of the advantages you get from a column store over a row store for doing analytics. Again, the way to think about this is we know what the query plan is ahead of time because it's SQL. We know exactly what we're executing. We know what attributes we're going to need, and at the lowest level in our join operator, we can make decisions about what we actually need to maintain or produce from one operator to the next. It doesn't matter when we do our joins because when we do a hash join, we're going to populate a hash table. We don't want our hash table to get too big because we're going to run out of memory. We want to just have the minimum amount of information we need inside our hash table to compute the join and not waste space. Okay? So now when we start talking about these different join algorithms, the method we're going to use to determine whether one algorithm is better than another or determine what the cost is, is going to be strictly based on disk IO. As we said in this class, going to disk is always the most expensive thing. We always want to make choices about our algorithms based on how much we're going to read and write from disk. Once everything's in memory, then the cost changes, the cost metric we use changes like cache misses and other things like that. But for our purposes here, the disk is always the slowest thing, so that's what we're going to be worried about. So we're going to use this join query here as a running example throughout the lecture, and we're going to say is that when we define our cost functions or cost formulas for these join algorithms in terms of the number of pages and tuples in these two tables. So we're going to say there's n pages in table R, and in total table R has lowercase m tuples, and there'll be n pages in S and lowercase n tuples in that table as well. So for this class, although I just talked about the output of these join operators, we're going to ignore the cost of outputting the join operation in all our calculations because that depends on what the data looks like and what your query is. We could have a billion tuples fed as the input into our join operator, but then it only produces one tuple as its output, and we don't know because we don't know what the actual data it is. We're trying to do this, our asymptotic analysis at a high level. The cost of producing the output is going to be the same from one algorithm to the next. Like if I use a nested loop joint and it produces a billion tuples as its output, the same join algorithm, sorry, the same join using a hash join algorithm will produce the same number of tuples, so therefore the output cost will be the same there. So we're going to ignore that cost because it's going to be the same no matter what algorithm we use. So we also again don't know, we just, as I said, we don't know how many tuples we're going to produce at this output, but we'll know this, we'll figure out, we'll learn how to compute this on Monday next week when we talk about query optimization. We'll talk about how to derive statistics from our tables and use that to estimate the number of tuples that we're going to use as input and output for our different operators in the query plan. So the last thing to mention also too is that, as I said in the beginning, we're focused in this class on the cost analysis of equa joins or inner joins between two tables, because again this is going to be the most, the most widely used join algorithm we're going to have in our system. Most queries are going to actually execute like this and we don't really care about things like cross product because there's not really any way to make those things go faster. Cross product is just two for loops to combine all possible tuples, create all possible pairs of tuples in our two tables. So there's nothing really we can do to make that go go faster. Where in our inner joins or equa joins, we can be a bit more careful how we stage our data and do better than the naive approach. So we're also not going to talk about left outer joins or sorry, outer joins in general. We're going to focus on inner joins. I'll talk a little about this when we talk about sort merge because you'll see how you can easily do left outer joins or sorry, outer joins. The basic algorithms are the same. You just maybe do some extra step like backtracking or check a little bit more data than you would otherwise if you're doing an inner join. For all our algorithms, again we're going to focus on inner joins with equa join predicates. So at a high level there's three classes of join algorithms. There's nested loop join, sort merge join and hash join. So we're going to spend most of our time at the end talking about hash join but it's good to understand these other joins because there may be different scenarios where you may want to use one versus another. For example, the index nested loop join is what you're almost always going to want to use for all TPP workloads because you're going to have an index available for you. All right. I'll say in general that the main spoiler is that hash join is always going to be fastest and we'll see how to make it perform well as we go along. All right. So the most basic join algorithm is called the simple nested loop join and I'm putting the little slow marker here just to say that again this is like the dumbest thing you can do but it will produce the right output. So it's exactly as it sounds. The nested loop join is comprised of two nested for loops. So on the first for loop iterates over one table and for every tuple on that table we're going to iterate over the second table and we check to see whether the tuples match our join predicate and if so then we emit them as our output. And so in the parlance we're going to use in describing all our join algorithms or the ideas of outers versus enter table. So it's exactly as it looks. So the outer for loop will be iterating over the outer table and then the inner for loop iterates over the inner table. So I'll use inner and outer table multiple times throughout the lecture and this is just just what it means. So why is this algorithm bad? Sort of obvious right? For every single tuple we're going to have in our outer table are we're going to scan the entire inner table from beginning to end all over again. So if you now compute the cost of this in terms of pages it's going to be m big m plus little m times n. So we have to scan the outer table once. It has m pages so that's what that cost is and then for every single tuple in the outer table of which there's little case m of them we're going to scan every single page in the outer table. So when you actually now put numbers into this so say we have our database has a thousand pages in the outer table with 100,000 tuples and 500 pages in the inner table with 40,000 tuples. When you run this formula with these numbers you can see that just doing the simple nest of loop join on these two tables produces 50 million ios and assuming you have like even a fast SSD where it's one tenth of a millisecond per io then compute this join it takes 1.3 hours. Right? That's pretty bad right? It's pretty slow. So what's one thing we could do to make this go faster? Just this algorithm what's simple thing we can do? Was that? I thought somebody said swap. Excellent yes perfect yes. So as I said in the beginning you always want to try to make the smaller table be the outer table. So if we do that right table s is smaller than the table r so use s as the outer table now your your ios get down to 40 million and your but your total time is still 1.1 hours. Yes why so say it again sorry? Yeah sorry move that over by one thank you. No screw let's just fix it now. Actually no I don't want to break this but yes just move it over by one sorry doesn't change it though it's still going to be hour right? So right so even if we swap the the inner table with the outer table we're still doing pretty crap here right because again we're doing the dumbest thing we're ignoring the fact that we actually can pack multiple tuples in a single page we know this because you guys put your own your own bufferful manager so you can have multiple tuples per page so in this case this algorithm you're assuming that for every single tuple it's going to be a page fetch to go get that that tuple right and so what you really want is sort of a buffer or a page nested loop join uh the the textbook call is going to block nested loop join but they're same same idea where now we're just going to iterate over every single block in the outer table and every single block in the inner table and then within the the tuples within those two blocks or two pages we're then do our comparison right so now in this case scenario now we have four nested loops put together but it's still you know this at a high level we're just iterating over the outer table and just iterating over the inner table but we're doing this on a per block basis so now if we go back to our formula instead of having little m on on the the the inner part and for the inner table now we have big m right so this is a this is a little bit better but still not great and of course as I said the smaller one is always we're going to want to put as the outer table so in this case here it's the one with the smaller number of pages not the smallest number of tuples right so again do our formula uh now i got the thousand correct okay so now if we do this uh m big m plus big m times big n now we're doing uh 500 000 ios and we can computer join in 50 seconds on a fast ssd right this is better but we can still do better than this right this is in this case here we're assuming that we only have three blocks or three buffer pages we can use right we have one for the outer table one for the inner table and then one for our output right you always need one for the output so if we have multiple pages multiple buffers available then we can use b minus two pages to do the scan the outer table and we have one page for the inner table and then one page for our output right and now again the album basically looks the same which is now we have b minus two pages for doing our for the top most for the again now we're taking advantage of the fact that we have a multiple manager that can provide us with some extra space where we actually can start paging in uh tuples our pages from our from our table and try to read as much as we can while it's in memory so now we use this uh the fact that we have b minus two pages for the outer table in our formula we end up with this this cost we say big m plus the ceiling of m divided by b minus two times n right if the entire uh two tables so if the outer table fits entirely in main memory right where b is greater than m plus two because again we always need two pages for the inner and outer now we can get this down to 1500 ios right you still have to scan um you know you have to scan the entire outer table put everything in memory but now we for now you just then just scan the inner table right so it's m plus n so now you can now we're cuttings down quite a lot but of course we saw this before when we talked about sequential flooding this assumes that no other query is running but you don't care about throw away anything else in your in your in your buffer pool this assumes that you can dedicate all the davis's memory to execute just your one query which may or may not be the right thing depending on on your workload so again in these both both in all these examples the block nest loop join or the simple nest loop join at the end of the day they're just sequential scans right you're just scanning through every single tube or every single block and just doing your comparison that way there may be a scenario though if we have a index available to us right sequential scan sucks right we know it sucks okay and as we said it's pressure on the buffer pool man memory manager that that may be taking memory away from other parts of the system so if we have an index available to us which is often the case if you have foreign keys then we can use that index to find matches on our inner table right instead of actually doing this sequential scan on the inner table we just probe the index to find the the one or the the small number of tuples that we want and then we just do our join for that so best case scenario we have an index available to us if not we can try maybe build one on the fly which is essentially what a hash one is doing i don't know of any system actually will build a b plus three on the table on the fly everyone just builds the hash table but again at the high level it's the same idea so now our our four loops look like this so for every single tuple in the in the outer table we're going to again scan do a probe in our index assuming we have an index on a some attributes some subset of the attributes for our join key and then if we have a match from an index then we omit them as our output so what i'm showing here the reason why you do the index probe then still check to see whether they match again because the index may only have uh say the say our join key our join is on three attributes the index may only have the first two so we'll go find the our matches then we go fetch the tuple then we go check to see whether the other key matches are joined so for this the the cost of the index probe we're going to say is some constant c right because this depends on what the data structure is depends on what the index looks like depends on whether the the thing is in memory or not right we just say it's some some fixed cost that'll be the same for every single tuple um we know in the case of a b-plus tree it's log n uh we know in the case of a hash table in best case scenario it's o one right so again we just say we just say it's c so now our cost is getting to be m big m for scanning every single tuple in the outer table which is unavoidable and then for every single tuple in the outer table we do we do one probe sorry it's it's the cost of scanning every single page in the outer table and then for every single tuple in the outer table we're doing at least we're doing exactly one probe into into our index to find our matches so that's why the formula comes out like this all right so next loop join super simple to understand you're always going to try to pick the smaller table as the outer as the outer table and then you just do you know do a sequential scan over the outer table and you either do a sequential scan or index probe on on the inner table and you want to try to put as much as the outer table in memory as possible to be able to reuse that that data as you're as you're doing your scan on the on the inner table and if you have an index on the inner table then then then that's great and you definitely use that and database system optimizer try to try to always pick the index all right so any questions about about nested loop join again it's the first loop this is the first join algorithm everyone pretty much implements when they when they build a database system the first time because it's the easiest one to implement okay the next class uh join algorithms is called the sort merge join so i think your textbook calls this merge join uh i other textbooks call it sort merge join they they mean the same thing and what's sort of confusing about this is like in the sort merge join you can use the external merge sort algorithm to sort your table so we talked about last time all right so but again they're all the same so there's two phases so in the first phase we're going to sort both the the the inner and the outer table based on the join keys and then when that's done then we're going to have these we're going to establish these cursors at the beginning of the two tables at the top and then they're going to walk down sort of in in proper order and doing comparisons of wherever the cursor is pointing at and to see whether we have a match and the idea high level idea about this is that because we're sorting things uh ahead of time when now we start scanning the table with our cursor we know that as the cursor moves down because it's in sort of order we never have to go back and look at previous values we've examined before because we know we know everything at that point is is wherever cursor is pointing at everything below or sort of everything back above it is lesser greater than based on how you sort of it than what we're looking at now so we never have to go look at things we we've looked at before i'll show example what i mean in the second and again for the sort phase you can use quick sort you can use heap sort or you can use the external merge sort algorithm we talked about last class so roughly the algorithm for the the merge sort the sort merge join looks like this so in the first phase you just sort the two tables based on the join keys we know how to do that from last class and then you have these cursors on the sorted tables that they're just going to scan through the the two tables until either one of them reaches the end and at the at each each step you reach sort of iteration through the wild loop here if the outer tables cursor is greater than the inner tables cursor then you increment the inner tables cursor if the outer tables cursor is less than the inner tables cursor then you increment the outer tables cursor otherwise you check to see whether you you have a match and if so then you produce that as your output from the join table the join tuple and then you increment the the the the inner tables cursor by one so again the algorithm might be difficult to follow but let's look at this visually again so this these are the two tables we want to sort for this for this join query here so in the first phase we do our sorting we do our sort on the on the join key again so this is just running external merge sort a quick sort whatever you want and then we produce our new a new tuples like this and then we have two cursors start at the top of each of the two tables and they're going to again go down in in step one by one and do comparisons across to each other to see whether we have a match so in the very first step we're going to check to see whether and the outer tables cursor the values pointing the cursor is pointing at the value 100 the inner tables cursor is pointing at value 100 these two are equal therefore we know we have a match and we produce a tuple as our output because we have a match then we now increment the inner tables cursor by one and do another comparison again we don't move the outer table cursor here so again now 100 equals 100 so we know we have a match we produce our tuple in in the output table and then we increment the inner tables cursor so now at this point we're comparing 100 and 200 but since 200 is greater than 100 we know we need to now move the outer tables cursor by one and this is sort of saying that we never have the backtrack so at this point here we don't have a match 100 is not equal 200 so we know that for the inner tables cursor we've already looked at everything that comes before it so now when we increment here now to 200 we never need to go back and look at anything that we've seen before because it's everything's going to be less than whatever that cursor is pointing at now so everything's going to be less than 200 so we know there's nothing that we ever need to compare again so we really only have to essentially examine the cursor only looks at each tuple once it may do multiple comparisons with the tuple that it's looking at but it never goes back and looks at the other ones so now we have 200 equals 200 we produce our output we increment the inner tables cursor now we have 200 does not is not equal to 400 and 200 is less than 400 so we increment the outer tables cursor by one 300 is less than 400 so we increment it again and then now 400 equals 400 so we produce a tuple inner output increment the inner table and now 400 does equal 500 increment the outer table 500 equals 500 we produce our output so what happens next what happens here was that well he says the fact let's say assuming this is this is this is the entire table right that he's right that this is it we're done right because we reached the end of the inner table so it doesn't matter that there's some some more tuples here right this thing will increment to nothing so we know that there's there's nothing else we could possibly look at because again we don't backtrack so it doesn't matter that we have a match here or doesn't matter that that these guys still exist we don't care yes so he says those two can can still match with the yes yes yes yes right yes um forgot about that yeah he's right okay so yeah this is technically wrong so if this was 500 then this thing would would go down uh and you have to increment this one down not this one yeah so it's slightly off you have to deal with that case yes his his statement was idea is a primary key it's not well in this case here it yeah it is this is the primary key or it's a unique key at least so yeah so it's I mean I'm gonna get into cardinality so this is end of one so there could be multiple tuples here for one tuple here so if you know that then you know that if you reach the end of this you'll never have any other match because you've already matched everything right if it's end to end then you would have to go to the very bottom of this one here yeah that's that's the corner case I forgot to discuss thank you I'll fix that yes so you're back here yeah so yeah his question is say if this was actually 100 not 200 um right if you if it's end to end then yeah in this case you would have to backtrack so you'd have to know that all right uh you'd have to know that all right in the value I'm looking at this is 100 where did I stop at same thing like where's the what's the how far do I have to go back to to reset to the current value so you would have to say all right there's 100 100 so I know that if I have to reset I have to start back up here same thing on here if I had to reset I come back up here so for the basic way to work is on the outer table for every single tuple sorry you never backtrack on the outer table you always backtrack on the inner table so if I start at the outer table 100 100 I match that next guy's 100 100 that matches and then I go down to 200 and I come down increment this by one and now I'm at 100 again I have to know I have to go back on the inner table to start 100 again and scan down again so there's an extra metadata of the maintain if you if you're doing end to end joints but at a high level the cost is is now in worst case scenario if they're exactly all the same then it's just two sequential scans over and over again but we'll keep it simple I was ever clear what I was saying there right in this example here we never the backtrack because we know it's a one end joint so for every one tuple on this table there can be multiple tuples on on in this table but not vice versa so as we scan down we never have the backtrack if it's end to end then you have the backtrack on the on the inner table to you know as you go down on the outer table but you instead of having to complete sequential scan every single time you can just get only backtrack to the point where you just know the values that you need to do your comparison and start okay and then it gets a weird you know the cost analysis can get weird because now if you're if you're backtracking across page boundaries then you're fetching in pages again and again we're keeping it simple ignoring that all right so the cost for this right we have the sort cost of of the outer table the sort cost of the inner table and that's just the formula we had from last class when we talked about external merge sort and then we have our merge cost the best case scenario it's a sequential scan on the outer table and sequential scan on the inner table m plus n so the total cost of these is just the combination of these two formulas so again using our example database and now let's say we have a hundred buffer pages to do to our joins sorry do our sorts for both r and s that means we can sort them in just two passes so the cost of sorting r is 3000 the cost of sorting s is 1350 the merge cost is just 1500 so the total cost of the merge plus sorry the sort plus the merge is 5800 and that roughly comes out to be 59 seconds if you're doing this on the ssd so we're getting a little bit better right we were um i forget what we were had before with the with with the other joins but it was roughly around a second so now we're getting into sub second numbers here all right so the the worst case scenario is is sort of the example we briefly talked about of where you have to do backtracking on on the outer table so now essentially becomes the the you play the cost of sorting but then all your attributes are the same for the two tables so you're just doing for every single page in the outer table you're doing complete sequential scan on the inner table then you backtrack on the inner table go down to the next page and do it all over again so in the worst case scenario for the sort merge join uh it'd be m times m plus plus your short cost all right so again we'll see this on monday next week we talk about query optimization the database system tries to maintain internal statistics about your database about your tables to make decisions about oh my my distribution of my values is really crappy so therefore i don't want to do the sort merge join because i know i'm just going to be waiting to do wasted i o because i'm not getting any benefit of actually sorting i would say that this case this particular example is super rare right because think about what this actually would be this be two tables that you want to join where they have one value in this column and they're exactly the same for every single tuple right i don't want to say it never happens because people do stupid things all the time but like you know in in you know in a real application you i don't think you this is beyond likely all right so when is the sort merge join useful well obviously when the tables are already sorted on your join key and if you're doing like an index organized table then you don't actually have any sort cost because it's already sorted on the thing you want to join in anyway so that that's an added bonus right you cut that entirely and it's just then it's just the cost of two sequential scans um also if you're if your query requires the output to be sorted on your join key then you're killing two birds with one stone because now you're doing the join and it'll it'll produce output that's already sorted on the thing that that's in your order by clause so then you don't have to do any extra sorting for the order by it's already sorted right so again the optimizer the data system can recognize oh i have an order by clause for my join and the order by sending i'm joining on so just don't do just do sort merge and then don't do the order by and you cut out an operator entirely entirely from your query plan all right so again the the the again the output so the input for these operators could be either on the data system will know all right i'm getting data from this from this operator below me in the tree and i i'm going to know the physical properties of that data i'll know whether it's actually been already sorted or not so it uses that and it's determination to to decide whether it wants to do a sort merge join or another algorithm okay all right so any questions about sort merge join okay so let's spend some time talking about hash joins so they said hash join is the most important algorithm this is where most major analytical databases are going to are going to be spending most of their time um because in general this always turns out to be the best so the basic idea is that it's almost like the index nested loop join where we'll have it uh we'll we'll build a hash table on the fly and then we use that to probe the uh in the inner table to see whether we have have a match so the way to think about this and why this works is that if we have if we're doing a join on on two tuples and their attributes match all right meaning so you say again we're doing echo join to something equal something if we then hash those attributes that we're doing our join on the hash will have to produce the same the hash function will produce the same hash value as well so if two values and unhashed form equal are the same then two values in their hashed form will have to be the same so we can rely on this aspect of that to then do do our join so we can just do our join on the hash attributes rather than the actual raw attributes themselves so at a high level that's essentially what we're trying to do in a hash join so we have uh two phases so in the in the first phase we're going to build our hash table on the the outer relation the outer table using some hash function it doesn't matter what it is murmur hash um murmur hash city hash whatever whatever you want to use and then the second phase is then we just do a sequential scan over the inner relation and for every single tuple we're going to hash the join key attributes of that tuple with our same hash function we use in the first phase and that'll tell us that'll point us to some location in our hash table and we can use that to determine whether we have a match so just like we don't care what hash function we use we don't actually care what what hash table we use we can use linear hashing cuckoo hashing uh robinhood hashing it doesn't matter as long as we can make sure that we we hash something and it jumps to a particular spot in our table and we can find the thing that we're looking for so the basic hash join algorithm again visually looks like this again the the the the algorithm is that you just build a hash table htr on the outer table and then for every single table every single tuple in the inner table you're just going to hash it and look up in the hash table to see whether you have a match so again we just do a sequential scan of the outer table populate htr and then do a sequential scan on the inner table and just probe inside the hash table to see whether we have a match right that's it pretty straightforward same thing almost as like the uh the index nest loop join except in the index nest loop join we had the index on the inner table or in this case we're building the the the hash table on the outer table right so what do we actually put in our hash table well again the key is just going to be the attributes that we're doing our join on right it has to be it's only works if you're doing equi join there are quality predicates because we want to know whether something equals something if we have an inequality if we have a a less than a greater than then the hash join doesn't work because we can't actually do that comparison with our in the hash table right it always has to be an exact match and then what we actually put in the value portion of the payload of a hash table again depends on the implementation and depends on what the what the output of the operator should be so you can put the entire tuple if you want um and we said that we wanted to do this because we'd avoid having to go back and get more data so if we build our hash key uh if we have our join key be uh if we if we wouldn't make our join key or the thing we're hashing on the subset of the join keys which you could do if you wanted to then we still have to maintain the rest of the join key inside of it and inside of the payload because then we have to do an evaluation side of that anyway we have to do that anyway actually if we hash the entire join key or not because we could have collisions and we have to go then make sure that we actually have an exact match inside side of the hash table this obviously takes up more space and it may cause our hash table to overflow and you know have long chains or other things that you know requires to allocate more pages or we could use the tuple identifier or the record id offset that we talked about at the beginning which we wanted to take materialization we said this would be better for column stores because you're only reading in data that you actually need but you may have to go fetch that data later on up in the query plan depending on what what you actually need so if everything fits in memory then our hash join is super fast because we just rip through the the outer table better hash table then rip through the inner table and the hash table entirely will be be there and we just probe inside of it and find things that we want of course we said that this is not always the case of of in large databases so we this is why we have a bufferful manager to allow us to be able to spill things out the disk as needed but just like we don't want the operating system to control what we swap in and out of memory we actually don't don't want the bufferful manager just to make his own decisions but what's the swap in and out right because it may end up swapping things out we don't want to as we're doing our joint i'm going to take guess why why is that the case so what is it about how we're doing our say we built our hash table now we're doing our probe actually in both cases on the build side and the probe side what is the access pattern of our hashing function looks like it's random right because you're taking whatever value you have hashing it to some random location in memory and it's we can't predict this ahead of time so if we just let the the the the bufferal manager decide oh i'll just use lru or clock decide what what to swap out or evict from my my buffer pool it may end up putting out something that the the next hash the next hashing of the next key would actually need be smart about this and try to stage things in smaller chunks there we're we can try to get everything to fit in memory and we don't have and we don't have this problem of things thrashing because the bufferal manager is removing things that we actually need so the algorithm we're going to use to handle hash joins that are large in the amount of memory that is available to us is called the grace hash join so it's called grace hash join because there was a project in the 1980s at the university of tokyo for a database database machine called grace and in the the the system doesn't exist anymore it never got commercialized but the they have a paper at the time that describes how to do the hash join i'm going to show you and then for historical purposes everyone just calls it the grace hash join so the basic idea is a lot similar to what we saw in the external merge sort where we're going to break up the uh in the problem into smaller chunks that can fit in memory and then we can do the join on those the chunks that fit memory and not worry about fetching other pages that will never ever match so the basic idea is that instead of just building a hash table for one side we're actually going to hash the other side as well and put things in partitions or buckets and then come back and build hash tables on just what's in those buckets so again as i said the this is the picture of from the the website i think it's the only picture i can find online of the the grace database machine who here has actually ever heard the term database machine before nobody who here has ever heard the term database appliance nobody okay so the the way to think about this is a database appliance is like instead of this you know say i want to set up my sequel or postgres what i could do is i could go buy a machine from super micro or del or emc you know set it up load the operating system then download the database system software install the database system and sort of set it up myself right that's what most people do a database appliance would be you'd buy a machine already preconfigured and tuned for your your database you know for whatever databases you want to run and it's sort of you know it's sort of the it's been set up for the exact hardware that you're running it on and so that way you don't have to worry about how to set up how much buffer pool to memory to use or how much to tune it for the os and other things all that's done for you it's sort of like a um you have battered batters included database system where that comes on its own hardware right so a database machine was this this sort of movement in the 1980s where people would sort of build like an appliance where it's sort of like a it was a preconfigured you know one rack unit that would have the databases and pre-installed but they went a bit farther and they actually had specialized hardware on it not fpgh gpu so sort of custom hardware that was that the data system could take advantage of right and this was thought at the time early 1980s that this was the future of databases that everyone's going to have these these custom hardware thing uh machines that the data system could exploit and get much better performance that you could on commodity hardware right of course it didn't pan out right because uh with Moore's law always getting better and better especially in the 1980s less so now the whatever advantage you had by the time you manufactured a database machine that had specialized hardware to do whatever you wanted to do so for example a lot of these systems have specialized hardware to do hash joins by the time you actually got that in production you know intel came out with a better chip that uh could you know outperform whatever you could you were doing before all right so these things never really took off um at least the customized hardware database appliances still exists there's a bunch of different uh examples here of different manufacturers that sell stuff so the first one is IBM selling you you know uh zos machine to run db2 uh there was a startup out of san francisco called clustric that would sell you an appliance of a tuned mycicle uh mycicle like database system and of course oracle is probably the most famous one they'll sell you lots of hardware that that comes with the data system already preconfigured and running on it like exadata is is you know pundits there's exadata brings them a lot of money basically selling these these appliances so again grace was not an appliance it was it was the database machine which is an appliance plus extra stuff of course then nobody nobody still uses this but everyone still uses the grace hash join so that's what we're going to cover so i don't know what the textbook calls this it might call this grace hash join it might call it like partition hash join the basic idea is the same i think wikipedia might call it's grace hash right so as i said on the regular hash join we would just scan the outer table build a hash table and then have the inner table do a special scan and just probe inside of that but here what we're trying to do is we're going to try to divide up the the outer outer and inner tables into buckets that can fit into memory uh and then we'll build hash tables on just those buckets or just build a hash table on the bucket all right so we do a sequential scan on the outer table again we're not building a hash table i'll actually this is not the ht this should be buckets so we're going to nice the same thing right so it's a hash table we're doing we're going to build buckets same thing on the other side and then these things can now get swapped out the disk but the way to think about this is because again we're using the same hash function on both sides we know that sort of at every level there there will be tuples that will match so because you know in hash function one we use puts things in this level here any tuples over here with the same value will end up in this level as well now we'll have collisions therefore we'll have keys that hash to the same bucket but that's okay the main idea here is that at a one level going across we know that there's nothing in these other ones that we have to examine to look for a join we sort of put everybody in this in the same location the same level so now what we do is now we do a just a simple nested for loop over just the buckets themselves right we said the nest loop join was okay when everything fit in memory so because these buckets will fit in memory then we just iterate over to each of them and check to see whether we have a match right yes is it just like reducing the range of the hash function his question is is this just reducing the range of the hash function yes in some ways yes right just like you're always going to do like yes so yes so in the regular hash join without without this this step you hash into you know location and then if you have a lot of collisions you're essentially going to be doing the same sequential scan you know whether it's a linear probing or the chain hash table you're just doing a sequential scan to see whether you have a match it's so this is actually do the same thing we're just hashing both sides it's like a almost like a divide and conquer kind of approach right and let's talk about parallelization later on but you can pair easily parallelize this because you can have one thread do the join across each level and again the main idea is here is that at this level here there's nothing in these other levels we'd have to care about because the hash function always put us in the right location all right so what's potential one's potential problem of this this works this works great we say that now uh if our hash table doesn't fit in memory we just do these buckets potentially swap these out the disk and then we bring in the buckets for each level one by one and and do our for loops what well what's the problem of this potentially exactly what are the buckets don't fit in memory right so the buckets don't fit in memory then you have to you have to uh maybe split them up even more so this is called recursive partitioning the idea is that again we're just going to keep partitioning over and over again our buckets into smaller smaller chunks till we can get something that'll fit in memory and then we can do our our our simple nested for loop join so the basic idea here is that if we recognize that one of our buckets is going to overflow or get too big then we'll apply another hash function on it that's not the same as our first hash function and again when i say different hash function it's it's you know if we use murmur hash for the first hash function it will use murmur hash again for the second hash function we're just we're just providing it with a different salt key right so that it produces a different distribution of hash hash keys for the same values right so we're going to take that bucket use another hash function hash it all again and then spill those out the desk and then now when we come back and do our our join we just have to recognize that whether we what level or how deep did we do our recursive partitioning to determine what hash function we should be you should be using to find a matching tuple so going back to our example here so again say that we we're building on the outer table we first do hashing we hash everything the buckets to the first hash table and we see that this inner inner bucket here level one it's it's overflowing so we're going to go ahead and hash it again with another hash function to split it out even further right and so the way to now think about in terms of levels the the first and the last one are still at the level they were signed with when they started but the inner one here it's now you know one prime one double prime one triple prime just to saying that it began at level one but then we applied a second hash function and now it's been split up even further of course now again you may be thinking oh what happens if that if i have a second hash function or all the values here are the same and everything's going to hash to the same thing anyway so can i just you know would i be stuck in infinite loop again in that case there's nothing you can do right so you just have to bite you know bite the cost of just doing switcher scan uh or start doing it you know the the blockness of loop join to fetch in the pages one by one right just for this this one level and the data system would recognize this that if i hashing this again everybody's the same so i breaking this up even further doesn't help me all right so now on the inner table again we start off we're taking the first hash function we split everything up into uh into chunks like this right again for the outer one and the top one the bottom one we know that on the other side they just got mapped to the same location with the same hash function but if we have anything that goes to level one then we know that we need to split this again with the second hash function and and write it out like that all right so then now when we do our comparison right we're only comparing across uh the same level here right this is sort of like the extendable hash table or linear hash table we kept track of what hash function we should use depending on you know where the split pointer was or how deep the uh the uh the the the local depth was for a particular bucket all right so what's the cost of this well assuming we have enough buffers then the cost is the essentially the three sequential scans on both tables right three times m plus n right because you have the partition phase you have to read both table in both tables in and write them both out so it's two two scan or one scan in and one write out so it's two m plus n then now when you do your probing phase again it's just a simple nested loop join like what the block nested loop join and so it's one sequential scan or every page on the outer table and one sequential scan on the inner table again the idea here is that for every single page on the inner table within a level within my buckets i only have to go look at the pages that correspond to on the inner table at the same bucket i don't go to the ones the ones above me and i'll go to the ones below me so it's just again two sequential scans sorry just a complete sequential scan on on both the inner and the outer in the part in the probing phase so let's see now this in terms of cost so again three times m plus n so three times uh 1500 is 4500 ios so we can do now scan on the fast SSD in in 0.45 seconds when the sort merge join case it was uh i think 0.59 seconds right so this is why again the the hash joint is always going to be preferred because it's always going to be oh sorry it's almost always going to be faster than the the sort merge join and and definitely almost always faster than the the nested loop join unless you have an index and you're only joining a small number of things so the the database system knows the size of the outer table then it can use the static hash table right we can just pick uh you know a linear probing hash table which is really fast and say well i i know that i fit everything without having to complete wrap around with having a large number of collisions based on what i think i'm going to actually get a need so that's the best case scenario and that's what data systems are going to try to do but the issue is that uh as we'll see in next class you have to estimate how many tuples you think you're going to have uh as as your input and if you do if you're just getting the the data directly from your access methods at the bottom of the tree that may actually it may not be hard to actually compute that but if you're doing a join after join after a join after a join if you get it wrong at the bottom then you're going to get really wrong at the top so you may be under provisioning or uh the size of your hash table and you may have to do uh you know to double the size if you have too many collisions so this is you could use you know the dynamic hash tables like extendable hashing on linear hashing um if you don't know the size but you're going to as as we saw before you're going to pay a computational overhead to maintain you to be able to figure out you know what buckets things should actually be in in both of those schemes where in the linear probing you don't have to do that so most databases are going to try to do a static hash table but if you get it wrong then you know you're going to pay the penalty of doubling the size it's really only for internal metadata you use something like extendable hashing or linear hashing like your buffer manager that you guys have been voting okay so any questions about hash joins pretty simple to understand uh the again and from our point of view we really only care about disk ideal cost so the recursive partition approaches pretty much in the only really optimization we have available to us we'll see the case in um in the advanced class when we do in memory joins you start to care about where the data actually is located like only what cpu socket or what level in your cache right for our purposes we don't care about this to summarize uh again the cost all these different algorithms again we said that using our running example in the worst case scenario the simple nest loop join it took 1.3 hours we do things in blocks we're down to 50 seconds you do the index nest loop join i don't know where i got 20 seconds from oh i said log n you know right it's a little bit better but then the sort merge join and the and the hash join is uh is even less than that this actually might be wrong i don't know where i got this from um because that shouldn't be log n all right i'll fix that later okay so the main takeaway from all this the hash join is almost always gonna be better than sorting um unless you know that the the output needs to be sorted or the data you're feeding into it is already sorted on on what you want um the a good database system we don't have time to open up postgres um but a good data system can actually support all these different join algorithms and can try to figure out which one is the best for your particular workload so postgres sql server oracle all the major data systems can do sort merge nest loop join and uh hash joins and then on the fly for each query they figure out what they want to use and then at the end of the day i'm almost always comes out to be a hash join unless you have an index already my sql i don't know if it actually supports hash joins i thought they did and someone told me they did we can check the documentation um again it might be in the action in version eight but i don't know so and then sql like i actually i think sql like might only support nest to loop joins okay so next class we're going to talk about you know i'm sort of alluding to this all throughout today's today's lecture was next class we'll talk about how the system's actually going to decide which algorithm it wants to use per operator and i'll get even more complicated when we talk about joins because it's not it's more than just saying do i want sort merge versus hash join it's actually making decisions about what should be the inner table versus the outer table and how should i order my join operations so if i the way to think about this i have to join table a b and c together should i join a and b first or b and c first or a later like all these kind of things can actually make a big difference in performance and you actually may out overshadow the benefit you get from using sort merge versus hash join right right so uh no class on wednesday i will post the uh this the midterm study guide later in the week and then uh project two checkpoint one is due today that's my favorite with the g to the e to the t here comes do i play the game where's no rules homies on the cuts of yama because i drink proof with the bus a cap on the ice bro bush we're gonna go with a flow to the ice here i come will he eat that's me rolling with fifth one south park and south central g and st eyes when i party by the 12 pack case of a part six pack 40 act gets the real i drink proof but yo i drink it by the 12 they say bill makes you fat but saying eyes are straight so it really don't matter