 Okay, so the topic for today's lecture is Join Algorithms. This is one of my personal favorite topics. I can probably say that about, you know, every lecture in the course. But this one in particular, I'm a fan of. So before we get started, we will, of course, look at the administrative stuff that's relevant for you guys coming up. So homework number two was due last night at 11.59 p.m. If you haven't turned it in yet, you can still turn it in. But remember, late days don't apply for this, for homework assignments. So you'll get, I think it's 25% off per day that you turn it in late. Project number two is out and it is due on Sunday, October 17th at 11.59 p.m. And it is the hash index project. So I think that's going to be due right after the midterm exam. So you kind of want to prioritize your timing to make sure that you can study for the midterm and you can also get the project done. So just watch your scheduling on that. So for the midterm exam, it is scheduled for Wednesday, October 13th, during a regular class time from 3.05 to 4.25 p.m. here in this room. It's going to be open book, open notes. So you're free to use whatever material you want to bring in and use for the exam. Please try and be reasonable about it. I don't kill the environment printing out every single slide from every single lecture and all that kind of just please try and be reasonable with what you bring, but it is open book and open notes. So it's going to include all the material that we cover up to the midterm. So everything in all of the lectures, in all of the assigned chapters for reading from the book are a fair game for the questions on the midterm. And if you have any more questions, you can see the Piazza post for more details. I think there's a link in there also to a summary page that summarizes all of these points. So before we jump into the lecture material, are there any questions about any of this administrative stuff? Sorry? Laptops. So the questions are laptops allowed. No laptops, no phones, no electronic devices like that. You can bring a calculator because you may have to do log calculations or that kind of stuff unless you can do them in your head, which would be incredible, but you're free to use a calculator. Other than that, no laptops, no cell phones, no internet connections in general. Are there any other questions? Okay. So let's start with just a recap. Why do we need to join? Why are we talking about this topic today? Why are joins important? So remember that when we are designing our database, we're normalizing tables in the relational format in order to avoid unnecessary repetition of information. So this is something like, for example, splitting up students and enrollments. We could, for example, repeat the student name and the student graduation year and everything else for every single time that an enrollment record appears in our dataset, but that doesn't really make much sense from a storage perspective. We'd have to duplicate the student name and whatever values are associated with the student multiple times every time the student has a new enrollment and a new course. So for that reason, we kind of split them up using the relational model where we have one student record stored in one place. We get all the information about the student stored one time. And then we can join it up with an enrollment table in order to get all of the enrollments for that student. So the way that we're gonna do this in our DBMS is to use the join operator in order to reconstruct the original tuple. So all of the values for a student plus all of the values for an enrollment without any information loss. So we're gonna be able to, using the join operator, combine those tuples in order to give a full result set without having to duplicate things. So kind of when the NoSQL systems were coming out, they kind of say, okay, well, joins are slow. So we're not gonna have them in the beginning. And kind of you have to engineer your application or your storage around the types of queries you wanna do. Now today, if you fast forward and look at what they're offering, pretty much every NoSQL system has some form of join support. So join operations are an incredibly important operator for database management systems and in particular, just as a statistic. There's one study of OLAP systems that found that nearly 80% of the query runtime is spent doing joins. So it's really important that we have an efficient implementation of the join operator and we get kind of all of the ancillary decisions around it right. So in today's lecture, we're going to talk all about the different types of join algorithms that we have available to us. And in particular, we're going to focus on binary joints. That means there are going to be two input relations or tables to the joints. There are arbitrary multi-way joins. They're typically not very practical to implement in real world systems. So they exist primarily in the research literature. So you don't have to worry too much about multi-way joins. All the joins we're going to be talking about in the course only have two inputs. So two input tables or relations. And specifically, we're going to be focused on inner equijoin operators. So you can have an arbitrary join comparison condition. So for example, if you want to say where the elements from one table don't join with another table, it's an anti-join. If you want to have an arbitrary join condition, for example, greater than less than, that's a theta join. So there are all these different types of joins out there. But for the purposes of this class, we're just going to be focusing on the inner equijoin algorithms. Those are by far the most common. They are what you usually get when you have a primary key, foreign key join. Probably the next most common one is going to be like a left outer join. But we'll talk about that in later lectures. So right now, we're just restricting everything to the inner equijoin. But these techniques, of course, can be generalized to support other situations. So in general, just as a general rule going through all of the algorithms that we're going to have here today, we want the smaller table to always be the left table or the outer table in the query plan. And by the smaller table, I don't necessarily mean the table with the fewest number of tuples or records, although that could turn out to be the case. What I mean is the table with the fewest number of pages. So you could have conceivably a table that has more tuples. But if they're very small, you could end up with fewer pages versus a table with fewer tuples, but more pages. So just keep that in mind as we kind of talk through the different algorithms. So kind of just a quick refresher on what join operators look like in a query plan. So again, we have this really simple query here. It's joining two tables, R and S on their shared key, r.id equals s.id. There's some selection condition on S that's filtering for values greater than 100. And then we're asking for RID in the C date field back. So in the corresponding query plan down the bottom, we have table scan over R, table scan over S. We are emitting the tuples from the S table scan. They're getting sent to this selection operator, which is filtering out some values. And then both the tuples from the R table scan and the filtered S table scan are going to go to this join operator and we're going to match up tuples that belong together. And then finally, we do the final projection step in the plan. So the two types of decisions that we're going to need to make when we're executing this plan is first, what is it going to be the output of the join operator? So by output, I don't mean the final output of the query. But if you think about what is the join operator going to emit in this query plan to its parent operator in the query plan tree. So we need to figure out, okay, how are we going to get what essentially is this join operator in the tree going to return every time it finds a match. And then the second thing we need to decide is kind of a cost analysis of the different join algorithms that we have. So we have to determine when one join algorithm is better than another. And we'll see a bunch of formulas or cost models that we can apply to kind of make decisions based on specific circumstances. So things like statistics about the data, the other different types of operators that we have. We can combine all of that in order to figure out which is going to be or which we expect to be the best join algorithm in a particular scenario. It's not always 100% accurate. Query optimization is a really hard problem and choosing between or choosing a bunch of different algorithms that you have available is really difficult. Especially if you have bad predictions about things like statistics for intermediate query results. So for example, if you have a bad estimate for how many values are being returned from this filter on S. If your estimates are way off, it may mean that the join algorithm that you choose later in the plan is not the right one. So we're going to talk about query optimization in way more detail in later lectures but for now we're just kind of introducing the basic cost models or formulas that we can use to estimate the how well we think a particular join algorithm is going to do. So as I mentioned, the first thing that we need to decide about is what the output of this join operator is going to be. So again, the join clause in the SQL query is highlighted there and so is the join operator in the tree. So we need to figure out every time we have a matching tuple from R and a matching tuple from S, when the join attributes match, we need to somehow concatenate them together in return a new tuple that represents their joined representation. So what kind of factors go into this decision? Well, we need to take into account our processing model or we're returning for example one tuple at a time or we're returning several. So like a vector of tuples or an array of tuples that can impact our decision. It's going to depend on the storage model. So if we have a row store for example, then can we get all of the tuple values at once from both R and S and we can kind of stitch them together or concatenate them together and return them from the operator. Otherwise, if we have this kind of column store model, then we might wanna do kind of this late materialization idea. We talked a little bit about early versus late materialization in the last lecture. So kind of the same idea applies here when you might want to just store the key plus a pointer basically or a record ID. And then it's also going to depend on the data requirements of the query. So in this particular case, we only need these two columns in the final projection there. So we could apply some kind of early filtering of attributes. It's called a projection pushdown. So we're going to apply our projection earlier on in the query plan. Again, we'll talk about that later in the query optimization lectures. But kind of the high level idea here or the high level takeaway is that there are all these different things in the design of the system that we need to consider when deciding what exactly the output of the join operator is going to be. So kind of I mentioned one option was early materialization versus late materialization. So early materialization just to reiterate is kind of copying the values for the attributes in the outer and in inner tuple. So if you have a binary join, you have an outer. Usually it's drawn on the left hand side and inner relation usually it's drawn on the right hand side. But we're combining those tuples together into a new output tuples. So we're copying both of them into some contiguous buffer and returning it as a new output tuple. So the advantage of this approach is that you never have to go back and kind of fetch the columns that you needed. That using the record ID or the pointer that you store, you have everything right there that you want. And again, you can push down the projection at the top of the query or even push that down so you don't have to necessarily copy in redundant fields. But use this trade off between having to copy around stuff versus just go and look it up. So it's probably a particularly bad idea for wide tuples. If you have lots and lots of columns, especially if you're not going to use them, you do excess copying. So just as an example of what this early materialization approach would look like, let's say we want to do a join of these two tables. So we're going to end up with essentially this output from the join operator. So we're going to take the values from the outer relation, which is R in this case on the left hand side and the inner relation, which is S and we're going to merge them together into one contiguous tuple for each match. And then this is kind of what is being passed. This output here, this intermediate result is what's being passed from the join operator to the projection operator above it. So again, kind of the advantage is subsequent operators like the projection here never have to go back to the base tables to get more data. You have it all at once, yes. So the question is, do you store the intermediate results in the buffer pool? Yes, you store the intermediate. So all intermediate results are materialized during the query processing in the same way that outputs are in general. So all of the intermediate results, anything that needs to be materialized. And so we'll talk about different execution models in I think it's the next class. But if you're producing, for example, one tuple at a time that's called like a pipeline or a tuple at a time processing model so you don't need to materialize that, you can just produce one at a time. If you have something that's blocking, for example, you need to materialize the full result. So for example, if you're doing like a sort or you're building this hash table here, we haven't gotten to that yet. But if you're building a hash table in this query, then you need to materialize that. So you use the same, you reuse the same buffer pool mechanisms that the whole rest of the system is using. Yeah. The purpose of this idea is to set some tables that take a type of like unique representation in the data register. So you need one that you can use, they don't have as much, like they are different methods, right? They don't have a specific story somewhere in between. So how do you, like how does the purpose of the differential would be these types of tuples and the people who come from original data? So the question is how does the buffer pool differentiate between these kind of like ephemeral tuples that are just used during query processing, the intermediate results versus persistent tuples that are actually stored in our database. So it's the same way that the buffer pool and the like paging, the file manager manages all of the files on disk and bringing those two in from memory. So you can just think about intermediate results that grow larger than memory, being like a temporary file that's going to spill to disk. So it's, intermediate results can just be thought of. And this is kind of one of the benefits of like the relational model or relational algebra is that each intermediate result is just a, they can all be treated like base tables or something, right? So every output is the same as, they're all equal in the relational model. So kind of you can reuse the same mechanisms as buffer pool, file manager, all that stuff in order to manage intermediate results the same way you would a base table. Yes, so the question is why do we need to push down the attributes from the projection layer? Can you just use the tuple IDs and then, okay. So that's a great idea that's called late materialization. Basically, the idea is that rather than trying to do copying around or getting together the attributes that you need and copying those, you can perform late materializations or going back and fetching them based on the record ID. So you get all of the keys that match and then you have the pointers or record IDs as the values and then you can go back and now look up like the s.c. date there. So does that answer the question? Okay, so the reason why you might prefer one over the other, as I said, it's gonna depend on all of those factors, the base storage model, whether you're already in a row based or column based format. If you're in a row based format for example and you have particularly wide rows, if you wait until the end to do this late materialization, then you're gonna have to go back and fetch the whole row. And if they are, for example, variable length, you may have to scan through to decode it and pick out just the attributes that you want. So kind of if you're in this row based format, it might be easier to do the copying up front. So again, there are these different trade-offs that you need to consider when deciding whether or not you wanna do early or late materialization. So late materialization shows up in a lot of column store systems. So there are a lot of all these DBMSs that came out, I don't know, 15, 20 years ago. There was monadb, Vertica, a bunch of other ones. And basically the idea was that all of the data is stored strictly in columns. And you wanna materialize the attributes you're going to need at the end as late as possible. So kind of you're again just storing the key plus some record ID to be able to get back to the other attributes to look up. So what this is going to look like, similar to how we did the one for early materialization is gonna look like we're joining these two tables. We're going to produce the keys as well as the record IDs for each of those two poles. So then we can go back and in the projection stage, look up what the values are that we need. So in particular, we need this C date value. So again, that's kind of we can use the record ID that we have stored there in order to go find it. And as I mentioned, this is really particularly useful for column stores because when we're accessing the join keys earlier, we're not also accessing the other attributes. So we can go and look them up later, whereas in a row store system, at the same time that we're accessing the join keys, we're also already getting all these other attributes. So it makes sense maybe to get them all at the same time. So for the second piece, that was the output piece. What are we gonna output from the joint operator? The second piece is this cost analysis criteria. So we kind of have to have a cost model or some formulas for deciding when a particular algorithm is going to be better than some other algorithm. So since again, we're just talking about binary joints here. So two inputs will use the notation of R and S. So R is one input, S is the other input. There are going to be M pages in the table R with little M tuples. So M is the number of pages, little M is the number of tuples. And similarly for S, there are going to be N pages with little N is the number of tuples. So kind of the cost metric that we're going to be computing boils down to the number of IOs, to the number of disc reads and writes that we're going to need in order to compute the joint. So again, remember, we're assuming everything's bigger than memory. If you're in memory, you can use whatever your favorite sorting algorithm is, or whatever your favorite joining algorithm is, it doesn't matter. But when we're larger than memory, there can be extremely, large differences in the runtime of different algorithms depending on which one we choose and how we choose to parameterize them. And we'll see some examples of how joining two tables, one way can take you hours and another way can take just less than a second or something. So it's really important that before we pick a joint algorithm, we look at kind of what the expected cost of that algorithm is going to be. So for now, again, we're going to ignore the output cost of the join since that depends a lot on the data. And we're not talking about a whole query optimization yet, we're just talking about picking a particular joint algorithm. So the problem is whatever output you choose, you're going to have to materialize the same amount of or return the same amount of data anyway from the joint. So right now we're just worried about the input costs of computing the joint operation. So one quick note about joins versus the cross product operation. So I think I mentioned, it was probably in the very first lecture that you can think of a join conceptually as a cross product or cartesian product where you get all of the pairwise combinations of tuples. And then you apply a selection after it to filter out the ones that don't match. So this is useful to think about conceptually. But from a performance standpoint, you never actually want to do that. Because if you think about what's happening, you have an n squared operation. You have to enumerate all of the possible combinations and then do a filter on those. So really what we'd like to be able to do is some kind of direct join where we find in as little work as possible all the direct matches. So we don't have to materialize the full cross product or cartesian product and then filter it. We can just get the matches in an online fashion. So there are many algorithms for how we can execute the join. But there's no one algorithm that's going to work well in every single scenario. So we're going to talk about a whole bunch of different options here. These are the algorithms we're going to cover today. And we're going to see when you might prefer one over the other and kind of using the cost models that will develop for each, how we can exactly pick when we expect one to be better than an alternative. So we're going to start with the simplest algorithm, the nested loop join, and some different variants of that. So we're going to talk about the sort merge join and the hash join and with the hash join. So before we go into the nested loop join, are there any questions about kind of the high level joint stuff where we are right now? Okay, so the nested loop join, this is pretty straightforward. Basically it's called the nested loop join because there are two loops. One loop for the outer or left hand side relation here that's going to be R and one loop for the inner or right hand side relation here that's S. So basically what we're going to do is for every tuple in R, we're going to iterate over R and for every tuple we're going to iterate over S. So for every single tuple in R, we're going to look at every single tuple in S. If there's a match, then we're going to emit that match as a result of our joint operator. So as I mentioned, and in this case here, the relation on the outer relation doesn't necessarily have to have fewer tuples, but it does have to have fewer pages. So in this case, the R relation does have more tuples than S, but just imagine it fits into fewer pages, so that's why it's on the outside here and that's why S is on the inside, the values are a little bit wider. So why is this bad? What's wrong with this algorithm? It works, it's simple, but what's the problem here? Yes, so the time complexity is N squared, right? And not only is it going to be N squared, but there's going to be multiple passes over multiple disk pages, which we'll get to in a second. So again, we have these outer and inner loops, we're going to do this join here. And what's going to happen is that every single time for every single tuple in R, we're going to scan S once. So the algorithm doesn't know anything about disk pages or locality. It's just basically a brute force search. So basically, again, for every single tuple in R, we have to do a complete scan through S. So what that's going to give us for a cost function is the number of pages M, big M, in R, so we're going to have to look at each page of R one time. So that's M plus little M, that's the number of tuples in R. Because remember, we're doing a complete scan of S for every single tuple. So that means we're going to have to do little M times N, which is the number of pages in S. So this is the cost function that we have for this nested loop join. So these numbers are kind of abstract. We can talk about the time complexity, which is fine. But if we want to put it into real numbers, we can take this example database here. So again, the two tables are in S. M is, let's say, 1,000 pages, and little M is 100,000 tuples. S has 500 pages, and it has 40,000 tuples. So if you assume kind of like a commodity SSD that has, I don't know, like 1 tenth of a millisecond IOs, then what we're going to end up with is, if we plug in those numbers, we're going to end up with 50 million, roughly, IOs, which at the 0.1 milliseconds, the total runtime is going to be about 1.3 hours. So if we decide, OK, let's switch the orders of the table. So I said we should prefer the smaller table on the outside, or being the outer part of the join. So we see here S has fewer pages than R. So if we switch them around, we get about 40 million IOs, which comes out to about 1.1 hours. So we can see, just by making this one simple switch, deciding S is going to be on the outside versus R, we can already start to speed it up. I mean, the difference here isn't huge. You go from 1.3 hours to 1.1 hours. But already, just by making this one simple decision, or this one simple choice differently, we end up with already a better plan. So we can kind of see how this theme plays out through the other algorithms, where if we make different decisions, we can get vastly different runtimes. Yes. So the question is, why is the capital M in the cost or the capital N, if whatever it is? So in this particular case, you're right, the inner part completely dominates the cost. But you could imagine in arbitrarily large, I mean, imagine that there's, I don't know, one page in M, and not that many tuples. You could end up with a huge number of reads on the outside versus on the inside. Does that make sense? So the numbers, if we change around the numbers, you could end up with, I mean, so imagine that the pages are much, or the tuple size, you can fit many fewer tuples per page, then the, in the extreme case, is one tuple per page. So in that case, then you're going to have equal, big M will equal little M. Yes, yes. So if M times N, little M times N, or little N times M is large, then that's going to dominate the cost. But if we plug in arbitrary different numbers here, we could end up with a different outcome. Does it make sense? OK. OK, so the other thing here, before we move on, is this is assuming that there are going to be two buffers for streaming the tables and one for storing outputs. So similar to how we looked at the sort algorithms in the last class, we're just going to assume here that there are going to be two input buffers, one for one to hold one page from R, one to hold one page from S, and then one output buffer. So these formulas all assume that model. So if we think about what's happening here, if we assume four kilobyte pages, then this comes out to about six megabytes, but it's still taking us over an hour to do. So just by choosing this really bad algorithm here and having to do all these disk IOs, we end up with a really terrible runtime. So the way that we can get around this is by performing what's called a block nested loop join. So rather than for every single tuple in the outer relation looking at every single tuple in the performing a full scan of the inner relation, we are going to break it up into blocks, where a block is the size of a disk page. So essentially the algorithm now goes from these two nested loops to four nested loops, we're going to say for each block in R, we're going to look at each block in S, and then for each tuple in that block, we're going to match for each tuple in the corresponding block. And if we get a match, then we're going to emit an answer. And again, we have the same m pages, little m tuples, n pages, little n tuples. So the algorithm is going to allow us to perform fewer disk accesses than rather for every single tuple in the outer relation, we're just going for every block in the outer relation scan S once, rather than for every single tuple. So the cost of this then is going to become m plus m times m. And this is going to reduce the amount of disk IOs that we're performing. So again, the important piece here is that the smaller table should always be the outer table. So if you reverse it and you do the table with more pages on the outside, then you're going to end up with more unnecessary disk IOs. So we can determine the size of the table, again, based on the number of pages and not the number of tuples. So just as an example here, same kind of parameters, and if we plug it into this cost model, we end up with something that's going to end up taking us about 50 seconds. So we've gone from something that takes over one hour with in both cases, whichever table we put first in the basic nested loop join to now with the block nested loop join, our estimate is it's going to take about 50 seconds. So that's a huge, huge improvement. Still not great. I mean, 50 seconds for six megabytes or whatever it was is still kind of a long time, but it's much better than what we started out with. So kind of being mindful of the way that we're operating on pages here, but we can generalize this approach. So in the previous example, we discussed each table or input is going to get one input buffer each or one input buffer page each, and then the output buffer. If we want to generalize to, we have more space in our buffer pool, if we allocate now multiple buffers for the join here, we want to use B minus two buffers, assuming we have B buffers available. We're going to use B minus two buffers for scanning the outer table. So in this case, it's going to be R. We're going to scan R on the outside. So we're going to use B minus two buffers for scanning the outer table, and then we're going to use one buffer for the inner table and again one buffer for the output. So the reason that this is going to be better is we want to buffer as much as possible from the outer table. So we can get in more in memory for each. If you think about this cost function here, the way we can reduce it is if we're now buffering M, we can bring in more M pages at a time. We can reduce the M times N multiplication there. So that's how we can cut down on the number of accesses to the S side of the join. So now this looks like for each of the B minus two blocks chunks that we can get into memory at the same time from R for each of the blocks in S. We're going to iterate over each of those values on the R side and then emit again the matches. So what this looks like in terms of our cost function here is using B minus two buffers, we're going to end up with M because we still have to read the full all of the pages once from the R side plus this ceiling of M divided by B minus two. If your number of pages doesn't evenly divide into B minus two, then you end up with this times N. Yes, sorry, wouldn't it be the same as? So the question is if you had enough space in your buffer pool to hold all of the R relation, the outer relation, then it's just a scan. So it would be M plus N, yeah, it would be M plus N, right? So I think the tricky part here is to think about this in terms of disk accesses or disk IO rather than time complexity or traditional algorithmic complexity because while the big O runtime might be the same, what you end up with in terms of the number of disk accesses you have going on is going to cause these huge divergences in terms of the total query execution time. So it's really, you have to be mindful of minimizing the number of disk accesses rather than I guess the total number of operations that you're performing. Okay, sorry, are you asking about the difference between this and the single buffer case or the difference between this and the basic nested loop? Why did the buffers need to be the same size as a page? Is that the question? Is that good? So the restatement of the question was that why did the buffers need to be the size of a page and if you use a buffer, that's the, if you use a buffer that fits all of the pages then it's the same as the previous algorithm, so in the original formulation wherever we were here, I think it was this one, we're assuming that we only have one page or one buffer pool page assigned for each input. So the outer table, outer part of the join gets one page to bring in an input, the inner side gets one page to bring in an input and then there's one for the output. So there are three pages available. So then if you want to expand the size of the buffer pool and the number of pages that are allotted here, you can go from this, which uses, is the IO cost using just those three buffer pool pages. You can go from that to this using B-2 buffer pages. And if B is large enough to fit an entire table in memory, then yeah, it reduces to just one on that side. Sure, are there any other questions about this? Okay, so, well, there you go. What if the outer relation completely fits in memory so that B is greater than M plus two, then you get, it reduces to this. So you end up with however, 0.15 seconds. So if you can get the buffer pool large enough to fit our entirely in memory or the outer part of the join entirely in memory, you can go from, I think it was 50 seconds down to 0.15 seconds here. So this shows kind of the, in choosing buffer pool size, yeah, you have to use more resources, but just by making that one simple change by allocating more memory space for this join operation here, we can get again another two orders of magnitude reduction in runtime. Okay, so why is the basic nested loop join so bad? Again, for each tuple in the outer table, we have to do a sequential scan to check for a match in the inner table. One thing that we can, basically this reduces to just a brute force search. You have to search both things to find all the matches. So one way that we can avoid this is by using an index to find matches on the inner table. So if we already have an existing index built for whatever we're joining on, usually if it's primary key, foreign key join, then chances are we'll likely have an index built on it, especially because we need it to maintain kind of the primary key integrity constraint that all primary keys are unique. So chances are we're gonna have an index available to us in the situation so we can leverage that index in order to perform the join. So the way that's gonna look is now for each tuple in the outer table R, so we're assuming here that we have an index built on the inner part of the join S for each tuple in the R table, we're going to go through and essentially probe into the index or access the index and ask to see if it contains a match. Now if this is a hash index, then the time complexity is a one, if it's a B tree, then search time complexity is log n. So basically for every tuple in the outer relation, we're gonna be iterating and we're going to try and find a match and we're gonna go look up in the index to see if there's a matching key in there. And kind of this is tricky to talk about in terms of a specific cost function because the cost of an index probe is difficult to quantify. So we're going to use this constant C and again if it's a hash index, it might be one thing, if it's a B tree index or B plus tree index, it might be some other thing. But this C constant is going to have really be determined on a case by case basis. If the entire index fits in memory, then that's going to be very different than if the entire index is much larger than memory. So we have to do a bunch of potential disc accesses. It matters if they're sorted versus not sorted. Doing sorted lookups in a clustered index is very different than doing random lookups into a hash index where values might be in completely different pages. In the worst case, you have completely random IO. So that's why in the cost function here, it's a little vague, but it's the little M, the number of tuples in the outer R relation here times this cost of performing an index lookup. So in any cases where there's like a problem on the homework or in the exam or something and you have to compute something like this, we'll give you the constant C that's the cost of an index lookup. Because again, you can't know at this point where how much it's going to take. So kind of summarize the different nested loop join techniques. The key takeaways, again, you wanna always pick the smaller table as the outer table. You wanna buffer as much of the outer table in memory as possible. So we saw that going from one input buffer for each of the two tables were joining to multiple input buffers for the outer part of the table gave a huge speedup. And you wanna loop over the inner table or use an index if it's available and the trade-offs make sense. So the different algorithms we talked about are the simple or the basic nested loop join, the block nested loop join and then the index nested loop join. So before we move on to the next sort merge join, does anyone have any final questions about nested loop joins? Yes. So the question is for the index nested loop join, do you build the index from scratch or are you just taking advantage of a pre-existing index? I think usually, almost always, you're using a pre-existing index. So if you have a table that already has a pre-existing index on it, then you leverage it. There may be certain narrow cases where it makes sense to build an index from scratch. However, as we look at some of the later algorithms in particular the hash join, that's basically what the hash join is going to be doing. It's going to be building up a hash table data structure which is an index. So it's going to be building up an ephemeral hash table to do the join and then it's going to throw it away after query processing is done. So typically, there may be some unique cases where it makes sense to build an index and then you might, if you do lazy index construction then you could build it to do the join and then keep it around for future use. But I think in terms of this particular algorithm we're talking about index and this is loop join. It's almost always a pre-existing index that you're leveraging. And then if you're talking about building a data structure on the fly that will be in the next algorithms that we look at. So there are any other questions about nested loop joins? Okay, so the next one we're going to talk about is the sort merge join. And I know last class we talked about the external merge sort algorithm. So it's a little confusing, but they're related in the sense that the first phase of the sort merge join is to sort both of your inputs. So you can use for that phase any algorithm. If again, if the data fits in memory you can use any sorting algorithm you want. Otherwise for the sort phase, phase number one you may use the external sort that we discussed in the last class. So basically whatever key you're joining on you want to sort both of your inputs on that key. And then so that's the sort phase and then the second phase is the merge phase where we're going to essentially step through or iterate over both of our sorted tables and emit matching tuples. There are a few cases where we may need to backtrack but unlike the nested loop algorithms where we're always iterating in kind of like this n squared idea or iterating over values on the inner side of the join for every value on the outer side of the join. In this case, because we're sorted we can kind of just scan through forward through each of them finding matches along the way. So let's start just by looking at the code and I don't want to spend too much time on this since this is confusing to look at but the basic idea is that first we're going to sort so we're going to get these two sorted cursors over R and S and then for every value in R and S well we have more values to look at. If cursor R is greater than the value of cursor S then we're going to increment cursor S because that means it's not a match. If cursor R is less than cursor S then we're going to increment cursor R and if they're a match then we're going to emit match and then increment cursor S. So let's look at a visual example because I think it'll make a little more sense here. So we have again our same two tables R and S and the first step is we're going to sort so let's get these in sorted order. Okay, now both of these inputs are in sorted order the keys are ascending, the ID key is ascending. So basically what we're going to do is we're going to start now the cursor at the beginning of each of our sorted inputs. So in this case here we see the key from the R table and the key from the S table are both 100. So in our algorithm that's a match. So we're going to append that to our output buffer and then we're going to increment the cursor for S. So now we move down here. Again we have R and S match they both equal 100 so we're going to append that joined we're going to admit the joined tuple and put that in our output buffer. So now we're going to come down here and we see, okay S is 200, R is only 100. So that's not a match. So we need to increment now the cursor and the outer relation and we get to 200. We see okay this is a match. Let's append that to our output buffer. Now in our algorithm remember if when we get a match the next step is to increment the S key. So we see here sort of the S cursors to get to the next key. We see here okay well these don't match so we're going to increment our R cursor. Okay there's a problem. R is 200, S is 400. We already saw 200 in the inner side the S side of the join. So we need to do this backtracking that I said and we scan back to the first value, the first appearance of the previous value that I had. So in this case there's only one 200 if there are multiple 200s then we have to scan back to the beginning of 200, the 200 range in the sorted table. So we scan back, do this backtracking to 200 and now we can emit another match there. Then again the algorithm proceeds, we increment that. We increment this outer side of the join. There's no match there between 300 and 400 so we increment this again. There's a match, we can output it. Again the algorithm proceeds. We output this 500 match and now we get to the end of the S cursor. So since we're done we have to check this one because again there could be another 500 key on the outer side of the join here so we may have to backtrack from the inner side of the join. But since we've gotten here and we know that 600 is strictly greater than the previous value that we saw on S we can be done. Or you can just let the algorithm scan out it'll scan to the end and be done. But you could stop early if you know that you've exhausted all of the inputs from the one side when there's no more outputs on the other side there. So the reason is because they're in sorted order we know there can't possibly be a match after we get past the range covered by the S key here. So 500 is the maximum key. Once we get beyond that on the R side we're done. So are there any questions about how this algorithm works? Okay so the cost functions for this are pretty straightforward. It's gonna be the cost of doing the sort. Sort on R plus the cost of doing the sort on S then the merge is just devolves to M plus N since we just have to do one scan through each. In theory if you have to do this backtracking you could end up with depending on how far you have to backtrack you could backtrack to a previous you could have to go back to a previous page that you already looked at but just for simplicity we'll consider it M plus N here in an optimistic scenario so you don't have to backtrack either that often or to go back to the previous pages. So the total cost of this algorithm comes down to the cost of the sort phase plus the cost of the merge phase. So again using our example database we go through kind of this whole procedure and use our 0.1 milliseconds time estimate we end up with the total time to do this join using sort merge join algorithm is 0.75 seconds. So again that's a lot better than I think the block nested loop join without any of the assumptions that we had about fitting entirely in memory was 50 seconds or around that. So again we can get down to 0.75 seconds just by changing the algorithm here. So the worst case for merging is when the join attribute that we're looking at for all the tuples in both relations contains the same value. So that means it basically devolves into this N squared thing where we have to look at every value in R and every value in S because they're all gonna be the same so you're always basically backtracking. But this is pretty unlikely if you're doing like a primary key foreign key join. Usually the distribution is not this. So the question might be like just at a high level when is sort merge join useful? So kind of we talked a little bit about this last class but one of the cases if you have one or both tables already sorted on the join key so if you have like a clustered index or something that maintains the sorted order then your inputs are already sorted so you don't have to pay the sort cost you can just go to the merge phase. Or if you need to sort your output on the join key so if you have to do a sort anyway you might as well just do this pre-sorting so you can use this sort merge join algorithm since the output needs to be sorting in any event. But kind of this again either you're using some kind of explicit operator or you have some kind of index enforcing it. Okay so I wanna get to the hash joins since that's the last part and they're important and cool so does anyone have any quick questions about sort merge? Great okay so hash joins. So basically the point of the join is that if the tuples that we're getting from RNS satisfy the join condition and we know they have the same value for the join attributes so where all of the matches have the same key matching keys so we can leverage that fact to apply some kind of hash partitioning where we can hash the keys to some partition i and we know that the tuple from R with that key has to reside in the same partition as the tuple from S so all of the matches have to be mapped if we apply the same hash function to both the relations they're all going to be mapped all of the matches will exist somewhere in the same partition. So therefore what we can do is rather than kind of in this nested loop or in the block nested loop algorithm that we looked at we know that all of the tuples that are in RI only need to be compared with the tuples that are in SI so we know that the tuples with the same key from each of the two relations are going to be mapped to the same partition. So basically what we do is we can break it the hash join algorithm down into these two phases so the first phase is called the build phase basically you scan the outer relation and you populate some hash table using your hash function H1 on the join attributes so you build up that hash table and the second phase is called the probe phase where you do a scan on the inner relation and you use H1 on each tuple to go look up the locations in the hash table to find a matching tuple so on the outer side you kind of do one complete scan over R to build up this hash table and on the inner side of the join you do one scan over S to do probes into the hash table to try to find a match so visually this is gonna look something like this so again the pseudo code is you build a hash table for R and then for each tuple in S you're going to try to probe the hash table that you built and emit a value if you have a match so here's our hash table here that we're constructing we're gonna do a scan on the R side applying the hash function to populate all of these buckets in the hash table and now on the S side we can kind of step through each of the keys that we have hash them using the same hash function and see if there's a match in the hash table so the contents that we're gonna store in the hash table are the key so whatever the join attributes are that we have for the query as well as some values so again this is gonna vary based on the implementation you kind of have that trade off between early versus late materialization and you can materialize in your hash table all of the other columns or attributes that you need or you can just keep around the record ID and go look up things later again it's gonna vary based on the implementation but this is the high level trade off again here's the kind of just bullet points about why you'd prefer one over the other full tuple avoids extra lookups when you have a match but it takes up more space it makes your hash table bigger so it might require more pages it might not fit necessarily as well in memory versus on the tuple identifier side you can get a smaller hash table but you have these extra lookups so before I talk about this are there any questions about kind of the basic hash join approach what we're doing here with this thing okay so one optimization we can perform is to use something called a bloom filter during the build phase so basically the idea is that we're going to populate that we're going to populate this bloom filter it's probabilistic data structure I'll explain it more in the next slide in case you're not familiar but it's a probabilistic set containment data structure they show up in all kinds of places not just databases but basically it's going to tell you whether or not the value is contained in the set so I'm gonna be more precise about it in the next slide but basically it's just imagine it's this thing you can ask that's going to give you a probabilistic answer about whether or not something is going to be contained in the hash table so we don't necessarily have to do the hash table probe so it's a lot cheaper and smaller and it can usually fit in memory so we don't have to worry about paging out to disk so it's sometimes called sideways information passing because the way it's gonna work is A is going to build up this hash table on the during the build phase and at the same time it's also going to be populating this bloom filter so basically the bloom filter again is just going to tell us true or false is the key contained in the hash table so once the bloom filter is built up we can pass it over sideways to the B side of the join and now rather than having to probe into the hash table all the time to retrieve the values which can be expensive especially if we have to go to disk to get the disk pages from the hash table B can instead just ask this bloom filter whether or not the key's gonna be there so if it is then we can go over and grab it from the hash table so as I mentioned bloom filters are a probabilistic data structure you can think about them like a bitmap that's going to answer set membership queries so I was kind of vague about the truth condition about whether or not something's contained in it you're never going to have false negatives but you may have false positives so the bloom filter may sometimes say that something is going to appear in the hash table even though it doesn't but it's never going to tell you that something isn't in the hash table and it's actually in the hash table so you never have these false negatives but you might have false positives so it's going to allow you to filter out a lot of cases where you may have had to go look up in the hash table but now you can ask this bloom filter and it's going to say no don't bother the key is not in there so the kind of the algorithm for doing this I'll just show a visual example imagine this is the bloom filter here so there are eight bits this is a very small bloom filter so it's going to fill up quickly but let's say first we're going to insert RZA so if we apply the hash to RZA we get bits six and four so that's going to be bits six and four we set those to one now we want to do JZA we are going to set bits three and one for that and now we want to look up this key Rayquan so what we're going to do is apply the same hash functions the same two hash functions we have it's going to ask us to look at bits five and three so we're going to go look up at bits five and three and we see that because one of the bits is zero then this is false we know that that key is not contained in the hash table because based on this bloom filter we need both of the bits to be set to true in order for it to be possibly contained in the hash table now if we get to the last key here we have ODB we apply the same two hash functions we get bits three and six we go and look those up okay we have a problem here because we never inserted ODB into the bloom filter but the other two inserts that we did earlier set those bits independently so this is going to be a false positive it's going to say true the key ODB is contained in the hash table even though it's not so just a quick summary of the cost analysis for sorry are there any questions about the bloom filter? I know I went over it kind of quickly but I want to try and wrap up here in time yes so the question is we apply the bloom filter optimization because the hash table might be larger than memory and lookups to disk based pages are expensive yes correct any other questions okay so the cost analysis let me so the questions are about like how big of a table are we able to handle using this approach so we need to have potentially spill partitions in our first phase we want each partition to be no bigger than B blocks so what we're going to end up with usually you need about square root and buffers in order to handle this and we're going to assume that the hash is going to distribute the keys evenly so there's some fudge factor that we have in here that's going to let us figure this out but assume for now that the hash functions even if the hash functions skewed then they're all going to go to the same place you're going to fill up the partition we want ideally the hash function to distribute the keys evenly across all the partitions so if we don't have enough memory to fit the entire hash table we don't want to let the buffer pool manager swap out pages at random because what's going to happen is the buffer pool doesn't necessarily know about the access patterns that we have in our join query so we specifically want to manage how pages are evicted ourselves during this operation so the algorithm that we're going to use here is called the grace hash join and it's an algorithm for when the tables don't fit in memory so it's named after this grace database machine from the 1980s basically the way it works is you break it down into this partitioning phase where you're going to hash both of the tables on the join attribute, put them in partitions and then the probe phase is going to compare tuples in each of the corresponding partitions so rather than kind of hashing just one side and then probing it from the other side we're going to hash both sides of the join and only compare the partition so this is a really brief aside this is what's called the database machine this was really popular I don't know in the 80s, 90s people were building all the specialized hardware for databases I think there were some questions in some of the previous lectures about is there specialized hardware for databases there are vendors that will sell you all sorts of appliances probably the most famous is Oracle Exadata there but these things cost like millions and millions of dollars so if you have a huge budget you want to buy purpose built specialized hardware that's tuned for a DBMS then you may want to buy some of these things but so brief aside we'll move on with this so again kind of the intuition here of the Grace hash join sometimes called the partition hash join is basically we're going to hash R into some number of buckets and then we're going to hash S into the same number of buckets using the same hash function so it's going to look like this we're going to do a scan one scan over R to put all of our tuples into each of these buckets based on their key then we're going to do the same thing for S using the same hash function so it's the same hash function and the same keys but they're in S this time and we have again zero through the max number of buckets that we want to assign so the way that we're going to do the join now is just basically to perform a nested loop join on each pair of matching buckets in the same level so that means that they're in the same partition between R and S so again we know the nested loop join algorithm for every tuple R in this bucket R is zero we're going to try and match it with a tuple S in the bucket S zero and if it's a match we just emit it and return it as an output of the join so are there any questions about this how this works so unlike the original hash join we're not just hashing the outer part or the outer table we're hashing both so that they map to the same partitions and then we can do this nested loop join for each partition so if the buckets don't fit in memory then use some kind of recursive partitioning strategy to split up the tables into chunks that will fit so I here assumed that each one of these partitions is just going to fit entirely in memory either in this case maybe in a single page or if you have multiple pages in your buffer pool then whatever the partition size is is going to fit in memory on both sides if that's not true and you just kind of recursively partition each sub-range using another hash function called h2 where h2 doesn't equal your original is different from your original hash function and then we're going to probe to look at the buckets at that level so visually what this looks like like this let's say we have these partitions here and partition one is we have more pages than we wanted so it's too big, it's bigger than some threshold we specified so we're going to do the second round of hashing for this particular partition we're going to use again a different hash function h2 to split it up into these sub-buckets so then we're going to wind up with now this many partitions where that middle partition is split up into three sub-partitions so of course we have to do the same thing on the other side we're going to apply the first hash function here to each of the keys that's fine for zero and n but for this partition one in the middle we know okay we have to apply hash function two to that to get down to the correct bucket so we kind of do this recursive partitioning in the same way on both sides in practice it typically doesn't go beyond two rounds so you end up with a depth of at most two here I guess if you wind up with the special cases like all of the keys in a partition are the same so no matter how many times you keep recursively partitioning it it's not going to get any smaller at some point you have to know based on the statistics of the partition that all the keys are the same so you can't make it any smaller okay so the cost of this algorithm I mean we'll skip this it's comprised of the partitioning phase and the probing phase and kind of we can add it up using the same formulas that we did before and get to about 0.45 seconds so just a quick observation about hash based approaches if the DBMS knows the size of the outer table then you can use a static hash table you don't have to you can use one of the hashing data structures that we talked about where you know the size up front you statically allocate that size versus one of these data structures that incrementally grows as you put more tuples into it so if you know that you can do that because there's less kind of overhead that you have to pay for the build and probe operations you don't know the size obviously you have to use a dynamic hash table so just in summary the different joint algorithms that we looked at here they all are along with their IO cost functions and with the calculated times that we had for each of the different approaches so as you can see depending on which algorithm you pick you can go from as high as 1.3 hours down to you know less than half a second and I think if you have that special case where one of them the outer relation fits entirely in memory that goes down to as low as 0.15 seconds so kind of there's it's important to be able to pick the best joint algorithm based on how many disk accesses we anticipate so wrapping up based on this I think we can see that that hashing is almost always better than sorting for operator execution sometimes sorting might be better on uniform data if you have a lot of collisions in your hash table or if your results need to be sorted anyway then you might as well sort so good DBMSs use either algorithm or they use some combination of them and they can make different decisions or trade-offs based on kind of the different characteristic aspects we discussed so next class we will talk kind of about composing operators together so we've kind of been looking at joints and isolation, sort, aggregation, all in isolation next class we're going to talk about how we actually combine all of these operators together in order to execute a full query so I will see you on Wednesday I've been talking about the St. I's Brew run through a can of two share with my crew is magnificent bust is mellow and for the rest of the commercial I'll pass the mic on to my fellow no need for a mic, check, bust it the bees will set to grab a 40 to put them in the yoga snap snacks and I take a sip then wipe your lips cue my 40s getting warm I'm out, he got sick too drink it, drink it, drink it then I burp after I slurp, ice-cube I put in much work with the BMT and the E-Trouble get us a St. I's Brew on the gum