 So, like Jignesh said, on Monday, and maybe Andy said last week, I don't know. They're both traveling today. I'm Matt. I'm a PhD student. I work with Andy. The last time I gave a lecture for this class, I was a first-year PhD student, and now I'm hopefully a final-year PhD student. So, hopefully, this is a nice bookend on that process. We'll see. So, with that said, let's talk about joins today. A little bit of my administration to get out of the way first. Homework two is due tonight. Homework three is due on Sunday, and the midterm is next Wednesday. This lecture is the last lecture for material you will be responsible for for the midterm. There's going to be a lecture next Monday, but it will not be covered on the midterm exam. Make sense? If anything looks wrong here, it's above my pay grade. So, you'd have to take it up with the course staff. But these are the dates and stuff that I was given. So, let's talk about the context for why we need joins in the first place. So, we pray at the altar of TEDCOT and the relational model in this class. So, that means we're going to reduce duplication of our information and separate our information into relations into different tables. But when we want to query that information and combine and make useful queries out of it, we have to join stuff, so things like customers and order tables or students and classes tables, those sort of classic examples you see in textbooks. And if you were paying attention to the database world 10, 15 years ago, which I'm guessing no one here was, the NoSQL systems were all saying joins are stupid. We should just, you know, denormalize all of our data, save a bunch of time. We don't need to do joins. And then it turned out a lot of people who adopted NoSQL systems ended up building joins on the application side anyway, which caused a lot of redundancy and slow systems, and it didn't really work out. So, from Andy's perspective, from our perspective, the relational model won again. And another reason joins are important is systems like analytical systems for OLAP systems. Hopefully you've heard that term before. They're going to spend 20 to 50% of their time just working on joins. Transactional systems, that's not true at all. But for analytical systems, this is where they're going to spend a lot of their time. So, using the right join algorithm, getting the right join order, this is all going to be probably the high pole in the tent that's going to determine your query run time. So, we need to make sure we get joins right. So, in the lecture today, we're just going to focus on a class of joins called binary inter-equa joins. Basically, binary, we're going to join two relations. Inter-equa join, we're going to compare two attributes, or an attribute from each relation, and if there's an equality there, we're going to emit sort of a concatenated tuple from that predicate. These sort of joins can be modified to support, or these algorithms can be modified to support other sorts of joins, depending on the sort of predicate you want to do, so range scans, anti-joins. I think the textbook generally just refers to this class of joins as like theta joins, where maybe you're not using an equality operator. And like I mentioned, these are binary joins where we're only worrying about two relations. Multi-way joins exist mostly in research literature. It was the case that SQL server added support for this sort of stuff in like 1998, and then eventually they decided that was a terrible idea. It made performance unpredictable, and they ripped it back out a couple of years later. As far as we know, relational AI is the only system these days that's really still playing with multi-way joins, but you will see it in the research literature. At the bottom here, there's a little bit of terminology. We're going to reuse throughout this lecture. We're going to talk about left tables and right tables and outer tables and inner tables. Here we're just making the statement that the smaller table we want to be the left table or the outer table, which doesn't really mean anything to you yet, but it'll start to make sense as we discuss the algorithms that are in play here. So we're back to looking at query plans that we've seen before. In the early lectures, we talked about how we turned SQL into logical operators, and eventually we're going to turn them into physical operators. So in this setup, data is going to start all the way down at the leaf nodes at our relations, tables basically, and the data is going to flow all the way up through this relation, and then there may be a filter on S, and then it's going to reach that join operator, and that join operator is going to do a comparison, and then it's going to emit tuples based on tuples that satisfy the join predicate. And so next week, when we start talking about query execution, we're going to talk about the granularity that we're working at. I'm saying tuples sort of vaguely because it may be the case you operate one tuple at a time. You may work on vectors of tuples, but that's sort of going to be design decisions we'll talk about next week. And then at the root of the node, you're going to get the root of this tree. You're going to get the actual results of the query plan. So when we're designing these join operators, we have a couple decisions to have to think about. Like what do we actually output from these operators to their parent nodes, and as well as like how do we reason about what the cost of these operations are going to be. And these are typically implementation design decisions. These are not sort of things you change on the fly in a system. When you're sitting down and you're designing how do I want to build my database system and how is data going to flow through these query plans, you sort of have to reason about these sorts of things of like what are the inputs and outputs of these join operators going to be. And so at a very high level, let's just take for example this join operation here where we're looking at ID in relations R and S. When it's doing this comparison, I sort of alluded to this earlier. The outputs can vary based on sort of the processing model, tuple at a time, vectors. It's also going to depend on the storage model. So I think earlier in the class you guys talked about NSMs versus DSMs. Can only remind me what that is. There's another term for when we talk about NSMs and DSMs. Yeah, exactly. So row stores versus column stores. So depending on how that data is organized and how it's flowing up from the base tables, that's also going to inform how you implement these join operators. And then the last is sort of the data requirements in the query, depending on if you want the sort of what operators are existing above these joins is also going to change sort of what you want their inputs to actually be. So let's talk about the first design decision which is what comes out of these join operators. So the first design choice you could do is an option called early materialization. So the idea is you have tables R, table S. We're going to do a join on ID for those and we're going to materialize all of the values. So we're going to do the comparison and wherever those IDs match, we're going to create our output tuples and we're going to send them up through the operator tree. So they're going to continue to flow up to that projection which is then going to project the information that we actually want. The nice thing about this is you never actually have to go back to the base tables to get your data again. So you do one trip to storage and you start sending your tuples through the query plan and eventually to the root node to produce your result. This could be, like all things in database systems, there are trade-offs to these sorts of design decisions. This could be a bad idea if your tuples are extremely wide with a ton of attributes because you're effectively copying potentially more data than you actually need if, one, the join selectivity is going to be pretty low or you're eventually going to filter that stuff out anyway depending on the projection that's closer to the root of the plan tree. If the table is wide but only one tuple matches, it's actually not a big of a deal depending on how early you sort of do these sorts of projections and this is also opportunities here for where you can actually push down these projections to sort of reduce some of that waste. So the key idea here for early materialization is these subsequent operations never really have to go back to the base tables to get that information. So you sort of have predictable IOs based on, you only need to know the size of your tables because you're going to scan them once, materialize your data and send them up through the query plan. The other option as opposed to early materialization is late materialization. So in this scenario we have tables R and S again and we do the comparison except this time we're only going to output like record IDs or something like a Postgres like a tuple ID, some sort of unique identifier for what tuple in the base table satisfies this join but you're not actually going to materialize all the values that you need now. So as these continue to flow through at the end you see here we actually need this C date field from the base table S. You have to go all the way back to storage to get that now. So you've got these tuple IDs, you know what satisfied your join, you know what you eventually want to output at the root of this query plan but you have to go all the way back to the base tables. And this was pretty common and sort of popular with column stores maybe 15 years ago because for them it made more sense to only look at the data they needed to just rip through the column that you were doing your join evaluation on and then eventually you just go back and you materialize the data that you actually need at the very end. In practice, I won't say this is less common but we've at least heard from one of the major big column stores, one of the early ones, Vertica, that was sort of a commercialized version of C Store. They basically initially did late materialization because they're like this is a great idea and then similar to like the multi-way join thing when the research community says something might be a good idea in practice it becomes kind of hard to predict and reason about what the total IO costs will be because here in this case you don't actually know what your total IO is going to be until you get to the output. It's hard to predict because you don't know how many tuples will satisfy the join predicates and any sort of other filters and so at the end you have to go back to IO which is sort of hard to predict ahead of time and going back to IO is actually getting, despite storage getting faster it's getting harder and harder in the era of sort of disaggregated compute we're pushing storage off to separate storage nodes in cloud environments so now you're often going across the network to get your data instead of just to a local disk. So this notion of early materialization is probably more common these days. So that's early materialization versus late materialization when we're looking at table scans for joins. So if you have to go back to store that data inside of this late materialization table or is that just like thrown away and then you have to grab it at the very end again? Right so the question is where does CDate sort of get materialized to when you're doing query evaluation and then like is it thrown away at the end is that sort of the question? Yeah it's sort of, that's a system design decision. Chances are you're going to like a knob you can typically set in database systems is just like how much work memory does a single query get to use to sort of store a scratch space and stuff like that they're going to use typically their own scratch space for that sort of information and then it's probably just going to be lost unless you explicitly wanted to sort of, there are things called views I don't think Andy's talked about those yet I don't know if we cover views in this class but there's this notion of views and materialized views where if you know you're repeatedly doing this sort of querying and you want to sort of maintain that information you can create those in the database system. If we don't talk about it in 4.45 we talk about it in 7.21 for sure. Yeah. I'm confused why the on-triple. Right so the question is late materialization should always be less I.O. than early materialization. For a column store. For at least for a column store. Maybe but there's also the case of like round trips hurt and latency hurt so in the case of like it may be like early materialization wins because you just sort of get the benefits of prefetching you just rip through all the data you send it across you process all of it at once you keep it around in memory you get typically more cash locality from that sort of stuff as opposed to having to sort of make round trips back and forth to storage devices where your latency starts to become the pain point rather than just sort of sequential I.O. And interleaving that sort of stuff gets a little trickier too right and again reasoning about contention in the system with early materialization your queries going to rip through your tables materialize your information and you're done like that queries probably not going to have to go back and like hammer storage is hard for the base table information as opposed to late materialization is going to keep kind of ping-ponging back and forth between like okay I need this go get that from the table and sort of these round trips get harder reason about I think. Cool. Oh yeah. So that's another early materialization by being more resource efficient is it because or is it related to when you're performing the join you're fetching all the pages sequentially anyways and if you materialize it right after the join that you can take advantage of the bulk of pool manager and is still fresh take advantage of more locality. Right. Yeah so a statement is early materialization may benefit from locality if you ripping through all the data already and then if you need it further up the query plan like it's already possibly still in the buffer pool or the OS page cache if someone uses that like Postgres who's the only one left. Yeah you would benefit from caching there as well if that were the case. So the other thing we talked about sort of these when we were thinking about design decisions for these joins is how do we reason about costs for these sorts of things and I alluded to this a little bit earlier we're mostly going to be focused on the cost of the I.O. for these joins we don't really compare we don't worry about the compute as much so in this case we keep going back to this notion of there's a table R and there's a table S that we're joining on in this query over here and I'm going to say that there's big M pages in R with little M tuples and big N pages in S with little N tuples and we're going to cost through this entire lecture we're going to cost these algorithms in the notion of M and N pages and little M and N tuples and like I said we're going to ignore the compute cost that's a controversial statement to some people because some people are like okay well some of these algorithms are O N times N and others are O N plus M and from a compute standpoint but in reality I.O. is still a high pole in the tent for most of what we're doing in these database systems unless you're doing an in memory database system then you may make some different design decisions but we're really trying to design or reduce I.O. here in these disk based systems. Just as a brief aside there's this very naive way to do a join which would be like a cross product or a Cartesian product so like in the SQL standard and some database systems I think offer something called a cross join where you can basically just create this Cartesian product of two different relations and you could if you really wanted to implement a join that way by just sort of then creating this gigantic Cartesian product of these two tables and then filtering down to the ones that you actually want but in practice that's a terrible idea and to be honest I don't know why anyone ever use a cross join but it exists. All you can really do is just create two for loops that run through the two tables and it's going to be wildly inefficient unless you were trying to create just like specifically doing something like testing that needed a Cartesian product of your data I don't know why you would do that. So here's a brief overview of the algorithms we're going to look at today there's sort of three classes of groups of algorithms the first we're going to look at is the nested loop join then we're going to take a look at the sort merge join which is slightly related to the external merge sort algorithm we looked at with Jignash on Monday and then we're going to talk about what's probably the most important one which is the hash join and spend a decent amount of time there. In practice hash join is generally going to be the fastest particularly for analytical systems the LTP like these transactional systems are typically not doing gigantic joins so they'll do something simpler like an index nested loop join and it may be the case that if you have like a sort by in your query you may want to do the sort merge join. Right I forgot to warn you guys about that yeah they're testing an emergency alert system today at 220 so it must be 220 yeah they are a little early yeah that is quite annoying alright we've got a couple more still beeping couple more oh god they're still going alright we're good I forgot to warn about that I knew that was coming there's always one so I mean the big takeaways here is there's no one size fits all solution in database systems depending on the task at hand the query your data distribution your system design different joins are going to make make sense at the right time because like if your query has a sort by you might want to do the sort merge join but like I said LTP likes to use nested loop join you can get pretty far with just a nested loop join for a transactional system my sequel didn't get a hash join until like 2019 because they could just do nested loop joins or I have a theory actually I'll discuss later with the sort merge join and Andy hasn't confirmed this for me but I think it's true so we'll just take it as gospel so we're going to start with the naive nested loop join this will actually give you a better idea when I was talking about again the outer table and the inner table and sort of where that name comes from right so if you were sort of naively trying to design a join algorithm where I said okay for every tuple in R and every tuple in S I just want to see do their IDs match the simplest thing you could do is just write a for loop for this outer relation and then compare it to every single relation in the inner table and then go to the next tuple and then compare it against every single tuple in S and omit that and so that's also where the name of our outer and inner tables is going to come from you'll sometimes refer to as the left and the right table the left table is usually the outer table the right table is the inner table that comes from sort of the query plan view that we usually think about and when you talk about optimizers I think in the future they reason more about left and right trees and use that terminology a lot more where it's like I feel like people living in the actual operator world think about outer and inner tables so it should seem obvious that maybe not but this algorithm is bad because we're doing a ton of work so we know we have to scan every page in M so our cost up front is we know we're eventually going to have to go through every page in table R but then for every tuple in R we're going to have to look at every page in N so the cost here is astronomical unless these tables are very very very small so for example let's put some numbers to this table R has a thousand pages with a hundred thousand tuples table S has five hundred pages with forty thousand tuples and that's going to cost us fifty million IOs to do just a simple nested loop joint and there's sort of a straw man presentation here okay if you do a millisecond per I.O. that's going to take one point three hours that's ignoring caching entirely there's no notion of an OS page cache here there's no notion of a CPU cache here this is just if you had to go to disk every single page do this operation that's how long it's going to take you and then if you switch the order of the tables you get about a twenty percent savings in your IOs and your execution time so that's sort of a brief tease into why optimization is going to be important in the future because even just something as simple as getting the join order right on these can have a big difference in the query execution time when we go to actually run these oh yeah the last thing these numbers are actually quite small if you had four kilobyte pages this is only about six megabytes of data which again this would fit in L3 so these are very small tables this would fit in your L3 cache you would actually be able to rip through this very quickly with a nested loop joint so this is sort of an example of like maybe you could get away with an index nested loop join if you know your tables are very very very small like you don't need to do anything fancy if you know they're going to fit in cache because tables that small do exist in practice so if you are doing a very simple join it may be the case you would want to do a nested loop join but in general we consider this bad in the common case so how can we do better we'll use the notion of locality here I mean that's pretty common when we're designing computer software systems so instead of just iterating for every single individual tuple in R and ripping through all of S we're only going to do it for each page in R so that's going to reduce the cost fairly significantly so instead of we're still going to have to pay this upfront cost you know you're going to have to go through all of R there's no getting around that but instead of little m times n so basically every tuple looking at every page in S we're only going to do it for every page and that's going to save us a bunch of IOs and when we're when the optimizer is choosing the join order here ideally we want the smaller table to be the outer table and when it's reasoning about this sort of stuff we're determining that based on the number of pages not the number of tuples number of tuples doesn't necessarily relate to the number of disk IOs we're more worried about the number of pages it's going to have to fetch from disk so imagine the case of you have your buffer pool you have B buffers available we're going to use B-2 buffers for the outer table that's because we're going to use one buffer for the output of the join and we're going to use one buffer to sort of stream the other inner table does that make sense when I say like basically we're going to try to use as many buffers in the buffer pool for the outer table and then we're just going to keep two set aside for this for the output from the join and the others we're just going to be completely churning and ripping through S and if we do that we'll get to the IO cost here so again we've got M so this is the case where the tables don't fit in memory we have to rely on our buffer pool and you can only hold M divided by B-2 in memory at a time you multiply that by N and our cost here if it fits in memory IOs in that case it's only 0.15 seconds I think we were over an hour under the straw man argument before so it's a dramatic drop in IOs my clicker is not behaving I think there we go and then if it doesn't fit in memory rely on a buffer pool that has 102 buffer pages which I guess he did that to make the math easier so you are going to be reserved for the inner table and the output you get 6000 IOs and then if the optimizer were to switch the join order you get 5500 IOs still the nested loop join is kind of just oh yeah you would end up swapping back and forth I think if you had the inner table as the one that you wanted to try to fit into memory right like and then bring the rest of the inner pages into the buffer pool are we going to get rid of that first block in the outer table as soon as we're done with it? yeah but we can sort of coordinate that our own by just sort of like pinning it I guess or cause like the inner table is always just going to be streaming through over and over and over again alright inside of your buffer pool right like your second loop will be giant so you're going to have to replace that every time because you can't fit all of it into memory yeah you're going to threat yeah thanks whereas in the outer loop there's less chance of crashing because for each outer loop innovation you're going through this many inner loops right makes sense so yeah like I was saying these nested loop joins are basically just sort of a brute force we're going to sequentially scan through the inner and the outer tables in the inner tables case over and over and over again and that's sort of what we're we don't have a choice if we don't know anything really about the data there's no order here under the relational model but if there's an index we can use that to help us out here so hopefully if we're lucky and this is where particularly true if we get an index we can choose an index and we can do an index join instead so what does that actually look like if we're going to use an index to do a join the clicker is going to be grief again so in this case for each tuple R and R let's get used to that being the outer loop basically there's no getting around looking at each one of these but instead of having to iterate one at a time looping through and doing a sequential scan on the inner table we can just do an index probe assuming we have an index on SID which would be great hopefully that's the primary key on that table or something like that and if there's a match in the index then we can omit that tuple so we basically remove looping over and over again around the inner table we just have to do index probes for the tuples so in this case we sort of hand wave away and say that the cost of an index probe is some arbitrary constant C that's because we don't actually know what it would be based on sort of what type of index it is if it's a hash index we're looking at something that's more constant time as opposed to like a B plus tree so you look at something logarithmic this also you know I said hopefully this is maybe well this can't be a unique index because I see duplicate values there in ID so that's not your primary key there could be duplicates in this index so it may not be exactly some beautiful data structures specific this is exactly how long this operation will take because we don't know the data distribution inside of this index but we sort of hand wave that away as being some constant C so the cost gets reduced to big M we still got to look at everything in R and then the number of tuples in R times this index operation this index look up yeah so your question is this only works if you're joining on primary keys it would have to be some sort of key in an index exactly it doesn't have to be a primary key you can have secondary indexes but it would just have to be something that has an index on it already it doesn't matter whether it's unique or what the constraint is there but if there's an index the optimizer will try to choose this and this is really I mentioned OLTP systems mostly try to just do index nested loop joins if they can like most transactional systems like if you see your queries doing sequential scans that's a hint that that's something you just want to build an index on and your transactions will run significantly faster so in the case of hash join, dbms just on the fly building an index on an inner table right so I mean that's a great question like in a hash join is it just sort of building a hash index on the fly yeah this is sort of foreshadowing how a hash join works as well because it's basically an index join but it's going to build a hash index on the fly and I think I think SQL server also like if it does an index or excuse me it can build sort of indexes on the fly as well if it benefits from this sort of situation where I think it's called a spooled index where it can sort of as you alluded to a hash join is going to build the data structure and then probably throw it away they can be smart and sort of keep that stuff around if they think it's going to be useful for future queries but yes this is sort of foreshadowing what's going to happen in a hash join so some takeaways from nested loop join pick the smaller table to be your outer table try to get as much of it into memory as possible and when you have to you're going to have to loop over the inner table and ideally you have an index on the attributes that you're using for the join key because that's going to save you a whole bunch of time and so and then we looked at sort of naive just two for loops that's where the inner and outer table name comes from block nested loop join looks it benefits from locality and then an index nested loop join so the question is if we already have a B plus tree index do we create a hash index if we want to do the index as a loop shape yeah so the C isn't it's some vague constant but it doesn't necessarily mean that the index is a constant time lookup I think Andy's just using the heavy lifting here of just saying there's some constant if it's a B plus tree it's going to be actually like you said logarithmic so I don't want to confuse the issue when I said like this is foreshadowing a hash join we don't build an index for an index nested loop join we're relying on an index that already exists if there's a B plus tree there already that's what we're going to use the hope is the optimizer wouldn't then build a hash index on top of it probably I'm trying to think if there's any scenario where you would want to do a hash join when a B plus tree index is there and I can't think of one because a B plus tree also gives you the nice ordering as well which you may or may not need in the query results it makes sense? yeah so like I said we got our takeaways here from the index nested loop from the nested loop join and then the different algorithms we have within those they're not always a terrible idea for again for transactional systems it makes sense and also if the tables are really really small just do a nested loop join so let's talk about the sort merge join and the sort merge join the basic idea there's two phases here in the first you're going to sort both tables you could use an algorithm like the external merge sort that Dignesh talked about on Monday but all we really care about today is these two tables they're going to be sorted you use your favorite sorting algorithm we're not reading too much about today how you make that happen that's what Monday was for and then in the second phase you've got these two sorted tables and you're just going to create cursors in each one and you're just going to go in order and try to look for matches and the nice thing that this ordering gives you is you don't have to start all the way at the beginning for the inner table every single time as you're working down the outer table so the ordering here gives you some hopefully guarantees you don't need to backtrack as far there are degenerate cases where you actually have to backtrack all the way back to the beginning every single time and this sort of falls apart and turns into a nested loop join or yeah a nested loop join but we can talk about that in a minute so I hate seeing code on a projector or pseudo code but we're just going to step through this really briefly so step one like I said we're going to sort the two tables R and S we're going to create cursors and we'll step through an example here in a moment so don't worry about grocking all of this right now you get cursors at the top of each sorted relation and you're just going to advance those cursors based on comparing the quality keys for greater or less than and then if you have a match as you're sort of iterating through you emit that and then there are scenarios where you're going to need to backtrack and we'll give an example of that in a minute so again this is more for your reference when you're studying or need to understand how sort merge join works let's go through an example because I think it's more helpful so once again we have tables R and S step one sort them beautiful this clicker has given me so much grief today initialize cursors to the beginning of our sorted relations and we're just going to do a comparison we're going to say does SID match RID it does emit that tuple go to the next tuple in S so we're still sort of working with this outer table inner table notion you'll see how this ordering helps us reduce how far we have to backtrack we don't have to start all the way at the beginning when you move to the next tuple in R we'll do another comparison that's also a match we'll emit that tuple, beautiful and we'll advance the cursor to the next one and so now we see that SID is now greater than RID so we need to advance the cursor in the outer table and we do another comparison and we see that there's a match and we emit that tuple advance the cursor again once again the SID is now greater than RID so we're going to advance the cursor in table R and this is a scenario where we have to backtrack we've advanced the cursor in R we're still less than the inner table's value so we have to back the cursor up in the inner table but unlike a nested loop join we don't have to go all the way to the beginning we're relying on this ordering to not have to go as far go back to 200, do a comparison it's a match, we emit that tuple advance the inner cursor again we're now less than 400 so we advance the outer cursor we don't have to backtrack on the inner table on this one we're still less than at the outer cursor the RID is less than SID so we're going to advance the cursor again do the comparison, output that tuple we're going to advance the inner cursor advance the inner cursor we advance the inner cursor we do another comparison, RID is less than SID advance the outer cursor we get a match, we output that one and we're at the end of SID this example for some reason there we go, yes we're backtracking to the very first occurrence of for example there's only one 200 here but there were multiple 200 in the inner table backtrack all the way to the first 200 so the question is if there were multiple 200s in the inner table would you backtrack to the first 200, yes because you have to compare now this sort of second 200 in the outer relation to the inner relation make sense? yeah, so the question is you backtrack to the occurrence of the sort of the outer cursors value I guess, yeah that's basically how far you would have to backtrack anything that could possibly still satisfy that equality predicate you have to make sure you backtrack far enough to check that so the question is how are you going to store that you could basically memoize that, right like you could basically just keep track of okay when I'm at a new cursor and I'm at the outer cursor and I'm at the outer cursor I can actually just keep track of okay when I'm at a new cursor value here where was the first place I had a match in the inner table so that the next time you have to advance this one you just jump right back to that one I could probably hold a record ID in a variable, right just a 32 bit integer or something like that to hold the record ID in this table that just knows how far I have to backtrack to for this current cursor you need to store the index or the tuple ID or the record ID on the inner table for how far you need to backtrack to for the current outer key does that make sense cool did that check out for everyone else too I'm not making that up okay you guys are going to have to call me on this I've never given this lecture so if it sounds like that's not right tell me if something doesn't pass the smell test 20 minutes later yeah so the question is you only backtrack in only when it's equal so in this one so we yeah like nothing satisfied the join predicate for 300 so we probably wouldn't have it cached that anything was there that we need to jump back to that there was no index for that in the inner table probably an optimization you wanted to go back to the pseudocode or one second okay so it's in the scenario where you have to increment the outer tables cursor you need to know potentially how far you need to backtrack on the inner table because you may have advanced past tuples that would still satisfy the join predicate yeah cool so how do we cost this thing the sort cost for rns comes from the lecture on Monday and so I'm not going to discuss that math because math is scary and the merge cost is you just have to look at every single page in both relations so it's just m plus n so in this case the high pole in the tense is going to be typically the sorting process and we'll get some clicking going again and so again here's our hard numbers with our I think those table sizes have more or less stayed the same if they haven't that makes this difficult we've got the sort cost that's going to be 4,000 IOs for r for s is 2,000 IOs then the merge cost is only 1,500 add those up to 7,500 and again this sort of straw man I O cost of one millisecond per access to the disk is going to take less than a second to perform the sort merge join so I mentioned this before there's a degenerate case here for the sort merge join where what if every attribute has the exact same values you're going to pay the cost to sort it all and then you're still going to have to like you're going to have to backtrack also in every single iteration on the inner table and this is just to sort of devolve to a nested loop join plus you paid to sort it first so in practice if someone makes a this shouldn't happen if someone makes an attribute column on a big table that's all got one value and they want to join on that and they did a bad job and there are also other things in database systems we can do to sort of help us out here there's things called zone maps there's other things we can know about the distribution and the cardinality of individual column that would maybe help us know hey don't sort this column it's a column that's just a billion ones or something like that database system sort of usually we usually have things to stop people from doing something too crazy like that so when is it useful it would be great if one or both of the tables is already sorted on the join key that would be beautiful or it may be the case that you want your output ordered for the eventual output so in case of an order by clause or something like that the database system may say well I'm going to have to sort this anyway I might as well just go for a sort merge join and it doesn't necessarily have to be sorted first it could have been an operation below like an index probe that's like a B plus tree that provides ordering it could be producing the inputs these may not be base tables underneath they may be index probes or something like that that could be producing ordered data into the operator and then the optimizer will know I could use a sort merge join here because my inputs are already going to be sorted so I don't have to pay the sort cost did you have a question? right, yeah so sometimes if the optimizer knows stuff coming in is already ordered and this was sort of what I alluded to earlier that Andy hasn't actually confirmed for me like MySQL I mentioned got the hash join in 2019 sounds like it got a disease they added hash join support to MySQL as late as 2019 MySQL storage engine is in ODB and it always clusters its data on a primary key so if tables are commonly being joined on the primary key well one you'd probably have an index anyway but again in ODB is also sorted already it's already basically clustered so they may have decided for a long time in MySQL we can just always rely on index or sort merge joins and we don't really need a hash join but eventually they came along and added it which gets me to the hash join the big one so the basic properties we're going to rely on here is if two tuples are going to satisfy the join predicate the hope is they're going to have some if we picked a reasonable hash function if you hashed those values they're also going to hash to the same value so we can rely on a hash table here to make our lookups more efficient and I'll show you how that happens and in some cases we're basically on the flag going to build a hash table on the outer table and then we're going to do lookups on the inner table using that same hash function and whenever there's a match we emit a tuple there's an illusion here to partitioning and stuff like that we'll get to that basically talking about what happens if buckets start overflowing or your hash table doesn't fit in memory but the basic idea is you're going to build a hash index on one of the tables and then on the other table you're just going to scan along apply the hash function and if you get a match you're going to emit that tuple so there's two phases here we've got the build phase which I described sort of briefly a moment ago you're going to scan the outer table you're going to apply some hash function you're going to build a hash table choose whatever hash table you want but I think you guys have discussed hashing already in this case we generally want to use linear probing and then in the probe phase you're going to go along the interrelation you're going to hash each tuple on the join key and anywhere you get a match in the hash table you're going to emit that tuple into the query plan so what does this look like step one build hash table and then for each tuple in S I think I've said this a few times now but I like to see it with the graphics okay so we're going to build the hash table scan all the way through R using hash function one and then we're going to apply the same hash function to the second table or the second relation and then anywhere there's a there's a match we emit that tuple so what is in this hash table one you need the key you have to keep that around from the join key you can't just build the hash table and then just keep like just a record ID sitting there with it you have to keep the original key because it may be the case that there's a collision on this hash function and you still have to do a comparison of was this a real match just because I hashed to the same value in the hash table you still have to do the key comparison just to make sure it wasn't sort of like a spurious collision or any sort of like misplaced tuple in the hash table that had to like move for linear probing and then this gets to sort of the discussion earlier about materialization do you put the value in the hash table with it maybe some people put just a record ID again it depends like all things in these systems designed it just depends on what works best for the system that you're designing whether you want to materialize this and keep it in the hash table and that's sort of going to become the canonical materialization of the data that you need to go back to later or if you just want something that is like an entry a different memory buffer so we have a brief discussion now of what's called a probe filter that relies on a data structure called a bloom filter there's a much longer example in fall 2022 so if you really want to see how a bloom filter works like going to sort of the bit logic and stuff like that we've pruned it for time but they're pretty interesting data structures the basic idea with this probe filter is before you do the hash table lookup you look in this data structure that checks to see if the key is going to be present in the hash table because it's typically a much smaller data structure these bloom filters can typically fit in a CPU cache so you don't have to it's just going to be much more efficient typically to use a bloom filter to do a lookup first then it is going to be in a hash table now I'm wondering why wouldn't you just use a bloom filter instead of a hash table because these are probabilistic data structures so they can never have a false negative they're never going to say that is not in the set but it's possible they could do a false positive and say yes this lookup that you're doing is in the set and then you have to go to the hash table and actually that's your ground truth of whether it's actually going to satisfy the join predicate but in practice these bloom filters can save a lot of hash table lookups what does that actually look like we're joining A and B for some reason because why it's not R and S but we're joining A and B and while we're building our hash table on A we're also going to build a bloom filter because we have to scan the data anyway and then when it comes to the probe phase we're going to first look up in the bloom filter if the key attribute and B satisfies the bloom filter if it doesn't you move on to the next one and then if it does only then do you go to the hash table the reason this is helpful as well again in a disk based system these hash tables may not sit in memory they may have to spill the disk which we're going to talk about very shortly but this will also sort of motivate why something like a bloom filter is helpful to sort of prune out hash table lookups ahead of time clicker there we go so as I alluded to what happens if you don't have enough memory for your entire hash table hash tables are oh sorry go ahead generally yeah you're still going to probably want to do the bloom filter lookup first because it's probably going to have a better chance of staying warm in your cpu cache then like depending on your hash table implementation they're not as cache friendly we're going to have to jump around through pointers especially for like linear probing and such we usually would probably just do the bloom filter lookup first but I think even in memory systems often rely on bloom filters before they do a hash lookup yeah one bloom filter for the entire table so it's a data structure in this case that they're going to so the question is would you have a bloom filter for the entire table that was the question if you're doing it the way it's described in sort of the previous slide where it's built on the fly you're just building this bloom filter on the join attribute so whatever that predicate is in practice systems sort of keep these these bloom filters around again as supplemental data structures they sort of build them as they go maybe like a similar to I think I mentioned zone maps before but there's an optimization that database systems sort of build these on the fly to help with other query execution sort of like an index as well but for this example here where we're just talking about how you could use a bloom filter as a filter in a hash join you just build it dynamically just on the join attribute and then you discard it and then I'll come back to you yep so the question is we talked about index join before and if we use a hash index that is that different than this conceptually not really it just happens to it just depends on when is the hash table being built so what I'm describing here is sort of an ephemeral hash table this is used for query execution and then is generally discarded whereas if you're using a hash join you're relying on a sort of a permanent data structure permanent sort of a persistent data structure that it can reuse over and over again across multiple queries so I have a follow up on the bloom filter so if I remember correctly is he built a bloom filter on all of the rows on the attribute value of all of the rows in a relation then the probability for a false positive is slightly higher than say you have two or three bloom filters built on a subset of the rows of the attribute value is that a viable option so your statement and question is that from your recollection instead of building a very large bloom filter of a very wide one that incorporates all the tuples attributes it's better to sort of layer bloom filters based on different individual attributes or a subset of the attributes in practice bloom filters are often multi-layered about how they do these sorts of things I mentioned that we talk about bloom filters longer I think in last year's version of this lecture there's also a link to like a bloom filter calculator that can basically say for how many keys or how wide is the attribute how big does my bloom filter need to be in order to give me some sort of bound on false positives and you can sort of play with the numbers or you could google bloom filter calculator but yeah because they're probabilistic data structures there's trade-offs here you're going to trade-off space versus false positives cool yeah bloom filters are cool there was also as just as an aside there was a research project here with one of Andy's students about six or seven years ago called SURF, succinct range filters and they're a very similar concept they're basically a probabilistic data structure bloom filters only allow you to test on is something in the set or not in the set this basically was like a bloom filter but for range queries and it's actually been pretty widely adopted in quite a few systems since that paper was published a few years ago so if you're interested in bloom filters go read the SURF paper so as I mentioned what happens if your hash table doesn't fit in memory so I thought there was a question and that's where something called the partition hash join comes in we're going to basically partition the table first and then we're going to basically probe within each partition so we're sort of like yeah I don't know how to describe this without just using the word partition again but we're basically going to segment these relations into small chunks that allow us to keep this sort of information that we need in memory you'll hear it referred to sometimes the grace hash join it was a project out of the University of Tokyo in the 80s this was something called a database machine which I guess is a link, I don't know where that links to but it's sort of this idea of you had these very special purpose computers that were designed just for database processing they would design hardware specifically just to compute things like joins and ultimately what killed it in the 80s was Moore's law you would spend a year, two years, three years designing this specialized system that could process your database queries and by then Intel had put out the next generation of x86 particularly when you get to the 90s as well and it crushed you in performance at that point we were still getting so much faster every year or two so that sort of killed the database machine idea then this is an example of one that Andy knows of again from the 80s this was from Britain Lee I think and yeah I don't know Andy finds that I'm using the dude just sitting in a tie using a database but this idea came around again like a lot of things in database systems and computer software systems and just this field in general this idea came around again and you had these companies like Teradata and ATISA probably the most well known database machine was Oracles Exadata and you could buy these million dollar racks full of very specialized hardware and what sort of made them database machines they were putting specific hardware in there for query processing things like FPGAs specialized CPUs what kind of has what kind of took the steam out of this wave I think was the cloud era so we went from because there was a brief window where this seemed like a great idea death of Moore's law we're getting reduced gains in CPU year over year and people like specialized hardware yes this is the solution but then people got less interested in specialized hardware and running on prem they wanted to run on commodity hardware in the cloud there's still highly specialized systems like yellow brick exists we did a we had a seminar series talk from them about a year or two ago on the YouTube channel which is really interesting about how they built sort of a specialized database system and now they've deployed it in a cloud setting that's bordering on like a unicolonal design where they basically boot the operating system and then they never make another system call again and the database system owns everything it's a pretty cool system so if that's interesting there's a seminar talk on that on the database groups YouTube channel back to the partition hash join this slide is exactly the same so the basic idea is we're going to we're going to hash this into hash the the outer table R into a sequence of buckets we're going to use the same hash function on the interrelation s to create buckets of its own and what you can do is you can now spill these buckets to disk because we're using the same hash function here keys that hash into sort of this first bucket from table R we only have to compare sort of bucket at a time we don't have to worry about keeping this all in memory as we do the comparison so you can just go down through these buckets do the comparisons and then you you emit those tuples you don't have to worry about keeping all this the entire hash table in memory and in this case you would want to use a different hash function when you're comparing the two or you could just sort of directly compare the you could almost do like a nested loop join inside of these buckets depending on how big they are you don't actually need to worry about hashing again but if you were hashing you would want to use probably a different hash function so there's a few edge cases here what if a partition doesn't fit in memory well then you have to recursively partition it you have to choose a different hash function for that specific bucket create more buckets and hash those and there's also this option where like if the single join key has so many matching records sort of the degenerate case I said before with the sort merge join the hash joins not going to really help you here because everything is going to hash into the same bucket and you're going to have to do secondary hashing on all that and that's also going to probably hash into the same bucket so there's sort of degenerate cases here that you have to reason about with a hash join so what is recursive partitioning look like in this case we have this first hash function on R and we have a bunch of tuples that are hashing into this bucket and that's causing a problem for us because this bucket is now full so we're going to actually have to there we go throw a second hash function at this bucket one and then you end up with bucket like one prime one double prime, one triple prime and then you're going to go hash table S and the important thing to note here you have to make sure even if even if this bucket one weren't going to spill for the inner table you have to do the double hashing now to make sure that tuples that hashed into this bucket with the second hash function after spilling we make sure we check those from the interrelation if you only hashed once or if you didn't do that you wouldn't necessarily be able to look at all the data you needed to you don't want to accidentally skip anything so what is the cost of a hash join in the great case and you don't need recursive partitioning it's just three times we're looking at every page so we had the partition phase we're going to read and write both tables M plus N I O's and you're going to have to do that twice for the read and the write and in the probe phase you're just going to go through each table one more time interesting I was just thinking it's interesting that we actually reason about writes on this join algorithm when we generally have not really thought about writes or outputs so again we'll put the same numbers on it what does the cost analysis look like 4500 I O's in this case 0.45 seconds I just wonder too far I think wait did I get a real number so if the keys are skewed we have this so the question is if you have to do the recursive partitioning how do you reason about the cost my answer to that is this is the only thing you should be responsible for on the midterm and I will make sure that that's true with Andy because as far as I know it's not in any other slides and we don't actually put any sort of cost on the recursive partitioning it's hard to say in practice because when you're talking about recursive partitioning it's dependent on your data distribution at that point like I said the degenerate case is everything hashes into one and it sort of devolves you could probably yeah without noting a data distribution I don't think you could actually put a reasonable sort of approximation on it maybe someone in the YouTube comments will tell me I'm wrong yeah yeah so the question is what order do you do this in and because in the second hash you may have spilling as well right so in practice yeah I don't know what a real system would do here in practice you very very rarely would ever go beyond two rounds it just doesn't seem to really happen but it's a good question and it's a good thing to think about is like sort of these degenerate cases when you're actually implementing these sorts of things but in practice you sort of Andy can bleep this out right so his statement is the whole reason we're here with a partition to hash table in the first place is the outer relations hash table didn't fit in memory so we're going to have to spill that and while we're building our inner hash table as well that's going to have to spill as well yeah so like as you're streaming it you're reading it you're reading it you're reading one page at a time and you're writing one page at a time like that's how I think about it is that correct yeah in the absence of us having any sort of constant here of talking about like the buffer pool size and introducing some constant B and reasoning about that I think the notion of just sort of streaming a page at a time from each table is probably the right way to think about these algorithms because this is sort of in the absence of it is a little bizarre that sometimes we talk about what this would do with a buffer pool and what it wouldn't but when we're generally just thinking about these algorithms we're just thinking there's no buffer pool and just what would be the IO cost so there's an optimization we can do in the case of sort of very skewed keys at time right we have 11 minutes is that right okay if your keys are very skewed and you notice a lot of things going into sort of the same buckets you can have this notion of a hot partition where you basically just say okay these keys are hashing over and over again to this same partition and you can sort of it's sort of like similar to the notion of pinning in a buffer pool you can basically say these partitions should stay in memory and you're going to do sort of comparisons immediately you're not going to worry about these sort of separate phases of a build and a probe phase like you can basically as you're doing the building you can compute that you could do the some of these comparisons on the fly in practice it's not really done it's pretty difficult to do but you will see some notion of a hybrid hash join sometimes in literature so that's why we're just sort of presenting something that could exist here in the notion in the case of highly skewed data this is sort of an alternative I guess to like spilling those first those first buckets and creating like ever-growing buckets like you could do an optimization here and just say keep those ones in memory and start doing our comparisons on the fly so we're going to wrap up I think on hash joins in this case the inner table can be any size and ideally we would love for the outer table to fit into memory that would be the best scenario so this is another example of like this is something the optimizer needs to try to get right table size isn't actually usually that hard to reason about like that it can actually get right and then if we know if we know the size of that outer table we could use something like a static hash table and we don't have to worry about resizing this thing if it gets to a certain size like we can just basically preallocate the buffers that we need for this hash table we know based on this table size that I'm going to need this big of a hash table and not constantly have to to rehash otherwise we have to use like dynamic hash tables or allow for things to like spill into like basement pages of this hash table yeah we want to having a hard time like thinking about it end to end so suppose we get a new query we want to do a join now do we have a buffer pool preallocated for like join queries or queries like these or do we try to fit it in a larger buffer pool or how does that work right so his question is if a query comes in and I need space in my buffer pool to run these sort of hash joins like is that memory sort of reserved for you or how does that work generally it's a knob you can tune in these database systems like how much space do you set aside for working memory in the buffer pool for things like hash tables and stuff like that other systems may allocate it on the side like a separate buffer pool or something like that just for hash joins because they may be able to say ok based on concurrency in this system we can reason about how much contention there would be for that data structure but it's sort of a system design implementation a decision yeah but always it's a good thing to be thinking about yep so how come we called the outer table would we call the one that is the hash table the outer table for index, the hash table one the inner table so the question is why do we call the outer table the table that has the hash table built for a hash join versus when I was talking about an index the inner join is the one that would maybe have the hash table on it so the I don't want to get too fixated on that inner table on an index join being a hash table or a hash index potentially like in practice it's probably going to be a B plus tree so I don't want to like confuse those two ideas for you guys so for a nested loop join you would still maybe size because hash join to the outer table is smaller so the hash table is smaller is more likely to fit into memory whereas with with the index join the inner table is larger so you would want to use an index to to speed up per outer table of time like for example if you use a B plus tree index on the outer table and the index join and your inner loop has more iterations then your total lookup time would be higher than compared to you use each key in the outer loop and you probe it into an index in the inner table but I think it's slower to probe the index every time so it is slower to probe the index every time but you're probing it a smaller number of times I think it's better to stream the larger table I think we could talk about this yeah and so in hash joins we typically also just refer to again it's like an implementation thing they love to overload the terminology you have your build side and you have your probe side of your tables they don't necessarily always reason about outer and inner tables where your outer table is your build side your inner table is your probe side so the terminology gets a little weird especially like you said with an index join an index nested loop join where the inner table is the one that's relying on the index or has the index already but that's yeah that's just the way the terminology has wound up yeah so generally hashing is probably going to be the better choice unless you specifically need ordering hash joins are usually going to be the way to do it there's a lot of modern analytical systems that just sort of default to a hash join they don't even bother reasoning about anything else the note here says good DMS is used either or both so like a more traditional like transactional or just general purpose all around system like a postgres they're going to support all these sorts of joins whereas like highly specialized systems very fast transactional systems will focus on index joins and OLAP systems will specifically use hash joins so next class you guys get to start talking about how to actually execute these queries that's going to be on Monday and then like the initial slide said you guys have a midterm on Wednesday so good luck with that you got a block on taps the feds can't trace that style is like tamper proof you can't lace that at the dominican or you could call me dominican black skelly black leather black suede timmelins my all black dirty haters send you to the purly gates you get gizama trying to skate and that's your first mistake I ain't lying for that cake your family see your weight my granddad's heavy weight they ran through every stake when they actually how I'm living I tell them I'm living great