 Alright, so next Tuesday I got to skip town, so it'll be a pre-recorded lecture that I'll post on YouTube and I'll remind everyone on Piazza. So next Tuesday that we know in class lecture, yes, sorry, yeah, so I've only taught Mondays and Wednesdays for the last nine years at Carnegie Mellon. This is the first time I've ever taught Tuesdays to Thursdays, so my mind is all messed up. So I keep, that's always the type of, yes. It's October 13th, but it's a Thursday, yes, thank you. And I made the same mistake on Piazza. So anyway, the next class on Tuesday, that'll be a remote lecture, so it'll come here, homework three will be due this coming Sunday, the midterm exam will be Thursday, October 13th, and that'll be here at the regular class time, and I post on Piazza the midterm study guide. You should be able to download the PDFs for the practice exams on, if you're on campus Wi-Fi or VPN, post on Slack or post on Piazza if you can. I updated the Wi-Fi lists or the IP addresses that are allowed. If it doesn't work again, that message would be, and then that's project two, not project three. Project two is out now. Checkpoint one will be on next Tuesday the 11th, and then checkpoint two will be due on Sunday, October 23rd. So we'll have an info session for this Thursday, which is what, today's the fourth or fifth. Whatever, two days from now on Thursday at 8 p.m., I'll post on Piazza, it'll be on Zoom again, and then record it, we'll make it available on Box, and the slides will be available as well. Yes? No. The question is, is this Tuesday's lecture in scope of the midterm? No. The midterm will include this lecture, Thursday's lecture, inclusive, but not next week. It's kind of sh**ing for me to do like, I taught this thing on Tuesday, now I ask you questions on Thursday. Yes? When are you ever released? Oh, we'll try to get that done this week. Yes? The question is, when will the grading submission for checkpoint two be released on grade scope, we'll get that done this week? I think it's down, we just have to post it. Yes? The question is, is checkpoint two is due during the fall break? Will the office hour start on fall break? Yes? I'll give a question. Okay. Yes? The question is, how are you supposed to work on the project during the fall break? I mean, Charlie, where's the university policy for that? Is it really supposed to do nothing? Oh. Alright, we'll take this offline, we'll figure this out. Yes? Alright, I will fix this, yes, I'll just post the PDFs on Piazza, thank you. Anything else? It should be project two, yes, there's a lot of mistakes on this line, I will fix this. It's the last thing I did before I came over, like, oh yeah, I got to update this, sorry. It doesn't really instill confidence that everything else is correct, but if I'm getting dates wrong, I'm okay with that. Okay? Alright, so let's jump into this. So today we're going to talk about joins, as I said, joints are the bedrock operator we could have in relation to algebra that's going to be commonly used in queries and we want to make sure that we're doing this, you know, we do this efficiently. And it's sort of obvious why we need to do a join, you know, this idea of a relational model that we're going to break up a database into these relations and, you know, we didn't want to talk about normalization and denormalization but the idea is we want to split these up into the relations into these atomic units so we reduce the amount of redundancy and wasted space. But now, when we want to answer queries, we've got to put things back together. And that's what a joins is going to do for us. Now, if you sort of pay attention or if you were cognizant of what was going on in the data this world 10 years ago when the NoSQL stuff was super hot, right, all those NoSQL systems are claiming that joins are super wasteful, super slow and you don't want to do them and their data systems are explicitly not going to support joins. Well, they were all wrong and now pretty much all of them support joins in one way, one form of another, right. So joins are super important. It's the right way to, well, it's not the right way, but it's the way we're going to be able to take data that we've partitioned or normalized out and put it back together to answer queries. No matter what's doing OLAP or OLTP, in the end of the day, you're going to do joints. For OLAP systems, joint algorithms are going to be super important because the research basically shows they're going to spend roughly maybe like 15 to 50 percent of the time of a query doing joints. So we're going to make sure that this is, we do this efficiently as possible. We'll talk about how to do distributed joins later in the semester and those systems are going to spend even more time doing joins, execution time doing joins because they have to move data around between notes and which is expensive, right. So the joint operator we're going to have is we're going to reconstruct a normalized tuple across relations back into its original form, right, without any information loss, right. We're going to assume that our joins are going to be correct. They're going to produce the right result and it's really about worrying about how fast we can make these things. So for this lecture, we're going to focus on a specific category or type of join. Specifically we're talking about a binary join, meaning taking two relations to two tables and joining them together. And we're going to focus on the inter-equa join algorithms. So inter-join means I'm looking for, there's a match from the outer table or the left relation, the right relation. I have to find a match for that. And the equa join means I'm using a quality operator, a quality predicate, like something equals something. Something on the one table equals something in another table. There are anti-joins or inequality joins. We're using like range predicates or not equals. That's a whole other set of algorithms we'll wear at that later. The inter-join, inter-equa join is the most common one, so we're going to focus on that. And with the easy extension, we can go, we can use left outer joins and right outer joins using the same algorithms as well. For binary joins, it means we're taking two tables and putting them together. There are multi-way join algorithms that exist. Like can I take three tables and four tables and join them together at the same time in a single algorithm? But they pretty much only exist in research. I've only done two systems that actually try to build a multi-way join. The first was Microsoft for SQL Server in like the late 90s. They added multi-way joins in 98. But then it turns out like it was not super janky, but like it made the system's performance unreliable or unstable. So they ended up removing it in 2001, right? And then more recently, the only system I know that's doing multi-way joins now is the system called relational AI. They have some algorithm called the dovetail join algorithm, which I could post on Piazza, but they're the only ones that are actually doing this. Pretty much all the systems are going to be doing binary joins. We can talk about multi-way joins in the advanced class, if you want. So another thing I'm going to keep mindful of is that in general, for all these algorithms, we're going to have the smaller table of the two tables we're trying to join be the outer table, or the table on the left side of the join operator. And then the larger table will be the inner table or the right side. Now I'll explain what outer table and inner table means in a second, but in general, when you look at query plans, the common convention is that you're facing on the left side of the operator as you look at it. That will be the outer table, and the right side will be the inner table. Some systems like Snowflake, I think, flip them. There's no common thing, but if we use outer versus inner, people will know what you mean. And it's going to be up to the query optimizer in our database system to figure out how we're actually going to figure out the ordering of these two things. So we're not going to describe how we're going to do that just now. We'll do that after the midterm. But just know that the query optimizer's job is going to be to try to figure out, OK, this table, I want to join these two tables. This one's larger than the other one. So let me put this one to be the outer, or sorry, let me put this one to be the inner and this one to the outer. It can certainly get it wrong, and it will get it wrong, but it's trying to at least make an approximation to figure out the ordering for you. There are techniques where you can start running the joint algorithm, realize you've got the ordering incorrect, and then stop and reverse it. But that's called adaptive query optimization. Again, it's one of the things that exists in the research literature. I don't really know any system that actually tries to do that, because it's expensive to stop everything, throw it away, and start over. OK, so what do they click here? Sorry. I showed this before, the query plan that we talked about. And then we're going to convert this into a sum that represents our query plan. And then the output of each operator is essentially going to be the result of whatever computation it's doing. And we're going to feed that up into the operator above us in the tree. And the amount of data we're moving, what kind of data we're moving, again, we'll talk about that next week. Actually, we'll talk about that, actually, starting on Thursday we'll talk about that. But the other thing I talked about before was before, therefore, here's the best joint algorithm I should use. And here's the ordering of the tables I should use when I do the joint. So in our joint implementation, there's two decisions we have to make first. And these are not really something you do like, how to say this? You wouldn't necessarily try different variants of these different design decisions at runtime. Typically, these are baked into the system itself. Like when you implement it, someone has to make a decision, OK, we're going to do our joins in this certain way. There usually isn't systems that try to switch dynamically and try to figure out the best way. Because from an engineering perspective, it's too complicated to maintain a bunch of different code paths. So the first question is, what does our output look like for the joint operator that we're going to pass up into the tree? And then the next question is going to be, how are we going to determine whether joint algorithm implementation is going to be better than another? Assuming our database can support a bunch of these different joint algorithms. The lower end systems usually only support one, maybe two. The high end systems, like the commercial guys, Postgres as well, they're going to support a bunch of different algorithms. So therefore, they can make a choice at runtime which algorithm to use. Yes? So yeah, so his question is, in this example here, where's the concept of an outer table versus an inner table? Again, this is a logical plan to specify, right? The convention that people use when query plans, when you show these diagrams, it's usually the one on the left is the outer, the one on the right is the inner. But it has to be, right? I think when you dump out it, sometimes some data systems have the ability to dump out visual representations of the query plans. And it's usually the outer will be on the left, right? At this point, for this query plan to hit, the logical ordering doesn't matter, right? Because it's just a diagram, right? This is not even saying, do I do a nested loop, hash, join, start, merge, join. It just says join. Well, again, we'll talk about query optimization and after the midterm. But the way to think about it is, you first do as much optimization you can on a logical plan, all right, when you don't have to worry about maybe the explicit ordering of certain things. And then when you would have converted to a physical plan that you actually need to execute, then you start worrying about these things. So you start to do the optimization in two stages. Oh, sorry. All right, so the first question is, what is our joint operator actually gonna output? And we certainly talked about this before when we talked about early materialization, late materialization, which we'll go over again. But it's more than just that. It's like, what is the amount of data I'm spitting out? What does the data look like? And how do I keep track of like, here's the things that actually match? And again, we're assuming here that the joint algorithm is correct, meaning like if it's this algorithm versus that algorithm, it's always gonna produce the same logical result. Like here's the tuples that match. But the ordering can different, right? The order can differ on the output, right? Because if it's a sort merge joint algorithm, the output's gonna come out sorted. If it's a hash joint, it could come out unsorted or will come out unsorted. Maybe I spit out tuples one at a time, but I'm spitting out tuples in a batch, right? From a relational outer perspective, it doesn't care, right? But in the implementation, we actually are gonna care, right? And so there's some logic in the query optimizer. It can reason about what's above the joint in the query tree and recognize that, oh, it needs to be sorted. Therefore, maybe I should just do the sort merge joint algorithm, because then I don't have to do the sorting of above, because it already gets sorted as they need, right? So there's all this additional logic we have to care about to help us make the decision whether we want to use one algorithm versus another, yes. So as David is, a lot of the things I'm saying seem to be these heuristics decide is one approach better than another. Is there any benchmark or any, even analytical reasoning or models we could use that determine one is better than another? I mean, there's standard database benchmarks that people use for evaluating one system versus another. It even actually within the system itself given the design decision. In practice, there's like a bunch of literature that maybe tries to look at a bunch of different approaches, but then in general, most systems sort of pick one and just stick with it, right? And sometimes when you push them, push the actual developers in these systems and say why you do things a certain way, it basically comes down to like, oh yeah, this is the way the last guy did it or something like that. It's never like a deep, deep reasoning. But there'll be some design decisions that we make that will influence why we do some things that one way or another. So again, if what I'm doing on column store, then maybe I want to do late materialization versus early materialization. So some things are influenced by what happens below you in the system. So we talked about early materialization and late materialization before, but I just want to go over it again in the context of joins. So say we're doing this table joining R and S. Say it's really simple dataset with a small number of tuples. The output of the join is basically stitching the attributes of R with attributes S, producing this sort of new tuple that has the combination of both the attributes, right? And then so that's essentially what's going to be the output of the join operator that I shove up into the next projection, who can then do whatever it wants, right? So the idea here is that in the scans on R and S, right, the access methods on R and S, I don't say whether I'm getting an index scan or sequential scan, it doesn't matter. I'm going to materialize all the tuples that are all the attributes for the tuple and then pass them up into the tree, right? So this approach would be bad if the tables R and S are really wide, have a lot of attributes, because then I'm copying a lot of excess data that I don't actually need up above, but if the selectivity of the join is really, really small, like it's only going to make one or two tuples, then maybe the wide column actually doesn't matter, right? So in this case here, an optimization we can do is obviously push down the projection on, in this operator operator here, because it only needs ID and the creation date. So set a pass, so it could do the projection in this join operator and then only pass up the subset of the data that I actually need, yes. So you're quite, could I even, yeah, so his point here, well, no, no, no, so his statement is, if I'm doing a join an RID and SID, couldn't I just get rid of the join entirely? The answer is no, because you don't know the relationship between R and S, like I could have one to one, in your case, yes, that would work, would one to many or many to many, so you wouldn't be able to do that. That is actually logic that system could figure out, like if it knows that, oh, RID is the primary key and SID is the primary key, and therefore it's gonna be one to one, you could be smart enough to try to remove that. Anyway, the main thing about this though, is like, as we go through the query plan, since we have, we need to go back to the original tables, like we pay that cost up front, and hope it works out that the amount of data that we pass along doesn't become too odorous, and therefore it wouldn't be better just to not bring everything along, which is what late materialization does, like it's the same join, but now in my output of the join operator, instead of passing along the actual values with R and S, I only pass around the record AD for each tuple that matches, and then at the other parts in the tree, like when I need the customer date or the creation date, I use the record AD to go find the actual attribute that the data that I need, right? As I said, this is ideal for column storage because we're not copying around a lot of data that we don't actually need. So a lot of systems, the column store systems would use this, one of the first column store systems was Vertica, and they made a big deal about this 15 years ago, but then they told me before the pandemic that they actually got rid of this optimization because it turned out to be just better off just go fetch all the data at the beginning, do early materialization, because you know what the performance cost is that that's gonna be at the very beginning, and it makes the performance of the query more stable because the number of times I'm gonna have to fetch some data for this operator up here, depends on the number of tools that come out of the join, and if I can't correctly predict how many tools are gonna come out of the join, then sometimes the query will be really fast, sometimes the query will be really slow. Or if I go fetch everything all at the very beginning, then no matter what the predicate is that I'm doing my join on, the cost of query is roughly the same. For some people, a lot of businesses, stable query performance is better than sort of having some query be super fast all the time and some be super slow. Yes. So your statement is, for this example, early materialization, isn't it better to have the outer table be the larger one? Because why, sorry? The wider one be, I mean that's like, sorry, more tuples are wider. That it's better to have the wider one be in the outer. So your statement is, is the number of tuples versus the number of attributes more important? The number of tuples is usually more important, yes. I can imagine like corner cases where you like, you have up to like 16,000 tuples, or sorry, 16,000 attributes. What do you think is upper bound of a Postgres? Oracle famously only allows for 1,000 columns. It's hard coded in the code. It used to be like 200 in the 90s and some guy spent two months fixing it. It was so painful, they said they're never gonna go back and do it again. So they only support 1,000 columns because it's pound of bind in there. Postgres is two to 16, I think a bunch of other systems is two to 16. All right, the other thing is gonna be different is, how are we gonna determine whether one algorithm is better than another? So again, we'll talk about the query optimizer after the midterm. The basic idea is that it's gonna try to figure out some internal, has some internal cost models say, this one algorithm is better than another. And the unit of measurement they're gonna use for this cost metric is gonna be entirely on IOs, right? Because I said in the beginning in this class, we care about the number of times I got to read and write the disk and maybe care less about like the performance of what stuff is in memory. Because again, the disk was so slow, although that's slowly changing with modern hardware. But for now, we assume the disk is always the bottleneck. And so the cost I'm gonna use, the number of IOs to compute the join. And so the variable is gonna use going forward in this lecture is that we're assuming we have two tables, R and S, because we're only doing binary joins. And we say, we have big M pages in R and then within the M pages, we have M total tuples in a relation. And then we'll have uppercase N pages in S and then lowercase N tuples in S as well. And so we're gonna ignore the computation cost from what things are memory. And we're also gonna ignore the cost of the output of the join operator because that'll vary based on whatever the selectivity of the predicate is, depends on the data. And that's really hard for us to compute at this point. Okay? All right, so the first thing to also bring up is also the cross product. That's the most simple join technically. It's just taking two for loops and just matching everything without actually just combining every other tuple in the outer table with every other tuple in the inner table and then once you produce that giant output, then you go back and apply the filter to remove anything. It's like the most naive, but also like the worst way to possibly do a join. So there are some cases where you do wanna do a cross product, like you can specifically say, I wanna do a cross join in SQL, then you get this algorithm. But from a theoretical standpoint, this is something we could evaluate, say this is a possible way to implement our join, but we know this is always gonna be slow. So some systems, actually all systems either don't consider it or throw this possibility out right away, right? So the cross join could exist, but we can agree on that entirely for this, for this inner join is gonna be the most common join. We have to, we wanna support. After that it's left outer join, which is just again an extension to the inner join. All right, so there could be three category of join algorithms we're gonna look at. So the nest loop join, and then of that we have different variants. The sort merge join will be very similar to the external merge sort we talked about last class. And then the hash join would be the one we wanna focus on the most, because that's gonna be, in general that's gonna be the best, the, it's gonna be the fastest and be the best in almost all cases. For our lab you probably almost always wanna do a hash join. If your app needs to be sorted and the sort key is the same thing as the join key, then you wanna use sort merge because you get the sorting of free. But then if it's all a TP where you're going looking up like one or two number of records and you already have an index, then you wanna use the index nest loop join, okay? And again, the different database systems will have different implementations of these sort of three category of join algorithms. Some systems like Postgres and SQL Server and DB2 and Oracle, they'll support all of these. I think MySQL only support, or you still only support nest loop join for the longest time. I think they added hash join maybe like five years ago. And it's, if you optimize for all a TP, then you can get by pretty, you can get pretty far with the nest loop join. All right, so let's look at the basic nest loop join. As I said, it's basic. There's this notion of an outer table and inner table. We refer to that because you think of the nest loop join, it's just two nested for loops. So on the outermost for loop, you can call that the outer table and the inner most one is the inner, right? And all we're doing is for every single tuple in the outer table, we're gonna go look at every single tuple in the inner table. If they match on our join predicate, which we don't really care what it is at this point, then we say, we have a match and we emit that in our output. And as I said, the visually, when people represent these query plan diagrams, the outer table will be on the left, the inner table will be on the right, right? So this is the dumbest thing you could do. Yes, because cross-partic, you remove this part. It's the same thing, right? You're not checking any predicate. All right, so this is the dumbest way to do a join, right? And because it's so naive, it's basically for every single tuple that we have in the outer table, we're gonna look at every single tuple in the inner table. And every time we, on the outer table, we switch to the next tuple, we scan clearly all over again. So there's no notion of pages, no notion of cash locality, no notion of a buffer pool here. We're just blindly going, grabbing every single tuple we can. Yes, so the statement is for this naive, stupid join algorithm, it doesn't matter what the outer, it's the inner, yes. All right, so what does this actually cost us? So it's gonna be M plus little M times big M. So we're gonna look at every single page in the outer table, we gotta go fetch every single page for every single tuple on the inner table, right? So the formula is basically M plus little M times M. So this seems very abstract, yes. We'll get his statement, it's okay we just buffer this, yes, we'll get that, yes. Why do we have to take little M times M? Yeah, why do we have to do that? Because like, for every single, you go grab a page in the outer table, that's the first M, the big M, right? And then for every single tuple on that, you're gonna go fetch every single page on the inner table. So this statement, there's no buffering, there's no notion of like, oh yeah, this page, if I bring this page in, I can do a bunch of scans, right? It's blindly going to the next page over and over again. So let's put some numbers behind this. So say we have a simple database, it has a table R, it has a thousand, it has a thousand pages with 100,000 tuples, table N has 500 pages, 40,000 tuples, right? So the basic cost analysis would be, this is, it would take 50 million IOs. So see if you have a reasonably, it's D that can do a scan in 100,000, or 1,000 microseconds or so, the total time this naive join is 1.3 hours, right? So if you put the smaller table on the outside, then we do a little bit better and now we get to 1.1 hours. So if you do some basic math on my example table here, right, assuming I have four kilobyte pages, this database is six megabytes, right? Easily fits in L3. But if I'm going to get for every single tuple, if I'm going to get every single page, then like it's, I'm gonna get crushed. Each tuple, you have to fetch the whole page. Yes. To state it as, the insane part is that for every single tuple, I gotta fetch the whole page. Yes, because again, we are, we are, it's a page oriented storage on disk, right? So we can't go pull out the exact, like byte sequence we want. We gotta go fetch the entire pages then. Now I'm using four kilobyte pages. I can have larger pages, right? If I'm in my database, but again, the math basically is the same. Yes. Is it the same as, again, so if you take CPU cache into, if again, six megabytes will fit in L3. So yes, I wouldn't actually have to go to disk. I'm just trying to show you, if you have to go to disk, like how bad it actually is. So to the point he brought up, like can't you just be, can't you just buffer things because you know you're bringing things into memory and do as much processing as you want to bring things to memory for the outer table and the inner table. Yes, that's called the block nested loop join. This is what people wouldn't, this is what people actually wouldn't implement, right? So for each block or page in the outer table, then I go fetch a block on the inner table and then for each tuple in the outer table block, I look at each tuple in the inner table block, if they match, then produce the output, right? And now we're doing a little better because now our cost is gonna be M plus big M times big M, right? So for every single page on the outer table, I go fetch that once. And then for every single page on the outer table, I go fetch all the inner table pages, yes. Yes, so the same as block and same thing as page, yes. I'm using the terms interchangeably, yes. I think the textbook calls this block nested loop join. Buffer nested loop join, same idea. Just we're making our algorithm be aware that there's this notion of pages or blocks where we can bring things in and there'll be more than one tuple inside that page. Or could potentially be more than one tuple inside that page, yes. So here we only have to assume that we have like two buffer pages, like one for the outer table and one for the inner table. Yeah, so his David and he's correct in this example here, I'm assuming we only have two buffer pages, one for the outer and one for the inner, yes. And a third for the output, yes. So again, we want to put the smaller table as the outer table and we want to base this on the number of pages between one table and not the number of tuples. Because again, the number of tuples per page could vary and it's all about reducing the amount of page reads I have to do from disk. So we go back to my simple example I had before, I had table R, table S with 1,000 pages versus 500 pages. Now when I do the math, I come down to, from 50 million IOs, I come down to 500,000, right? And now I'm down to 50 seconds. We're going from one hour to 50 seconds, just by being aware of, we can keep things in memory and process it as much as possible. I think the textbook might use like a spinning disk hard drive for these examples with like a five millisecond seek time. So their numbers are like, one minute for these joints, right? But these numbers are more realistic and modern hardware. All right, so the example he brought up is, what if I have more buffers, right? So again, if I say I have B buffers and assume that one buffer is one page, so I have B minus two buffers for scanning the outer table because I always have to have one buffer for the, for the inner table, sorry, for the inner table and the, starting the output, right? And the basic algorithm looks like this, so B minus two pages, fetch all the pages I can for the outer relation that for each page of the inner relation, I do the same matching that I had before and produce the output if the joint predicate values to true. So now again, we go back to our example four for the formula. Now it's the M on the outer part, the ceiling of M divided by M, M divided by B minus two times that. So again, I did a complete pass on the outer table then for M divided by B minus two buffers on the inner table, right? If the outer relation fits entirely into memory, then we're golden because now we were just, it's just M plus M. Go fetch the entire outer relation, bring that into memory and go fetch for just do a one sequential pass on the inner relation. Now we're down to 1500 IOs, right? And doing this in 150 milliseconds. That's not gonna look reasonable, yes. So the statement is for these IO times, I'm only considering the cost of fetching things into memory. For this algorithm, for the next loop joint stuff, you never write anything back out, right? Because I bring it into memory, I check to see whether it matches and then once I'm done with it, I just throw it away. I'm not writing anything. The bottleneck is the reading from the disk from the hardware into main memory, yes. I think I mentioned there's some hardware and you can actually push down the predicates onto the hardware itself, like they have little ARM cores down there, you can do some filtering down there, but we can ignore that, yes. So this question is not for these algorithm analysis, in a real system, should I assume that all my buffers that I have from memory, would that be used for joins? No. So the way it works in certain systems, you can specify how much memory you want to dedicate for each query to use for joins. And then once it exceeds that threshold, then it actually spills the disk. So you can tell that it's something like, I want to use one gig of memory in total, but then I want to use maybe one megabyte per query to use that for my join. And if I exceed that, then I spilt a disk. So the amount of memory in the system that can be used for joins at any given time is the number of active queries doing joins at any given moment. And so the number of active queries depends on some other knob you can set. Yes. So his statement, his question is, what if I don't use a buffer to store the output and I just write it to disk? How does that actually, like how does that work? You can't do that. But what does that mean to do sequential write? What does that mean? What's the sys call to do that? Yeah, but what do you pass into the sys call? A buffer, right? Like you can't wait a magic wand and like bits appear on the disk, right? You know, it's not a joke. I mean, like I'm not trying to dismiss your question, but like there are like classic positive API like F write, you have to pass in a buffer. So that's our buffer there, right? There are like MVME and there are like APIs in modern Linux where you can do like kernel bypass to do direct writes to the hardware to avoid like the buffering stages within the OS. But even then you got to pass a buffer, right? Like 10 bytes. I mean, you can do it, but the OS is not gonna be happy about it or the harvest not gonna be happy about it because it's gonna turn into a four kilobyte block in no matter what, right? Let's do something else too. There's something called IOU ring in newer Linux where like you basically get a circle buffer. That's for mostly reading though. You can get like a shared buffer that like is shared with like the hardware or the OS where like it's, I mean, it still has to be allocated. So it still has to count from our joint algorithm, which this is getting more advanced. But like there is a way to like basically get a shared buffer where you could write things in and not have to allocate, do additional malloc in your joint algorithm. But like at the end of the day, you need memory to put something somewhere, right? Yes, yes. So the statement is here, like I made a big deal about how sequential IOU was gonna be faster than random IOU, even a modern SSDs. In my, is there any consideration I'm making for this in my joint algorithm at this point here? While I'm scanning through the table, right? One page after another, that's if it's, if the disk manager has laid out those pages sequentially on disk, then this is all sequential IOU, right? There's again, if I'm accessing each page, well, if the pages are sequential on the hardware, then like I get the benefits of sequential IOU. Yeah. His statement is in a modern SSD, it gets better performance on random IOU than a sequential disk hard drive. That is true, yes. But even then modern SSDs, like there's a whole, it's basically a whole nother computer down there and it's gonna organize things in, if you give it a byte stream all at once, it will try to put them sequentially together. Your statement is that you're always better at storing things contiguously on disk, yes. I, remind me, I posted a Piazza, there's always some benchmarks that come out and there's a tool called FIO you can use for micro-benchmarking and people just, you know, they take the new SSDs that come out and they show that like sequential IOU, it's not, the gap isn't as huge as it is in a spinning disk hard drive, but it still exists, yes. Yeah, so his statement is in my example here, I'm using one buffer for the output, but when it gets full, where does it actually go? I'm ignoring that for now. I think I said in the beginning in our calls, we ignore the cost of the outputs of where it goes, right? Because it depends on how the system's actually implemented because it may be the case that like, what's my buffer's full? Then I pause the join algorithm and then hand the buffer up to the next operator and let it do whatever it wants to do. Or maybe I send that to another machine. For our purposes here, we can ignore that. All right, so let's get through the nested loop join stuff quickly because again, it's not the main focus. So why is it so bad? Again, it's very naive, it's a brute force search for every single tube on the outer table, we're just doing sequential scan to check match on the inner table. And even if we do buffering, you know, we're still looking at everything. So one thing the system can do though, if it already has an index on the table we're using for the join, we can use that index or convert that table to be the inner table. And then now as we scan through the outer table with one tube after another, we can just probe that index and do the look out and find the thing that we're looking for, right? So this is called the index nested loop join. And this is what in most LTP systems, this is what they will implement. Because usually things like for Andy's account ID find all the orders he made. You have an index on maybe the order ID with the user ID and you can probe that into easily probe it to that index and find your matches, right? So we're just gonna do a naive scan on the outer table. And then for every single tuple in the outer table, do a probe in the index, then we find a match, then we reproduce our output, right? So the cost now is M plus little M times C. Oh, shit, every time. So C thing is this constant factor that it's hard for us to quantify because it depends on what the index actually is, right? It depends on what the look up on the index actually is doing, right? If it's a unique index, then for one key look up, I'm gonna produce one tuple, but it may be the case that it's a non-unique index, a secondary index, and that one probe may produce, I don't know, a billion tuples, a million tuples. We don't know, right? So we just say it's just some constant factor C that's undefined at this point in our analysis for these algorithms, but in a real system, you would know what the index actually is, and you have a rough estimate of what the selectivity would be, and that can determine whether the index nest will be joined is a good idea or not, right? And there's also, going back here, now this is broken. In my example here, I'm just doing a sequential scan on the outer table. That also could be an index scan as well, right? But again, the idea I'm just trying to show here is that on the inner table, instead of doing a sequential scan, I do a probe into an index. This is essentially what the hash coin's gonna be as well. We're gonna build a hash table index, and then do probes inside of that, yes. The statement is, can I think of C as the number of pages we're gonna access in the index probe, yes. But again, like it may be the case it's one traversal down, I find exactly what I want in a single page, or maybe traversal down to be organic, but then I have to scan across. The traversal doesn't matter, no guarantee. We can't assume the index is already in memory, yes. You see, yeah, we can assume the index fits in memory or not. You see, this question is, does my index have to be an exact match on the search key? No, like I could have like a prefix of the joint predicate, but like that gets me at least to the leaf nodes, and then I can scan along, and then I would then have to do an initial predicate evaluation when every tuple that had come out of the leaf node to see whether it matches my joint and then produce the output. The statement is, you don't see why this is a good idea because I traverse the index, I go fetch the page that the index points to, and then do additional check. It's certainly better than like blindly sequential scanning the inner table, right? Again, you don't want to assume the cardinality between the inner and the outer table, right? It could be one-to-one, one-to-many, right? So it could be like for one, it could be one-to-one, in that case this is beautiful because I probe down on the inner table on the index and I get one tuple that I need. Sure, like so a no match versus a match for our purposes doesn't matter. I still have to pay the cost to do the lookup. All right, so key summaries for this. Okay, we always want to put the smaller table to be the outer table. We try to buffer as much of the outer table as we can into memory as possible and then we loop over the inner table and if we have an index, we'll use it. Some systems will actually build the index for you on the fly and it's again, essentially what a hash join is going to do for us, but in the case of a Siebel server, they're actually build a B plus tree index to do your join. And then if you do it enough, I think they give you a warning and say, hey, you probably should add this index, we're using it a lot. It would be a good idea. So again, we talked about the stupid join, that stupid join, nobody does that. Block one is probably the most common and then for O2P systems, you definitely need an index and that's the loop join. Okay, all right, so now the sort merge join. I think I said this last class when we talked about join algorithms. So this is the sort merge join, but then within the sort phase of the sort merge join, you can use the external merge sort that we talked about last time. And I'll just try to be clear when we're talking about one. But actually for this lecture here, we don't care what the sorting algorithm actually is. It's quick sort, heap sort, whatever, we don't care. We just assume that the data's coming back sorted after we're done with it, right? So the first phase we're gonna sort but the inner table, outer table based on the join keys. And then in the merge phase, we're gonna have two cursors that are gonna walk through the outer and the inner table together sort of in lock step. And we're gonna check to see whether we have matches and depending whether we do have a match or don't have a match, we will iterate the one of the cursors down versus the other. And in the case to identify if we have, based on the joint type and whether there could be duplicates, we may have to backtrack on the inner table, put the cursor back up and then do a scan again. But the basic idea here is that because it's sorted, we would know what the boundaries of how far we have to go back. We don't have to jump back to the very beginning as you would have to do a nest loop join. We can jump back to some other starting point. That's not all the way back. Again, in the worst case scenario, if you only have one value for your joint key, then there's nothing you can do for that. You have to jump back to the beginning no matter what. But in practice, most time, you don't have to worry about that. So this is the joint algorithm. I don't like showing code in class, but I'll walk through visual example next. But basically, here's the first phase where we sort both R and S. Then we build cursors on the sorted R and sorted S. And then while our cursors are not completed, in particular for the outer table, we'll just check to see whether we have a match and we produce the output. And if we know that one is greater than another, we have to increment the other cursor. Yeah, so she said there's no backtracking for simplicity, I didn't show that. Let's walk through an example. We'll see what it means, yes. All right, so say again, here's our same joint we had before when joined R and S. So first thing we're gonna do is sort R and S on their ID columns, right? And then we produce our output like this. And then now we have the cursors at the very beginning and we're just gonna compare the joint keys of the outer table and the inner table. And in this case here, since RID is 100, SID is 100, that's a match. So we put the joint tuple in our output buffer and then we're gonna increment the cursor on the inner table. So we move the cursor down on S, like this. Do the same comparison on RID and SID. Again, both are 100, so we have another match. And we increment down the inner cursor. Now at this point here, the RID, the cursor in R is pointing at 100. Cursor in S is pointing to 200. So since 200 is greater than 100, we know we need to increment the outer cursor, right? And we're never gonna have to backtrack on the outer table for inner joint. Now we do a comparison. Now 200 equals 200, so we produce that in our output. We increment the inner table cursor. 200 is less than 400, right? So we increment the outer one. But now we see we get 200 again, right? So in this case, we have to keep additional metadata to know that we need to backtrack our cursor to the last value that we saw, right? In this case, so we saw 400. The last distinct value we saw was 200. Now that could be exactly the previous one, which is my example here it is, or it could be some number hops away because I have 200 repeated. I need to jump to the first 200, right? And now I can do my comparison again to produce the additional result in my output buffer. Same thing, inner one increments to 400. 200 is less than 400, so I increment this one. 300 is less than 400, so I increment this one again. Now I have 400 equals 400, produce the output. Then I get 500 here. This moves down, I get 500. Now since this is a match, I don't have to go back to 400 because I know there's nothing above this 500 that could possibly match with this because I didn't see 500 before on the inner table. And then I reach the end and I know I'm done. And I don't have to backtrack on this side here. Is that clear? So again, we're not talking about left-out points, but the way about also too, if you're doing a set of inner join, a left outer join, if I don't make sure my cursor here, I can just produce the result with the results for the inner table. It's basically how you would use this to do left outer joins or outer joins. All right, so what's the complexity to do this? So it's the same sorting cost we saw from last class, assuming we have a bunch of buffers. So we have to pay the penalty or pay the cost to sort the outer table, pay the cost to sort the inner table. And then when we do our merge, we assume that there's no backtracking here. We assume, because it's hard to identify what the actual page boundaries are for tuples. So for simplicity, assume there's no backtracking. So now the merge cost is just n plus n. Because we're gonna scan through every single tuple in the outer, every page in the outer relation that's sorted, every single page in the inner relation that's sorted and do our merge. So just again, to simplify things, we just say it's n plus one. So if we go now put the real numbers, right, they're from about four. The same way you have 100 buffer pages to do the processing, you end up with 0.75 seconds. So we went from one hour with the naive join to I think one second for the block nested loop join, ignoring the in memory nested loop join and the index one we can't identify. So now we're down to 0.75 seconds. So far so good, right, we're making progress. So the worst case scenario for sort merge join is when there's a single value on your join key, like everyone has ID equals one in the outer table and the inner table. So like you pay the sort cost obviously, but the sort is stupid because everything's all one and then you do the merge, right? So I mean, I'm sure people are stupid. I'm sure somewhere in the world people have data sets where they build tuples of one column and try to join it together, but in practice, this is not likely. The data system will do it because you asked it to do it, but like, I mean, like you can buy, you can buy Drano, you're not supposed to drink it but people drink it, right? There's nothing we can do to prevent this. So I've already said this, when is the sort merge useful? So if both the tables or at least one of the tables is already sorted on the join key, like if you have a clustered index, then you don't pay that sort cost, that's fantastic. Or also too, if there's an order by clause on the query and the order by clause just happens to be the same, you know, subset of the keys, the same keys as your joint operator, then you, you know, two for one, you do the sort merge join, then you don't pay the sort cost at all, right? Again, the query optimizer can be clever and try to figure out, okay, well, my join is only gonna produce one tuple or small number tuples, so who cares if I get the sorting for free? The hash join could potentially or index nest of loop join could still be better. Systems also be aware too that since the, if you're doing a index nest of loop join, since the index will be sorted, assuming it's a B plus tree, not a hash index, then that output will be sorted too and then you potentially don't have to do the order by as well, right? So let's get to the more important, the most important one, the hash join. So the main thing about this is that, so some tuple in R and some tuple in S, if they're gonna satisfy our join condition, like something equals something, then obviously these two values will have to be the same, right? My RID has to equal my SID. So if that's the case, then we know that they're gonna hash to the same value as well, because they are the same value, the same hash number, right? So the basic idea here is again, a divide and conquer approach where we can assume, since we consume that they're gonna hash to the same thing, we can use the, we basically build a hash table to allow us to quickly identify, find the location in our hash table that's gonna have the tuple we need on the outer table, or sort of the inner table, right? So again, the idea is here, we're just limiting the amount of comparisons we have to do by scanning through the outer table once, building a hash table, and now we just probe into the hash table with the inner table, right? It's basically the same idea of the index as a loop join, except I have to build a hash table instead of having a B plus tree already existing, right? So the basic hash join algorithm has two phases. Phase one is the build, build slide where you take the outer relation and then build a hash table with some hash function h1 that's on your join attribute. And I sort of set that up, it could build a disk if necessary. I used any of the hash table techniques we talked about last week, right? For now, it doesn't matter. In practice though, if everything fits in memory, linear probing usually is the best. If it doesn't fit memory, maybe, actually, yeah, linear probing usually is always the best. You always wanna choose this. Simple is better in this case. You just wanna be fast as possible. And then the second phase, the probe phase, you scan for every single tuple of the inner relation, use the same hash function, jump to the hash table and see whether you have a match, right? So it looks like this. Again, phase one, we're gonna build the hash table on our join key on the outer table. And then in phase two, again, just scan through the inner table, use the same hash function and find a match. So the key in our hash table is gonna be the attributes that we're joining on, right? So RID and SID that we would use that as the join key, or sorry, the hashing key. We always need to keep the original value of the keys in the hash table as well because in case of hashing collisions, we might land into a slot in our hash table and we have to see whether the key that we're looking at actually truly is the one we're trying to do our join on, otherwise we could have false positives. And then the value will vary based on the implementation. If it's early materialization, then you would have the entire tuple or the portion of the tuple you actually would need. If it's late materialization, it would just be the record ID, right? But you would always keep, you always wanna keep a copy of the key in the hash table to avoid that lookup when you do the comparison. All right, so how fast is this? So assuming we have B minus one partitions in phase one, then the cost of doing this is B times B minus one. So the number of pages we would need in N pages needs to be square root of N buffers because you always try to do powers of two. And so it's basically you're just gonna scan through, build the hash table and then do the probe. There's this fudge factor idea here where if you, since we don't know at this point what the, how skewed the data actually is. So you may actually have some hash buckets actually spilled a disk because everything's hashed to the same location. We'll see optimization for that in a second. But the basic idea is that like, I can take a pass, one pass through the outer table and one pass through, sorry, one pass through the outer table but the hash table assuming B minus one things are filled in memory, then one pass through the inner table, do the probe and I find my match, produce my output. And the fudge factor says whether how, we'll have collisions, how many more things I have to look at. So one common optimization you can do for this is to add a bloom filter in front of the hash table. I think someone brought this up earlier when we talked about hash tables before and I said this is, yes, this is a common optimization. The basic idea is that the cost of going to the probe of the hash table is actually, it's not in trivial, right? You have to hash the key but then you have to do that lookup and depending on the collision rate and the fill factor of your hash table, you may have to do a lot of comparisons to find the thing you're looking for or find the empty slot if you're doing later probing and it knows you're done. So we can put a bloom filter in front of the hash table and use that as a quick and dirty determination whether the key you're looking for actually exists or not without actually looking to the hash table. The idea is that this bloom filter is gonna be much smaller, more compact in memory than the actual hash table itself. So if I, depending on the selectivity of my key, like if my keys are never gonna match on the inner table, then I check the bloom filter, bloom filter says it doesn't exist, then that'll be super fast to be just to not have to probe the hash table. So the basic idea is this. As I'm scanning A in the build phase and building my hash table, I'm also gonna build this bloom filter. And then I pass this along to B over here so when it does the probe, it checks the bloom filter first. If it says it doesn't exist, then I'm done. If it says it does exist, then I actually go do the probe in the hash table itself. Who here knows what a bloom filter is? I always ask this every year. All right, less than half. Okay, cool. So let me go explain what a bloom filter is. So bloom filters are super important data structure. You can see all throughout your life in computer science not just for databases, but they're super useful in databases. So this is gonna be a probabilistic data structure, which is gonna be a giant bitmap that we can use to answer set membership queries. So set membership means like, does this key exist yes or no? And that's different from an index. Index says does this key exist? If yes, tell me where to go find it. The filter can't tell you where it is. It says whether it's there or not, right? And so this is gonna be a probabilistic data structure, meaning it could produce incorrect results with some amount of probability, a false positive rate. So we'll never have false negatives. So if this key doesn't exist, we know that's true, it doesn't exist. We could have false positives, meaning we ask it does this key exist, and it says yes, but that doesn't actually do this. Doesn't actually do this. So go to my example here. If I'm trying to do, I wanna do a joining A and B, I would check the bloom filter and say, does my key exist? If it says no, it doesn't exist, therefore I'm not gonna get incorrect results. If yes, it does, then I come back here and check the hash table and it doesn't actually exist at all. In that case, it doesn't produce incorrect results, but maybe I did a hash probe that I didn't actually have to do. It's basically doing hashing before a hash table, yes. Yes, but the hash table is huge, right, compared to the bloom filter. So the bloom filter only has two operations. You can do insert and lookups. So insert basically is, I take a key, I'm gonna run some number of hash functions on it, and then set bits in my giant bitmap based on this, right? I'll show what it looks like in the next slide. And the lookup basically says for a given key, I, for each hash function, I see whether the bits are set to one. If yes, then I know it exists. So again, here's a really simple example. So the term bloom filter, the guy invented this within the 1970s, his name is Bloom, that's why it's called this. Also say too, there's other variants that can support deletes, but we can ignore that for now. For hash joins, we don't do deletes. We just wanna build this thing and be done with it, right? So insert and lookup is all we need. All right, so I have eight bits in my bloom filter. In reality, you want something larger. And there's four methods to say, if you go to this website here, basic function, how many keys do I have, how much memory do I have, and it'll tell you what the false positive rate will be, and it'll tell you how many hash functions you wanna use. All right, so say I wanna start inserting keys into this, so I wanna insert RZA. So say I have two hash functions, reduce some output, and I mod it by the number of bits I have, and then all I do is take the output of this, and I flip a bit in my bitmap, right? Same thing, I insert JZA to run the two same hash functions, mod by the number of bits I have, and I just flip the bits inside the bitmap. This case here, the first one here, maps to position three, but then I was already set by the RZA, right? So, you know, it's not accounting, it's not accounting, it's setting a bit. So now I do a lookup at RZA. Again, it's just the reverse of that. I take my key, I hash it, mod by the number of bits, and then if all the bits that the hash functions end up pointing to are set to one, then I know my key exists, right? So this returns true. Yeah, I know my key might exist. Yes, thank you. I do a lookup on Rick Han, the chef, right? So again, I get five and three. Three is set to true or set to one, five is not set to zero, so this returns false. And I know that's correct. You know, all the bits would have to be set to true. But if I lookup at ODB, rest of piece, I get three and six, but again, I've got ODB, so this will return true and that's a false positive. So again, I have seven hash functions, I have this bitmap of my Bloom filter, that I'll sit in my, wouldn't sit in CPU cache where it's certainly sitting in memory, depending on the size of it. That's way faster to probe on this to see whether something exists versus going lookup in the hash table. So this is going back here. So this optimization is super common and in practice, I think the number shows like you can get like a two X speedup on hash joins because, you know, again, depends on the selectivity of the join operator, but because you don't have to do this program in the hash table, which could be huge, you have to go fetch things from disk, this is a big win. So most modern systems do this. The reason why it's all cyber information passing because it sort of breaks this dag architecture we talked about before where data sent from the children up to the parent operator and the query plan, but in this case here, we're kind of like sliding over the Bloom filter to the side here. Vertica calls this, other systems may call this, but if you think about it, you just put the Bloom filter embedded inside the hash table then you're not doing actually cyber information passing, but the concept is the same. All right, so any questions about Bloom filters or this optimization? All right, so if we don't have enough memory, if we don't have enough memory to store our entire hash table, the hash join is gonna be problematic because it's random IO. In my example I showed here, I assume the hash table fits entirely in memory. I can put a small memory Bloom filter in front of it and I'm voting that random IO on the hash table, but if it doesn't fit in memory, then that's gonna be a big problem for us, right? Because we don't want to let the buffer pool just start swapping pages out in sort of random order because there's no guarantee that the next thing I need will be in memory, right? I'm ready to just swap it out because the hash function randomizes everything. So in this case here we wanna use the same technique that we talked about at the end of last class to basically partition our data into buckets, write them out in stages, and then bring them in sequentially and do all the processing or analysis on the data we have that we bring in memory and then discard it and move on to the next set of buckets. This allows us to convert again, what would have been random IO and a large hash table to partition it to smaller chunks and do a sequential IO on all those buckets. So this technique is called partition hash join. I think the textbook and the Wikipedia might call it the grace hash join. I'll explain what grace comes from in a second. But we're now gonna split our hash join to two phases again, but we're gonna have a build phase where we do a pass over the outer table and the inner table, hash it into some function, write the data onto buckets, and then the probe phase, we're gonna bring all the pages within one level of the buckets in together, build a hash table on the outer table, and then do our regular hash join on the inner table, do probing into it. Depending on the size of the buckets, you could just do an in-marry nested loop join, which would be super fast because you don't have to build a hash table, but I think that the textbook and the, in practice, you typically build a hash table on the build side. So the reason why it's called grace hash join because there was this very influential academic project in the 1980s called Grace out of University of Tokyo, and they basically built, they're trying to build a system that could support joins that exceeded the amount of memory, and they came up with this particular technique. So grace was an early example, or an example of what was called a database machine. This is a term that's not really used today, but basically, I mean, back in the 1980s and certainly even today, the holy grail of database systems is, people want to build specialized hardware that can do some amount of a database system, like some join operator, some sorting operator, or some portion of the query execution in like custom hardware. There are systems that can do this now in GPUs, you can do this FPGAs and things like that, but back in the 1980s, it was like the wild west, people were like building all sorts of crazy, and like, because they, at the time, like, you know, databases were super slow, so anything you could specialize in hardware could potentially make a big, big difference. So the first database machine was the thing called IDM out of the company called Britton Lee. I just like this photo because it's like, he's wearing a suit and tie when he works on a database, which is not how things work today. And this thing called IDM, they basically build custom co-processors that could do, I think, joins, or sorting pieces in the hardware itself. And then since then, again, there's just years and years and years of people trying to build these different, these different appliances, or now they're called appliances, but these data machines basically customize hardware or taking commodity hardware and tuning it exactly for the database itself, maybe adding, sprinkling out a little bit of customization or custom ASICs that do some additional processing or FPGAs. So probably the best, why they use one is these exadata things from Oracle. I mean, these things are like millions of dollars. Teradata is another one that they'll sell on appliance. Again, people keep doing this, right? The latest one is the thing called Yellowbrick that they started building out more custom hardware for databases. Basically all the data machines in the 80s failed because of Moore's law. So by the time you built custom hardware, got it fab, put it out in production, Intel put out the next version of X86 and any gains you've got basically got destroyed. Same thing with FPGAs, like systems have gotten faster and faster, unless you can get things on the cloud now on Amazon, people don't want your specialized stuff, right? So this one right here, Natesa from IBM bought, this was like the early 2000s they had FPGAs do filtering for data, right? And IBM basically killed it because nobody wants custom hardware, right? I actually think the next decade is gonna be a wild because of the limitations of X86 and like I don't think GPU databases are the thing, I don't think FPGAs are a good thing. I think some of this RISC-5 stuff looks super interesting because like you can start fadding all sorts of specialized accelerators because there's enough space in the die. That's, we can take that comment offline. All right, anyway, so again, it's either gray hash one or partition hash one, same idea. So what we're gonna do is in the first phase, we're gonna just scan through R, take our hash function and build up these different buckets, scan to S, same thing, use the same hash function, build up different buckets, right? And then now what we wanna do is do matching for the tuples that are in within the same level of the buckets. Because you know, again, if the assumption is that since our hash function is deterministic, meaning the same value always has to the same, same key always has to the same value, we know that anything we're looking at sort of at this level here can't possibly exist at lower levels because otherwise it would have hashed in the same thing, right? Because we assume we're doing aqua joints, right? So now once we put everything out to these different buckets, we're gonna bring in the matching buckets together from the inner table, the outer table at the same time, and then we just do a regular hash join to be different for. So let's take the outer table buckets, build a hash table from that, and then just do a probe for all the tuples within the inner table. Yes? So his comment is, I'll extrapolate your comment, but it may be the case that our data is highly skewed and therefore everybody's hashing in the same bucket, so I have to do another round of partitioning, yes. Let's recurse the partitioning, we'll see that in the next slide. Right, yeah, next slide. So this assumes that the buckets, if the buckets don't fit in the memory, then we have the same problem we had before where now we build this hash table after we've already done the first round of partitioning, and now we still have random IO. So in this case, the way you have to handle this is that you would do another round of partitioning to split the overflowed bucket to more buckets with an additional hash function, and you just have to make sure that the inner table and the outer table are both doing that the same amount of recursive partitioning together. Right, so let's see what it looks like. So I take my first round, I build the partitioning on the outer relation, but I see this bucket one here, this is getting overflowed with a bunch of pages. So what I'm gonna do is just take that, do a new hash function. Again, it's still XXHash3 just with a different seed so that it produces different random values. I hash this again and now a new set of pages, or new set of buckets at this level, and I maintain the existing ones. And then now when I do the build on the, sorry, do the hashing on the inner table, same hash function gets mapped out like this, but I just have to recognize that if I hash something to what would have level one in the first round on the build side, I just have to hash it again and end up with the additional partitioning. And I can do this recursively for as long as I want if something keeps overflowing. Again, in practice, two passes is enough, yes. Yeah, so in the generic case, same it is, optimistically, this will work. You can divide things enough to fit memory within two passes. In the worst case scenario, everything is ID equals one. They hash the same thing, then this doesn't matter. You're better falling back to, honestly, the nest loop join. It's nothing you can do to make this happen. So your question is like, my statement is like, if it's the generic case, the nest loop join, the block nest loop join is the best way to go. Your alternative is to bring an additional metadata to do what? Yes. So his statement is instead of having keys here, just hash on the record ID and that way I get things split up. Well, within a key. Within a key. But then the problem is like, if you do that, how do I do the probe side? Right, because I don't know the record. If I knew with the record ID, I would just do the join, right? So the keys have to match. Yes. So his statement is instead of doing the hashing, which randomizes everything, could I do basically range partitioning? Could I say, from zero to a hundred, go to this bucket, one on one to 200, go to this bucket. There are range indexes. His statement is that sounds like hashing a few outputs. I mean, you can build range index. Some systems support range indexes. Postgres does. DB2 and Oracle do, I think. I don't know whether they use them from joins. I know I've seen pavers doing range joins of research pavers. I don't know if any of these real stuff actually does it. I mean, that's a good example where like, anything just works with everything. Again, except for the dinner case where you have small cardinality. Like, you're better off spending your time optimizing this in your system, because this is gonna be more common versus like these little one-off things, to make the range stuff work better. Yes? So, his statement, his question is, am I assuming that every bucket fits in memory, but then if I have to make multiple buckets within one level, I just write the previous one out the disk? Yes. Yes. So, how does the data system gonna decide before I start running the query that I'm not gonna have to do recursive partitioning? They decide on the fly. Oh, oh, oh, so statement is, if you know you're gonna have to do certain, so many levels of recursive partitioning, could you make a decision that's still better to use sort merge than this thing? These are just cost estimations based on like histograms and sketches and other approximations, right? Again, this is, how do I say this? Estimating, it's super hard to figure out which one with the inverse of the outer. Once, if you're trying to do a join on data on some of the output of another join, so you sort of have this multiplicative effect where like I get this estimation wrong and that makes this other one wrong, makes the other one wrong and like after one join, it all falls apart. All right, I wanna quickly finish this up. The partition has joined, the cost is three plus n because it's two plus two, two passes, one pass to read it in, one pass to write it out into buckets, then we ignoring recursive partitioning and the probing phase, I just read them back in together, right? Going back to our example we have before, the same number of pages in our sort of retrieval example. So now I do three times n plus n down to 0.5. Now we're getting hash join down to 0.45. Again, ignoring recursive partitioning. So one quick optimization I wanna bring up because you might see this in the literature. Actually, if you go and read Wikipedia page on hash join, they'll mention this. So a hybrid hash join, the idea is that if I know my keys are skewed, they're not all one but a lot of them are, the same value or hash to the same thing, then basically what I do is I do the same gray hash join I did before but I designate some level in buckets to be the annoyed ones that stay in memory and I just do my join very quickly on those, right? So I'm bashing everything's happy to this level zero here. So what I'll do is build the hash table for this guy in memory. So now I do my probes on this. I see I'm, oh yeah, it's the special one. Let me just do the probe and actually produce the output but all of these other ones could spill to disk, right? To sort of, again, it's a hybrid approach where you're doing the gray hash join where you're spilling the disk and you're keeping this in memory, right? So getting this to work correctly is super hard, right? Because how big should this be versus how these guys, I don't know how often this is actually done but this definitely shows up and it's in the textbook and it shows up in a bunch of other cases. I think the commercial systems will actually do this more so than the Post-SMI SQL, okay? All right, just to finish up quickly. Here's a summary of all the join algorithms we talked about and the hash join and you see now why it's always gonna be better because in practice it's gonna be just three passes through the data without any recursive partitioning, right? So hashing is almost always better than sorting for join operations, actually for aggregations as well. But if your data is already sorted or the output needs to be sorted, it may be the case that the summary join is better and good data systems like Postgres and other systems and the commercial systems will be able to figure this out for you, okay? All right, so next class, we'll talk about actually more queries and then we'll have the, we'll talk about project two at the recitation on Thursday night, okay? Unless you wanna go through this quickly. Project two, it's out, do it, right? Two parts, check point one. Page layout, insert update, right? Our second, single key, gotta do splits, gotta do removes. You don't need to be multi-threaded for now. Check point two is actually make this multi-threaded and you have to support an iterator that goes along range scans on the bottom. You only have to go in one direction. You can only go and send the order on leaf notes. You don't have to go opposite direction. Make sure life a lot easier, okay? Follow the textbook, start with small pages. Make sure you route page ID, extra credit again. If you go fast, don't fuck on. Next class, see ya. Super Snake.