 Hey everybody, welcome back. Today we are going to talk, today we're gonna talk about join algorithms. And without further ado, let's just dive right in. So we've already been talking, when we've been discussing these out of core algorithms about the centrality of this idea of rendezvous, of making sure that two objects find them each other in space and time, right? That they're both co-resident in memory at once. And so grouping and aggregation, like we talked about last time, averages and things like this, that's a form of rendezvous. You wanna make sure all the computer science students are together so you can compute their average GPA at the same time, okay? So it's a form of rendezvous, sort of from a single table. Join is the other kind of main kind of rendezvous we'll talk about, which is when you have items that are in two different collections, two different tables, how do you make sure they get into memory at the same time so they can be brought together, okay? And if you think about it, this is kind of basic. If you have any large data collection types, so they don't have to be relational tables, they could be anything that's a set or a list or a bag of things. If you need to batch things together from one such object, that's kind of like grouping and aggregation. And if you need to bring multiple such collections together and consider their items in tandem, that's gonna be something like join. So people sometimes accuse database folks of being obsessed with joins, but they're actually pretty fundamental to computing. Producer-consumer relationships in any system is really a form of join of streams of information, right? When you're talking about instant messaging or any kind of networking protocol, it's really a form of join. You're just bringing together information from producers and consumers so that the producer message gets handed to computation that the consumer can take it away and put it somewhere. So this is a very fundamental operation, the join. We're gonna talk about it in this class, obviously in the context of databases and particularly in the short term in the context of relational databases, but all these techniques we're gonna learn, you should keep in mind, for any other kind of scenario you're in where you have to join up things, put things together from multiple collections. Okay, here's the simplest, most exhaustive form of join. It's the Cartesian product or the cross product. I give you two collections, R and S, and I want you to form all pairs, R comma S of items in capital R comma S, right? It's often called the Cartesian product and it should be easy for you to see why, based on your high school geometry class or something like that. Think about all the rows of R being listed along this axis and all the rows of S being listed along this axis. The pairs of R, S form the Cartesian space in here, right? First row of R with second row of S, et cetera. So not surprising that this is called the Cartesian products because it forms the Cartesian plane of tuple pairs, okay? Okay, we often just call it the cross product. If I tell you that R has N records in it and S has N records in it, how big is the Cartesian product? N squared, right? It's the square, right? This is again, this should be high school stuff or even earlier, okay? Good, keep that in mind. So N squared is the biggest join, so to speak, result that we can make out of two relations and sort of big O of N squared. We'll get back to that later on. A more typical thing you do though when you're doing queries is what's often called the theta join where you're taking R and that little bow tie symbol is the relational join symbol, okay? And the subscript theta says find all pairs R, S where theta of R and S is true. So theta is some predicate, which means to say some Boolean value to function, all right, that compares R tuples and S tuples. So we might want to join up, say, friend requests with users where maybe the friend request table has a column of people I have a crush on, okay? That's the crush ID column and users has an ID column and the predicate is to compare friend requests dot crush ID with users dot ID. Note that you can form the cross product of a relation with itself. Just think about that relation existing on both these axes. So here's the people in my family and here's the people in my family again, all right? Mom, I'm not gonna write this down. There's five people in my family. I'll list them on both axes and I wanna mark the places in this grid where one family member is younger than the other, okay? So the predicate here is age is less than age. So let's say that this is the left-hand family and this is the right-hand family in that expression, right? We're just comparing family members with themselves. We're treating it as if it's two different sets but it's really just this one set used twice, all right? A more typical example of that in your sort of database textbook would be find all employees and their managers. So imagine that the employee table has a field called manager ID and a field called employee ID. We're saying I'll write this down actually. We're saying select for now we'll say star from, and here we're gonna see something we haven't seen before from employee table E and another version of the employee table which we'll call mugger for manager and the theta that we're gonna write down in the where clause is E dot manager ID which is a column in the employee table equals the manager's ID, all right? So imagine that there's a column for IDs and there's a column for manager IDs, all right? So each employee could play multiple roles because you could be a manager of people and you could have a manager as well, that's fine, okay? And oftentimes people want to add one thing to this which is and E dot salary is greater than manager dot salary, isn't that fun? Find people who make more than their manager, okay? So that's using the end relation twice, good? All right, a very common case of a theta and note that the theta here is actually a Boolean expression with two clauses, right? There's this equality thing and this greater than. So this was kind of a fancy one but a very typical theta is just some kind of a quality lookup, okay? And when the theta's an equality lookup we'll call that an equijoint, all right? Equijoints are easy to do so their special case we'll want to focus on. They happen all the time and they're easier to process, they're cheaper to process. And a special case of an equijoint is when one side of that equijoint is a key that is to say it's the unique identifier of one of the two tables. So here's an example. We're gonna have a table of students, all right? We're gonna have like student ID, name and I don't know, GPA. And then we're gonna have a table called enrolled which tracks which students are enrolled in which tables. It's our enrolled in which classes, sorry. And this is gonna have a student ID and a course ID and maybe a semester and year, okay? So we've got students, we've got enrolled and we know in the student table the two people can't, they can't be two people with the same student ID so we'll just underline that for now. That's called a key of the student table. Student ID uniquely determines what row of the table we're talking about, okay? So you can't have duplicate student IDs in this table. You can have duplicate student IDs here because each student could enroll in many courses, right? But the join, one side of this join, namely students dot, I guess we called it ID, students dot ID here, we know that's a key. So in essence, a way to think about this join is you go down the enrolled table and you do lookups for each enrollment record who is associated with that ID and it's just a lookup into this table. So this is like a lookup table for that join because we're just looking things up by their unique ID. Okay, now that's not generally true. The general case of a join is the full Cartesian product, right, but in this case there's exactly one match for every enrolled tuple in the student table. And that happens pretty often, right? It's just an observation, really. Okay, so the examples that the book likes to use are these boat reservations. So there's this boating club and there are sailors and there are boats and they have reservations. And so I'm gonna stick with those for a bunch of this so that we're consistent with the book, okay? And we're gonna wanna talk for our various join algorithms and examples. We're gonna wanna talk about how much work we're gonna do in terms of IO requests to process joins with different algorithms, all right? And for much of this class, our cost metric of choice will be number of IOs that we issue to the IO subsystem. Fairly obvious reason for that, IO is way more expensive even with flash than RAM access, right, and so if it's your dominant cost, sort of at least by a wet finger estimate, the cost of your algorithms will be the number of IO requests you issue. Recall from two lectures ago that sequential IOs on magnetic disks are much slower than, or sorry, much faster than random IOs. So in some cases, we might like to distinguish sequential and random IOs. Today for our analyses, we'll just worry about the number of IOs and not worry about whether they're random or sequential, okay? All right, so here's our schema for our examples. There are sailors. The key of the sailors table is the sailor ID. Every sailor has a name and a rating, which is I guess how good a sailor they are and an age, which in this slide says real, but real is not actually a SQL data type. It would be float if this is SQL. The reserves table, much like the enrolled table here, pairs up sailor IDs, boat IDs, and days. And notice that there's a line underneath all three of those. So the key of the reserves relation is sailor ID, boat ID, day, which means that a given sailor can only reserve a given boat once per day, right? And when you know those three things, then you get exactly one row out of the reserves table. Sailor ID, boat ID, and day, right? And there's a reservation name, which is a string so we can pull up that reservation later on, I guess. Or I don't know, it describes the reservation. It's not important, but there's an extra attribute. All right, here's the stuff that's gonna affect our costs. In the sailors table, every tuple's 50 bytes long. So we're gonna assume that we have a fixed amount of space for every tuple in this example. Wouldn't be true in general. Lots of times you have variable length tuples, but for now let's assume it's fixed. We're gonna have tiny little 4K pages because the book is old. And so there's gonna be 50 bytes long, 82 tuples per page, which means 4K pages, okay? And we'll have 500 pages. And the way we'll notate this is these brackety notation. Brackety around the S equals 500 means there's 500 pages. Geez, I thought this notation, sit on the next slide. I mean, introduce this notation somewhere. There is that notation. We already did it, but let's, let me show it to you and then we'll go back. So the notation we're gonna use, brackets around a number, sorry, around a relation name, is the number of pages used to store that relation, where page is the unit of transfer from disk, right? The IO transfer unit. And in the case of the examples in the book, that's 4K, four kilobytes. Typically today you'd configure it's more like 64 kilobytes, but minor detail. Little p sub r is the number of records per page of r. The traditional vertical bars around r is the number of records in r. It's the cardinality of the set. So r is a set of tuples and vertical bars r is the cardinality of that set. It's the number of rows in the table, okay? And you note that the number of tuples per page times the number of pages equals the number of total tuples. Yeah, so simple stuff, just some definitions. Going back to where we were, apologies for the eroding. In sailors, there's 500 pages, so we say brackets around S equals 500 and 80 tuples per page, so p sub S equals 80. Reserves, tuples are a little smaller. You can get a few more of them on a page. There's 1,000 pages though. As you would expect, there's more reservations than there are sailors, because each sailor's gonna reserve boats more than once. And notation. Okay, so here's an example of a joint query at the top. We already put some examples on the board. In fact, this example is almost identical to that example. It's just got two different tables. So I'm pairing up sailors and their reservations. So find all pairs of reservations in sailors such that the reservation sailor ID is the sailor sailor ID. So ends are very common. The cross product would be really big, right? It's the size, it's the number of tuples in R times the number of tuples in S, which is a big number. It's something on the order of N squared, it's quadratic. We don't like to do quadratic things that go to disk. That's bad. If we can avoid doing that, that would be good. So what we'd like to do is some, well, so one way to process a joint always is you blow up the entire Cartesian space, you generate all pairs, and then you throw away the pairs that don't match the where clause. Works, right? But it has this sort of N squared cost. Okay, so we'd like to do something better to reduce that cost, and we'll see a number of techniques to reduce it for kind of algorithms, nested loops join, indexed loops join, sort of merge and hash. The remainder of the lecture is gonna be to go through these algorithms and talk about their costs. Questions up to now? All right. So we did this, that's our cost notation. Let's look at the simplest algorithm we could think of. It's called nested loops join. It's the kind of algorithm you would get if you took like a programming language like Python and you took its collection types and you did what they tell you to do with them, which is to do like a map or a nested iteration of the two type, of the two collections, right? You'd say roughly for each item in collection one, for each item in collection two, compare if the items pass the theta test, and if so, add them to the output or emit them or whatever, right? That's what you just kind of standard programming language thing to do. It's called nested loops because there's a nested loop in the algorithm, yeah? Easy enough. Well, what is the cost of this thing if it's on disk? Well, let's see. For each record in R, so how many records are there in R? There's PR, which is measured in records, so I put it in red because red is gonna cost us a lot, right? And then there's R pages. So the total number of records in R is p sub R times R. For every record in R, we're gonna do a full scan of S, right? That middle, that second loop, the inner loop is a full scan of S for every record in S. So for every record in R, we do a full scan of the S table on disk, right? Oh, by the way, don't forget we have to pay for that cost of scanning the R as well. We're gonna scan R once, and how many times are we gonna scan S once per each record of R, right? And in our example, that's 100 times 1,000 times 500, plus 1,000 IOs. You can do the arithmetic, it takes a long time, even in this darky little example. All right, so this is a pretty bad algorithm. In fact, how many times do we invoke that third line, the inner loop of this algorithm? Roughly speaking. Yeah, the number of rows in R times the number of rows in S. That's how many times we call the theta test, which is to say we really generate every point in this entire rectangle with nested loops joined. We visit every single one of them for every row in R, for every row in S. So we go like this, basically. Visiting these things like this. We visit every point and we decide if the theta's true. So it's an N squared algorithm, if R and S are both N, right? Pretty bad algorithm. So would it be better if we made the smaller relation in the outer, if say S is really long? Is it better to do R as the outer loop and S as the inner loop or the other way around? So in this case, R was the bigger relation, I believe. Would it be better the other way around? Smaller one should be on the outside. Let's think about, I think what you said is true. I'm just convincing myself. Let's assume that the little guy, which in this case is S is here and the big guy R is on a different disk here. We're gonna scan S once and we're gonna scan R lots of times, right? Once per row of S versus the other way around. So what you said was that we'll move the disk arm here only once, one pass, which is true. And so we're passing over R lots of times. I mean, but we should be able to get this out of this expression, right? Just look at the arithmetic. If S is smaller, would it be better to do S and R the other way around? How many times do we visit R in this thing? S plus one times, right? If we do it the other way around, if we swap the R and the S, so to speak. Well, how many times do we visit R? I'm sorry, visit S. R times, right? So there's one extra time that we visit R than S, so to speak. We can affect this by exactly the one on the output, right? The R times S term is multiplication is commutative. So it doesn't matter whether R is S or S is R. The thing on the outside, if we change its name to S, it goes down a little bit. So that additive term would be smaller if we picked the smaller relation to be on the additive term. So it's better to do the small guy on the outside, but it's kind of in the noise because of the N squared behavior going on in the multiplicative piece of this equation. Yeah? Question? Good. That's true, right? So note that the expensive clause is not just determined by R and S, but also the number of tuples per page in R. So if you had a relation where the tuples were really teeny, even if it was small, it would send you running across the other relation many times. And the other relation maybe the tuples are big, so there's fewer of them. So that's an excellent point. So when you turn this around, that doesn't become PR, it becomes PS, and that can have a more significant effect. Good. Excellent. Short answer here is you just write down the equations for both of them and you pick the cheaper one and that's the right order to do them in. Right? But there's some thinking about it, the algorithm that's helpful here. Good. Okay. What assumptions have we made in this equation? I alluded to some of them before. Yeah? I think it's an awesome question. Okay. Think of, that's an awesome question. It's warming my heart. Think about this picture. So we're gonna say for every row of R we're gonna scan S, right? So if we think about the way we traverse the Cartesian space, right? The R, let's say R and S are for now on two different disk drives, which would be nice if it were true. It's sort of ideal. So there's two disk heads. S's disk head is gonna go from the beginning to the end and then start over back at the beginning to the end. So there's a little seek at the end of each of these rows to start over, right? And everything else is sequential. The R disk head is gonna just go sequentially with no seeks upwards, right? So that's good. You could maybe optimize this a little bit by making this guy alternate directions on S. Go that way. Oh, now go that way. You don't have any seeks. You just do kind of an elevator on S. And that's fine because none of this is ordered, right? In relations, we don't care what order things happen. So it would be just fine to scan S for the first tuple of R you scan it this way, for the second tuple of R you scan it backwards. That would be just fine, okay? Actually, you guys certainly have never seen these, but old printers used to work like that. They would print a line forwards and then the paper would turn and they would print a line backwards because it was too time consuming to move the little head. Anyway, fun facts from the 70s. Okay, but the truth is, though, that if these are big tables, these C costs get dwarfed by the time it takes to scan it and whatever, 100 megabytes. Yeah, no, but you can change the direction that the head moves in and out of the platter. And if you have a track buffer, then it really doesn't matter what order the track comes in. The cost is more in terms of the discard movement and you can wave that in and out. Yeah, awesome question. So the question to repeat it for those in the back was, when we're talking about the systems in this class, how much control are we assuming we have over these disk drives and IO interfaces and so on? Doesn't, for one thing, doesn't the operating system, isn't the operating system supposed to do that? And also, even if we wanted to do it, would the operating system allow us to do it? And it's worse than that actually. It's also the ASICs on the disk drives are doing logic, actually significant logic these days. So it's a great question and it's a question that's kind of be devil designers of databases forever because they have oftentimes for particular query, you know exactly what IO pattern you want and you're like, cool, I'm just gonna access the blocks in order and so I'm gonna be good, but the operating system abstractions are actually lying to you and it's not doing things in the order you think at all. And so when we get to transaction processing, this is gonna drive us bananas because we're gonna do things like, okay, we just wrote out all the disk blocks for this transaction, we can tell the user his data is safe, but the operating system actually lied and never wrote it to the disk at all. And then we promised the user something, but actually the operating system didn't do what it said and we get screwed up. So this caused people back in way back when to circumvent the operating system entirely and try to get direct access to the device hardware interfaces and stuff like that. These days, modern operating systems allow you to get that control if you set the right flags and do the right things. So at least in terms of making sure that writes really go to disk, for example, we're gonna assume we can do that. In terms of making sure that the behavior of the disk head is what we expect, we're gonna sort of count on probabilities essentially. If you take a disk and you fill it with one gigantic file and you take UNIX or Windows or something like that, you say, I wanna have a file this big in there. The block numbers in that file typically are laid out sequentially by the operating system when you first allocate on an empty volume and things will work out pretty good. Now occasionally the operating system might see a dead block and move something somewhere else and screw you over, but usually it's not that bad. So that's kind of the state of the world. If you're like a super crazy high performance person, like you're trying to win a benchmark and you're in the Oracle benchmarking group, then you might actually go to those raw IO devices. And people still do that. It's not worth the trouble for normal folks. Like no one in production does that stuff. And many systems don't even support it anymore, but a lot of the systems do simply to win benchmarks. Okay, long answer to a short question. Great question. And I will reiterate this when we get back to transactions about how the operating system might be getting in our way and how we have to subvert it. Okay. Operating systems, well, thematically, and then we'll move on. One of the themes of the divide between operating system designers and database designers is operating system designers are generalists. They wanna build a thing that will run any kind of program you might wanna write. And as a result, what they do is they tend to provide very low level interfaces that are very universal, like memory access, virtual memory, or disk block access, you know, file cache, right? And then they try to guarantee that things will be right most of the time and not knowing anything else about the workloads, they'll do a pretty good job. So they have this kind of average case performance across a wide range of workloads. Database people traditionally, they get a request that's in a super high level language and they know exactly what they're trying to do. And then they might decide to translate this to a particular, let's say, join algorithm because that's the fastest join algorithm for this particular data set for this particular query, okay? And then they did all that modeling of these costs the way we're doing right now to choose this join algorithm based on some assumptions about performance. They want those assumptions to be true. And so the control of the system kind of comes all the way up from this high level language all the way down to performance. And they wanna own that stack for their specific use case of the database. And it leads to a kind of different design philosophy. It's sort of this more top-down design in the database community and operating systems we often describe as bottom-up design. And if you take the graduate systems class, CS262, we talk about this a lot at the beginning. We read the classic UNIX paper and we read the classic relational database research paper system R. And you can see this tension at play. It's pretty cool. So and it affects kind of all these designs. But we're gonna be kind of control freaks a little bit and assume we can control our hardware, okay? Okay, now what's the cost of this algorithm if the smaller table fits entirely in memory in terms of iOS? Say S fits entirely in memory. Then what's the IO cost? Well, we're gonna scan R through the course of that outer loop once. We're gonna scan S once the first time on that inner loop for every tuple of R, right? So for the first tuple of R, we're gonna scan S. And lo and behold, it's just gonna get cached in memory. So for the second tuple of R, it's already gonna be in memory. So we'll actually scan S only once for that first tuple of R to bring it resident. And then it'll just stay in memory. So it's actually just blocks of R plus blocks of S, right? To do this if S fits in memory. So that's good. So if you have a small interrelation on your inner loop, nested loops is not such a bad algorithm from an IO perspective, okay? All right, we can do better though. Here's something that's called page-oriented nested loop. And let's try to get rid of that factor that the gentleman over there pointed out was bedoubling us, that P sub R. Why are we doing the scan of S for every tuple of R? That seems crazy. Instead, let's bring in an entire page of R, call it B sub R, B for block. Pages blocks, they're interchangeable words, unfortunately in database parlance. But for each page, B sub R and R, if we grab a whole page full of R tuples, then we're gonna scan S, okay? We're gonna scan S one block at a time. So you get a page of R, and then you go get the first page of S. And then for every record on this page, for every record on this page, check their cross product, and then move this guy over. All right, so it's kind of a chunky version of this picture where instead of reading individual rows of R, we're gonna read a whole block of rows of R, all right? And then we're gonna scan this one block at a time on S. And within each one of these little blocks, we're essentially doing the old fashioned Cartesian product algorithm in memory, right, we're going, ah, ah, ah, ah, ah, ah, ah, ah, ah, ah, ah, ah, ah, ah, ah. Get it? That's what that says. Okay, and this way, we read R once, a page at a time, and we actually read S once per page of R, not once per row of R, right? The number of passes this way, it's like one big fat arrow for the first page of S, and then another big fat arrow, I'm sorry, for the first page of R, another big fat arrow for the second page of R, right? Okay, so in this case, it's blocks of R times blocks of S. Oh, don't forget, we still have to pay for that scan of R, plus the scan of R, which in this case is 5,000, no, sorry, 500,000 in one, and if you do it the other way around, it's 5,000, 500,000 at a half, so 1,000. So this is way better than what we saw before, so we took off a factor of, I don't know what do we say, 80 tuples per page of R, we took off a factor of 80. Anytime you can get a factor of 80 times performance improvement, that's good. Or another way of saying it is what you were doing before was really stupid. Okay, so we just want a factor of 80, that's great. Let's keep going. What could we do that would be better than this? Well, we wanna make fewer passes of S, so what should we do? But reading more of R in at a time, let's make these arrows taller. So instead of reading one page of R, let's read many pages of R and get a really fat arrow going across S. We got the memory for it. Okay, so this is, in the book it's called block nested loops, but this kind of is a disaster because I just told you that blocks are the same thing as pages. So don't use the word block, so to avoid any confusion I would call it chunk nested loops join. We're gonna read in a big chunk of R tuples into memory, all right? And that's gonna be like B minus two pages for some memory allocation of B pages. We're gonna use B minus two of R available pages to read in a big swath of R, okay? And then what we're gonna do is we're gonna scan S in our typical streaming algorithm fashion, right? So you read in one page of R and then for every tuple that's in the current chunk of, sorry, let me start over. Read in B minus two pages of R. Read in one page of S. For every tuple in the B minus two pages of R, loop through the tuples in S. It's this picture all over again just with B minus two pages of R this way, okay? So for every tuple in that buffer of R, we read every tuple in that first page of S and then we read another page of S and we do the same thing, okay? Any questions on that? Was that clear? All right, so in this case, the cost is you're gonna scan the outer and we're gonna divide the outer into these B minus two page chunks and that's how many times we'll pass through S, right? So it's the cost of scanning the outer, which is R in brackets, plus the number of outer chunks, B minus two page chunks, times the cost of scanning the interrelation. And so just to write that down in our previous notation, it's the number of pages in R plus, so that's the scan of the outer, plus that number again, divided by the chunk size, which is B minus two times, right? Okay, so we just saved ourself roughly a factor of B minus two and if you get a lot of memory, that's big again, it's like maybe you have what 100 pages worth of memory at 64K, that's nothing. So we just divided by at least 100, maybe bigger depending on how much memory you wanna allocate. So this is a big win again, orders of magnitude better than what we just saw in the previous page. Okay, so we're shaving orders of magnitude off this thing. So if we had say 100 blocks for B, we just shaved off factor of 100 and that's gonna give us now 1,000 plus 5,000 is 6,000 IOs, whereas previously we had 6,000 versus 500,000 something. That feels good, right? 500,000 down to 6,000, it's a factor of 100. We expect a factor of 100 because we're dividing this by 100. That's good, all right, cool. If you flip it around, scanning S costs 500 IOs, that's under five chunks per chunk of S we scan R, that's 5,000 IOs, that's 5,500. So it's actually a little bit better, but not by much. All right, the next thing you might wanna do, any questions on nested loops joints before we go on? Cool, okay. I'm gonna do one more and then we're gonna sort of take a brain break and talk about something a little different. The next thing I wanna do that seems natural is couldn't we use some kind of data structure like an index on one of the tables so that we don't scan this whole Cartesian space? You see there's a plug back there, there's a plug on that wall, like anyone's gonna plug something in back there. That's insane. There's ethernet ports too, cool. All right. It's crazy that we're spanning this whole space for an equality join. It just doesn't make any sense, right? If I told you to look up something in the dictionary, better yet, look up 100 words in the dictionary which is a join of 100 words in the dictionary. You say, okay, word number one. A, apple, right? You start paging through the dictionary. It's insane, the dictionary's sorted already. You go and you take that first word you're supposed to look up and you do binary search in the dictionary till you find the word, right? You don't scan the dictionary, that's insane. And even the other way around might be annoying. If I didn't give you 100 words, I'd give you like 10,000 words. It'd be annoying, like through all 10,000 words, it's stupid. Why don't we build an index on one of these things like the dictionary that we look up in all the time. It's worth taking the time to sort that damn thing and print it in order, right? So we're gonna take that idea here and we're gonna do the same thing. So there's plain old nested loops join except that we added a little clause. Where r.i equals s.j. So that was our theta. And we're gonna add that to the inner loop by doing a look up in an index. So instead of scanning through s, we're gonna do a look up in an index on s and then say for everything that matches this look up, I'm gonna do a nested loop. But only for the things that match. There may be more than one thing that matches. That's the thing to keep in mind. If you take, I don't know, what's an example of two things where there's multiple matches? You take students in this class and I don't know, I don't know what's a good example. I mean, you can just take this example, right? If you scan students as the outer loop, there'll be multiple matches and enrolled as the inner loop because each student takes many courses. So you say for each student, look up on an index what they're enrolled in and for one of you guys that might bring back five things. So we need to combine this student with all five of those things and that's the inner loop for each tuple that matches on the look up, okay? So you get your r, you do a look up in some data structure to get the matching s's and then you go through the matching s's and you form r-s pairs and then you do it again for another r. And oh, what's the cost of this? Let's see, you gotta scan r and then for each tuple in r, so it's the r in brackets times the p sub r, we pay the cost of doing a look up in the index. Now, the cost of doing a look up in the index, I will postpone until we talk about how to do indexes, all right, which is coming in a few weeks, but it's some logarithmic cost in the size of the data, blah, blah, blah, right? I mean, we know in tree-shaped data structures give you log factors and performance. I'll give you the details later, but we're gonna do a look up in that index so that's pretty much the cost equation, all right? And if we make some conservative hypotheses about how tall these indexes are, they're typically really short in a database system, they're like three high, it's like three IOs to go from the top of the index to the bottom, because they're gonna be really fat, really short indexes. This is gonna be real cheap, okay? So think about a number like three or four being in that yellow box, IOs. They're random IOs though, so you're walking a tree usually, all right? Okay, so I wanna take a little pause. We've done all the nested loops flavors of algorithms for joining. All right, and at this point, maybe we'll do a couple of announcements and then we'll come back and talk about some architecture stuff and then do the other joint algorithms. So, Anthony, I think you wanted to make an announcement. I just made a post on Piazza, but vitamin one or quiz one is now posted and it's due on Monday, end of day, that's two, two. And then we're also planning to host a Scala workshop, but we wanna see what your availability is. There's a poll also on Piazza and you can respond in the next two days. We'll try to schedule something that maximizes usefulness to as many people as possible. Okay, I guess I could put a little context around both those announcements. So vitamins, the idea is we wanna have quizzes every week. They're gonna be easy. You should get 100% on your quizzes if you like, watch the lectures and even maybe read the book a little bit. The questions will not be hard, but the goal is that we don't get to a point in the semester where you're like, wow, we learned 700 things and I can only remember 11 of them. What were the other things? So that used to happen in this class fairly often that the homeworks can't exercise all the material. You get to the exam and you realize there's all this material you haven't thought about it ever, okay? So this is gonna get you to at least kinda stay tracked and if you're anything like I was in college, if you don't have an actual assignment, then you don't do the work. That was very like lazy evaluation on learning. So I'm just gonna assume you guys are as well. Even if you're not, that's great. Then this will be no problem, but these vitamins, you gotta do them every week. If you don't, it will affect your grade. So that's your incentive, little stick, but the good part is they're not hard. Do not collaborate on them. They're not like projects. They're just like go read the book and figure out the answer and check the box. They're all gonna be online. They shouldn't take a lot of time, but just you gotta do them on time. So that's vitamins. They're good for you, okay? Yeah, so they used to be called bunnies because they were cuddly and cozy and the TA who called them bunnies is gone and I can't call them bunnies. We're gonna call them vitamins. All right. Homework two is being passed out tonight-ish, actually tomorrow. It's gonna be passed out tomorrow in kinda beta. It might take a couple days to kinda settle down, but it's gonna be passed out and available tomorrow. It will require you to code in Scala, hence the Scala programming session if you want. Now my attitude with programming languages is your upper division Berkeley students, like learning a new language is just not a big deal. Scala is a Java variant. You've programmed lots in Java. The code base you're hacking on is a Scala code base. You'll figure it out. Instead of typing class, you're gonna type, well actually class, you type the same word, but there's gonna be some syntax translation stuff and then if you want to learn Scala, it's a much nicer language than Java and you'll have the opportunity to try to program in a Scala style. If you don't wanna learn Scala, you're just gonna have to kinda check your keywords and then it'll mostly look like Java. Okay. So don't freak out about it. I made people learn Ruby a couple years ago. It was no big deal. Scala's gonna be the same thing, but you might wanna go to the info session. We're gonna go through one of the online tutorials. I'm not gonna go through it, but I think Vikram is. And just kinda step you through the process of going through the online tutorials and getting used to it and answer questions. Okay. So that's the context on those two things. And the reason we're using Scala, just to be clear, is because you guys are gonna be hacking on a real system called Spark, which is in wide use in industry. Actually, it's probably the most widely adopted Berkeley open source project ever. Maybe except for BSD Unix. Although Linux kinda beat BSD Unix over the years, but very widely used. Written right here in Soda Hall by a bunch of PhD students. It's a pretty interesting system. It's kind of growing up. It's maybe two, three years old, so it's not so mature that it's got everything in it already and there's nothing for you to add. There's lots of stuff you could add to it, and we will in our projects. But it's a real system that's really being deployed in industry and it's kinda cutting edge. So it'll be fun, I think, for you guys to get your eyeballs on it. It'll give you some real world experience for your resume. Hence the Scala. Okay. Any other logistics questions before we go on? That aren't piazza-able? Cool. All right, I wanna clarify an architecture point from last time that some folks came up after class and I just wanna make sure it's eminently clear to everybody what I was talking about. So we're gonna go back to our database internals and how actually do we implement these algorithms as iterators? And what does that really mean in terms of the code and the way it gets executed? So here's the thing to think about. Remember we said there might be something like file scan iterator that's feeding tuples, so to speak, into a sort iterator. But really what's going on here is that sort says next and file scan says return, right? All right, so let's think about what's going on. What's really happening, really, is you can imagine there's a sort of meta iterator up here called main, all right? Just like in C or in Java. This is the main thread of control of the program. It is a single threaded thing. All right, so there's one thread, that's a picture of a thread, for this whole stack. And this is going to be like thread per query. Think about it that way. When you connect to the database and you issue an SQL query, you get a thread. And at the top of that thread is a main. And what comes under the main are the iterators for your query. Okay, so what's going to happen is main is going to say init and this guy who's sort is going to say init and this guy who's a file scan is going to do that Unix POSIX open file, blah, blah, blah, and it's going to go back and say return. And then what's sort going to do? It's actually not going to return from the init yet because if you remember the sorting iterator, it's first going to generate all its runs. So the next thing sort's going to do is it's going to say next. So this guy, it still hasn't returned from that init. It's going to say next. This guy's going to return a tuple. It's going to say next. It's going to return a tuple. And this guy in the meanwhile is generating these sorted partitions on the disk, right? And that's all happening with function call return, function call return, function call return. When these partitions are completely done, all right, this guy returned his last tuple. He's all done. This guy can say actually close. I am finished with you, File. I never want to see you again. And this guy says cool, I'm closed. And then this guy can say return from init. And now this guy's going to want to have tuples to give to the output, right? Whatever the output is. So he's going to say cool, thanks for initting. That took a long time, but whatever. Next, and it's going to start getting tuples. And then this guy's going to start returning them by merging these partitions together, right? All one thread of control. At any given time, the iterator that has control is the iterator that's kind of generating data on a next call. If you say next, when you return the data, it goes back to main. And sort of control flow of who's running is determined by the function stack, right? And passing of data is done through return on the stack. It's all single thread. Any confusion, any questions about that? Yeah. Oh, so remember, this sort put this on disk just because it's sort. But let's go through the example you just described. We have file scan. We have a very simple operation which MapReducePeople would call map. I will call it map. Why not? I think in database landing, it would probably be a project operator actually, or a select operator, but that's okay. We'll call it map. And main, okay? So what's gonna happen here is it's gonna say in it, in it, return, return. Because maps don't have any state, so they don't do anything out of it. And then it's gonna say next. Map is gonna say next. File scan's gonna say have a record, thank you. The map is gonna apply its map function, say have a record, thank you. And it's done. And then the whole business is gonna happen again. But at any given time, there's really only one tuple in play, right? So in terms of data cache locality, this is just fine, right? Now you're moving up and down the function stack, which maybe isn't so good. So if you really wanna be fancy, you can build your query processing system to have batches of tuples going through these iterators, and they have little like gimmek tuple at a time bundle kind of logic. And you can do that. It's a minor modification to what you're gonna learn in this class. Okay, that'll mean you'll get a little more instruction cache locality so you'll handle a whole bunch of tuples in a single iterator. Kinda go through that loop tightly before you go to another iterator. Fewer function invocations, all right? But this is kinda micro-optimizing. Cool. Other questions? All right, there's a much bigger performance issue in this picture, which I talked about with these guys after class last time. I'm gonna bring it up today and I'm never gonna talk about it again, but I think it's kind of interesting. Let's save a really big database with a really big table in it. And it doesn't fit on a single disk. It fits in a computer, single computer let's say for now, but it's too big for a single disk. It's pretty typical. You probably wanna get a few disk drives on a single computer to get your disk bandwidth, the amount of data you can stream out of your disks at once to equal your memory and CPU bandwidth. How much data can you flow through memory in your CPU at the same time? So in an ideal world, you have some number of disks. You have some kind of computer with some kind of network interconnect for these disks, right? Some kind of bus. And you've got one or more processors in this computer. Let's say you just have one for simplicity, okay? So you have a CPU, an IO channel, and a CPU. Your goal is to have the aggregate scan rate of these disks all reading data at the same time equal the aggregate bandwidth of your IO channel, right? So there's no bottleneck here if all these guys are reading it once. And your CPU should be able to chew on those tuples and do this nonsense exactly as fast as your disk drives can go, right? That would be, that'd be awesome. And that's what's called a balanced architecture. That means there's no bottlenecks, right? And if it is not balanced, let's say your IO channel is not sufficient. Well, you do whack-a-mole, you get a better IO channel, right? That's your bottleneck until it balances. And then maybe, you know, you don't have enough disk drives. And so you have more disk drives. And then maybe your CPU's too slow and then you take away disk drives and buy a new box and put them on the new box, et cetera, right? But the point is you want this to be balanced. So a trick question that gets asked in like database graduate prelim exams is, what's the bottleneck in the database system? And you'd think the answer would be, well, the disk, it's a database system, that's the whole cost. And the answer is, there are no bottlenecks in a well-balanced system, right? It all just flows. That's the goal. Now, that's very idealized. That's like, well, how much CPU am I gonna spend per tuple? That depends on what I'm doing in the CPU, which depends on the program, you know? And maybe my programs don't just scan the disks, they do other stuff. But in the ideal world, this all flows. Okay, great. How do we make it flow in this single threaded model? I mean, think about what's happening at file scan. This is an iterator. It gets the thread, you call it, and it says, oh, you want some data? Cool, I'll get you some data. Call the operating system. And the operating system goes, oh, you want some data? File starts over there. Let's get this disk drive to do a read. And it does a read and it gives you the data back. So there's a whole lot of latency there, and you're only getting one disk drive spinning for you right now. That stinks. These three guys are lying idle. This big fat IO bus you had is lying idle, right? That's terrible. What you really want is you want these guys to be going like pumping data, right? And then you want this guy to catch it just when it comes. So actually the way you would architect one of these things, and most database systems work like this, is you actually have separate threads for the disk devices. So you might have like a thread per disk device, and the IO, not the IO channel hardware, but the IO layer of the database actually under file scan, what's going on is there's a totally independent thing down here called the disk sub control system. And it's multi-threaded. It's got a thread like per drive. And when you say init, these guys go, oh, dude, you want to scan this thing? We're on it. And so this guy, when you say init to him, he says fork a bunch of threads and start scanning. And this stuff's filling up a buffer, all right? So this is actually control flow and data flow are not connected. This is producer-consumer. This stuff is pumping data into this buffer. And the file scan thing, every time it wants to get a tuple, lo and behold, it's already in memory, isn't that nice? And so this path never actually waits for an IO. And the disk bandwidth is fully drenching memory with nice, rainy tuples. So that's actually how you implement this. There is underneath file scan, hidden from view, there's a detached set of threads that are feeding us with data. From the viewpoint of the query, though, that's all under the covers. It just so happens that every time it looks for a tuple, that tuple's already in the buffer pool, all right? So I'm not going to talk about multi-threading the IO subsystem ever again, but for those of you who want to know how to make this stuff go fast, you definitely have a separate thread or a separate process to manage your IO. And if you do like a top on a Unix system that's running a database, there'll be a bunch of workers that are, if it's not multi-threaded, if it's multi-processed, there'll be a bunch of workers that are off the IO workers and they do asynchronous IO like this. Good? Cool. I'm not going to say anything more about threads and iterators and stuff like that unless there are questions. You will be implementing stuff like this in your homework, right? You're going to implement, in your homework, you're going to implement one of these iterator blobs in Spark, all right? Should be fun. Cool. OK. Next, let's worry about some more join algorithms, shall we? So we did the nested loops join algorithms. And now we're going to learn the other alternatives. Sort merge join. These are going to be dead obvious to you. And the reason they're going to be dead obvious to you is you already learned the single-table versions of these algorithms, right? They're going to look very much like the sort algorithm and the hash algorithm you learned for single-table. We're just going to do it, you know? Now we're dancing together. We've got two relations we need to do it on. But it's going to be pretty straightforward. So the basic idea is you sort R using the algorithm we already learned. You sort S using the algorithm we already learned. And then you merge them together, find me the things that match. That's the algorithm. Pretty straightforward. We know the algorithm for sorting already. Well, example. I don't know why I have an example anymore. But here's an example. This is really annoying. Do we really want to go through this example? Yeah, we probably do because there's some details. OK, fine. Well, happily, these are already sorted. So we won't have to wait for that. So in this example, sorting has already happened. And then what we're going to do to merge, to sort merge join, so the merge is where the joining happens, is we're going to keep a cursor on each table. So we'll have a blue cursor on the reserves table. And we'll have a green cursor on the sailors table. And we're going to walk down them finding matches. So we'll start with 22 here and 28 there. Do they match? No. The lower one, namely my left hand, the green one, we're going to move down to 28, yuppie. Does 28 match 28? Yes, hooray, send it to the output. Now, before we move the green one, we're going to loop on the inner loop of the right hand side and say, is there another 28? Lo and behold, there is. So we're going to output that pair also. Great. And then we're going to go down. Are there any more 28s? Oops, no, it's a 31. Better keep that thing there. Now we move my left hand. 31, cool. Is there a 31 on the right? There is, hooray, to the output. Move the right one down. Is there a 31? Hooray, yes, move it to the output. Is there a 31? Yes, hooray, move it to the output. Next thing's 58. That's no good. Move this one down 44. Nope, move this one down 58. 58, 58, go together. Yay, end of file, end of file. We're done. The thing to keep in mind is that not only are you walking these in sync, moving the smaller one down whenever it's too small on either side, but when you find a pair like the 31 and the 31s, it looks like nested loops join, because you've got to make sure that you get all pairs of 31s here and 31s there. So there's a little nested loops join in the middle of this thing. And that's where 31 was here, and I was going 31, and then 31, and then 31, right? And then I moved this guy. Question, let's do it. What if both sides of 228s? It's going to be exactly what I just said, which is nested loops join. So let's say we have 22, 28, 28, 31, and 28, 28. Let's just actually, for fun, let's do 10, 28, 28, 40. OK, how many 28s should there be? How many tuples with 28s in them should there be in the output for starters? Four, right? And so here's what's going to happen. We're going to start like here and here. We move the smaller one down, OK, here and there. We move the smaller one down, and now we've got a match. And once we've got a match, we enter nested loop join. So for each tuple here, we're going to scan all the 28s here. So you find one match, send it to the output. Move this down. Find another match, send it to the output, all right? Move this down, no good. Oh man, now we've got to rewind actually to the first 28. And that's a good reason you asked this, because now what if this is a 28? Oh my gosh, it is a 28. Now I need to combine it with that guy and that guy again. And I move this down, rewind, and then I realize, oh, I'm out of 28s, and I skipped it here. Right, so that nested loops, you do need to be able to rewind one of those relations back to the top of the duplicate group here. Yeah, good, I'm glad you made me do that, thank you. And in fact, if you implement this nicely, you actually, your nested loops logic, well nested loops is so easy, it's probably not worth it. But in your homework and in other places, you'd like to reuse sub-routines from some algorithms and compose them. And a lot of these join algorithms and hashing algorithms kind of have little sub-routines that are like their main memory analogies, or that are analogies from another algorithm. And sometimes you can compose them up. So in your homework actually, you'll be asked to do exactly that. You're gonna write an in-memory algorithm, and then you're gonna use it as a sub-routine in an out-of-core algorithm, right? So you'll see that. It's hashing, and you've seen it in class. You're just gonna have to implement it. Okay, what's the cost of sort merge join? Well, we have to sort R, we have to sort S, and we have to scan them again, right? Ignoring the nested loops. Let's assume that our nested loops always fit in memory. It's unusual that you have duplicates where there's so many of them that you have to re-scan the file, right? We can worry about that case together, but probably if that's the case, your optimizer should have known it and chose the nested loops join. All right, but if assuming that the nested loop inside the sort merge join is negligible, sort R plus sort S plus scan the results, right? Okay, what's the cost of sorting R in IOS? Sorry, I guess I'm looking for an actual equation for the cost of sorting R. Yeah, two N log N, oh good, all right. That would be a good algorithm analysis if we were in memory, but if you remember, we wanted to know how many IOs we were dealing with, and when we studied sorting, we talked about how many passes of the relation you do, say in a two pass sort. So the first question is how many passes we have to do in our sorting algorithm from two lectures ago, one lecture ago, right? But assuming it's a two pass sort, remember the picture. It's just a chance to reinforce material from last time, right? So sorting, we're gonna read things into memory, so that's one pass, quick sort, blah, blah, blah, write to partitions on disk, read them back in, and merge them together, right? Okay, now the read on the input and the write on the output are all gonna happen through the iterator interface, and we don't have to pay for them. They may not even be on disk. There may be something other than a scan beneath the sort. But the internals of the sort, this is in memory, this is on disk. We write once and we read once, and then it goes to the output, right? So it's at least two passes of the file to sort, ignoring the initial read and the final rate, yeah? Okay, and in this case, I think we're gonna have to pay for that initial read as well, because we're doing this example off of still data. So there's an R, there's two Rs here to write it and read it back, all right? So it's three number of blocks of R, three number of blocks of S. All right, this worst case thing here is about that nested loop string, okay? Let's ignore that. Suppose B is 35 buffer pages, we can sort R and S, let's say, in two passes. We can do that if we have 35 buffer pages. If we're naive about this, and we take the naive cost, we will sort R and write it to disk totally sorted. All right, so that's another R. So it's four Rs, one, three, four Rs to read it, sort it and write it to disk. We'll do the same thing for S, and then we'll pay to scan them both back, and that's 7,500, all right? If I give you more buffer pages, unfortunately it doesn't change the number of passes in this case, it's still gonna be 7,500. All right, but we can do a little bit better than that. Oh, and just to compare the costs, chunk nested loops join I think was 2,500. So we're not beating chunk nested loops join when you have 300 buffer pages. But if you only have 35 buffer pages, we're actually doing better than chunk nested loops join, which is just to highlight the point that there's not like a clear winning algorithm among these. It all depends very often about how much RAM you have, how big your relations are, whether one relation's much bigger than another, and so on. And so most systems implement more than one of these algorithms. Okay, so here's an important refinement on sort merge that's gonna make it a bit cheaper. Why can't, when we're done generating these partitions of R, rather than merging R and generating an output, why don't we just do the join, right here? So look, we're gonna do the same thing for S. Bring it into memory, sort it, write it into partitions, two S's to do that. So we might do this as number one, do this, and then this, and then this. We might do this as number two, do this, and then this, and then this. Now, we've got this into partitions on R, this into partitions on S. Why don't we merge them and join them at the same time? So we're merging R's with R's, merging S's with S's, and then merge joining the RS streams together, all in one pass, okay? So basically what we're gonna do in memory here is we're gonna have half the memory buffers used for merging R tuples, partitions of R, half of the memory buffers for S tuples, partitions of S, and the game is in the merge join algorithm when we say find the next tuple of R, when we say next, what we mean is the next, which is the smallest of any of these guys, and when we say find the next tuple of S, it's the smallest of any of these guys, and then when they match you join them and put them to the output. Make sense? We just saved ourselves from writing down R, writing down S, and then reading them back to merge them, right, which is two passes of R and two passes of S we just saved, so it's a big deal, right, these are big files, okay? So in this case, if we have enough memory to have buffers for all the partitions of R and all the partitions of S, like this, so we read R and write out sorted runs, read S and write out sorted runs, merge them together while finding the matches. We now do, let's look at this, one pass of R, three passes of R, one pass of S, three passes of S, and then we write out the output, right? And I haven't paid the cost of writing out the output in this picture, so we stream out the output, maybe it just goes to the screen, right, okay? And so how much memory is enough memory to do this in two passes? I'm gonna let you do that as an exercise at home. It's the same analysis we said, how much memory was it to do two passes to sort just R? So now I'm giving you fewer partitions in the second, fewer memory buffers in this merging phase because we need to do the same thing for S, right? So you have to have a small enough number of partitions here that you can keep them both streaming into memory at the same time. You guys can figure out exactly where the equations fall to get two pass sort merge join, right? And if TA's, if you wanna do that in section, it's a good thing to work through in section. All right, sort merge join is an especially good choice if, hey, what if one of the tables happens to already be sorted and we know that? Then we just pay for the sort, we just do the merge. Merge join is really cheap, right? So that's good. Or if somebody wants you to sort the output anyway, hello Siri, welcome to class. If the output has to be sorted anyway, one of the side effects of the sort merge join is that its output stream is sorted. So maybe you can get the sort in the next stage down for free, okay? So these are the kinds of things that a query optimizer's gonna try to think about and we'll learn about that later in the semester. Hash join, hash join is a lot like the unary hashing we learned, all right? First phase, do this for R and then do this for S, right? Generate your hash partitions for R, then generate your hash partitions for S, okay? Just like we learned for unary hashing. And what's gonna be the results after that? All of the apples in R will be in one place. All of the oranges of R will be in another place. All of the apples of S will be in one place. All of the oranges of S will be in another place. Now we've divided, now we can conquer. Let's just do apple to apple joins and orange to orange joins, right? So you can just take these partitions and join up the partitions that have the same hash value, right? So we really just divide and conquer very straightforward. So you do this for R, you do it for S and then you're gonna build a hash table just on R in memory. So you bring in a partition of R, put it into a main memory hash table and we're gonna stream S in our usual streaming fashion through this input buffer. And for each tuple of S we'll just do a lookup in the hash table to see if there's any matches, right? So it's like we're building a main memory hash table on a partition of R and we're streaming the matching partition of S by just looking for doing lookups in that main memory hash table. Really simple, okay? Cool thing about this is that the R hash table, the R partitions have to fit in memory, right? You have to build a hash table on them. The S partitions can be huge because you're streaming them. So if S is a really, really big relation and R is kind of a medium-sized relation, it's bigger than memory but it's not that big, this is great because S might have required lots of recursive partitioning but R maybe not, right? And S can, you don't need to recursively partition. You partition it once and then you just stream the results past. So it's quite nice if your tables are two different sizes. You hash both of them into partitions and then you bring in the main memory hash table on the little guy and stream the big guy past. Sometimes we talk about one table is the building table because it's the one we used to build that main memory hash table. The other table is called the probing table and you want the building table to be small because you want to build main memory size and smallish hash tables, okay? So we don't have an inner and an outer. We have a building and a probing relation in this case. Cost of hash join, well the partitioning phase you read and write both relations and the matching phase, you also actually read and write both relations. If they're both kind of small enough that they can be done in two passes an equation you should remember from last time then the total cost is the three passes of it plus whatever writing the output if you want to charge for that. By the way on homeworks and exams when we ask you what will it cost to do such and such a join we will be very careful, all of us we have to remember to be very careful to tell you whether or not you should pay for the cost of initially reading the input relation and the cost of writing the output relation because we could either ask for you to do that or not usually not is a good thing but we won't make you assume, we'll be very explicit. Okay, in this case we wrote the cost of the output relation. All right, what is the cost of a two pass sort merge join? I think we did that but what is the cost of a two pass sort merge join anybody remember from two slides ago? It's the same because sorting and hashing are dual, right? We sort of remembered that. How much memory is needed for a two pass sort merge join? Well we talked about having to fit both the partitions into memory in the second pass at the same time, right? So we're gonna need a little bit more memory than just sorting one table would need, yeah? We need something like, well there's a square root in there somewhere and now we're gonna have to have something times two, right, amount of memory. But with hashing, guess what? You only hash one of these guys, right? So actually, this is just the memory needed for two pass hash join is just the usual single table hashing equation for the smaller of the two inputs to the join, right? That building relation, which you should pick the smaller one for that, that's the thing we're gonna partition and potentially recursively partition. But the probing relation, which would be the big one, we don't have to do that, we just partition it and read it, okay? So the cost here for hash join and remember is dependent on the smaller table. Cost for sort merge join is dependent on both tables. And again, good exercise to work out the numbers here and nail them down. We'll expect you to know the answers to these questions on exams, right? And this is a little bit of plug and chug from the previous lectures on the single table algorithms making sure you understand these algorithms. So good exercise for you guys to do at home. All right, optimization now for hash join. And by the way, you should think about how this affects these questions too. Here's an optimization. Say you have B memory buffers, all right? So there's this crazy thing that happens with the smaller relation, with this algorithm I just taught you. Suppose the smaller relation is smaller than RAM. So smaller than B, all right? So the building relation, you read it in, there's no need to partition it. You just put it in a memory hash table, it fits already, right? So you take only one pass of that building relation and then you know what? You only need to take one pass of the probing relation too because there's only one partition against which to look it up. So you only take one pass of the probing relation, you just stream it through into lookups in your main memory hash table, right? So if one of the relations fits in memory, it's a one pass algorithm. If you have like one tuple over what fits in memory in the building relation, according to what I taught you, you have to partition and then suddenly you have the cost of partitioning the big relation too, okay? And so now you're gonna pay a cost which is B squared. I'm sorry, which is two passes, right? So if the thing is B or smaller, it's one pass. If it's B plus a little epsilon, it's two passes, all the way up to B squared, right? Because remember, after B squared, we recursively partition, right? So but that jump is really stinky. It's like, damn it, I just had one tuple too many and I didn't get it in one pass, now I have to do it in two passes. We'd like to get a ramp there somehow. So how can we take advantage of the fact if this thing is B blocks, B plus one blocks big, how can we kind of make this not be a step function but a smooth function, all right? And the cost, of course, for one pass versus two pass are N versus three N, two N, every additional pass. So that stinks. I wanna get, and in between that point, there's like memory going to waste because the partitions of the little guy will be little partitions for a while, right? When you partition something that's B plus one blocks into B pieces, they're gonna be like one block each when you read them back in. Kind of a waste of memory, right? So what are we gonna do? Here's something clever. Why don't we keep one of the hash buckets in memory from the get go, all right? So as we're reading in, well, I guess there's a picture on the screen, I should just point to it. As we're reading in the first partition, we're gonna build a hash table in memory for it, but it's gonna be small. It's only gonna be K buffers big, all right? So that first partition's gotta be small. So we have to have a hash function and we have to wait to get ourselves a nice small partition for that first bucket. You can figure out a way to get your hash function and your modulus and stuff to give you one thing that's small, all right? Like it's like 10th the size of everything else. You figure out how to do that's pretty easy. Okay, but we're gonna make sure we have one partition that's little and we're just gonna stick it in memory. So as you read it in from the building relation, you're just gonna hash it into memory and you're never gonna write it to disk. Partition one just fits in memory. The rest of memory is used to write partitions two through B minus, however big that thing was. Okay, so that's the first pass and then when we read the probing relation, we're gonna do the same thing, okay? We're gonna, things that match on that first hash bucket and the probing relation, we'll just do the lookups as we read them from the disk. Generate those lookups and spit them to the output, right? So pass one just to be clear again, R goes into here, hash equals one is in this bucket. The rest of the buckets are for output. So sum of R, R sub one is sitting in memory. Now when we read S in the probing relation, some of the tuples of S will be in hash bucket one. We do their lookups and we send them to the output. Goodbye, those S tuples now just get past to the output. The rest of the S tuples that match the other hash functions go to their hash partitions and then you do the second phase of hash join. But you gotta win on some fraction of R that just went directly into memory, some fraction of S that just directly streamed through. Those guys never got written to disk or read back, right? And if you can get most of the building relation, let's say you had a B plus one block building relation, this is like B minus two blocks of that thing. It's almost all of the smaller relation is here, which means that almost all of the tuples from S will hash into it, which means that almost nothing will get written. And generally speaking, you'll see a ramp that looks like that instead of like that step function we had before, okay? So this is called a hybrid hash join because that first bucket is like a main memory hash, it's like a naive main memory hash join where you just put things in memory. But it's a hybrid of that with the spilling hash join with the out of core hash join. So it's a hybrid of in core and out of core hashing all at once, right? And it smooths out that kind of step function for medium sized relations. This is kind of cool hybrid hashing. It was actually something invented at Berkeley, the paper on this is, it was a Berkeley paper since then. And it's kind of cute, I like it. Okay, that's hybrid hash join. All right, which is better than which? Well, sorting is good if your input's already sorted or if you need the output sorted, all right? And as we know from sorting, it's not sensitive to data skew, bad hash functions, all that kind of stuff. Hashing, well hashing, you've got this hybrid hashing trick. So with sorting, well, more to the point though, let's stick to the second thing, the second thing. With sorting, you have to sort both tables. With hashing, the number of passes depends on the size only of the smaller table, right? That was this whole point I was just making. So the smaller table is the one that you have to bring back into memory. The bigger table, the partitions can be huge and you won't recursively partition them. All right, so for joins with one very big table and one not so big table hashing is a big win. And the hybrid hashing thing can make it even a little bit better. There's no analogy that I'm aware of for sorting, unless you wanna call tournament sort an analogy, but that's pretty weird. I don't even wanna think about that. Hashing's good if the output needs to be grouped, i.e. hashed, but doesn't need to be sorted, so that could be good too, so that could be a win. Okay, so that gives you a flavor of some rules of thumb, but what you really need to do in any given situation is write down the equations for the cost of sorting, the equations for the cost of hash join, and then just push the numbers through and compare. And that's what a query optimizer will do when faced with the choice of what join algorithm to use. So to recap, nested loops join works for any theta because you traverse the entire Cartesian space, make sure to do it in this chunk-oriented way so you take fewer passes of that interrelation, right? Index nested loops join typically for equi joins because most indexes do equality lookups. It's nice when you already have an index on one of the table's keys, join keys, right? So that's, if you don't have the index, it's usually not worth building it on the fly. You should just do sort merge join or something. Sort and hash are only actually good for equi joins, pretty much, so you can't do arbitrary theta joins with them, but no index is required in advance. And the bottom line, there's no clear winner here. Most database systems worth their salt implement at least a few of these. Sorting and hashing, sometimes people will implement only one, they'll skip hash join and just implement sort merge join. It's okay. But you probably want to implement them all, they're not that hard. And you'll want to make sure that you, that is to say you, students in this class, know the cost model for each of these because you'll need it for your exam. I mean, for understanding query optimization, right? All right, homework two goes out tomorrow. There will also be partner signup form information on Piazza. You have to do that as well. Partners teams of two are required. Part of the course is to make sure that you learn to talk to another computer scientist if you haven't already. And vitamin one goes out tomorrow as well. No, it goes out right now, it's live, it's live. All right, see you guys.