 Yo, hey, yo, hey, yo, yo, pack the chrome style, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots, and wrappers wet with the church a few times. Now I'm not trying to hit you. It's a low bar. OK, all right. So today's lecture is one that we have in top of four in 7.21 on multi-way joins. Now, I was debating whether to call the lecture worst case optimal joins or multi-way joins. And the term worst case optimal is a sort of category of multi-way joins that we're only going to focus on. But in my opinion, the term is confusing. And apparently there's an anecdote where Don Knuth thought the term worst case optimal join was confusing as well. I'll send the link on the Piazza afterwards. But let's jump into this and sort of cover it at a high level. And this will be a shorter lecture. So the last two classes we've talked about hash joins and short merge joins. And these were always in the context of being done as binary joins. Meaning we have two tables. We're joining them together. And now if a query has multiple tables that it wants to join together, we're going to do this as a series of binary joins. And there's been decades of research on how to make these joins really fast. So this is sort of the default choice for everyone when they build a new system. So this is fine and dandy. Assuming you get the join order correct, which we'll cover in a week or two. And this works great also too when the output of the join is decreasing in size. You're trying to push down or do the joins that are going to be the most selective as early as possible in the query plan so that the output of that result of the join is smaller as you go up the query plan. So again, this is fine when the output is always smaller. But things can go bad if the output now is larger than the inputs. So for this join here, instead of having the output from joining s and t, each have 100 tuples, if the output is now 1,000, then now I have to join 100 tuples over here with 1,000 tuples over here. Even though the output, again, it's going to be smaller. It's going to be 10 tuples. Again, these are small numbers. But again, think of terms like millions of billions. This can be problematic. And so the goal we're trying to solve today is how can we join maybe r, s, and t in this example together at the same time so that we don't have to materialize this larger intermediate result? So again, just visualizing the same query I had before now exactly with real numbers, this is essentially sometimes called a triangle query, or it's a query with a cycle. We're trying to join everybody with everybody. So no matter what order we try to join these three tables together, we're always going to have the intermediate output be larger. So we join r and s. We get a large table like this. But then when we join it with t, it gets filtered down. Same thing if you do r and t. And same thing when you do s and t. So this thing is the problem that we're trying to solve today, that we want to avoid having to materialize an intermediate result that is larger than our inputs. And the reason why we don't want to do this, because obviously it's wasted computation as we execute the query, because now we got to take this larger intermediate result and join it with the table, or the next table in our query plan. And obviously, we also have to now materialize this into memory or in disk. So wasting storage space, why we do this? So the high level goal of what these multi-wea, or the worst case optimal joins are that we're going to try to join all three tables, or three or more tables, together at the same time by examining their attributes rather than one at a time, rather than looking at each table at a time. Now, again, I'll show visually what I mean by this. And again, if you take the intro class, if you know anything about databases, this is a way different way to think about how to do joins than the binary joins. And this is rare. Not everyone does this, but I'll talk about it at the end. This is, I think, the future of the next 10 years in relational databases that everyone's going to have to support these worst case optimal joins. So I'm going to talk about the two main implementations. Well, I'll start with some background of worst case optimal joins. But I'm only going to focus on the two main implementations that are out there. One was the paper you guys read that was an Umbra. And then there's this leapfrog try join from the logic black guys that actually predates the hash try join from Umbra. There is another one that just came out that's an extension of the leapfrog try join from researchers at University of Washington. That literally came out a month ago on the archive. So I didn't have time to include this. There's another paper from Stanford called empty-headed or level-headed. The hash try paper alludes to this. We won't cover that. These two are considered the superior ones. And the Germans claim that theirs is the best. OK? All right, so again, the high little idea of what we're trying to do here is instead of trying to join tables by examining two tables at a time, doing the join, and then moving on to the next table, we're instead going to look at the attributes, the variables that they want to join, and examine those first, which may span multiple relations, multiple tables. For simplicity today, we're only going to look at three table joins. But you can easily extend this beyond that. So this idea of this worst-case optimal join was first proposed by different Germans and theoreticians in 2008. And then there was two major works done in the 2010s, one from Chris Ray on empty-headed or level-headed. And then there was this leapfrog try join from the logical guys that I'll talk about in a second. So what's interesting about worst-case optimal joins is that its runtime is going to be bounded by the size of the result that it's generating and the number of variables that you need to evaluate. And just like in a binary join, where you have to get the join ordering correct to do the most selected join as quickly as possible, so you throw away the data you know you're not going to need. In worst-case optimal joins, you want to evaluate the variables or attributes across relations in order such that you remove tuples or dead tuples that you know you're not going to need. So again, we're only going to talk about the algorithm here. The figuring out the right ordering of the attributes is a whole other problem in query optimization. The umber paper you guys read talks about it a little bit, but we're not going to go too deep on it today. Because the idea here is that if we get the right ordering, we can find out tuples that aren't going to match for the first set of attributes we're examining. We don't have to do the examine the other ones. And in some cases, if we're traversing the try or whatever data structure you're using, you can just skip whole sections of it. The other interesting thing about worst-case optimal joins is that the more tables that go into it, the better its performances can be relative to the input because you can look at more tables and throw things out more quickly, again sooner rather than later. And this is sort of the opposite behavior of a binary join. We really have the notion of more tables, but the behavior isn't, sorry, the runtime isn't exactly bounded by the output size. It's a different beast. So again, for me, when I read these papers, the worst-case optimal join seems counterintuitive. And there's a definition from this professor in Waterloo who's building a graph database that supports worst-case optimal joins. And he has in his blog article here that I'm citing, he talks about the definition of what you think of worst-case optimal is that it's the worst-case runtime of the algorithm. The worst-case runtime of the algorithm will meet some lower bound that's been defined that's going to be the lowest runtime of all possible join algorithms for the worst-case scenario. That's kind of confusing, too. My definition, and again, probably wrong, but I'm sure someone will correct me on YouTube, that it's the runtime of the join algorithm is better than all the join algorithms when the query and the data that the query is going to process represent the worst possible scenario. You're going to do better than a binary join capacity do. It may not be fantastic. It's obviously not going to be log n to do the join, but it's going to be better than what you have done if you did a binary join. And again, this link here will take you to a footnote in the blog article from the guy at Waterloo where he talks about Don Knuth did not like this term as well. And he's a professor at Stanford, so you know. So there's not very many systems that implement this. The graph database systems don't do this because they don't really have a notion of tables and doing joins. They're doing graph traversals of the data structure. So as of 2023, the only two commercial systems that support this, or support worst case happened with joins are relational AI and logic blocks. And logic blocks will be the leapfrog tri-join that I'll talk about in a second. And actually, relational AI are the people that form logic blocks. Logic blocks got bought, all the guys then formed relational AI. So it's the same people, right? The leapfrog tri-join in logic blocks is the first, I think, commercial implementation of a worst case optimal join. In academia, the first one would have been empty-headed from Chris Ray at Stanford in the early 2010s. There's the Ombra one that we've talked about. And then Duckty B is there's a paper insider this year. This is all very, very new. It's a paper insider this year where Peter Bontz, the guy that built Vectorwise, he added worst case optimal joins to Duckty B. And then Kuzu is a graph database based out of Waterloo. So not very many systems. You don't see Snowflake. You don't see SQL Server. You don't see Oracle and so forth. But I think what's going to change in this decade is that as of this year, the SQL standard is going to support this new extension called SQL PGQ, property graph queries. It looks a lot like the Cypher language that's in Neo4j for running queries on graph databases. This is now going to be in the SQL standard. And so that means that you're going to be able to use relational databases as Ted Codd predicted back in the day for any possible data types. And you could do now the graph traversal stuff where you would need the worst case optimal joins talking about today to operate efficiently on this data. So that's why this matters. That's why, and again, there's not much fat to cut off on hash joins and sort of joins. I think I told you this. I had a student, a foreign PG student of mine, he spent six months trying to speed up hash joins in our system. And he got it down from 12 cycles per tuple, which was a state of the art at the time, to 11 cycles. We're literally counting single cycles. That's not a paper. That's not research. There isn't that much you can do. Any time you have to go to disk, or any time you have to spill out your CPU caches, you're always going to be bottlenecked by the hardware. In his worst case optimal joins, I think this is still fertile territory for some major algorithmic and implementation and improvements to potentially order magnitude speed up over what we're currently doing today. So again, this is all very, very new. All right. So I'm going to talk about the leapfrog trijoin from LogicBlocks. And this came out in 2014. And then they have a newer version in relational AI called the dovetail join, which is a worst-case optimal join algorithm. But I can't tell what it does because they have a five minute YouTube video that doesn't say anything. And maybe I just missed the paper. But we'll just focus on this one. Because again, this is what the hyper guys address a lot of the deficiencies in this implementation they address in theirs, as well as the empty-headed one. So for this one, they're going to assume that all the tables you want to store are going to be sorted ahead of time on the join keys. Or you're going to build a try index for them on the fly. And then the idea is that the try is going to represent the join key itself. And every level in the try will represent a different attribute that's going to be computed as part of the join. They're going to know what a try is. They're going to not know what a try is. All right, then I'll just show this quickly. We used to teach tries. And then students didn't know what they were. In this class, we used to teach a whole lecture on like red X trees, tries, the R index stuff. And I might have mentioned this at some point in the intro class, I forget. But just quickly to make sure everyone knows what it is. The intro class has project zero. Oh, it's project zero now? All right, even better. Yes, perfect. OK, sure. All right, this is what a try is. So instead of like a B plus tree, we store the entire key inside of a node and do whole key comparisons. And at try, you break up the key into its individual components or digits. And then you store those at the different levels. So let's say I have three keys, hello, hat, and have. So the way you would find this key would be the path into the try. So at the first level, you have H because you only have one entry at the top level because that's H is the common letter and all the keys. And then going down here, you find H-E-L-L-O, right? So tries are old. They're like from 1959 by some French guy. And then the term try was coined by this guy, Edward Fredkin, who apparently is faculty here at CMU. I think he's still alive. I've never seen him. But I think he's still listed. Anyway, tries are really interesting. Tries are an alternative to B trees or B plus trees. They're not that common because the cost representing these different levels when you have these variable length size nodes can be challenging. But they show up in a bunch of interesting places. I know the Cassandra guys are making heavy use of this now for the LSM. It's one of these data structures that the days people have kind of ignored for a while. And then the hyper guys brought them back in a vogue about 10 years or so. OK, so let's see how we can use these to do the try join with in logical blocks. All right, so we have our three tables, x, y, z. Again, the way we're going to do the join is basically we're going to iterate through the try and do comparisons to see whether we have a match across the different relations. And we sort of go one relation at a time looking for a match. So the first thing we need to do is sort it, done that. And then for simplicity, rather than show the try structure of the next slide, I'm just going to show you here's the values you could possibly have. And so for each of these, you start with the iterator that's going to walk the list of values. And you start, obviously, with the first position. So the x iterator starts at 0, the y iterator starts at 0, and the z iterator starts at 2. That's the initial starting point. So now, again, we're going to walk through one by one of the different lists of values from the different relations and try to see if we have a match. And if we know that the last iterator that we looked at has a value greater than the one we're looking at, that our iterator is currently pointing at, we need to jump ahead or leapfrog ahead into the list to try to find a match. So in this case here, if we start with the x iterator here, the last iterator we would examine upon initialization would be wouldn't z. So z is pointing at 2. We're pointing at 0. 0 is less than 2. So we've got to jump over until we find either 2 or the first value that's greater than 2. So in this case, it would be 3. So we increment our iterator to 3 here, and then we move down to the next iterator. So again, the last iterator pointed at 3. We're currently at 0. 0 is less than 3. So now we need to jump to the first value that's greater than equal to or greater than 3, which would be 6. Same thing, come down here at z. Last iterator is 6. We're pointing at 2. So jump ahead past 4 and 5 to get to 8. Come back around. We jump to 8 here. We jump to 8 here. And then now we have a match. Conceptually, that's what we're doing. And then obviously, this would be expensive. We just sort of have a list like this. And so we're going to use tries to speed things up. So we're going to take the same list we have before, but now the different attributes we're going to represent at different levels in the try. And the first level will always just be the identifier of what relation we are, so relation R. The second level will be the first attribute, in this case A. And then the bottom level will be the attribute B. So we're going to build this try structure for all of the relations we want to join. So you end up like this. So R has A and B. So first level is A. Second level is B. T has A and C. And S has B and Z. So today now we want to join R and S and T together. So assuming that the query optimizer has figured out, OK, the first attribute I want to examine is A. I'm going to do A, B, and then C to do my join. So I'm going to start here in the first relation. I traverse into R. And then I line up A, 0. And I can use that 0 to then probe into this relation for T, and I get 0. So now I want to find the matches for C over here. So I start with my first relation on B. Sorry, the second attribute in R on B, use that to probe into the try on S. And now I'm going to have an iterator down here for the values of C and iterator down here. So for simplicity, I can just scan through and get all of them. Instead of doing comparison one by one, I'll just get the set of the attributes. Then now I can take the intersection of those. I know what A is because that's how I got here. And it was 0. And now I'm looking for matches. I know what B is because it's 0 because that's how I got there. So now I'm doing the intersection on values of C. And you end up with those three tuples. So you need to do the same thing. So now you iterate over here by 1. You get B equals 1. You're going to use that when you done probe into S because you're already at A here. I land here. I rip through this, get the same set I had before, 0, 1, 2. This only has 1 tuple, or 1 value, 0 for C. Take the intersection and end up with 1 tuple like that. Do this for the last one. I get 2 over here. Use that to probe in S, scan along here. Get the same set I had before. Take the intersection and end up with the tuple. So I can just do this for all the other ones one by one and then fill out the rest of the join. This question is, does this apply to an outer join? You wouldn't take the intersection. It'd be matching all of them. But then whatever doesn't have a match when you do the intersection, the value is null. I actually don't know whether, worst case, I have no joins. I don't know whether people actually do that. That would be kind of weird to do a graph query with that outer join, I think. Is it? Yeah, yeah, yeah, yeah. But even then, like I said, find me all the paths from A to B. But if it doesn't have one, then put null. It's like an anti-join, too, right? Yeah, OK. For simplicity, assume we're doing inner joins or doing equity joins here, OK? OK, so that's the triage one. Again, the leapfrog triage one. The idea is, again, you have to build these tries on the fly every time you want to do a join. Now, there's the sum up papers. They talk about, oh, the data sets read only. And I can pre-compete all these tries ahead of time. The challenge of that one, of course, is that that takes up space. And you have to do it for every possible join or you could have. And then, of course, as soon as you actually do support updates, then you've got to throw everything away and start over. Now, in the world we've been talking about this semester, we've been assuming that our parquet files are read only shorter than S3. So we could do this. We could pre-compete these tries, except that for really large data sets and for any possible join order, this would be expensive to do, all right? Basically, what I just said here. So even again, even if it's read only, having to build this ahead of time for everything is just not realistic. Because you've got to store it somewhere, and then you've got to go read them from disk. So another approach, instead of using all these tries, from the Stanford work, is to use nested hash tables. But this is going to be expensive to do, not again, not just to have to materialize every possible join key order you could have, but when you actually go down to the comparison, you're going to do a bunch of extra computations for taking the hashes, checking where the hashes match, then checking to see whether the keys actually match, because you could have collisions. And then because you want to have things nicely memory aligned with variable length keys, you either have to store some dictionary code representation of the key, which means you have to then do a lookup on the dictionary and lookup later on. Or you have to actually store the key itself, which makes the hash table bigger. Or you store a pointer to the tuple that has a real data, but then that's doing a lookup to, again, checking for collisions. And it's more cache measures and making everything slower. So this is what the Umber guys are solving in their work. Again, the idea is high levels are the same. They're going to use a try data structure to quickly identify matches or the sequences of matches within when you do multi-way joins. But the key difference is that instead of storing the actual values the way we saw in the leapfrog try joins, or if you use it in nested hash tables, they're actually just going to store the hashes of the key join keys as 64-bit integers. And the advantage of that one is you get, now as you do the traversal of the hash try, you're comparing integers, not actual strings and other arbitrary values. So that means you can have a very tight code in your implementation that only is operating directly on integers. It also solves the problem of dealing with variable length data because everything gets put into 64-bit values. So of course, the problem with this one is that you could have collisions. The two distinct values hash to the same hash value. So you do need at the end of, once you've figured out the intersection of what tuples actually match, you then have to go follow up the original tuple and see whether it's a true match or not, or whether you had a false positive. So I like this paper because it's not too deep on the theory, but also it really explains a lot of the low-level details of what they had to do to make this thing perform it. So I had you guys read the original paper that came out in Sigma. There's a follow-up tech report published at the same time. It's one of the citations that goes through more detail of all the different design decisions and optimizations that they made. So this is the data structure that they're proposing here. So again, it's just a try, nothing special, but they do some interesting things like they basically store all the key value pairs in the agent level as 16 bytes, eight byte keys or eight bytes hash values as the keys, and the eight byte pointers, but the pointers are special, we'll talk about in a second. Then you have pointers down to the lower levels, and at the bottom it's basically a giant link list of all the tuples that match or that correspond to the different paths down through the hashes. So there's three optimizations I want to discuss that I think are critical to make this work. The first is how they're going to represent their pointers, and then we'll talk about how they do minimize the amount of unnecessary work they're doing when they build these tries, and then we'll talk about how they can do fast path for sparse data such as sparse tries. There's another optimization they're doing about this doing address shifting, basically it's a way for them to represent things in smaller numbers at the different levels. So the first thing optimization they did is what they call a tag pointers. We saw this technique before when we talked about the way hyper-represented pointers in their chain hash table for joins. So remember in x8664, even though we have 64 bit pointers into memory, the hardware like Intel and AMD, they only use 48 bits. So that means you have 16 bits, you can put anything you want, and the harbors are just going to ignore it and drop it. So what they're going to do is they're going to store some extra metadata about what the pointer is pointing to, so that the code knows, I'm finding this address to some the next level might try, what is the memory layout, what's actually going to look like. Because the alternative is you have to have a look up table and says, okay, well you're jumping to this address and here's the size of the chain or here's some other additional metadata about it, but instead they can represent it in just the 16 bits. So the first thing we're going to have is a single bit for whether the path is a singleton, and that basically means I'll show the next slide that there isn't additional levels in the try as you do your join. It's just you could jump down directly to the lowest level of the leaf nodes. So you have a single value to look at. You have another one bit to represent whether the pointer is pointing to a node that's been expanded. They're going to do this because they want to do lazy expansion and they want to lazily build the try as needed rather than populate the whole thing at all at once. Again, I'll cover a few slides, but basically if I call createIndex on a table in any database, it's going to sequential scan and build up the entire B plus to your hash index, whatever it is. In their world because the way, because you're doing joins and it's at a femoral data structure, you know there's going to be some paths or some parts of the data structure you're never going to access. So why build it? So they're going to build it on demand. Then they're going to maintain in 14 bits, the size of the chain that you're pointing to, and this allows you to allocate the memory you need for the upper level nodes as you do on demand expansion. Then the last 48 bits is just the memory address. So let me go through again the singleton and expansion optimizations. So with singleton, the idea is that if I know that from some point in my try as I'm doing my join, there isn't any branching, it's a straight path down to the actual tuple. Then I can rather than storing the intermediate nodes down to that leaf node, I just have a direct path to it. So in this case here, and so think of like the first level is hashing the hashes of the first attribute I'm doing my join on, and then we have the hashes of the second attribute and then these are actual the underlying tuples that I'm representing. So for this hash on an attribute, first hash, the first attribute joined on attribute zero or value zero, there isn't any alternatives. It's only going to be the other attributes going to be one. So again, it's a straight path here. So what I can do instead is instead of storing that intermediate node, I just have a direct path to the bottom. Then I set a bit to tell me that the thing you're about to point to is going to be the actual data you're looking for not the other value. Again, as you traverse a B plus tree or a tri, you're obviously keeping track of what level you're at. So you would know if it was a balanced tree, you would know I'm at the leaf node, therefore I can interpret the data I'm looking at a certain way. Because otherwise you have to store metadata in the actual node itself to say, okay, this is a leaf node or this is whatever. So by storing that single bit in the pointer that's pointing to this thing, I don't have to do any lookup. I know that I'm jumping to something that's going to be the actual tuple and not another node in my try. The next optimization is to do lazy expansion. So the idea here is that we assume that the join is going to be very selective. Most of the time I'm not going to access most of the nodes in my try at all. So again, why bother building it? So in the very beginning, the only thing they populate is the root node, and then the link list of all the tuples at the bottom. So only when now you go and do a probe into the try as part of the join, then you go back and then populate things. So in this case here, I do my lookup. The first attribute, I want to get one. So my iterator would start here. I would look at my memory address. I would look in the first flag, it tells me that the thing I'm pointing to is not expanded, and therefore it's going to be the last level. So now when I come down here, I know I'm looking at this link list of all the values or all the sets of tuples that correspond to this hash value of the first attribute. So all I do now is just traverse along this, get the list of those tuples, and then now I want to expand it for the next query that comes along, or the next probe into my hash table comes along. It may go down the same path. I want to go ahead and build the intermediate node here, the second level. So what they can do is they store that the length of this chain that I'm pointing you down here in the pointer at the root. So I know the size of this value I expect to see. So I can allocate this now at the very beginning, do the traversal, and then populate it as I go along. Because otherwise you have to again scan this link list and not know the size until you finish scanning it. So after I do this, then I go back and update the pointer in the root to now set the expansion flag to one. So first they're good. Okay. So that's pretty much I really want to cover for the hyper stuff or the umber stuff and the leapfrog tri-join. Again, in general, the challenge is going to be that if you only implement multi-way joins, in some cases if the intermediate result is not larger than the inputs, it's going to be slower. Again, there's a paper that just came out that says a month ago that says, you can get the best of both worlds using the multi-way, the worst-case optimal joins, if you have an efficient data structures. So again, this is still very early. The umber guys have to then extend their optimizer to try to figure out explicitly, am I looking at a situation where I want to use a worst-case optimal join versus the regular binary join? Because again, if you get it wrong, then the worst-case optimal join is going to be much slower than the binary join. So that is used heuristics to help them figure out in this optimizer. So additional consideration that they're making when they're generating the physical plan from the logical plan. Again, this paper is the only one that really discusses this. The other papers say, hey, you want to do this, but no one really solves the problem like, when should you be doing this? Okay. So just to finish up, I think, again, as I said, this is an active area research and I think in the next decade, I predict that every relational database system is going to have to have support for worst-case optimal joins. Maybe not exactly the way we talked about it here today, but again, I think people will have a bunch of new ideas as they come along. Again, we need this, we're going to need, all these systems are going to need this because you want to support the property of query extensions that's coming out in the SQL standard this year, you need to do these worst-case optimal joins because graph traversals are also oftentimes looking at cycles in the dataset. So if this is true, if my prediction is true that all these systems now support worst-case optimal joins, relational data systems support worst-case optimal joins and the SQL extensions to do property graph queries, to me in my opinion, this is going to kill off all the graph databases. Neo4j, Tiger Graph, Memgraph, whatever. I mean, they're not going to die and be gone forever, like still be there, but they're not going to get the market penetration that relational databases have had, because relational databases can do these things. Okay. All right. So next class, or today's Wednesday. So Monday next week, we're going to read the Freud paper. The idea of this paper is taking UDFs, the user-defined functions that are written in PLBG SQL and converting them into relational algebra at SQL that you can inline into the calling query. I think this is one of the, I've said before that I think this is one of the best ideas to come out in databases in the last decade that came from Microsoft. The Microsoft paper you're reading is going to talk about these fantastic speed-ups. Sam and Kai have basically ruined the dream for me and poked holes in a bunch of the things that they've done, which is quite disappointing. We will cover them, but I still think at a high level, the technique is sound and some of you guys are actually working on similar things for your project three. But it is disappointing that Sam's got to ruin everything for everyone. So. Okay. That's my favorite all-time job. With the G to the E to the T. Now here comes Duke. I play the game where there's no rules. Homies on the cuss of y'all my food because I drink proof. Put the bus a cap on the eyes, bro. Bushwick on the go with a blow to the eyes. Here I come. Willie D. That's me. Rolling with 501 South Park and South Central G. And St. I's when I party. By the 12-pack case of a thought. Six-pack 40 act against the real policy. I drink proof, but yo, I drink it by the 12 hours. They say Bill makes you fat. But saying eyes is straight, so it really don't matter.