 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. This is going to be a lot different than the kind of joins you see in the intro class, and even in the last lecture we just had about parallel hash joins. And we'll sort of motivate why we want to do this, discuss one of the first invitations of it, and then we'll see how the Germans are going to prove on it. And I'll finish up with something from DuckDB that basically says, you may not actually need any of this if you build other things correctly. All right, so last class we spent the entire lecture on how to do hash joins as fast as possible. And the big focus was on how to run this in parallel. Again, not across multiple nodes, but across multiple workers running on the same box. And it was really about trying to minimize the number of cycles per instruction, minimize the remote memory access to the different new regions, just trying to run fast as possible. And then although there were some examples and some results that showed that the partition hash join was going to be superior to the other approaches, getting the engineering right for different hardware, different workloads, different data sets is really challenging. So oftentimes you should just do a non-partition hash join, and that's going to be good enough to get you maybe 90 to 95% of the benefit, without getting into the low-level details of performance things. So again, for the last lecture, this was doing what is called a binary join, or joining two tables. And as you can imagine, these things are super common in pretty much every database system, every relational database system. There's been years and years and years and decades of research and trying to make hash joins as fast as possible. As I said, I had a PhD student spend a little time trying to make the hash join go faster, and we were literally going from 12 cycles per tuple to 11 cycles per tuple. You're really getting down to the bone of bare metal performance that there's not much else you can optimize. So the binary joins are going to be the preferred approach, the better approach, when we know that the output of the join operator is going to be smaller than its inputs. So in this case here, we're joining RS and T. We're taking join S and T, we do a join, and then we're going to produce 10 tuples. These are imaginary numbers. They're small, but think of like orders of magnitude bigger. The output is going to be much smaller. The problem is going to be, though, and what we're going to try to adjust today is when the output of the join operator is going to be larger than the inputs. So say for whatever reason, based on the data and what we're trying to join on, this thing operator is not going to produce 1,000 tuples, right? And this is the problem we want. This is the worst case scenario for databases because now we have to materialize that. And even though it's producing 1,000 tuples here, when we do the join on this output and R, now we're going to produce 10 tuples. So we had a bunch of data that we've materialized or synthesized from the join operator that we now have to deal with even though we're still ending up throwing a ton of it away. So another way to look at it is like this, where you actually start using real tuples. Again, same query, trying to do a cycle join between R, S and T. So the problem is going to be that no matter what order that I choose to join my tables, if I choose to join R and S first, then join T, this intermediate result is going to get ballooned up and get bigger. Same thing, if I try to join R and T and then join S, ballooned up, same thing, do S and T balloons up in the end. So again, the reason why this is going to be a problem for us is, I've already said, it's been wasted storage because now again we have to materialize these results somewhere in memory and then if it gets too big, we have to end up having to spill a disk. And that'll be typically the local disk on the worker node. Worst case scenario, we spill to S3 of the object store. That's even slower. Then of course obviously it's going to be a bunch of wasted computation because we're materializing tuples, we're passing them along and then we're going to do the next join on the last table. It's not going to match, we're going to end up throwing it away. So ideally we want to be able to identify, if we can, which of these tuples we actually don't need when we're going to join the third table or any other additional tables to avoid having to materialize it. So that's the high level approach to the problem we're trying to solve today is how to avoid this blow up of the intermediate results. And what we'll see in the multi-way join, the way it's going to work is that rather than sort of thinking in the join operator in terms of like, I got this table and this table, let me join them together, you're going to think in terms of attributes. And now you don't care where the attributes are coming from, whether it's table one, table two, table three or so forth. And now you want to do a comparison on these attributes and then have that be what you synthesize as part of the output. So that's the big picture of what the multi-way join is going to try to do for us. So, first sort of background what worst case optimal joins are. Again, the lecture I'm calling multi-way joins because that's the idea like we're joining more than just two tables. So the class algorithms are going to be multi-way joins and then these category of the implementations that we'll see will be called worst case optimal joins. But typically the terms are used interchangeably. But multi-way joins, the idea of multi-way joins existed in the literature, I think back to the 80s, but the worst case optimal implementations that came along in the late 2000s. All right, so first again, the highlight of the idea of what the worst case optimal joins are, then we'll look at one of the earliest implementations of leaffrog try join and then we'll see the Germans hash try join, which is sort of an optimization of the data structures they're going to use over the leaffrog join. But at high level, the methods still going to be the same. Again, and then we'll finish up some quick implementations from DuckDB and then I want to briefly talk about how to do system profiling in a database system and then talk a little bit about the Harvard counter stuff that he had a question about a few weeks ago. So again, the idea of a worst case optimal join is that we want to join three or more tables at the same time. And again, the way this is going to work is that rather than taking the entire tuple from one table and the entire tuple from another table and then comparing all the attributes in our join key, we're instead going to grab a single attribute for our join key, it could be multiple columns, from all the tables we want to join together, mash them up, figure out whether we have any matches and only proceed with doing additional comparisons for a given tuple if we know that the subset of the values on our join key are actually matching already. Again, we want to avoid ways to work of doing a bunch of comparisons for things that aren't getting thrown away. So as soon as we can identify that there isn't going to be a match on a given tuple as we're doing our join, we can throw that way as soon as possible. So as I said, the idea of a multiple join has existed for a while, but in terms of proving that you can have a worst case optimal join and I'll define what that is in a second, came from these other Germans in 2008, which I think was, for this paper, I think actually was Thomas Neumann's PhD advisor, but I might be wrong. And then the first two implementations of this will be empty headed at a Stanford and then a commercial system called Logic Blocks, which is a answering database system that you use data log instead of SQL. Data log is another declarative query language, but very few people actually use it other than Logic Blocks. So again, what's going to be interesting about these worst case optimal joins are rather than thinking about the computational complexity in terms of the size of my input tuples, the performance is going to be bounded by the size of the output and the number of attributes we're actually going to compare against. And this is hard to actually have an exact estimation without selectivity information about the attributes to say here's how much longer things are going to take, whereas like a hash join, for example, you know you've got to build a hash table, there's a cost to that, you've got to probe the hash table, there's another cost to that, and you can more or less ignore the selectivity of the hash join itself because you're always going to do that work, whereas in the worst case optimal join because we get short circuit comparisons once we know an attribute within a join key is not going to match, then it actually is, you know, it'll vary. So what's also interesting about too is that the more tables we're going to throw at our worst case optimal join algorithm, the better its performance is actually going to be relative to the input because the idea is that we're going to try to compare as much as we can all at once. Again, rather than having these stages in the binary join, yes. Is the statement as a question is would all the things I'm talking about here still hold if it was join key which is one attribute? In terms of would you, well you wouldn't get the benefit of short circuiting additional comparisons, right, for additional attributes. Is it an easier problem if it's one attribute to do a multi-way join? No, I mean you wouldn't, how does this, you wouldn't get any benefit of some of the data structures that they're going to build, like the tries or in case of empty-headed, like nested hash tables. Yeah, I actually don't know the complexity of this thing if it's one attribute, if it's one attribute, I still think this is going to be better if the intermediate results are going to balloon up because again you're not materializing it, but all those optimizations that we'll see from the paper you guys read, of like how to do singletons and fast-pass down to the leaf nodes of the try, like those obviously don't make any sense if it's only one level, yeah. So we'll see this a little bit and the hyper-paper you guys read, or the umber paper you guys read talks about this as well, like in the same way that we have to get the join order right for binary joins, like make sure like a join, b join, c, we have to get that ordering right. We have to get the ordering correct in a worse case happening with join, but the thing we have to worry about is actually the ordering of the tables themselves, it's actually the ordering of the attributes. So we want to do comparisons on attributes so we know they're going to be most selective as soon as possible, so again we start throwing away data and not doing useless computations. So the definition that's sort of floating around the internet actually comes from this professor up on Waterloo for, he's building a embedded graph database called Kuzu. You think of like, you know, duck db or single light, but for graph databases, right? So his definition of a worst-case algorithm is one where the worst-case runtime of the algorithm meets a known lower bound for the worst-case runtime of any join algorithm. So I read this in like, what the hell are you talking about, right? Because you're kind of using the definition of the thing, using the word you're trying to define in the definition itself. So an alternative would be something like this, where the worst-case algorithm join is one where the runtime of the join algorithm is better than all other join algorithms where the query and the data represent the worst possible scenario. So if you have the situation where the intermediate size is going to balloon up massively, if they think of like a Cartesian product as the worst-case scenario, then we want to choose an algorithm where it's not going to be, you know, magically log n, but it's still going to be better than just doing all other approaches with a binary join, no matter what ordering you have for your tables in a binary join. So, again, that's why I called this lecture, the multi-way join, because that one I think is easy to wrap your head around. This one is a bit screwy, and don't take my word for it also. For this guy's blog article, he has this anecdote where he talks about where he met Don Knuth, and he told Don Knuth he was working on worst-case optimal join algorithms, and Knuth was like, what the hell are you talking about, either, right? So as he says, are they so good that they're optimal even in the worst-case performance? Yes. So, anyway, so, again, if you understand that, this is going to be the best approach for the worst-case scenario when we have this, again, ballooning up in results. That's the thing we want to solve. So, as I said, there's not very many systems that are actually going to implement this. So, the logic blocks was the one I mentioned before. We'll see this when they do comparison again in the Umbra paper. Again, this was an early reasoning system that was trying to do, again, graph traversals, and they had an early invitation or they had a leapfrog hash join. Umbra, what you guys read about, relational AI is the follow-up to logic blocks. So, logic blocks got bought. All the key people left the company, ever got acquired, and then built relational AI. It's written in Julia. It's a relational database system that's doing a graph representation on a relational database system, and they're using their variant of a newer version of the leapfrog join. And the clue is this is not a water load that I mentioned before. So, the reason why this is going to be important, and we're not going to go too deep into the latest SQL extension, SQLPGQ, but last year in March or February, the new specification for the SQL standard came out, and in it they have this thing called SQLPGQ. So, it's an extension, new capabilities in the SQL standard that allow you to define property graphs over relational tables, and then do pattern searching or graph traversals directly in SQL. So, the extensions are inspired by Neo4j, Cypher, I think TigerGraph has their graph query language. There's bits and pieces of existing graph databases, but now all this exists in the SQL standard. The only system that actually supports this is Oracle. Oracle is on the standards committee. There's an experimental development branch of DuckDB that has some support for this, but the language is all sort of slightly different. So, we're going to need the worst-case-out-no-joins in order to implement this efficiently for these kind of queries. There's graph traversals on relational databases. So, that's why this matters. So, even though in Oracle only supports it now, I think in the next five years, every major OLAP database system will have support for this. And you're going to need to do a worst-case-out-no-join to make that run efficiently. All right, so let's go to the first implementation, or first one of the first implementations, LeapFrog Try Join, and then we'll see how the Germans extended this to make it run faster. So, the idea is that to do a multiple-way join, the LeapFrog Try Join is going to assume that either the data is already pre-sorted or that you're going to build an index data structure on the join keys right before performing the join itself. So, as we talked about in our world, we're accessing bunch of Parquet files, Sydney 1S3. Those things are unlikely to be sorted. And in the Parquet and Oracle file specifications, we can't store additional data structures, so either we have to pre-compute these things and store it as this bunch of separate files that we load back in or as we're scanning data or building these tries on the fly. And the umber paper you guys read, they're going to be doing the same thing. They're going to be building the tries on the fly, too, but they're going to do a bunch of tricks to try to do lazy evaluation or lazy materialization in the data structure. So, another way to go about this is when I call CreateIndex in Postgres or whatever, what is that, what's going to happen? We're going to do a single row and then populate to the index. And so the umber guys are going to try to avoid having to do that. These guys are going to build everything. So, in this try, we'll see in a second, they're going to have a separate data structure per table, per relation we're trying to join. And then each level in this try is going to correspond to one attribute that's in our join key. For that table, each level will correspond to an attribute that it has involved in the join operator. So, as I said, logic blocks have been in this in 2010 or in 2013, I think, 14 people came out. They have their new company in relation to AI. They have supposedly a better version of the leapfrog hash join called a dovetail join. I can't actually figure out what they're doing because there's a five minute, there's this blog article here in a five minute YouTube video that doesn't list anything deep, right? But they claim it's better than what these guys have. So, the way to think about this is that we are, again, we're going to sort our data or build it next for it and then we need a way to iterate through the all the tables we're trying to join at the same time and then do comparison across the attributes to see whether we have a match. And because the things are sorted, we don't have to backtrack on our join keys, right? So, let's say we have three tables, y, z, first step, we're just going to sort it, so that's fine. And then for now, for this demonstration, I'm going to switch to a horizontal view and I'm going to put spaces into where there's actually new values corresponding to the sequence, like, you know, zero to ten. So, at the very beginning, we're going to have an iterator for each of the three tables, right? So, in this case here, we're trying to join a one-attribute, id. And so, for x, the first value is zero, so the x iterator is pointing at zero, y is pointing at zero, y is pointing at two. So, we're going to start with the, at the top, and we're going to sort of, we're going to do comparisons with what the iterator is pointing at across the different tables, because we know what they're pointing at, too. And if we find it that the value that our iterator is pointing at is less than what the other iterators are pointing at, then we know that there isn't a match for us here, and therefore we're going to leapfrog or jump to some other point in our value list that's going to be equal to or greater than the maximum that everyone else is pointing to. Right, so in this case here, zero is less than two, so we need to jump over and find the next, the next value for x that is greater than or equal to two. In this case here, it's three. The iterator is going to jump over there, and we update that, that. Now because this guy now did a jump, we now need to come down to the next one and do the same comparison here. So the iterator is pointing at zero, so it needs to find a, since zero is less than two and three, we need to now jump to another position where the next value is greater than or equal to three. Right, so even though it has a two, so we know this guy is pointing at three, so we know that there isn't going to be a match because this thing would have been pointing at two as well, so we skip this, so he's now going to leap over or jump over to six. Same thing, come down here, he's at two, two is less than six, two is less than three, so we need to find something that's going to be greater than or equal to six, which is eight here. Then we loop back around and do the same thing. Now the x iterator can jump to eight, the y iterator can jump to eight, and then lo and behold, we have a match. So at a high level, this is conceptually what we're trying to do. Now obviously the devil's in the details because how am I doing this jump? Because you switch a scan would be stupid and slow, and I'm also only showing how to do this for a single attribute. So the way they're going to represent these values in this sort of manner is through a try. So everyone here knows what a try is, because that's the project zero. Okay, so I'll skip what a try is. Actually, the guy that coined the name, Edward Franken, he was his faculty at CMU. I think he just died last year. So the try guy died. Okay, so we now need to build a try for every single table and where we're going to have each level in the try is going to represent a single attribute. So this will be slightly different than the try representation we think of in databases, like to replace a B plus tree, because in a try you would take a string value and you'd break it up into different digits, or radixes, and store those as a single level. And here we're going to store the entire value in a node at a level corresponding to a tuple at a table. But we're not going to have any duplicate values. So if we see the same value for a given attribute over and over again, we'll have one instance of it, but we'll have multiple pointers coming out of it for the different subvalues for the next attribute. So again, I'm going to just flip it on its side and make it easier to visualize. So we have two attributes, A and B. So in the first step here, we want to add an entry in our try for these three zeros. So we always start with the root, traverse down, and generate the zero node. And then we come down to the B, and then for all the B values that correspond to the zero values, we're going to have edges coming out of them and have those three values. And then just scan down the line to do the same thing for one and zero and then for two and zero. So again, anything at this first level here ignoring the root, this corresponds to attribute A. Everything below that at the second level is going to be attribute B. And then depending on how many attributes I have in my join key, this keeps going down and down. And then in this leaf node here, obviously when they have the same parent, they're going to be in sorted order. All right, so now let's put this together in a complete example to do a join between R, S, and T here. So I've already built the tries for the three tables. And again, here in our join, we're trying to join R, S, and T. We're trying to join R dot A equals T dot A. R dot B equals S dot B and S dot C equals T dot C. So in this case here, our evaluation does not have all of the attributes that we need to compute the join. And so assuming that the optimizer has figured out that the optimal join, sort of optimal evaluation ordering for attributes is going to be A, B, and C. So for this, say we start with table R and we start at the root, traverse down. The first entry is going to be, the first level is going to be A. The first value you can see for A is zero. So then we can use this value to now do a lookup in the table T. So again, it's just our join is R dot A dot equals T dot A. So then now as an entry point going into the root of this try, we come down to the first level and we have a value, we have A equals zero. We have a matching value there. So now what we need to do is traverse down. At this point here, since we know we have a match for R dot A equals zero and T dot A equals zero, then we need to now go to the level below them and actually start comparing the tuples of the values for the different attributes at the next level. So in this case here, the next level for table R is going to be B. So when we go down to the left side, the first value we're going to hit is zero. So we can use that now as the probe into the try for table S because we're trying to do R dot B equals S dot B, right, in the, you know, where it calls up that. So same thing. We enter R, sorry, we enter the try for S, traverse down, now we went for B. Great. So now we know that we have for a match, at least an attribute for R dot A to T dot A and R dot B into S dot B. So the last step now is to do the comparison for where S dot C equals T dot C. So to do this, we're just going to have iterators in the region that, or the link list or the list that's below the first attribute in the try for T and S and we're just going to scan along and accumulate all the values for C across these two different, you know, for these two different tables, right. And now we have a set. We just do the intersection and it tells us whether we have matches. And we know how to fill in the values for A and B because we know how we got into our try in the first place. So we put it here, got A equals zero, we had a match there, then we had B equals zero, had a match over there. So when we fill this out, we know the values that A and B are. So we're just doing the intersection over C. Yes. The question is what if the ordering of the, sorry, I don't understand. C equals the top of B, what do you mean? Yeah. Oh, sorry. So if I put this above this, you can't do that. Why? Because I know the global evaluation ordering, ABC. So in my try, even though I don't have all the attributes for a given table, I still have to follow that ordering. So B's got to come before C. The order is determined by the query optimizer before you start running this. It's the same as join ordering when you're doing your binary join. Yes. This question is, is it okay to have one try per table or do you need to have one try for every possible join ordering? So this brings up a good point. I think the empty headed paper, and I think this paper says you'd want to pre-compute these ahead of time. So all possible join orderings, you would have to pre-compute them. The Umbergeis claim and think they're correct that that's super wasteful and you would just build it on the fly and that's better than trying to pre-populate everything. Okay, so we did a match where A equals 0, B equals 0 and we got all the C's for that. So now all we need to do, start back over here with our B iterator in table R. Just go to the next one. Do the same thing, probe into table S. Follow along the path, get the B. Now we have an iterator for the C value over here. For this one here, because we're still at the same A value that we don't need to switch over to another leaf node, we just restart and go back at the beginning of our link list here. Same thing, scan along. So now we would only end up with one entry for S.C. and then three entries for T.C. You obviously could cache this because you know it's going to be the same thing every time. Compute the intersection and then we end up with one table. Then do the same thing. Move over to the next one. Traverse down into the S. Get our sets of C values, intersect, and produce the tuple. So again, now at this point here, since we've exhausted all the B's, so we're done with this A, we go back over the root, come down to the next side. Now we get A equals one, probe into the trifor T. A equals one, we've got a match. Do the same thing, scan along, and then do the intersection. Same thing, come into two and so forth like that. Not so bad. So, related to his point, either pre-computing or building a tri on the fly every time you want to do this joint is going to be expensive. Always think at extremes. If you have billions of tubules having to build this tri across every single table every single time, it's going to be slow. Even though in our world, we're assuming our data set is read-only, whereas the hyper-fabric, they were talking about supporting incremental updates. Again, trying to figure out all possible joint orderings ahead of time, and then materialize them, and then fetch them in from disk every time you're going to join, is going to be impractical. Yes? His question is, what do I mean by building over every possible joint order? Wouldn't you have one ordering of the attributes? No, right? So in my example, ABC was the optimum ordering for given the data, but what if I added a bunch of where clauses or conditional predicates that start filtering from A, filtering B, and C, or RS and T before I join? So now the ordering can be completely different from one query to the next. Yeah. So trying to figure that out for every single possible combination is wasteful. So the empty-head approach from Stanford, what they're going to claim that's going to be better than this tri is just to use nested hash tables. But again, this is going to be expensive to do as well, even if you're building them on the fly. And this is probably because the hash table, despite how great it was for doing a binary join, we saw a less class, in this world is going to be really expensive because you're just doing so many different hash look-ups over and over again. And a lot of it is going to be wasteful. So the Umbergeis would argue that if you use hash tables for this, you're going to at least have to do one key comparison to see whether you have a collision in your hash table and do a look-up. But you still need to store the actual keys and the pointer to the tuple is to deal with collisions, and now you're just trashing your CPU cache because you're jumping on to random locations over and over again for all these hash table data structures. In the case of the binary hash join, it's one hash table. Now it can be big enough where it's going to pollute my CPU cache, but there will be still some locality in that because I'm not traversing different paths and reading a bunch of different random things all the time. The other argument that they're going to claim is that if you have wavelength keys or strings, then you still need to use dictionary encoding to make sure that you can keep things all nicely aligned in your data structure. And that means potentially it's still having to do look-ups in the dictionary to go figure out what the actual value is when you want to do maybe deeper comparisons. So these are all the flaws of the early worst-case hop-up join approaches that the Umbergeis are trying to fix with their implementation. And the key idea what they're going to do is that it's basically going to be the leapfrog hash join we just saw, the leapfrog try hash join, sorry, leapfrog try join, but instead of now storing the actual values of the attributes in the try themselves, they're going to store the hashes for the values, just 64-bit values. And the idea is there that's going to be good enough to do a quick comparison to see whether two possible values could even match at all. So then we end up throwing away as much as we can without having to go maybe do deeper investigations to go read the actual data themselves to see whether there's going to be a match or not. So again, another thing about this is like you're trying to make the sort of first peak to see whether these two attributes are going to be the same or not, be that cheap as possible because you know you're going to end up throwing most things away. So in the try itself, each node is going to be another hash table and they do some tricks of storing things that are raised to do quick lookups inside that and that's going to be, the map is going to be, or the hash table is going to map a hash value for a given attribute to a pointer to the other parts of the try data structure. And that pointer could be either a child node or a pointer to the actual tuple that's represented by that value. And now because everything's going to be in doing hashes, which are going to be 64-bit integers, we don't need any additional logic in our lookups and in assertions when we build this try just to deal with the different data types we could have. So it's going back to this code specialization idea but rather than code-gening stuff at the very beginning or generating code and then compiling it, they just make sure that the data itself is always going to be one data type so that you can have one implementation that has no indirection or no lookups or no branching to deal with different possible data types. And obviously if it's hashing we could have false positives. They argue with something like murmur hash, I think they mentioned aqua hash or xx hash from Facebook. That's going to be good enough where most of the times you're not going to have collisions. And so if you do have still collisions at the very end you're just going to check to see if the tubeless themselves actually match even though the hashes don't. So this is the diagram from the paper. This is the data structure they're proposing and I'm going to go through a bunch of different optimizations that they have in here. But again, the try itself is not fancy. They have another data structure called Art, the Art Index, the Adapter-Videx Try that was in Hurlin-Hyper. That thing is having different allocations for different nodes, different sizes. I don't think they're doing any of that here. The real magic is in how they're going to store the pointers and try to do lazy materialization. So you're always going to have to build the root of the try and these are going to be 16 byte buckets. They're going to use 8 bytes or 64 bits for the hash and then 64 bits for the pointers to the actual tubeless themselves or the pointers to the next level in the tree. But as I said before, the Germans like sticking things in pointers where they have unused bits and that's the key thing that they're going to do here. So I want to go through a couple of optimizations of how to use these tag pointers and then how to do the late materialization. Because to me, that's the really clever part of what they're doing. Because the hash join itself, like the leaf log try join, that's been already proposed. They're making it work actually efficiently. So as I said, Germans love sticking things in pointers. We said before, x8664 only uses 48 bits for memory addresses. The harbor ignores anything else for the other 16 bits. So because you got to allocate 64 bits, they want to put something in there. So within the pointer itself, they're going to use 16 bits to record three additional things. So the first is they're going to have a single bit flag that corresponds to whether something is a singleton or not. Meaning there isn't going to be a path through the, sorry, singleton meaning like there isn't anything in between the root node and the bottom node. It's a direct path to the leaf nodes. And then they're going to use another bit for this expansion flag, just to mean that has the nodes below it, have they been allocated and expanded? Because they're trying to do lazy materialization. So even though the data structure will have a pointer to something, to a lower level in the tribe, they're not actually going to materialize it until you actually try to go look it up. So if this flag just says, hey, by the way, you're about to jump some location that hasn't been expanded or allocated yet, so go do that first, then flip this bit and then traverse down. And to know how to expand it, they're going to maintain the 14 bits for the chain length so that you know when you're traversing along the leaf node what's the number of elements you expect to see because everything's fixed length. It's always going to be 16 bits or 16 bytes. You know that the size of the chain length can be computed from this counter here. And the rest is just going to be the 48 bits that the harbor is going to use for memory addresses. We saw this example with the hash table. They would store a Bloom filter in the 16 bits. There was another example too I'm blanking on to as well. There was another example from the Germans they were doing this as well. I forgot what it was. So let's go through the singleton stuff and how the expansion stuff works. So again, the size of the hash tables in the try is going to get smaller and smaller as you go down because this end up being oftentimes for a single pair of values for an attribute there isn't going to be too much duplication as you go down so you end up with these paths through your try where each node is only going to have one entry. So the idea is that instead of storing it's a whole separate hash table or node within the try for a node that only has one entry at a level then you just bypass that and skip down to the bottom. So in this case here when the hash value is value zero when we jump down here we only have one entry inside this so then now rather than storing this digital node and again follow the pointer, go down, look at it only find one thing and the traversing down here instead what they do is just have a fast path pointer that takes you directly down here. So then you would use that, this expansion bit sorry the singleton bit set it to one to know that if you're at the root there isn't going to be anything else below you you can just jump right down to the node at the bottom. Now you obviously still need to store the information that was in this guy so that you can actually do the comparison to see whether you actually have a match or not. But again that's just done down the leaf node, yes. The singleton bit can be used anywhere down the tree so the moment you know you're having more children. Yes, this statement is the singleton bit used at any point in the tree so that at any moment you look at it and you know that the next thing, the pointer you're going to follow is to take you to the bottom, yes. So the next optimization can do the lazy chart expansion again the idea here is that unlike in the logic blocks you know multi-way join, the worst case of a join where they're populating the entire try before you start even start joining the idea here is that you would populate the first root node you still have the, obviously you have the tuples at the bottom but then the idea is that if nothing, when you do the join itself if there isn't any comparisons along a path in the try then why instantiate the memory for it and why try to allocate it, right only when you actually go to need it then you populate it so there's limits to the overhead of trying to build the try in the beginning because you're just building the first level and the bottom level, right so the way it works is like say in the very beginning my try would look like this this is kind of confusing here but you would have sort of think that the bottom is a link list that tells you the ordering of things so if now someone comes along and tries to do a look up down this path and say this is, we're trying to join on two attributes so I'm missing that second level so I look inside this I see that the expansion bit is set to zero so I know that I don't have anything below me at this point so then I could go do now comparison sorry, fast path down to the bottom I need to do deep comparisons and I scan along the leaflets to find what I want but then now I go ahead and populate what the values actually were and I know how many things I should be looking at because my chain length would tell me when the expansion bit is set to zero this is going to tell me how many things I need to look at at the bottom so I can then allocate that node, put that here and then I update the new node pointers to point to different parts of the list at the bottom then I go ahead and flip the bit to be one so now that anybody else comes along it follows down the same path they'll know that they're actually looking at expanded nodes below me and not directly to the bottom yes why does an optimization surely like don't we want to do as much work on the build side or is that not relevant for more people? this question is why is this an optimization don't you want to do as much work as you can on the build side to make the probes go fast as possible but yes, but like you're trying to join three or more tables all at once so there's going to be so much memory pressure for that data structure think of like trying to build complete hash tables for all three tables at the same time that'd be super expensive again thinking of extremes each table is 10 petabytes or 10 terabytes so this is just trying to minimize the amount of work minimize the explosion of memory and storage for your data structure for parts you're never actually going to need would you mind explaining the algorithm yeah this is sort of not clear so this is all in sort of order and I think this is going down here this is going over here and this is just saying that the think of this as again the link list to how to follow along for the rest of the tuples yeah actually this is actually not in sort of order one three one two two three so this is how the original tuples appeared and now you're just keeping it in that order but then you're storing link lists yeah so I think you have to do one pass you have to do one pass with the data anyway in the beginning because you have to hash it and figure out what the root is right and I think that's when they construct this link list I'd have to double check that alright so I'm going to show one graph from their paper so this one they're comparing against empty headed which is the thing from Stanford early prototype logic blocks and then the original version of Umbra Umbra with the leapfrog try join from the logic blocks guys and then the Umbra with their hash try alright and for this one they're trying to compute a three-click query sorry three-click graph or sub-graph from a graph data set from Google Plus Orchid and Twitter Google Plus was like early Google's attempt at doing Facebook okay Orchid was the Brazilian Facebook yeah and Google bought these right and then Twitter is alright so again the the main takeaway is that like the for these larger graph data sets the in the case of the Twitter one I think the graph is highly connected so building those data structures in the beginning you know just it's super expensive and they end up just timing out whereas the late materialization shows the real benefit here because you know yes it may not going to run as fast as if you built everything ahead of time but you have you know you don't have too much memory pressure of trying to maintain again this data structure to do the to do the join right so again just showing you that the umber hash try is preferable over than the leaffrog try so the real comparison is like this one versus this one right because empty headed is a prototype logic blocks is again the only commercial system at the time they compare against right this is what they care about like if you get Germans building your hash try and Germans building your leaffrog try then you know the hash try is better right so the the challenges though and the paper brings this up which is a good point is that you need you still need binary joins and so I think there was one they had a bunch of experiments where they showed that the for work loads like tbch and the join on a benchmark if you're just doing binary joins even when it's not unfiltered the multiway join is actually not going to be as good right it's not going to be as performant and as the binary join it's only the cases when the immediate result size is going to blow up is when you want to use the worst case optimal join again as we defined in the very beginning so what you really want is a system that can support both and then at when a query shows up be able to determine which joins within your query plan should be using one algorithm versus another it's no different than trying to figure out whether when a programmer's join or hash join or a nest loop join which you typically don't want to use that but you want your optimizer to be able to figure this out and so in the paper they talk about how Umbra was able to extend their optimizer using heuristics to basically figure out on the fly based on the statistics they've collected whether to use one versus another and no system can do this I'm not trying to advertise for Umbra but like logic blocks only did multiway join I think Kuzu only does multiway join I mean the way they describe it in the paper it seems pretty easy to do the thing because if it's like there are a lot of binding there's a lot of cascading that you can just do Yes, nobody else does that Nobody on binary keys act with one of them like on primary keys that you know that the jeta can't blow up those are the small tables If you do an equa join interjoin or equa join with on primary keys you know it's not going to blow up yeah but it's like he was saying it's a bunch of heuristics that figure things out I mean one of the use cases they say I know I want to use a binary join it's like this has problems when again you're doing graph traversals with a lot of self joins because you're like looking up you're doing joins on the edge table and over again or if you're joining on non-farm key primary key attributes then things can blow up again it's not that they're not they're not as common as farm key primary key joins but they still exist enough where this all falls apart okay so I'm going to quickly finish up and talk about one additional optimization from from the duct B guys so it's the guy who wrote the fast lanes paper the people at CWI they had an experimental bench at duct B where they added support for the SQL PGQ extensions on a relational database system and in this great paper which I can post on piazza they basically lay out here's all the things you'd want to have in a relational database system to make a to officially support SQL PGQ queries and that they basically opine that all these specialized graph databases that are out there, the Neo4j and so forth are just fundamentally flawed because they're based on storing edges and vertices in these inefficient data structures that don't take advantage of all the less 10-20 years of developments and optimizations that we've been talking about in this class to make relational queries run faster so independent of the worst case I'm going to join there's a bunch of stuff that we've already covered like vectorization better query optimization we'll cover in a second or cover later in the semester or compression all those things which you need to make a graph query run faster and that the existing graph databases basically ignored all those developments and went down their own path and they're going to lose out to relational databases and I agree to that I agree with that statement so I'm going to show one opposition that we haven't really covered it doesn't really fit into other parts of the that we talked about and you kind of need to understand when you have the balloon up with these intermediate results in these graph or triangle queries this is when you actually want to apply this technique for binary joins it doesn't make sense so the technique is called factorization the idea is really really simple basically rather than materializing duplicate tuples over and over again for a join or whatever operator you're trying to generate you just figure out here's all the actual unique values and maintain a column of a counter it says how many times have I seen this seen this tuple so now going back to my examples I have before when I was doing those joins and the intermediate results was blowing up is that again to have to materialize all those results I could just store it in a factorized form and have a counter but now the challenge is in my implementation all the operators in my system need to be aware of that they're operating on factorized tuples and be able to account for that if I'm running a count query this can't be something internal something that just gets synthesized and ignored treated like any other column the system needs to know this is a counter column and adjust the computations accordingly simple trick nobody does this but again I think this is something that the relational guys will eventually have to add eventually have to add so here's a graph here's some graphs from this again the paper that I mentioned from the duck knee guys where the comparing is Neo4j they're comparing against the extended version of DuckDB with PGQ or SQL PGQ and I think they only implemented Worst Case Admiral join I don't think obviously they already have vectorized execution compression and all the stuff that we talked about so far that DuckDB has and then they compare it against Ambra with their try hash and the main take away is that the Neo4j basically gets crushed like these are all log scale right so they're running the same queries for this link data benchmark this is something that the CBI created with the other graph databases so this is a bunch of workloads that are trying to do pattern matching on graph structures or the logical graph and again just going down the line for different scale factors Neo4j gets crushed and I'm not trying to dunk on Neo4j but that's the oldest graph database they've raised the most money probably like 200 million right and this is when you think of graph database people think of Neo4j and for millions and millions of dollars they're getting crushed by rag tag group of Germans although they're the best Germans and the DuckDB people right and that's because again the system they were not originally designed for doing graph analytics by taking advantage of all the optimizations we've talked about so far plus the worst case optimal join they can crush Neo4j Neo4j as far as I know they store like the there's a separate data structure for the vertices and the data structure that keeps track of edges right yes sorry this question is why is umbra slower and scale factor 100 so why is it faster here than here I don't know I had a good look okay so this is both a I think both have an avatar for why you want worst case optimal join kind of queries but also like why you don't want to use a specialized graph database right so this is an active very active area of research and as I said only a small number of systems actually support worst case optimal joins but I think that's going to change over time and there's new papers coming out all the time there was a new paper out of University College in London for their sonic join and it beats the hash try join I can post a link to that but people are actually working this trying to make this go better and I think that industry typically is three or four or five years behind academia and this kind of stuff but I think now it'll be with the SQL extension for graph queries this will start rolling out in more systems and like I said once you support SQL PGQ why would you want to use a graph database before we jump into the system profiling stuff again, we're going to have on Wednesday we're going to have project presentations everyone's going to get five minutes we're going to try to be more strict on the time so everyone gets through this we're going to go in reverse order then we went last time, I promised and then what we're going to do is we're going to record to resume the talks so that William and I can then watch it again and then provide you guys notes and feedback because I didn't do that last time we lost track of everything because there's so much we're going to post on YouTube and then we'll give you feedback this weekend is that good? alright so this is a quick run through of how to do database system profiling so these slides are a few years old but all the techniques are basically work it's going to be referencing a system that the previous system we were building but if the high level ideas are still the same okay and this is all going to be in SQL plus okay alright so say we have some programs we have two functions and so we want to be able to speed it up with only debugger so the really simple way to do this is open up gdb run the program and just click pause it, stop it then print out the stack trace figure out what function you're in and just record it in the spreadsheet it's ghetto but it would work so if we do this and say that we pause it ten times, get the stack trace and then six out of the time times we were in the function foo so we can basically say that roughly 60% of the time is of our program based on the data we collected it's in foo it's bad but it would work right you can just do it more and more and then you get better samples it's a perf dose it's a perf dose right yeah basically yes, is this what perf does? yes but it has horrible support not you sitting with your keyboard like this alright alright so if we say foo is, we're spending all our time in foo we're going to make that run faster what do we do well this is Omdel's law so if we say we're going to make foo run two times faster we're going to do what the overall potential speedup is going to be right so we get 60% of our time foo drops in half, the 4% of time for the function bar we leave alone and so Omdel's law basically tells us that it's going to be whatever the formula here one over the percentage time we're in trying to optimize what we're speeding up and then one minus the percentage time there so the plug and chug of the number means that our program will run 1.4x faster right so Omdel's law actually works you want to keep this in mind when you try to figure out what you actually want to optimize for so now the question is how do we actually do something better than hitting with the keyboard and at a high level there's two approaches there's me valgrind and perf so valgrind is a heavyweight instrumentation of the actual binary itself to basically introduce some timers if you will for different function calls and that it's just going to collect this while the actual program runs in user space and then at the end it spits out a report you can then visualize and figure out what's going on yes more or less yes but code coverage will tell you what lines are being executed it's not going to tell you where you're spending the time this is the idea is you want to know what the time is that's what this is it's going to basically be a better version of this that's going to use hardware counters which again the CPU is maintaining these encounters about like everything L1, L2, L3 cache misses how many times did how many branch predictions cycles per instruction number of instructions way more things the hardware is collecting all this information so you can actually get it for your program while it's running and if you compile a symbol you can then have it in the perf report actually see what the lines are of code and how many times you're running them and how much time you're spending in them so Valgrind would be a Valgrind is what you use back in the day sometimes the visualization is a lot better depends on the tool and the purpose of what you want to use in modern systems but it's good at least to look at both of them Valgrind is actually a collection of tools that you can use to do dynamic analysis Memcheck would be looking for leaks callgrind is what you want to use to figure out how much time you're spending in different parts of the code and then if you wanted to keep track of what parts of the code are allocating most memory over time you would use the tool called massive so to use callgrind you basically would run your program with the Valgrind command line tell it I want to run the callgrind tool with the additional flags of how verbose or how detailed you want the report to be when it runs your code and then this is going to spit out this callgrind.out file and then you can use a visualization tool like Kcashgrind to see something like this and you get a nice visualization for here's all the functions this other function how many times it's been invoked what percentage of the execution of the program so here you see the human distribution of all the time being spent in different parts of the code again you'll see when you call libraries that are pre-compiled that you don't have symbols for you might just see the library name and a memory address so there's ways to try to get that if you can use libraries that you compile yourselves and then here we see the call graph view and then you can drill into each of these and see additional information but again so this is going to be done so while your program is running it doesn't have any special privileges there's no hardware to make things run better so your program is actually going to run slower so the timing could actually be quite off like the wall clock time when you run with callgrind turned on versus just running by yourself the timing could be off and so for this matters a lot for race conditions and other things you may not experience the problems you would see when you run without callgrind but there's issues in this run that you wouldn't see in the production run so the better approach is to use perf again for this one I think you need root privileges because you have to have permissions to get the counters from the hardware but the basic idea is that you're going to start your program with word perf you can specify how many cycles how often you want to go check for events and how much detail do you want the traces to be there's a bunch of different flags for these things and then it's going to run your program I think it runs about the same speed but again it has to materialize these results somewhere so it has to start running to disk for this dump file so if your program is sensitive to disk.io then this can interfere a little bit but it's not as heavy weight as callgrind so then after you run your program in the directory where you ran perf you can see a dot dump file or it has some kind of dot perf name and then you just use this perf report tool and then that will give you a sort of visualization like this where again you'll see this is actually measuring the time being spent so you'll see the the rank list at the top, the ones you're spending most of the time and then you can drill into them and if you compile the symbols you'd actually see the lines of code that generated these things right, so cumulative events and then additional things you can click enter and you can go see where you're spending your time so in callgrind you can see the lines of code but again this is going to be way better so this is probably two or three maybe four or five years old because before we named this as a noise page we named it after my dog so that's where you see the tier name in there but this is some benchmark we had to see how fast we can do reads across multiple threads and there's other third party tools like spot spot and that's you can see nice things like this you ever see these flame graphs these flame graphs are being generated by tools based on perf and now you can see where you're spending most of your time in and then this is what this is just saying what additional things you want to measure this is what I'm measuring cycles, last level cache misses CV utilization, all bunch of stuff you can tell perf what you want to collect okay so there's a bunch of these links and slides here I'm assuming there's hooks to do this and rust what now? he says if you use cargo flame graph you get a nice flame graph I'm assuming that's running perf underneath the covers but you need root privileges I think to run perf I think you can actually just set an environment variable and it works yeah I think for low level hardware counters you need administrative privileges okay any questions about performance counters yes what do you mean sorry so the question is how would you use perf to optimize one single function right so you would get say you use command line and get perf report if you have this one function so one you can find the function in here but then you can drill that this is not you can drill into that function and it'll show you the lines of code and how much time you're running it and how much time you're spending in it and they can use that to figure out where you're wasting your time and it could be because you're calling malloc a lot or something stupid in a way you didn't think you were calling and then you can then refactor and optimize that you can count by cycle, civilization cache misses I don't know if you can get memory allocations in this TLBs would be another one you would care about and those are all the point I'm trying to make is perf can record things not that call grind can't about why your program is slow call grind will tell you how much time you're spending and how many times you invoke a function but it actually can't tell you why that function is slow other than looking at and looking at the lines of code this will give you like the low low hardware things that we care about yes yes the point is like yeah another thing about it this is basically this gets everything and it'll be too much and so his point you can basically put I don't know if they're like kernel programs or whatever invocations tell perf start recording now and then when you leave the function turn it off but like just running this for everything first for like a small portion of your system will at least tell you what the higher level things you want to target first and then you can then drill that into that and say why am I spending too much time with too many cycles in this part so yeah symbols are symbols are separate than like compilation optimizations right because again you need this so you like purple they'll show you the lines of source code then you then get assembly view you actually then see the assembly so then you got like if you want to start thinking what would it o3 do to change my beautiful russ code to something bizarre you gotta look at the assembly there's no other way to do it David is there a way to prevent o3 from majorly rewriting your function so that you can debug it especially for high in that case it might be doing the right thing we can take this online but like other questions awesome again next class presentations and then I'm happy to talk if you guys want make sure you send me the slides and your document before class starts and I don't think it should be 60 degree weather in February but enjoy it you