 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots, and rap let's fled with the church. All right, look at this. So today we're going to talk about parallel hash joins. So this is going to be the first of the three lectures we're going to talk about how to do hash joins. Or sorry, how to do joins, high performance joins. So this will be the first one hash joins. The spoiler is, this is what you're going to want to use in almost any case. Next class after the mid-term, sorry, after the spring break will be sort merge joins, and then we'll finish up with multi-way joins, which is a special case. All right, for the class project stuff, I posted on Piazza last week about project two, project three. I still need to upload the project three info page. I will do that tonight. So project two, if you haven't signed up for a database system yet, I haven't gone through and improved them. I'll do that tonight as well. Please do that if you're looking for. I will also put out a list of the ones that I think are interesting. Yes, question? It's what I talked about last class. What you're going to do, how you're going to do it, how you're going to test. It's like super high-level, not anything deep. OK, it's not a three-page favorite. It's not a three-page favorite. No, it's slides, literally slides, for five minutes. And you can fake it by trying to get your laptop to work for two minutes, right? OK, like, that's not what I'm going to do. And also, too, I saw people signed up that I started looking for teams. I've talked to some of you. I will reach out to, again, individuals tonight. And we'll start trying to coalesce and get everyone into groups and assigned to projects in time, OK? And then I'll be on campus tomorrow if we, I think, there's a faculty candidate in the morning. But I'll be around when you try to make something. If you need to talk, we try to make arrangements, OK? Are there any questions about project three, project three? And project one was due last night. Some of you got 110%. Congrats. For those who didn't finish, please try to finish as soon as possible, OK? Or reach out to me if something else is going on. All right, so today's class, again, we're going to focus on hash joins. So we're going to begin with the background of what we need to care about when we do, not just hash joins, but any join algorithm or any operator implementation in a database system in general. Then we'll focus on how we do the parallel hash join and sort of define the building blocks of the steps or the phases of doing the hash join. Then we'll talk about how do you build the hash table, what do the hash functions look like, what do the hashing scheme look like. And then we'll finish up the evaluation really only to two graphs from the paper you guys read, which was a deep analysis across a bunch of these different hash join algorithms that have been proposed in the last decade and really trying to understand what makes one better than another. Because they would have contradicting results saying, sometimes one implementation was better than another. And so this paper you guys read was trying to clear the error to say, OK, within a single platform, a single system, a single implementation, which hash join approach is going to be the best. And the spoiler is going to be the no partitioning one. It's going to be the best. So I'm going to teach you radix partitioning and other ways to optimize that, the partitioning phase. But in general, it turns out no partitioning is probably going to be the best thing to do. Because to get the tuning right for the number of passes and the size of the bits you want to look at in the radix, getting all that right per hardware is nontrivial. So most systems, as far as I know, are just going to implement the no partitioning approach. OK. So today's lecture is about how hash joins on a single node. So I don't have a slide discussing this, but think about, again, we're focused on single node execution at this point in the semester. So what we are aware that we could be intentionally running in a distributed system, where there is some higher level orchestration that is moving data around between nodes as needed. So today's class is really like, OK, well, somebody else has got the data to my node, and I'm going to do the join on the data that's local to me. So we're not going to worry about, like, if I have to go pull things over the network, if someone's already done the shuffling for me, or did a broadcast join, all that we ignore. It's like, the data is on my local box. I want to run the join to produce output. And then where it goes next, at this point, we don't care. All right? So again, I was emphasizing that we're only looking at how to do joins between two relations, or two tables, using multiple threads to paralyze their operations. We'll look how to do multi-way joins, or like, three or more table joins after spring break. So to do a join, there's basically two approaches. There isn't any magic to this. It's going to be either a hash join or a certain join. The basic idea is that we want to avoid having to do brute force sequential scans over our two input tables or relations over and over again. So we either build a temporary data structure, like a hash table, to help us do quick lookups of fine matches, or we sort the data and then just walk them with iterators and checking across. So we're not going to discuss nested loop joins because in an OLAP system, you almost never want to do this because the data you're trying to join is going to be large. And chances are you're probably not going to have an index already in it. All right? But if you cannot de-correlate it, that's fine. If you can't de-correlate it, yeah, screw you. Yes, you have to do nested loop join. We can ignore that for now. We'll come to that later. Yeah, we'll talk about nested queries. I think we talked about query optimization later on. But best case scenario, assume we can do a hash join. So for OLAP systems, a lot of them don't actually even implement hash joins. Like MySQL, I think maybe MySQL 8 has it. Yeah, I think they may have just had it. But for a long time, they didn't have it, right? Because in most of the time, OLAP workloads index nest and loop join is what you're going to want to use. And I just want to iterate to say, like, this is the index nested loop join is conceptually the same thing as a hash join, except traditionally when you say I'm doing index nested loop join, it's going to be on a B plus tree index. But they're basically doing the same thing. There's some data structure that allows us to do quick lookups. In the case of the B plus tree, it'll be log n. If it's the hash table, average case, it'll be 1. There's some data structure that's going to always have to do a sequential scan or the entire data set to quickly find a matching table to do our join. But in the case of a nested loop join, the index is already going to exist. And then when the query is over, the index still exists. And the hash joins we're talking about here, I'm going to spend all the time building this hash table, probe it to do my join, and then throw it away immediately after my query is done. And then the next query comes along, and I'm going to build it again. And that's OK. There is some research literature about saving the hash tables in your data structures from one query to the next. It's sort of like a materialized view, but not exactly the same. But for our purposes here, assume we don't have existing hash table, and we have to build it on the fly. You can do this in some systems. They'll build a B plus tree on the fly. I think SQL Server calls us a spooling index, spooling scan index, a spooling index scan. Again, they build a B plus tree instead of a hash table because for whatever the query is, that's better. But not everyone does that. So hashing resorting is a classic debate in databases that goes back to the 1970s. We'll see another similar debate when we talk about query optimizers. Do I go top down or bottom up? There's pros and cons to both these approaches. The way back in the 1970s, because the machines that they were running on were really limited, had a small amount of memory, they had to do external merge sort. So the literature at the time deemed that sortmer's joins were better than hash joins. Because at least you could have sequential access instead of the random access you have in a hash table. Then in the 1980s, hash joins became the preferred approach. I think it was a combination of two reasons. There was the grace hash join technique that came out of Japan, showed how to basically spilt a disk if necessary by partitioning, which is basically what the Radix partitioning approach we're going to talk about later on will do. And then there was also this movement of specialized database accelerators called database machines. Things like specialized hardware people were building in the 1980s that could do hash joins in hardware. And of course, they all failed because of Moore's law and Intel CPUs and other CPUs were getting better and better at the time. And by the time you actually fabbed your database machine hardware and got it out in the market, the next version of x86 was out. And any benefit you got, you lost. So hashing was deemed equivalent in the 80s. Then in the 1990s, there was a paper by Gertz Graffy, the guy that invented the volcano approach. We've mentioned a couple of times before. He wrote a paper that basically said the algorithms were equivalent. But there really wasn't any major performance difference between the two for the horror that was available at the time. Then we get into the 2000s. And for the last two decades, hashing has been the dominant approach. And this is just a byproduct of the horror getting faster, the being clever about where you're storing the data in memory, and just better implementations of the system of these approaches. So 2000s, hashing dominated. 2010s, the question was, should I partition or not my data before I do my joins? And then the current decade we're at now is no partition is deemed the better one. But again, it's still worth understanding what this is and then what the sorting one is as well, which we'll cover next class. So this lecture today, we're going to focus on these things. So the paper I had you guys read was, as I said, it was a summary of what the previous decade hash joint implementations looked like in the research letter term. And we're trying to understand, OK, which one is actually the right, what are the right results we should consider in building a new system? Because oftentimes they said contradictory things. So the modern era hash joins kicked off in 2009 from this paper from Oracle and Intel, where they were showing that hashing was better and they had a partition, the sort of radix partition that they were using this. Although they claimed in this paper that they think sort merge would be better once AVX 512 comes along. Of course, well, AVX 512 wasn't called that in 2009. They said once we have 512-bit Sydney registers, we think sort merge is going to be better. The later papers show that actually that's not the case. But this was the first one that said, hey, hashing is better. Then there was a paper from the people in Wisconsin in 2011 where they looked at the differences between partitioning and non-partitioning hash joins. And we can discuss these results later on, although it's over a decade old now. Then the hyper guys came out in 2012 and said, OK, everybody's wrong. Sort merge is actually better. And even without SIMD, we can get better performance in our system over hash join using sort merge join. So that was 2012. Then a year later they came back and said, OK, ignore what we said here, we were wrong. And it turns out you really actually want to use hashing. And if you make it a number where you can get much better performance. 2013, the same year, the other Germans, I guess the Swiss Germans at ETH, they came out and said, OK, here's another implementation that extends the Wisconsin guys have done what we did. And here's how to get better performance. And then again, the paper you guys read came out 2016. We basically said, everyone is showing different results, it's not clear what's actually what's going on. It's not clear what optimizations are getting implemented and all these different approaches. They try to do a thorough evaluation of all these things. And the main finding is going to be that the non-partitioning one is going to turn out to be superior, both in terms of performance, in the common case, and in terms of implementation. There's a paper that came out 2015 from the same people the year before where they looked at all the different hash table implementations. So this is the follow up to this, which I think is great work. And then where we're at now, the state of the art is the same Germans as hyper Germans, but these Germans building AMBRA, they have a version of Radex hash join in their new system. AMBRA is the successor to Hyper that shows that the Radex hash join is better marginally. And then the engineering costs of getting that better performance just isn't worth it. And you're better off trying to implement the non-partitioning scheme. Another thing I'll point out too is because these papers are saying sometimes one is better than another, as far as it know, no system actually tries to implement multiple variants. They usually just pick one hash table implementation. They'll pick one hash join implementation, and that's it. They stick with it. It's not worth the engineering cost to be adaptive or dynamic on the fly based on what the data looks like. There's other things you need to worry about. And then we'll see later on at the end, these Germans show that you're not really spending that much time when you execute joins in query execution. There's other things to worry about. So our goals for building this high performance join algorithm are going to be the following. And I will say that these goals are going to matter whether or not we're building a sort of what is called a harbor conscious algorithm or harbor oblivious algorithm. And this distinction just means like, are we going to have, in our implementation of the hash join algorithm, are there going to be sort of parameters or knobs that we can tweak based on what we know the underlying harbor looks like, what the cache line size is, what the memory access speed, and so forth, or the L1, 2, or 3 cache sizes. So cache oblivious means you build your algorithm without having to worry about any of these things. And harbor conscious means you don't worry about any of these things. Harbor conscious means you are aware of these things and the implementation. So these two goals are still going to matter. It's just whether or not the implementation is going to spend the extra time to try to, and being harbor conscious to achieve these even further than what you can get through careful programming now. So the first thing, the first goal for us is that our hash join, or actually for both types of joins, is that we want to minimize the amount of synchronization that's occurring between different worker threads running on the same box. So that means that we want to minimize the amount, whether it's necessary for one worker to coordinate with another worker about who's going to write into some space or waiting for the data that they need to generate. And it doesn't mean that we need to make our entire algorithm latch-free, because that always doesn't mean better performance. It just needs to be smart about where we take latches to make sure that we're not bottlenecked on everyone doing writing to a single location. The next goal is we want to minimize the cost for us actually going accessing memory. As I said, assume that we're running on a single node. Assume that the data that we want to do our join on is already brought into memory. We've already fetched a name. We've already done the sequential scan. We're up in parts of the query operator, or the query plan. Everything's in memory. So now when we want to access data, we want to make sure that we maximize the locality of the data. So ideally, we want to have a worker access data that's in the same CPU cache, but at the very least, maybe the same NUMA region. And then when we bring data into our CPU caches, we want to maximize the amount of the reuse we get out of it before we move on to the next piece of data. Because we don't want to ping pong, bring data in, do something on it, throw it away, and then bring it back in a few minutes later, because now, again, we're polluting our cache. We're spending a lot of time doing these memory stalls. So let's dive a little deeper into this one here. So when in our algorithm implementation, we need to be mindful of what's in our CPU caches, or what's capacity in our CPU caches. We also need to be careful about what's in our TLB or in hardware. Because what we don't want to have is we have our thread is reading random data. And as I said, ping pong back and forth. And now, all in there, are we going to have cache misses installed, go out to memory, and go get it. But we also can have misses in our TLB, because the data we need is not there, or the entry is not there. So if I need to touch 10 tuples, and those 10 tuples are in 10 different pages, but my TLB only holds five entries, I don't want to have to go cycle through, bring things in, enter my TLB, and then bring another one in. Because now I'm paying two cache misses for TLB and the data itself. So the two ways we can do this, we want to maximize the sequential access. And this means that we try to cluster the data that we want to access for whatever phase we're in the join within to be a single cache line. So bring those 64 bytes in, do whatever we need to do on it, and then we're done and move it away. And then we'll see what optimization we can do. If we have to write out data to an output buffer, we can use the streaming instructions to bypass the CPU caches and go immediately to DRAM, to put it out to DRAM. Because we know we're not going to read it right back. For random access, again, for this one, we can split things up so that the chunk of data that each operation or step is going to operate on will be within a single cache line. So there's going to be this tradeoff between the number of instructions we have to execute to compute the join versus the memory we're going to use. And you know, Hedda says, it may be the case that the extra CPU instructions to partition the data is not going to give us, is not going to outweigh the overhead of, sorry, the extra CPU instructions aren't negating the benefits we're getting, or don't overcome the amount of instructions we're executing and the improvement we're getting is not worth the penalty of spending that time versus just blindly accessing data. And that's why no partitioning is going to be better. So while these things matter, the question is, is it worth the time to do the partitioning steps to maximize these goals, or is it just better just to read the data and maybe be a little bit careful what chunks of data we're accessing as we go along? Again, we'll see that as we'll be talking about the approaches. All right, so hash join is probably the most important operator you could have in a DV system. But again, it's sort of like the SIMD stuff where if you just try to feel a really small portion of the system, and I try to vectorize the hell out of it, and I'm going to get amazing potential numbers improvements, but it's all the stuff around the operator, whatever the thing I was trying to run, all the overhead of getting data in and out of that sort of small kernel, that's going to be dominating the cost. So while it is important and every system needs to do this as fast as possible, it's going to be a bunch of other stuff that actually could matter a lot more. Now I will say it, and I'll show this at the end. Again, they claim, and the Germans claim that for TPCH, from TPCH query, you're only spending maybe 10% to 20% of the time doing the hash join or the overall query. We have numbers from Impala from a long time ago that showed in their system, and maybe because it was a long time ago, they were spending 45% of the time doing hash joins in all of TPCH. So I don't actually know what the real number is. I think Prashant measured this. I could go back and look. I think it's somewhere in between, and I'm more inclined to go with the more recent German numbers than the Impala numbers. But we'll come back to this. But regardless, it's still going to be an important algorithm. We need to operate as fast as possible. And again, we want to take advantage of all the additional cores that we're getting, because getting Intel's not going to ratchet up the clock speed a lot more. There's going to be more and more cores. So we want to take advantage of all those cores. Again, this is all being done independently whether using SIMD or not. So the basic steps of the hash join are the following. So there's three phases where the first one is optional. The first one is the partitioning phase where we're going to do a scan on R&S, and we're going to split it up into disjoint subsets or shards or partitions based on some hash function that we're going to use on the join key. And the idea here is that we want to break it up into smaller groups so that when we go do the subsequent phases, the build and the probe, that we can have worker threads operate on discrete junks and have all that data be local to it. So after partitioning phase is optional, but you don't have to do it. If you ever, sometimes it's the gray hash join, that's the partition phase comes from that. The next, you have the build phase. This is where we're going to scan R on the outer table. And we're going to create a hash table on the join key. We'll talk about what that hash table could look like and what the hash function would look like. Then in the probe phase, you scan the inner table. Look at its join key and hash that. Then do a probe into the hash that we built in the second phase. Check to see whether they have a match. If yes, then match the two tuples together and then produce the output. Then move on to the next one. Or if we're in a pipeline, write it up to the next operator in the pipeline. So the big thing about the paper from these Germans here is that they're going to include the materialization cost in this last phase here. The probe phase of combining the two tuples and producing the output. Some of the papers that their site don't do this, they said, OK, I have a match and they need to discard everything. But obviously, that's not real. So they always include that in their execution costs. Because it's going to affect certainly the caching behavior of the system. So we're going to go through each of these phases one by one and understand the implications of pros and cons of the design choices we would have for the implementation. So for the partition phase, the two approaches are basically what are called implicit partitioning or explicit partitioning. So with implicit partitioning, we assume that somebody else in the system, like when they sent data to us or we loaded it from the disk, has already done some kind of partitioning exactly on the joint key that we wanted. And therefore, we don't need to do any additional partitioning. We don't need to do any extra pass. The data is already in the right locations for us. This is not usually the case. You can define joint keys, but think of now, again, bringing things off disk in the memory from these parquet files. It could be just a bunch of random stuff. So maybe within the parquet file, things are partitioning, but I'm trying to read multiple parquet files. It's all going to be a mismatch. So with the explicit partitioning, the idea here is, again, we want to scan this first example here. We're going to scan the outer relation and redistribute it amongst all the CPU cores. We haven't talked about this yet. Sorry, this is flapped around. So yeah, I don't know why this is. Ignore this. I'm teaching you random partitioning. So we're going to divide the two tables. These should be two tables too as well. Tie the two tables, split them up into different CPU cores. So then now when we go to do the build phase or the probe phase, the CPU cores are operating directly on the partitions that they're assigned to. And we can be careful where it's being located so that when we split the partitions up, we put in different numeric regions so that the worker operates only on its local numeric region. So we're not going over the interconnect and paying up the longer latency. So again, split the two interrelations into partition buffers based on the join key. And then the big goal of this is that although we're going to pay this penalty of having to do this partitioning, the extra CPU, the extra instructions and the extra time we're spending to do it is going to be less than the extra time we spend if we didn't do partitioning. Like we just let the data be where it is and we have an unoptimized execution of the instructions to do the joins. That's going to be worse as if we didn't do anything. But again, the research says that it doesn't always turn out to be the case and it's better off than not doing it. So if it's a row store, you have to copy the entire tuple. That could be expensive. If it's a column store, then you only copy the data you actually need. We could just be the join of the offset. But it depends on whether you're doing early materialization or late materialization. All right, so the two approaches to do partitioning if you want to do it will be non-blocking and blocking. So non-blocking, I don't know if they cover this in the, I forget the start of the paper, but the basic idea is that we only have to scan the relation once and then we just produce the output as we go along this simple partitioning scheme and we'll determine where we actually do our writes. We'll see that in a second. But then while we're doing the partitioning, because we don't need to do multiple passes on it, when we put a tuple into the output partition, some other thread could pick it up and then do whatever it wants with it. We could do the probe or could do the build on the hash table side. We don't have to wait for the partitioning phase to be complete before moving on to the next one. In the blocking approach or the radius approach, we have to scan it at least twice because the first pass you have to go through and figure out what the, how to break up the data, right? In the private partition case or the radius partition case would be, you know, with this histogram thing and then once we produce the output, sorry, once we do produce this histogram, then we go back and start writing to this partition and then something else could read it. That assumes that we may not want to take another pass over it again, to partition even further, which depending on the, if you have a lot of skewed data, you may actually want to do. And we'll see how we handle this in a second. So typically, when you ever see in the literature, they say I'm doing a radius join, a radius hash join, it's with this extra partitioning phase. All right, so in the non-blocking case, again, the two approaches are gonna be, I can either have a shared partition or a private partition. So the basic idea is like, where is the output, or where are the worker threads gonna write the data that they're partitioning to? Are they gonna write to some global hash table or global partition space? Or are they gonna write to private partitions? And this is a good example in computer science where there's no free lunch, where I can, in this case here, I can just have them write into this global space and be done with it once everyone's finished. In this case here, I'm gonna write locally and that can be faster because it'd be less last contention because I'm not writing to a shared data structure, but then I gotta put it back into a, call us into a single space afterwards. And then that's gonna take extra time. So let's look at this visually. So here's my data table. And say I'm doing the morsels approach or I'm gonna split them up into some number of tuples to these chunks that are each gonna be signed to a single worker thread. So say I wanna do the join on column B. So I'm gonna go through, each thread's gonna scan through an access B and then hash it based on whatever my hash function is. And then there'll be some global partitions of the hash table, say we're using a change hash table here, that I'm just gonna split them up and write them at rotten into. Actually, this is not even a change hash table. These are just link lists of blocks of buffers, right? And so if I wanna update now these link lists because it's a shared data structure, meaning any worker can write into it, I gotta protect them with latches. So once whatever thread requires a latch, then I'm gonna pen the next entry to it. So this is, it's easier to implement because everyone's just writing to the same space. I don't have to do any cleanup afterwards because once everyone's finished their scan over the data table, this thing has everything that it actually needs, but I could have contention if everyone's trying to write into the same partition and I'm getting, you know, getting latch contention. Private partitions is basically the same thing, but now every single worker thread has its own local partitions that they can write into. They don't have to take latches because nobody else is writing to the same space that they are. But then when this is done, then I have to coalesce them back to the global partitions because this P1 here has data, I need to know that P1 and P1 and P1 here are put together into a single P1 because otherwise I guess I could have false negatives. So now what you do is you have, for each level of local partitions, you have assigned one worker thread to go through across all the different local partitions and then combine the results into the single output. And at the same time, you do the same thing with the other one and the last one as well. Again, this is occurring in parallel. So again, the end state is that I have a bunch of these partitions with my data in them. Okay. All right. So now with radix partitioning, the idea here is that it's basically the same thing, but instead of, in these previous examples here, I did one pass over the data. So each thread went through the data once and I wouldn't call this a pass when you do this copy, this is just literally this mem copy into the global partitions. So with radix partitioning, you gotta do one pass over the table and compute a histogram of the number of tuples per hash key for some radix, we'll cover that in a second. Then you use this histogram to figure out within the global partition space where each thread is gonna write into. So this is like a coordination step. You scan through the data, compute this histogram, then you distribute the histogram across with all the worker threads and then they can use that to figure out, okay, if I'm writing into this worker thread, if I'm writing into this offset within my partition space, here's where I should write into. And I know that I don't need to coordinate with any other thread because no other threads would be right into the same space than I am. So then with this histogram, you compute the prefix comma and then you scan through it and then the writing phase occurs in this last step here. And again, if you wanna do recursive partitioning because you have a lot of skew and you have one partition has a ton of data, you can loop back around and run this again. All right, so let's first cover what a radix is and let's cover what a prefix sum is and we'll see how to put this together to do these three steps. So a radix of a key is just gonna be the value within some position of the key itself. So assuming we have integer keys, so think of like keys 19, 12, 23 and so forth, the radix would just be like what's the position of a digit within one position here, right, the first position. So for these, the radix would be nine, two, three, eight, one and four. And then likewise, I can do the same thing for the second position here. So what we're gonna do is we're gonna take, we're gonna compute the radix, which is just bit shifting in multiplication, get the radix for the first position of the hash key and then from that, we're gonna then compute a histogram to say, okay, for every radix, what's the number of elements that I have in my input sequence here? And then we're gonna use the prefix sum, we'll show it in the next slide, then that's gonna help us determine where the offset is gonna be. So the idea is that for a key that shows up, to save this as the hash value of it, and at first the radix is one, then I know that there's gonna be three other values of one that I need to write into my partition space and therefore I know what worker thread I am, I would know what offset I'm allowed to write into. So the prefix sum is how we're gonna get, determine that offset. The prefix sum is just taking a sort of, a moving summation of an input sequence where each value that's produced in the output sequence is the sum of the values up to that point, right? So the very beginning of the prefix sum of the first element is just one, but for the second element in the prefix sum, it's whatever the previous summation was plus the new value, so just three. And I just do this all down the line, for all it is. So now you can see where I can use that histogram that I have before going back here. So now the numbers are two, three, and one, so my prefix sum would be two, five, and six, and that would tell us again, what offset I'm gonna write into safely. So actually I found this out today. There's a paper from Guy Blueck from like 1990 where he says, hey, wouldn't be great if SIMD had this to do prefix sum, because that's really important. As far as I can tell, there isn't a SIMD instruction that does exactly this. Like it's a bunch of bit shifting stuff to make it actually work. And I don't think it's performant, at least the current AVX512. But I'm saying Guy's awesome. He thought of this stuff 30 years ago. Okay, so let's see how we put this all together, okay? So assuming here we've already divided up the data to morsels, and so we're gonna have the worker threads scan through the input sequence. And again, assuming these are the values of the hash of the keys if we already hashed them. But we would obviously hash them on the fly. For simplicity I'm showing it like this. So what we're gonna do is we're gonna have, look at the first offset, the first rate X, and we're gonna compute a histogram. Each worker thread's gonna produce a histogram and say what's the number of keys that hash would or have this sort of, this rate X value here, right? So you just go through, scan through and produce this output here. So now the histogram for this guy is partition zero, so zero here. There's two elements or two keys. A partition one, there's two as well. For here, for partition zero, there's one. And then for partition one, there's one, two, three. Right, so this is the histogram I compute by looking at the rate X. So then now, assuming I have again this giant output buffer here, where I wanna put all my tuples in after partitioning them, I can use again the prefix sum to determine where should each worker thread be allowed to start writing the keys that it finds when it scans through back to the second time or the second pass, right? So for worker thread zero, partition zero, it's the first element. So this prefix sum would be zero at this point, so it writes there. But then since now for partition zero for the second worker thread, it would write to here. And again, as they're scanning through, as I'm showing two worker threads here, but obviously I think if there was 32 of them running in parallel, they don't need to coordinate about who's writing into what because they've already pre-computed where they're allowed to start writing. So there's no synchronization costs other than waiting for everyone to say, okay, did you compute your histogram? Exchange data, then compute this, which is not that expensive to do, right? So same thing for partition one, they can write here safely without coordinating. So now again, they do the second pass, scan through the data and then populate the output partition appropriately. So there's a simplified version where I'm just showing like the hash keys, but think of like it'd be the keys plus the actual tuple itself that we're actually writing here, not just like 30 bit numbers. So now if I had a lot of skew, say partition zero, partition one, in the built-in probe phase, if I just have two worker threads, I could let CPU or worker zero take partition zero and CPU one take partition one, but let's say in this case here, I'm only, it's three versus what? Three versus five, but assume there was like 10 versus 10 million besides the partition, maybe I want to partition this again. So I could just rerun the same algorithm recursively and do another two passes on this partition to divide up even further, right? And then you're looking at the next radix digit and it's doing the same thing we did before. Again, most systems will just do this, just do sort of one round of the two pass algorithm. For disk-based systems, sometimes they go to more than one just because they want to divide up even further to fit page sizes on disk. Again, in practice, it depends on how much skew there is. They, again, how this is. Hashing does alleviate some things, but if everyone has the same join key, the exact same value, then no amount of partitioning is going to help you. Because they're all going to hash the same thing. All right, so any questions so far? Yes. Question is, will this end up being a sorted lift? What do you mean by sorted? Sort of on what, the join key? But it's hashing, so hashing's random, right? So you don't have any sort of order. It's clustered, right? So like all the tuples with the radix zero, they're all together, but there's no guarantee that within this cluster that they're going to be ordered based on the join key. So it's not sorted. Within a partition, they're unsorted, but yeah, they're grouped together. I'm using the word clustered. That's a better word. They're grouped together based on the partition value, like the radix that we generated. The same question is, when we write it out to the output buffers, we don't need to put it in order. You actually can't, right? So going back here, like at this point here, or before this, like all we know here is that worker one is going to write partition one data here. It doesn't know what's in worker zero's data. So how could it then, like how could you write it here but then also be sorted within partition one? Because this is partition one here entirely. So you don't want to coordinate with this other worker threads. You can't sort things. You don't want that. You don't want to sort things. You don't need to. Partition one's elements always at the top of partition one. Does that matter? When you shift everything, what do you mean? Sorry. What's that? You're pretty close to order by A to C. Oh. First order by A, first order by C. Yeah, yeah, yeah, I got it, got it, got it, got it. All right, I understand. Yeah, sorry. In this phase, no. But if you recursively, you're just looking at every single radix. So again, what are you saying? Would you be globally sorted within a partition? It's still not so sort of a hash table. Like globally, but like, yes, globally, yes. But like, it's, you can't do anything with it though because you have to look up the hash table and that's going to be another random access. I don't think you can take advantage of it anyway. They're just sorted by the hash table. Yeah, you still have to go, you approach the hash table anyway. That's going to be random. And then within the, depending on what hash table you're using or hash, hash news scheme you're using, that's going to be a sequential scan within that. I don't think there's nothing, unless the hash table is sorted, which it won't be, you don't get any benefit from this. But it's an interesting observation. Okay, all right, so there's two optimizations that the, the paper points out, they became, makes a big difference. The surface, first one is the software, right? Combine buffers. And the idea here is that in, in my example here, I was showing that the partition, the worker threads were just writing through this global partition space, right? And it gets on PowerPoint, so who cares? There's not a, you know, not a real system. But if it was actually writing to a, you know, particularly chunk of memory like this, it could be really bad just doing this sort of, this random access, go back here, like I write to this partition, then I write this other partition here. And what you instead want to do is have a little buffer that's local to your, to each worker, worker and write all your, your new updates there. And when that's full up to a cache line, then you write that all once in a batch out to the memory space, right? And this one removes the pressure on the TLB and improves cache locality in the data you're accessing, right? It's similar to the private partition stuff we saw before, but you don't have to do that separate write phase where the worker threads scan through and combine everything afterwards. I can still get the same benefit of everyone sort of writing locally, and then they write out a batch to get better performance. And then the next one is the streaming write stuff that we talked about before. Again, as I'm writing out this partition, the data to my partition, partition output buffer, I know I'm potentially not gonna have to read it again. Again, ignoring requests of partition. So as soon as I write data out into here, right, I just write, I know I'm not gonna go back and try to read what I just wrote. That's gonna come later when I do the build or the probe and the hash join. So there's instructions in x86 to call it worker streaming writes where I can write to the memory location and it bypass the CPU cache and I don't pollute it. I know you can do this with SIMD. I don't know if you can do this for regular scalar variables, right? Again, so the combination of these two things, and we'll see this when they do the optimized evaluation. This is gonna be a much better performance over basic implementations. There's other optimizations that they cover, like being a new malware, which we've covered before. And I'll talk about a few other tweaks we can do as well. All right, so now, we've done partitioning. Now we wanna do the build phase. Again, that here is we scan on the outer table on R, either the tuples themselves or the partitions of them. And then for each tuple we find, we're gonna hash it based on the same hash function we used before and then write it out to some location in our hash table. And if we're doing, if we're organizing the data in the hash table's buckets, we wanna have sort of these buckets be equivalent to a few cache lines and nicely align so that we're not paying the penalty of unaligned access. So a hash table is a combination of two components. Some people say I have a hash table, it's gonna be, you need two of these things. First will be a hash function and that's gonna be taking a arbitrary value such as our join key and mapping it to some smaller domain, typically a 32 or 64 bit integer and then it, which will be random. And the, we wanna focus on this work with the concept of this trade off between for a hash function to how fast it's actually gonna be to compute the hash versus what the collision rate's gonna be. And we'll cover that, looks like in a second. And then the second piece is now after we, when we do a hashing on our values or join key, we're invariably gonna have collisions. The question is how do we gonna handle collisions in our data structure? So again, more trade offs to here between do we have a really large hash table which won't have any collisions or do we have a smaller one and have collisions and then have to pay the extra instructions to find places to insert data and find the key when we wanna do a lookup later on. So we don't have a lot of time, so I'm gonna go quickly on both of these because again, the main takeaway is gonna be XX hash on Facebook is gonna be the fastest hash function and then the linear hash, linear probing hash table will be the fastest limitation. But I do like to cover like cuckoo hashing and hopscotch hashing just in case you, so you guys are aware that these things exist. I'll explain why you don't wanna do them so that when you go in the real world, so instead of, hey, we should use this fancy hash table, you just say no, we can point to this lecture. All right, so for a hash function to hash join, we don't care about cryptographic properties or guarantees, we don't care about being able to decrypt it, it's a one-way hash. So we want something that's really fast and has a low collision rate. The best hash function you could have being the fastest would just be always return one, right, because it's the simplest thing to do. It's just copying one data from a register to another register, but of course, that's gonna have a terrible collision rate because everything, no matter what key you give it, it's gonna hash to one. The best collision rate would be the slowest is what is called a perfect hash function and this is a theoretical concept where no matter what key I give it, I generate a guarantee to be unique hash value. So again, this is in the research literature. You can't actually build this because the way you actually implement it is with another hash table. So you have a hash table for your hash table. You see, nobody does this. So for the latest up-to-date numbers on what the fastest hash function is, there's this benchmark called SMHasher, which I think is from the member two guy or the member hash function guy, but he basically has this stress test for these different hash functions that are out there and you can measure the collision rate versus the performance. So I don't want to spend the time on this, but here's a bunch of hash functions. The modern era hash functions started, I think, in 2008 where some random random on the internet built murmur hash. You put it out there and then people started using it because they had this nice trade-off between collision rate and performance. The Facebook or Google took this and they forked it and that became, I think, city hash. The guy that built Z-standard, the compression stuff we talked about before, he built XF's hash, which again, this is still considered the state of the art. There's things, XF's hash three is the latest one, and it performs really well. And then 2016 was highway hash. There's a fault of the farm hash, but again, you wouldn't use this for a hash join. So this is a micro-matchmark that I run with a different framework. And just measuring, for different key sizes, what the throughput is. And again, you can see, again, XF's hash three just blows everyone away. And the sol-tooth pattern here is the cache line sizes, right? That like, when I have another cache mess, whenever it's hashing, as the key gets bigger and I fill out my cache, I'm getting more throughput, I'm processing more data. So this, I ran this benchmark on my desktop machine and then I ran it again on Dev 12 and Dev 12 was much slower. I have to look at see what's going on, which is supposed to be a newer CPU. Anyway, so, city and farm hash don't use SIMD. I don't think XF's hash uses SIMD. So none of these are, I think I'm using SIMD because using SIMD makes it less portable and they're trying to be your general purpose thing, right? All right, so again, we're gonna use, we're gonna use XF's hash in most systems. And then we try to pick up what a hashing scheme is. So I'll quickly go through these. Again, linear probe hashing will be the most important one or the most common one. Chain hashing is probably the second most common and then these show up from people that sometimes like, oh, it looks clever, let's try to do this. But unless you know what you're doing, you probably shouldn't. Like for example, before the pandemic, we had the guy that the co-founder of InfluxDB, he came and gave talk at CMU. He mentioned they were using Robinhood hashing. Dave Anderson was in the audience and asked him why? And he said, one of his engineers solved on hack renews and seemed like a cool thing so they implemented it. But it's like, that's not a good reason, right? The research literature clearly shows that linear probe hashing is the better approach. All right, so with chain hashing, this is what most people think about when you get a hash table in Java. When you say you have a hash table. And the idea basically is that there's gonna be these, these linked lists of buckets for each slot on a hash table. And to handle collision, we basically follow the linked list until we find a position where we can, a free space where we can put our key, right? And then when we do a lookup, we have to scan the entire bucket chain for a slot. And we either we find the key and we stop or we reach the end and we know that it doesn't exist anymore. So conceptually, it's basically gonna be the same thing as a linear probe hash table. But the linear probe hash table has this giant, you know, single size array, whereas the change hash table can expand because you just increase the size of the chain over time. So let's say these are the keys I wanna get in there. These are my bucket pointers and these are my buckets. So for key A, I hash it, I landed this bucket. I find the first slot, I can write it here. For key B, get a hash there, I can write it there. C, so forth. Now for D, when I landed this bucket, both slots are occupied. So what I do now is just extend the chain for this location, this part of the hash table. Create a new page or create a new block of memory and I write my entry to D there. And I can do the same thing for E, F, and so forth. So one optimization that Hyper does, which I think is clever, because they use a change hash table, is that for the pointers that they're storing in this sort of slot right here, as well as the pointers that they're storing between buckets, they actually store the memory address, which is actually only 48 bits in x86. And then you use the remaining 16 bits for a bloom filter to tell you whether the key you want is in there. So I don't ever understand this. So when you allocate a 64 bit pointer in x86, it does take 64 bits, but the Harbor only uses the first 48 bits. You can't really have two to the water, 64 locations of memory, it's two to the 48. And the reason why they did this, Intel did this was because when they, they knew nobody actually really needed 64 bits, but instead of making weird 48 bit registers, they just said, let's just make it 64 bits. And then the Harbor only addresses up to 48 bits. So when you allocate a pointer, that's 64 bits, there's 16 bits, you can put whatever you want in there, and the compiler doesn't care, and the Harbor doesn't care, and they'll ignore it. So any days you use that extra 16 bits to put a bloom filter to tell you is the key you're looking for actually in, gonna be on the longest chain. You can have false positives, because that's bloom filters are approximate data structures. So you could follow the chain and not actually find the thing you need, but it could save you doing an additional look-ups because the bloom filter will tell you the thing you want is definitely not there, and you can skip it. I think they can go a step further to this point. If I need this bloom filter, they can use the knowledge to bring all four bits up. Yeah, it's, it's... It's based on the line. Yes, it's the same, if you're using memory-aligned in C++, you can know that you don't need full 48 bits, and you can make the bloom filter even bigger. Yes. We used to do tricks, not for bloom filters, we used to do memory-alignment tricks like that in noise page. Look up stuff in Apache Arrow. We can take that offline. All right, linear prep hashing, it's basically the same concept, but instead of having this version, this bucket chain that can get extended, I just have this giant, giant array. And so the way I'm gonna resolve collisions is that I'm gonna scan until I find either an empty space in my array, which one I can do my insert, or I find an empty space and, because I'm doing a search, and I know that the key I'm looking for isn't there. And if I reach the end of the array, I can wrap around and start from the top, from the beginning to start over. And obviously I need to keep track of where I started or entered the hash table so that if I loop back around and hit the same space, I know that the thing I'm looking for isn't there, or there's no free spaces, and therefore I can stop and break out of an infant loop. So it looks like this, same keys I wanna sort or storing this, hash A land here, hash B land there, hash C, I go here, but A is already occupied. So I just go down to the next free spot, and then I can put C in there. Same thing for D and E will go again, go here. So again, if I'm doing a look up on E, I would hash it here, and I would do my comparison, does E equal A? No, just give it the next one. I keep going till I find either the key I'm looking for an empty spot. And then same thing for F, all right? So for this one, you pay the same penalty for both on the bill side and the probe side, the probe side to do these look ups because you could have collisions that like the cost of trying to find something is the cost of the same, trying to insert something when I have collisions. We'll see examples in a second where the sort of extensions to this will try to shuffle things around so that look ups will be faster than inserts. And again, depending on what your workload looks like, that may be the right choice, but in practice it actually doesn't pay out to be the case, right? So this is always, always gonna be the fastest because it's so simple. And there's no indirection, there's no like, there's no like, there's branching because you have to decide whether to jump to the next thing or not, but there isn't a bunch of, once data is stored in the hash table, you're done. You don't have to go back and move it later on. So, there was this sort of thought in the literature for a while that basically said, hey, it's clear that if you're building a hash table, you know, with a certain number of keys, but then I'm gonna probe it a lot. I'm more, I'm gonna read the data structure more than I'm gonna write to it. Then maybe it's worth the overhead of doing a little extra work when I'm writing to it to move things around so that when I read the data, it's more likely to be in a position that I'm looking for, right? So, if I wanna avoid this long search when I do inserts and lookups, I just make a really big hash table so that guarantees or reduces the likelihood of a collision. But there may be other tricks we could do potentially that could shuffle things around so that when I do my lookups, I find the thing that I want right away. So, the first technique would be Robinhood hashing. The idea here is that we want to extend linear probe hashing where we're gonna allow for workers to steal locations or slots in the hash table from rich keys and give them to poor keys. And then we're finding rich versus poor is the number of hops you are, steps away you are from where you should have been if there was no collision. So, the idea is that if I have one key that's already in my hash table but they're really close to where they should be and the new key I'm trying to insert is really far away, I'm better off swapping the position and having the poor key be closer than they would have otherwise been in regular linear probe hashing, right? So, it looks like this. So, going back here, I insert A here and so now I need to sort of whatever the hash key is as well with the original value but now I'm also gonna sort of the number of hops I am from or positions I am from where I should have been if there was no collision. So, with empty hash table, A lands exactly where it should be therefore it's zero jumps away from where it should be. Same thing with C goes up here or sorry, B goes up here, he's fine but now I wanna insert C. C wants to go where A is but at this point here, C is zero hops from where it should be, A is zero hops from where it should be so C will leave A alone and then C goes down here but now we update C's counter to say I'm one hop away because I should be up here but I'm down here, I'm one away. Now I do a insert with D, so at this point here, D is zero hops away because it wants to go here but C's already occupied but C is one hop away so it's greater than where D is so D is not allowed to steal from C so D just goes down here. Now I'm gonna insert E, same thing, both E and A are zero hops, so A stays where it is, at this point here, C is one hop, E is one hop so they stay where they are but now at this point, now E is two hops whereas D is one hop, so two is greater than one so at this point, E is allowed to steal from D, she's in the head, steals the position and then now the worker has to go figure out where to put D back in so D ends up landing here, right? So now on average, if I do a lot of lookups on E and D, the distance is basically gonna be the same or is the same, then F goes down here, right? So I don't think I guess why this is bad for performance. It's a bunch of copying, right? Like going back here, so I gotta figure out where to put it, oh, I'm gonna take this one so I gotta copy D out, put it somewhere that's on a buffer then write an E and then put, go down and put D here, and that's not for free. So even though again we're doing this on the build side and not the probe side, it's gonna, the overhead is just gonna be way too much and then the benefit you get when you actually do the probe is gonna be marginal compared to this because also too, this is unbounded, right? I can just keep going and trying to swap, swap, swap back and forth, right? Until I finally find a position that's free and where everything's all balanced out. Sure, yes, simple case, sure, this is fine but on average not so much. So hopscot hashing is a modern variant of Robin Hood hashing. Robin Hood hashing came out in like 1985, this is from 2008. So it's a variant of linear probe hashing where again we're gonna allow keys to steal positions from other keys but we're only gonna do it in the context of what they're called neighborhoods. And so this is a way to bound how far away you have to look or how far away you have to move something when you start trying to shuffle things around. The idea is you want the size of a neighborhood to be a single cache line or a small number of cache lines so that at least when you're trying to decide where to shuffle things around, you're not paying penalties to go out the memory and figure out is it okay to put something there? I try to look only for the things that are already in my CPU caches. So we have to have a guarantee that a key has to be in his neighborhood or doesn't exist at all. So again, Robinhood hashing, I can keep moving, shuffling things around. Is your linear probe hashing? I keep shuffling things around. And definitely, till I loop back around, I have infinite loop and I'm done. In hopscot hashing, we're gonna bound it to a neighborhood meaning like even if I have extra free space in my hash table where I could put something, if it doesn't fit my neighborhood, then I'm considered full and I have to stop blowing the hash table and re-build it to be double the size it was, right? So the high level goal of what they're trying to do is that because they're keeping things in the same cache line, the cost of going looking up and getting the neighborhood should be the cost of actually trying to find the key within that neighborhood. Because you already paid the penalty for the cache line miss to do the lookup. That's the overarching goal what they're trying to do. All right, so for this simple explanation, I'm saying the neighborhood size is three. So the neighborhoods will be overlapping. So for this sort of first position here, this is neighborhood one, two, three, four and so forth, right? And then for the bottom one, it's like neighborhood six, it can wrap around and start at the top again, all right? All right, so I want to start A, so it's in this neighborhood. Again, I can insert it to any of these slots within this neighborhood and that's considered fine, but in practice you always just put the first one, right? So A goes there and then B wants to go this, that's its neighborhood, so it'll go up there. Now I want to insert C, C's in the same neighborhood as A, A's occupied, so just like linear perpassing, I just scan down until I find a free slot and then I can put C there. D wants to go where C is, this is its neighborhood, again, same thing, scan down, if I find D, I'm just gonna write that. All right, so now I want to insert E, but E goes where A is and so if I scan through, I'm gonna see that all the positions in my neighborhood are occupied. So again, we have to have the guarantee that a key either doesn't exist or exists in its neighborhood. So what it needs to go do is now go back and look at the keys that it just scanned through and figure out whether any of those could be shuffled to another position and still be in its neighborhood and everything's fine. So I would sort of keep track of the stack, but here's all the things I just passed through when I couldn't find my free slot and I would see, okay, well, for putting in, it should be, yeah, sorry, D hashes here, this neighborhood, but it's in this position here. So for D's neighborhood, neighborhood four, it could move down here and that's just fine. It's still in its neighborhood. So now with this, I can go put E in that position there. And then for this last one, F, it just goes at the bottom here. So again, it's like Robinhood hashing where we can move guys, steal their position, but rather than keeping track of this number of positions away, it's just implicitly based on where they're located, whether you determine whether or not you're in the neighborhood or not. All right, quickly, cuckoo hashing. I, this one I do think shows up in real systems. I think IBM DB2 Blue did cuckoo hashing. I forget the number of hash tables they used or the number of hash functions they used. The basic idea here is that you can have multiple tables. I'll show you the multiple tables. You could have a single table just with different hash functions. But the idea here is that when I do an insert, I'm gonna hash it multiple times, the key multiple times, find different positions, and I find whatever one's free. And if all the positions I'm trying to look at, they're not free, I'll steal one of their, steal one of the keys that were then there, put my key in there and then move it, move the key I took out to some other location. So you have the same problem that you had in Robinhood hashing, Hopscoat hashing where I'm paying this copying penalty to balance things around. But the benefit is that when I do my scans or so my probes, I'm definitely gonna have O1 lookup because I'm gonna need to land to where the key should be or it's not there. All right, so say I have two hash tables, two hash functions. I wanna insert X, both of these, the first hash function goes here, second hash function goes there. I flip a coin outside, okay, I'll write X here. Now I'm gonna insert Y, gonna hash it twice. First hash function maps to where X already is, so that's occupied. Second hash function goes to this position here, that's empty, so I can put Y there. But now I come along with Z and for both hash functions map to locations of the two hash tables that are occupied. So now we're gonna flip a coin and decide, okay, let's go steal the position of the second hash table for Y. So Z's gonna get inserted there to overwrite Y, but now we pull Y out and now we're gonna put it back on the other hash table, right? So we hash it, it lands here and X is located here. So again, we can go steal its positions. We take X output Y in, now we're gonna put X back in. So then we come back to the other one, hash that, and we finally find a free slot. If we loop back around and recognize that we're back to where we started, whatever key we're trying to hash and put in or point to two locations that are already occupied that we've already seen, then we know we're in an infinite loop and we have to stop and kill us and rebuild the hash table to be double the size of was. Again, same thing, nice O1 lookups, but the build phase is expensive because now of all this extra copying, right? And hopscotch hashing, I don't know if anybody actually implements that. I just find it kind of, it's a different, it's a variation of Robinhood hashing, I like it. All right, so now we have our hash table. So now we gotta do the probe. There isn't any magic to this. We, if we partition the data, then each worker is gonna scan through the partitions, otherwise you break up into morsels or whatever chunks you want and we're scanned through, use the same hash function that we did to do the probe and the partition phase, do a lookup of the hash table, we built the build phase, check to see whether we have a match. If so, then we produce our tuples, combine our tuples together and produce the output. All right, so the one optimization you can do comes from vector-wise and what happens is when you do the build phase and you build your hash table, you also build an auxiliary bloom filter on the side that tells you whether the key you're actually looking for is even in the hash table. And the idea here is that the cost of doing the lookup in the bloom filter is so cheap, relative to the hash table lookup, then it's just go check that first and it's always gonna be worth it. So the idea here, I wanna join A and B, so A is gonna have to build, on the build side, it's gonna build a hash table, but it's also gonna build some bloom filter. And then now when I start doing on the probe phase with B, I pass that over to B, B checks the bloom filter first, if the key's looking for is found in the bloom filter, then it does the actual lookup in the hash table. If the bloom filter says it's not there, then you know it's not there because you can't have false negatives, then you don't even bother doing the lookup in the hash table, right? So the vector-wise guys report that you get about a 2x performance speedup for really selective scans. Like if most of the things are not gonna be found in the hash table, then this bloom filter is gonna crush performance, all right? It's like doing a predicate push down, but the information that you're generating to tell you whether the tool would even match is generated on the other side, on the build side. And again, from a strict relational model standpoint of these query plans and these trees and these operators going into each other, B's not really supposed to know about A, right? So that's why it's sort of the sideways information passing thing. Okay, so it finished up. So again, the German paper you guys read, they even let all these different variants that we've covered. Again, they're only gonna look at chain hashing, linear probe hashing, and then something called a precise hash table which came from IBM. We don't need to cover that. I don't think anybody actually does that other than IBM. It's basically, it's like a giant array with a little bloom filter stuff in front of it. But they're gonna enter all this stuff into a single test bed, and then they're gonna have what they deemed as the unoptimized versions or what they call the white box versions of the implementations based on reading of the papers that discuss the algorithms. And then they're gonna have what they call the black box implementations. They actually go through and understand what are the bottlenecks and actually try to optimize them even further. So the core approach that they're gonna compare against is the no partitioning hash join, again the precise hash table for IBM, and then the two pass radix hash join either using chain hash table or linear probe hashing table. And then we'll have a special case variants of these where they're gonna use, just instead of a hash table we'll use arrays because you know you have monotonically increasing primary key integers, like one, two, three, four, five, six, seven, eight, nine, 10. Never you know exactly where to go jump into the array to find the key of that value. You can't always do that. You're always gonna have keys that look like this. So again, it's nice to know see if this is the best you can actually do but nobody actually does this. So I just wanna show one graph. And so the, this can't really see the division over here and these are all the optimized ones of the black box or the white box implementations where again they went through and optimized them further. And then these are all the ones that are just, again if you just read the papers, this is what they came up with. The sort merge drawing, this is what we'll cover next class. So I realize that the paper talks about it a little bit. The paper to read after the spring break covers exactly what this project actually is. So these are the abbreviations which I don't like that they use and the paper were just kind of confusing. Just some mapping to them. So again, what's the main takeaway from this? The no partition one does better than all of them and all of these extra ones are like the additional oppositions you can do to like the streaming rights and then the combined buffer output or the localized buffers. All these things, if you do all those things and they're tuned exactly for what the Harvard actually wants them to expect and look like, then you can get almost two X better performance. But for this one, the no partitioning with linear prep hashing, you don't do anything. You just implement it once and it runs on whatever hardware you have and it does not as good, but it's pretty good. So this is why again, most systems are gonna choose this. So then they have this other graph says, okay, well how much time are you actually spending doing joins in a real system? Relative to all the other parts you have to do, like producing output buffers or scanning a table and so forth. And what they show is that the you're really spending, at most 20% of the time doing hash joins for an memory data set with an engine like this. So the argument that they make is that all the extra work you do of these different algorithms to do hashing really, really efficiently doesn't maybe matter that much. And then there's other things in the system that you should be optimizing. And I would argue actually the query optimizer is probably the thing that has the most impact performance because if you have a crappy join ordering for your queries, it doesn't matter whether you're using an optimized hash join it's still gonna be super slow. All right, so just to show you alternative numbers, this is again, this is a profile data that was sent to us by actually Ipocratus, the guy who will talk about Redshift later in the semester because he used to work on Clidder and Paola. This is from profiling data they sent us for running TPCH on their cluster. This is how much time their system was spending doing hash join. Now, I don't know whether this also includes network traffic. I don't have these numbers anymore but they argued that you're spending almost 50% of the time doing hash joins and therefore you should worry about making that as efficient as possible. And these are data numbers. I'm more inclined to believe the German numbers but I just want to show these to say, you may see other numbers out there. All right, so getting the main takeaway in this class is that it's kind of like a teach you stuff that say like, okay, don't do this but it's like dare, like when I was growing up, they show you how to do drugs and they say don't do it, right? And I wouldn't have known about those drugs unless they told me about them, right? So it's kind of like that, like, hey, here's a bunch of stuff, don't do it, which make you go do it, I don't know. But the partition hash join is gonna outperform in a bunch of use cases than the no partitioning approach but it's not trivial to implement and it's just not worth the extra engineering effort to get that last like 10% of performance because there's other things you should be worried about in your system. And as I said, most data system vendors are gonna pick one hash join implementation, one hash function, and that can be done with it. And don't try to be clever about like, oh, my data looks like this, therefore I should use, you know, radix partitioning this way or versus no partitioning that way. I already just does just does one thing. And that's it. That's my favorite all brass. Oh, my God. What is it? Yes, it's the SD cricket IDES. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Now here comes Duke. I play the game where there's no roots. Homies on the cusley, I'm a boogers, I drink brook. With the bus a cap on the ice bro. Bushwick on the cold with a blow to the eyes. Can I come? Will he eat? That's me. Five, six pack for the act, gets the real price. I drink brook, but yo, I drink it by the 12 hours. They say bill makes you fat. But saying eyes is straight, so it really don't matter.