 So, today is the first of a two-part lecture on joins, so today we're going to focus on parallel hash joins to get started because these are the most common ones. And then on Monday next week, we'll do sort merge joins, which is another class of joining algorithms. So, for today's class, we'll start off with the background of joins, and everyone should know what a join is and roughly what we're trying to do here, but then we'll go into more detail how to do a parallel hash join, and then the two major to design decisions we're going to have in our join algorithm is the hash function and the hashing scheme, so we'll talk about how to handle this, and then we'll finish off with an evaluation, okay? Okay. All right, so, when you took the introduction database class, we taught you joins, but we don't, we didn't tell you anything about how you're actually going to really execute it, right? We didn't talk about threads, we didn't talk about, you know, how you actually build the hash table, how you actually build it, I mean, how you actually do, you know, what the hash function you're going to use. But now in a modern system, since we're going to have a bunch of cores, we need to be able to do our join in parallel because this is going to end up being the most expensive operation that you can do in your database system. I know the paper you guys read says otherwise, but we'll look at two sets of numbers. So we're going to use all these extra threads we have to be able to process our join in parallel and then speed up our, speed up the operation. So the two main approaches in any major OLAP system that you can computer join in in parallel is either the hash join or the sort merge join algorithm. And so again, this lecture will be at the hash join, and then on Monday next week we'll do the sort merge join. So we're not going to discuss nested loop joins at all. In fact, when we introduce joins in the introduction class, we start off with the nested loop join because it's the most easiest one to understand because it's just two for loops iterating over the outer table and the inner table. And what I'll say is that the one way to sort of think about this is that, well, first of all, nested loop joins are more common in OLTP environments because in these applications or these database systems, you're not actually joining really large tables with each other. So again, think of an OLTP environment. The join would be for Andy's a customer account, get all his orders. And so it's a one record to get my customer account and then a small number of orders that you just joined together. And so typically in an OLTP environment, you would model the database with foreign keys. So in order to enforce the foreign key, you have to build an index. So you can check that any time you insert something to the child that the key that's paired up with the parent actually exists and they use an index for that. So you end up always doing an index nested loop join where you go grab the one key you need from the outer table and then you can join it together with the inner table through the index. And when you sort of think about it at a high level, this is essentially doing the same thing as a hash join. And a hash join, what happens is, you assume there's no index. So you build the hash table on the fly on the outer table. And then that way, you can then probe into it with the inner table. In an OLTP environment, you already have the index. It'll be a B plus tree, usually not a hash table. But at the end of the day, this is essentially going to be the same thing. So we're not going to talk about nested loop joins other than just think about it at a high level, an index and nested loop join is almost the same thing as a hash join, except that you already have the index built. So the classic debate in database systems is this back and forth between sort merge join and hash joins. So in the 1970s, with the first database systems, it was assumed back then that sort merge join was the best way to do a join algorithm. And this is because back then, they did not know how, at least as far as I can tell, they didn't know how to do efficient hash joins when the size of the table you wanted to join together and the hash table itself exceeded the amount of memory that was available to the system. Whereas in the case of sorting, they knew how to do external merge sort back then, I can do it reasonably efficiently. So it was deemed back then that sort merge was the better way to go. Then in the 1980s, we saw the rise of these database machines that we talked about last class. Where now inside of this specialized hardware that was designed for the database system, they added a special co-processor accelerator, if you will, to do hash joins efficiently. So then it came out that the hash joined algorithms were better in the 1980s. Then in the 1990s, Gertz Graffi from the Volcano fame at Cascades, he has a paper that basically says, well, if you try a bunch of different things and try different, if you want to workload and other scenarios, sort merge join ends up being equivalent to hash join. They get roughly the same performance. But then in the 2000s, and since then, hashing has dominated. And I think this is partly due to more memory and faster processors and more cores. So in this environment now, the hash join, at least from the early 2000s, has been the dominant join algorithm that we now acknowledge as the superior one. So hashing was dominant in the 2000s and the 2010 where we're currently at now. The question is really rather not whether you want to do sort merge join versus hash join, right? Everyone says you should always do a hash join. The question is whether you want to do a partition hash join or non-partition hash join, and we'll cover what those are as we go along. And then for the 2020s, in Donald Trump's second administration, who knows what will happen, right? We might be all dead, right? So in the last 10 years, though, as I said, the history of what's the superior join algorithm has gone through a bunch of phases. And this has mostly been carried on by different academics publishing papers that say this approach is better than another one. And so the first one in this class came out in 2009. And this was a joint paper from researchers at Oracle and Intel. And this is sort of one of the first ones that showed in the modern Harvard scenario that the hash join was better than the sort merge join, except that they posit that if you had wider SIMD registers, then sort merge will actually be superior. So if you don't know what a SIMD register is or what I mean by wider SIMD registers, we're going to cover that on Monday a little bit and then also on Wednesday next class when we talked about vectorized execution. So back then, I think we had 128 bit registers, but now we have 512 bit registers to do SIMD and ABX 512. So now we actually have the hardware that they talked about in 2009 to make sort merge actually be superior. But as far as I know, nobody's actually done that investigation between this yet now that we have the hardware available to us. In 2011, some researchers at Wisconsin came along and they started showing that and measuring and identifying the different trade-offs for having a partition hash join versus a non-partition hash join. Then the hyper guys came along and sort of threw a bomb in the middle of the room and said, well, no, you guys are all stupid. Sort merge is actually already faster than hash join even if you don't have the SIMD stuff that the Intel guy said you needed up here. So that was in 2012. But then a year later it said, wait, ignore what we said in the previous year. You don't want to use sort merge join. You definitely want to use hashing and we really mean it this time. And then in 2013 again, there was another paper from the guys at ECH Zurich from the systems group. And they showed that you want to actually use a radix hash join, a radix partition hash join, which is what these guys were talking about up here. And then now where we're at in the paper, you guys are at from 2016. The guys at Sarland basically said it's actually way more complicated than just saying one's always better than another. There's a bunch of different scenarios, a bunch of different cases where one approach, and when I say approach I mean partition versus non-partition, or one of those approaches to a hash join might be superior to another one. The other key thing about this paper and why I really like it is that they actually try to build a full system, but they actually try to make sure that they understand how these joint algorithms would work in the context of a real database system. So one of the big criticisms about maybe not so much the hyperpapers, but a lot of these other papers is that they're not actually doing the full steps you would do to do a join. They're not actually materializing tuples that are saying they do compute the join, see that there's a match, and then throw away the result. But there's actually sort of extra copying steps you have to do. And then these guys actually try to do it. So what are our design goals in a parallel join algorithm? So this is true not just for the hash join, but also for the sort merge join. These look a lot different than what we talked about in a introduction class when we were really focused on disk IO. We were trying to choose algorithms, to join algorithms based on how much things we're going to have to read and write the disk, given the amount of available memory to us. But now in an in-memory database system, again we assume the input's going to be all in memory, but also we're going to assume that we have enough space to store the hash table, whatever intermediate data structures we have to compute the join, we're going to assume all that's in memory as well. So what do we actually need to care about? So at a high level there's two goals that we want to have in our join algorithm. So the first is that we want to minimize the amount of synchronization that needs to go on from the different threads that are running to compute our join algorithm. Again, in an introduction class we just talk about there's these two for loops or if you're doing an sloop join and you don't worry about whether other threads are computing the join at the same time as you. But now, since you want to parallelize our work across as many cores as possible, we don't want to have to take latches or block on each other as we go along. We'll have to block on phases of the join going from one phase to the next, but inside a phase we don't want to do any synchronization or coordination. And the second goal is that we want to minimize the CPU cache misses. So this is a little different than the locality stuff that we talked about last class. Now we want to make sure that anytime a thread tries to read something from memory in order to do some computation on it, we want to make sure that as much as possible that that's going to be in our CPU caches. And you can sort of extend this further and obviously say you want to make sure that the data that you read will be local memory to you. But before you even get there, then this is more important and you want to try to maximize the cache affinity or cache reuse that you have. Yes? What is the different way of comparing this? What are we doing different here? Yeah, and you said they're different. So the question is what is different about trying to maximize cache reuse to minimize cache misses versus maximizing your access to local data? So it's sort of like you could always just access data that is local to you, but you could still be jumping around to different parts of it and having all these cache misses, right? Yeah, it's forbidden to jump to another place. His statement is it forbidden to jump to other places. It's forbidden and we'll see this when we start doing partitioning, right? The goal of partitioning is to have every local memory, right? And you can sort of think that is it in my NUMA region, but even more fine-grained is it actually in my CPU caches? Does it fit my CPU caches? They're related together, right? But we're going to live more fine-grained than what we talked about before, right? So how do we actually improve our cache behavior? So the two factors that are going to affect how many cache misses we have are obviously how much space we have in our CPU caches, right? L1, L2, L3, but then also how much space we have for the TLB entries, right? The translation local side buffer, right? So if you have your threads just jumping around in memory, ignoring whether it's in the same NUMA region or not, if you're just jumping around reading different locations in memory, then every single time you read at some location, it's going to be a cache miss. And depending if your TLB is full, it's going to be another cache miss because you're not going to have a way to map the virtual page to the physical page. So essentially for every bad read you do, you end up paying two cache misses. And that's going to be much, much slower than just having everything local in your CPU caches to you. So this is affected by how the threads are actually going to access the data. And this can be in terms of the temporal location of one access to another and also the spatial location. So spatial location would be like, am I reading data that is local to me or is close by? The temporal data would be like if I had to read the same object or location multiple times, I want to do that all together. I'm going to do all those operations within the same time window of each other because it's more likely to be my CPU caches. So how we can handle this is that we have to deal with the two type of accesses we could possibly have during our join algorithm. So at first we have a non-random access and this would be like a sequential scan. So we can handle this by making sure that we try to cluster all the data that we're going to scan in continuous order into a single cache line and that way when we bring in that data into our CPU caches, we do again all the operations we have to do on that data before it's then evicted because we moved on to something else. We never want to go back and read something that we've already processed before because then we've got to bring it back from memory into our CPU cache again. And then for random access, this is like if you're doing a hash join, you're probing into the hash table. So ideally we want to make sure that we split the data up so that it'll be able to fit in our CPU cache and our TLB. And again, we sort of related to this. We do all our processing on the partition, this chunk of data before we move on to the next one. So this is the goals we're going to try to achieve to get better performance in our hash join algorithm. So as I said at the beginning, the hash join is my opinion the most important operator you can have in a database management system when you're doing OLAP workloads. Again, we're not worried about OTP, we're not doing transactions here, we're really focusing on running complex queries that are scanning large segments of the database and joining multiple tables together. So we're worried about TPCH, TPCDS, not TPCC or TPCE. So again, in order to get better performance, because the clock speeds aren't going up, we have to take advantage of all the additional cores that Intel and AMD are providing us. So we're going to try to come up with a join algorithm that can run in parallel. And the idea is that we want to have all the cores be always busy, always crunching on data without them always getting blocked because they have cache misses and we always have to keep pulling data from DRAM onto our CPUs. So to give you an idea how important hash join is in a data warehouse system. So this is a measurement that I had a student do a few years ago where we were working with Impala, working with the guys at Cloudera, and doing some profiling on their system and trying to understand where they're spending the time when they execute queries and try to identify some opportunities to optimize things. And so this is running the TPCH workload, all queries, all 22 queries on their system, and then we're breaking it down by the four major operators that you would execute. Hash join, sequential scan, union operator, aggregation, and then all of the sort of miscellaneous thing. And for this we're measuring the CPU time spent in each of these operators. So Impala is a distributed data warehouse that's designed to run on HDFS. So there is some network transfer and network communication between the nodes in order to execute these operators. For that we just ignore that. It's simply just the CPU time. So in this case here, hash join ends up being about 50% of the total execution cost of queries for across all this workload. So this is why we're going to focus on hash join and try to make this go as fast as possible because this is the high pole in the tent. Now the paper you guys read, they had this little graph at the end where they say they take one query in TPCH, a Q19, which is just a two-way join, and they say that these black bands here represent the non-joined part of the query execution. So these little color parts are the time they're saying we're actually doing the hash join. Then everything else is the other parts of the query. I think this is an anomaly. We haven't double checked these results in our own system. I think this would be correct for a two-way join, but for more complex queries, especially like in TPCDS, there's way more joins, and I suspect that these numbers will be different. So this is what we've gotten, but this is what they reported. That's going to be up front about that. So let's talk about what the hash join looks like. So a hash join can be broken up into three phases. So in the first phase, you do partitioning, and this is an optional phase, meaning you don't have to do this, and everything else would still be correct but the idea here is that you're going to take your two input relations, so you're trying to join tables R and S, and you're going to divide them up into partitions based on their join key, and then later on in the build phase, you'll have the threads this process on those local partitions. Think of this like the morsel stuff that we talked about last class. Then in the build phase, you scan the outer relation, in this case R, and you're going to build the hash table for that relation based on the join key. Then in the third phase, the probe phase, you do a sequential scan on the inner relation S, and then you take its join key, you hash it in the same way that you hashed it in the build phase for the other table, then do a probe in the hash table to see whether there's a match, and if so, you combine the portion of R with the portion of S, and you spit that out as your output. It's pretty straightforward. We're going to go through each of these phases one by one. As I said, the partition phase, the idea is that we want to split both the input and the inner and the outer relations based on these partition buffers based on their join key. The idea here is that for a given partition from both tables, say there's partition one, then all the tuples of the same join key will exist in that first partition for the outer table and the first partition for the inner table, and then you never have to go check anything else from the other partitions because you've already sort of done this first pass to partition them into buckets. You haven't matched them yet. You still have to do the build and probe phase, but you know you don't need to check any other part of the hash table. So this seems counter-intuitive, right? We're going to scan the data once on both relations. We're going to hash it, but then we're just going to copy the result into these partition buffers. We're not actually going to do any join or build the hash table here. So this seems counter-intuitive because again you're scanning the data and sort of just preparing it without actually doing the join, but the argument or the intuition why this actually might work is that if the cost of partitioning is going to be less than the cost of all the cache misses we would incur during the build phase from the threads just reading whatever they want, then this actually will be better, right? Yes? Is it possible for an addition to like what it is for joining to expand multiple partitions? Your question is, is it possible for a join key to span multiple partitions? Yes. How would that work? It's like, it's a partition buffer. Thanks, Link. Your question is the partition buffer. Thanks, Link. No. Right? So like you have to be able to extend it. I think I'll have a graph in a second, a diagram. You have to be able to extend it, right? Because if you put it into another partition, then it's like, you can't find it, right? And you would have a false negative. All right, so sometimes this approach is called the hybrid hash join, or a partition hash join. So one of the things we'll see in these joins is a bunch of different names that people use and essentially means what I'm describing here. So if you do partition, it's called a hybrid hash join. So what we actually put in our partition buffers can depend a lot on our storage model. So we're not going to talk about the trade-offs between early materialization or late materialization for this point here. Maybe on Monday I could talk about that. But the basic idea is that if we are a column store, then doing this partition phase is actually relatively cheap to do because we can just rip through our join columns and do the hashing and put them into partitions without reading unnecessary data. But if you're in a row store, you always have to read essentially the entire tuple, right? Because you have to jump to the offset and that's going to bring a bunch of crap. Maybe you actually don't even need to use for the join. So in some cases, again, in addition to just saying, am I going to do some of my cache misses during the build phase, this part can be additionally expensive too if you end up reading more data than you actually need because you're a row store versus a column store. All right, so this is another example. I think we'll come across this multiple times in today's lecture where if the query optimizer can possibly figure this out for us because it knows what join we're doing. It knows what the distribution of the data looks like and the selectivity of our join. So it could possibly identify, in some cases, that you actually don't want a partition because you're not going to get the benefit you think you possibly could. And we'll see later on when you actually have skewed workloads, this partitioning thing actually doesn't work because you end up having a partition that has all of the tuples or a large number of the tuples. So there's two ways to implement the partitioning phase. The first approach is what I'll call non-blocking partitioning where this is where you can have one set of threads scan through each table and hash the key and put it into the partitions and then you can have another set of threads being fed those partitions and reading the tuples out and then start doing the build phase and building the hash table. So when I say non-blocking, I don't mean the sense of when we talk about concurrency control where readers don't block on writers and so forth. I just mean that we don't have to actually complete the partitioning phase entirely before we're allowed to start the build phase. We contrast this with the blocking partitioning approach where we may end up scanning the table multiple times and we only materialize all the results once each surpass is done. So this is called a blocking partition because you can't start the build phase until the partitioning phase is done. So one of the approaches will show how to do this is called radix partitioning so sometimes you also see in the literature people say they're using a radix hash join this means that they're using this approach here. So I'll go through examples of both of these. So for non-blocking partitioning there's two sub-approaches the sub-design decisions we have to deal with and the first approach is that we can do shared partitions where we have all the threads are running and they're writing into the same set of partition buffers that means we have to now be able to make sure we don't mess up the integrity of the data structure so we have to use compare and swap or latches to make sure that the threads don't interfere with each other. And then the last, the second approach is private partitions where every thread will have their own set of sub-partitions then you sort of do one pass over the input table and populate those private partitions and then you go back through and each thread picks out just for one partition all the different threads and write it out to a global partition. And again, I'll go through examples of both of these but this is a good example where there's essentially no free lunch in the same way we saw this with the BW tree and a lock free data structure you end up doing more work in order to claim your lock free or latch free in the same way here if you want to not have to synchronize your threads you end up having to do more work and execute more instructions by going a second pass over the data for your partitions. All right, for shared partitions let's say we have a single table here and we're going to do the same thing that we did with morsels where we're going to divide it up into disjoint subsets and then have a thread running at a core be in charge of processing or partitioning the data at that partition. So let's say that for whatever query we're executing the join key that we need to join this table with another table is this column B here. So what we're going to do is each thread is going to scan through and look at every tuple and look at the value for this attribute here and it's going to run it through a hash function and the hash function has to be the same for all cores and then they're going to generate these global partitions. Right, so assume we have n partitions so the first guy here will look at the first tuple, hash B and then it'll tell you it goes to the other partition 1, 2 or 3 or n and then all the other cores are essentially doing the same thing. Right, and again because these are global partitions because they're shared we have to use some kind of synchronization primitive when we actually insert into our partitions to make sure that we don't interfere with each other and this can just be a comparison swap. So this is not a hash table this is just buffers like blocks of memory and the things we're putting them are fixed length so we always know how to jump into some offset to find our next location to write it into. Yes? His statement is is this NUMA inefficient? Yes because this portion of the data can think of like morsels this could be at some NUMA region this could be at another NUMA region and this thread will be processing on its local data but then when it hashes it it's going to write it out to anything. So yes we pay the penalty to go over the interconnected right there. So now with private partitions as I said we're not going to have at least initially we're not going to have that global shared partition space every thread is instead going to have their own local set of partitions right so now again this could be in the same NUMA region because this thread here would write into its local partitions but the problem is now all our partitions are sort of subdivided to these smaller subsets across all the different cores so in the next step which we need to do we need to combine these together and the way we're going to do that is that we're going to have each core be responsible for one level or one partition group so the first thread here will be in charge of partition one so he's got to go into each memory region for these private partitions find where partition one is and then write it out into the global partition partition one and again same thing this would be this output partition here the combined one this could be in your local NUMA region but you're reading from disparate locations so another thing also to say here is that in this example here or actually partitioning in general this is a good example of dealing with the issues of what you're actually putting into these partitions so you could just be copying the entire tuple or you can be a little bit smarter and say well I only need to join on B so maybe I can just have the join key be stored here plus the offset of where it corresponds in my fixed length data blocks depending on how you actually your storage is set up and actually what you put in here this copying can be really expensive it can be really cheap if you're not copying everything over you may have to go back into the data table and read everything again so again it depends on what the join looks like it depends on what the data looks like alright so let's talk about now a blocking approach with a radix partitioning so the idea here is that we're going to scan our input relations multiple times and we're going to generate our partitions in a more sophisticated manner than the sort of the the join key stuff that I showed before so at each pass that we're going to take during the radix partitioning there's going to be three steps so in the first step we're going to scan through our table and we're going to compute a histogram of the number of tuples that will get mapped to per hash key that will get mapped to some radix at a given offset so if you remember from the radix tree stuff we talked about when we talked about indexes a radix is just like a digit and a key if my number is 123 1 would be a radix, 2 would be a radix, 3 would be a radix contrast this though with the radix tree was actually basing on the actual bits themselves for us here we just worry about the actual integer in a position so we compute a histogram in the first step then in the second step we take this histogram and we compute the starting location of where we're going to start writing data in a full partition at a given offset based on the prefix sum and I'll show what a prefix sum is in a second and then once we have these starting points we scan through our table again and now partition them based on the hash key using the prefix sum that's telling us where the starting point is so again I'll go through all of these one by one and then the prefix sum and the radix is really easy to understand and then you'll put it all together to actually do the radix partitioning and it's actually really easy alright so again the radix is just the value of a particular digit in an integer at a particular position so if I have keys 89, 12, 23, 0 8, 41 and 64 the first position is just these digits here so 9, 2, 3, 8, 1, 4 and the radix of the second position is just the first digit 8, 1, 2, 0, 4, 6 again this is different than the radix tree which is based on bits of an actual integer key now we're just actually basing it on the 10th position and so now the prefix sum is just a running summation of a sequence of keys so we have the prefix sum we define is it starts with a sequence of input numbers and then it's going to output a second sequence of numbers which is the running total of the values going from in one direction by adding together what was the sum of the previous position with the key at that my current position so if I start at 1 this is my first position here the prefix sum is just 1 because there's nothing that came before so there's nothing to add it to so it's essentially 0 plus 1 equals 1 but now for the second position 2 I take the prefix sum of the previous position add it together with my key here and that's the computer prefix sum so 1 plus 2 equals 3 and I just keep doing this down the line and add them all together and now I have my sequence of prefix sums is this clear? and then what we'll end up doing is we'll use this as again to say alright I know I had 3 entries here so my starting point where we're actually going to start writing data would be the value of this so to do a radix partitioning for this I'm going to keep it simple and say these are just the hashes of the keys after we've already hashed them right so it's not the actual join key itself it's the hash of the join key so in the first step here we're going to have each thread come along and just scan through the data that they're responsible for so sort of think of this as the dividing line the first thread we'll look at the second four keys the second thread we'll look at the second four keys and they're going to create a histogram that's going to tell us again how many keys will correspond to a given partition so we start with the first radix in this position here and we're just going to look at each one and say this goes to partition 0, this goes to partition 1 partition 1, partition 0 so we just go through and read all these one by one and compute our histogram so this says now at thread 0 or cpu 0 partition 0 based on the first radix will have two entries partition 1 again based on the radix here will have two entries partition 0 at cpu 0 so cpu 1 will have one entry at partition 0 and partition 1 will have three entries so now we can then compute our prefix sum based on these histograms and this is going to tell us where each thread should start writing the data for the entries that appear at that partition so for partition 0 at cpu 0 the prefix sum is 0 because there's nothing that came before it so we start there for partition 0, cpu 1 the prefix sum is partition 0 at cpu 1 it's 1 plus what I was before us which is 2 so our position is 3 so we can just go through now and scan through our partition and it'll tell us we use this as a way to write into into this giant partition space here so we don't have to coordinate between the different threads right after everyone blocks you sort of everyone has to block until you've ever used the histogram once you have that then they can go ahead and now start writing to this because they know nobody else has been writing to the same location and all you need to do is maintain an internal counter and it says right well if I'm writing another tuple back to the partition where I started that before how many things have I already put in there so we can do this again we can do recursively keep partitioning this so that we generate partitions at the right size typically you want these to be able to fit in your cpu caches so we keep doing essentially the same process all over again but then before in the first pass we start at the first position so in the second pass we will start at the second rate exposition and we just do the same thing we just scan through and split everything up and generate another group of partitions so is this clear? blank faces yes the question is do we need to do modular operations what do you mean by that oh yes so you have to know how many partitions you have ahead of time so I'm over simplifying this you essentially take this hash key and you mod it by the number of partitions you would have so yeah 0 mod n 1 mod n and the same thing over here 7 mod n yeah you do that yes I should add that so at this point again the partition phase is optional and I've showed two ways to do this we do the non-blocking way, we either share partitions or private partitions or the blocking way using radix partition which then you can operate on as you go along into memory that's local to each thread so I would say that in theory you can keep doing this recursively over and over again for every single radix partition but in the literature nobody does more than two passes and I also say as a spoiler in practice nobody actually does this in a real system as far as we know alright so now we get to the build phase and as I said again our threads now are going to scan over the outer table in our join and this could be either the actual base table itself or if we ran through the partitioning phase it would be the partitions of tuples and then for every single tuple we have we have to hash it again on the join key and then put it into the appropriate bucket or slot in our hash table and ideally we want to have it so that our buckets are only a few cache lines of size so that any time we have to read a bucket we're not paying a huge penalty for reading a lot of data we try to keep things in our cache lines so we can come in and out very quickly so the major design decision we have to deal with in the build phase is how the hell we're actually going to build our hash table and the two design decisions within that are what hash function we're going to use and what hashing scheme we want to use so the combination of these two things essentially defines what a hash table actually is so in the first case of the hash function the basic idea here is that we want to be able to map a potentially large key space into a smaller domain because we don't want to have to be able to have to allocate a slot for every single possible key that could ever exist in our domain for the join key so the hash function is a way to sort of condense this so there's this classic tradeoff of speed versus collision rate when you choose a hash function that we're going to have to deal with here so the way to sort of think about this is the fastest hash function you could ever possibly have is just return one no matter what key you give it just return one it's the fastest possible thing you do the problem is your collision rate will be terrible because for every possible key the hash is one and as we'll see when we talk about the hashing schemes having a really bad collision rate is expensive because now you're going to have to essentially generate these linked lists or do these scans actually find the key that you're potentially looking for so if you have a really bad collision rate everything ends up being a sequential scan which is the thing we want to avoid in the first place by building a hash table in the hashing scheme the main idea what we're going to do here is we're going to decide how we're actually going to handle key collisions after we've already hashed them with our function here and again this is this tradeoff between allocating the largest possible hash table you could ever have so that there's never a collision versus having to execute additional instructions if we have a smaller hash table because we have collisions again, using my example before if I hash everything to one then my hash table will always put everything in the same slot and therefore I have to do a linear scan to find the thing that I actually want another way to also think about in the hash function as well you could build the other stream from sort of hashing everything to one is you can have what's called a perfect hash function where every unique key is like a unique hash function this is more of a theoretical thing like in theory you could build this using another hash table but in theory you could build this but it would be expensive to do and every single time you add a new key that you need to include in your domain then you maybe have to rebuild your hash function so nobody actually does that either so we're trying to approximate things and it ends the tradeoff between these issues here so in terms of selecting a hash function the key thing to remind ourselves is that we're doing joins here so we don't care about security I often say on video that I don't care about security one day it's going to burn me, that's fine but we don't care about any kind of cryptographic guarantees in our hash function we just want to have something be fast and have low collision rate so that means that we're not going to want to use like SHA-256 or MD5 or all those sort of classic security hash functions so really it's just doing something that we can do internally and we don't worry about leaking information yes so the statement is if we did care about security using cryptographic hash function would not help us if you never leak the hash outside the system then yeah it doesn't help you either I mean in the case of SHA-256 or these other these other hashing functions some of them are meant to be two way so like for a given encryption key I can take a hash or take a key I take a value, hash it and I can actually reverse it we don't care about that, we only care about one way hashes so this is a summary of the sort of the major hash functions that people use now in existing systems we actually reached out to a bunch of different companies and asked them what they use and their systems Oracle wouldn't tell us but it's something proprietary internal murmur hash actually turns out to be the most popular or CRC which I should have included that's fine so murmur hash was this hash function that put out in 2008 by this random dude on the internet he just put the source code out and then people picked up it and said oh this is actually pretty good for a lot of different things and in particular it was pretty good for joins so then Google took some of the ideas from murmur hash 2 and they created something called city hash and what was interesting about city hash is that it was tweaked so that it could be faster to hash short keys or keys that are less than 64 bytes and Google wanted to do this because for some of their application domains they had keys that were less than the size and so they could get bad performance and they were very careful about how they aligned things and how they actually do the digest and so they came up with a hash function derived from murmur hash but there was better performance for this and then later on they had a newer version of city hash but farm hash that was designed to have better collision rates there's another Google released another hash function called highway hash in 2016 as far as I know you don't want to use this for joins because it has sort of cryptographic guarantees about being difficult to infer anything from the keys but we don't care about that for our joins the last one here so murmur hash is often used in systems CRC is also used as well, city hash is good for some systems I think one system uses this but then nobody uses these other ones here this last one is CL hash which I think is kind of interesting so this was created by Richard and Canada, Daniel LeMire and what's really interesting about it is based on a different kind of math than all these other hash functions it's based on something that's called carry less multiplication the interesting thing about it is that it's only the last three years that AMD and Intel add instructions that make carry less multiplication actually perform really well so now that they added these new instructions it's now possible to do something like CL hash which is a better performance than some of these other ones for some situations so this is a benchmark framework that I found on the internet a few years ago and then I tweaked it to be able to support CL hash and some other things so this is actually I ran this last night on my brand new as of January 2018 machine workstation so this is a core i7-8700 and so with this experiment it's doing it's comparing a bunch of different hash functions computing hashes for keys of different sizes so the y-axis here is the throughput in terms of the amount of data we're processing per second it's not the number of hashes we're actually doing so what you see here is that for smaller key sizes everyone's pretty much the same murmur hash and standard type of library hash they're roughly about the same although murmur hash is a little bit slower but then you see these giant sawtooth patterns for city hash and farm hash where they're doing really really good and then they drop down and go down and get worse and they do really good again and then they have a longer sawtooth pattern going this way let me take a guess what these different spikes represent what's that right yeah these are the the key sizes of a certain byte and everything's sort of cache lined nicely up to this point and then you go down and have to allocate more memory for their buffers right and then you do it again up to here and I actually don't know why the pattern yeah 64 bytes 64 bytes after that point so yes so his question is are these things so fast that you actually wouldn't even bother trying to memorize anything because that just waste extra space absolutely yes as far as I know nobody memorizes to do hash joins if you throw a cl hash in the mix it's much more interesting so again you see this sort of spiky pattern again I guess that's based on alignment when you're less than 64 64 bytes it's slower but then beyond that for larger key sizes it does really really well so the practicality of this I think is limited because typically when you're doing joins in an OLAP database most of the time you'll be joining on compressed data like dictionary codes so you're going to be 8 bytes so having it's very rare that I think you would actually do joins on really large keys but this is just sort of showing you that there is some interesting patterns for larger key sizes alright so yes this question how large is the output size I ask you a question I see that 32 bits are 64 bits alright it's probably 64 bits and then you can mod that by the number of partitions or the number slots in your hash table so now that we have a hash function we have to have put it in something right so this is what people typically call the hash table in the paper they call it it's a hashing scheme the idea here is that it's the method the heuristics or the protocol we're going to have in our hash table that we're going to use to define when we have a collision when we have two things hashed to the same location in our hash table how do we actually deal with that so we'll talk with chain hash tables, linear hash tables which is the most common ones and then we'll talk about two variants of linear hashing called Robinhood and cuckoo hashing so chain hashing is what most people think of when they think of a hash table where you have these slots that then have pointers to essentially a linked list of buckets and then when you hash something you land in a slot, you follow the pointer to the first bucket and then you find where a free slot is in that bucket to store your tuple the idea here again is that when we have a collision we just keep appending to the bucket and if the bucket gets full in our linked list we go on and make a new one so the way you determine when you do a probe is to figure out whether a key actually already exists you hash it to the bucket location and then you just do a linear scan or a sequential scan of every single tuple in the buckets until you find the one you're looking for or you reach the end and therefore you know something there and assertion is basically the same thing as this so this is what you get when you use like a border map in C++ and if you use Java U2HashMap you get a change hash table like this so at a high level diagram it looks like this again so we take our key, we hash it mod by the number of slots in our table and then this would have a pointer to the first entry in our list of buckets and then we just go find either the free slot if we're doing an insert or we find the key that we're looking for as we go along it's pretty basic to understand the other approach is to use what's called these are from a broader class of hash tables or hashing schemes called open addressing and the most prevalent one is called linear hashing where instead of having these linked lists of buckets we're just going to have a giant array, a giant table of slots and then when we hash our key we mod by the number of slots and that's going to take us to a position in this hash table where if that position is empty then we can go ahead and put our key right in there if it's full then we basically do now we iterate down through all the positions until we find a free slot so now what happens is that when we want to find our key we actually need to do a comparison to make sure we're actually seeing the key that we're looking for so we need to restore some extra metadata in the hash table so that we know that we're actually looking for the thing that we expect to see because something else from another position could be in the position we're looking at right and then we also need to keep track of what key we started off with because potentially we could scan through the bottom, wrap around to the top and get back to where we started before and if we don't know that we've already gone through and seen everything then we can do this infinitely so we need a mechanism to tell us when to stop so at a high level it looks like this so now let's say that we're going to hash we have six keys we want to put in here so again the first step you basically say we want to start with A you hash it, mod by the number of slots and then that's going to tell us where we actually go so in here now inside of our hash table we have to store the original hash value of the thing we just hashed as well as the original key value as well and we do this because we need to know when we're actually seeing the thing we're looking for and when we've gone past far enough and we're not going to see anything else so B would hash like this in case of C now it hashes to the same location as A so again since A is already occupied there you leave it alone and you go down to the next slot and you put your key in there same thing for D it wants to go where C is so it goes down to the next one E goes here, it wants to go where A is so we go all the way down and put ourselves down here and then F goes here and it goes there right so again you do a look up if I want to look up on E I would do a hash on E that would land here and I would scan through and say I haven't seen a I don't find the thing I'm looking for and I know that I haven't gone past the slot where I should be yet right so I keep scanning down and then I find the thing that I want doing a linear scan until I find the thing I want or if I was looking for a key that wasn't in this list here if I reach the empty slot then I know my keys can't be in there so I can terminate my search so actually for this one you don't need to synchronize with a latch or anything you just do compare and swap and try to take a slot if you succeed you're in if you don't then you go down and move to the next one so the thing to do with linear hashing is that we're going to have a lot of wasteful computations if we have a lot of collisions then we're going to have these really long sequences of slots with different keys in them and then when we land and look for the key we're looking for we may have to do a bunch of evaluations to find the thing that we actually want so one way again to reduce this is just have a really large hash table so that our collision rate is low but then this ends up taking you know you end up wasting space and this is also why the reason why when we saw this in the joint ordering benchmark paper they talked about how we're postgres with it if it underestimated the size of the output of the selectivity of the joint operator it would allocate a hash table that ended up being too small then had to go ahead and resize it they're using a linear hashing approach where essentially all the slots get full and you can't insert anything else into it so you have to then double the size of it and make it bigger which again you have to block all the threads from accessing it while you're doing this because you don't want any false negatives or false positives so let's look at some other approaches to deal with these collisions the linear hashing is super simple you check to see whether something is in there if yes then you go on to the next slot so one approach to handle to be even more sophisticated about your collisions is to call robin hood hashing so this is a paper from 1985 so this is an older technique but it sort of showed up on hacker news a couple years ago so it's back in vogue and the basic idea here is that we're going to do linear hashing just like before where we have again a giant list of slots and we try to hash to a location and if it's not there then we move on to a slot below it but the idea is that we're going to keep track of the number of positions the number of steps away every key is from where it's optimal location like where it should have been if you hashed it and then as we're inserting try to insert a new key if we notice that the if the key we're trying to insert is farther away from where it should be from some key that already exists then we can go ahead and steal that position and then take the key out and put it down below somewhere else so it's called robin hood to steal the slots from the rich keys meaning the ones that are closer to where they should be and you give them to the poor keys the ones that are farther away so let's look at an example like this so we start with A we hash A then we have to store the hash location of our slot we have to store the original key but now we're also going to include this number of jumps from the first position of the optimal position where it should be so in this case of A we have zero steps from where we should be because we're exactly where we hashed into same thing for B up here it's hashed right into its slot so its position is zero so now when we hash in C now we do a comparison and say well A is already in here so is it richer or poorer than I am so in this case here C hasn't been put into the table yet and hasn't jumped down any positions so its current position offset is zero A's position offset is zero so in this case here we're going to leave A alone and move C down here but now we set its position offset to one because it's one jump away from where it should have been then we get to D D wants to go in here C has an offset of one and that's going to be greater than D hasn't gone into the table yet so its offset is zero so D is going to end up going down here so now we get to E E wants to hash in where A is so again we do the same thing in the beginning A's offset is zero E's offset is zero so we leave A alone now we get to this next step here C's offset is one E's offset is one so they're equivalent so we leave C alone so then we get to D D's offset is one and it should have been where C is now but now we've gone two jumps from where E should be so E's offset is two so E is considered poorer than D so it's going to club them over the head and steal its position so E ends up going in here and with offset two because it's zero, one, two and then D ends up getting put down here now with also offset two and the idea here we're trying to sort of amortize the the number of the length of the scans where I have to do to find particular keys by averaging out the distance from every key to its optimal position so in the case of F F wants to go over here D's position offset is two F at this point is zero so F would have to go down in here right yes your question is why would chain hashing required so in in practice on average if you have twice this number if you have twice a number of slots then the number of keys you could possibly have then that has some theoretical minimization of the number of steps you have to do yes the probabilistic data structure in some ways because you're hashing this and you're landing it's not probabilistic but it's not an exact thing like a linked list where you know where everything should be because the hash function is going to end up taking a key domain, map it down to a lower key domain and the probability that you're going to have a collision depends on a whole bunch of different things so this seems kind of cool right problem is it actually doesn't in practice for joins this actually doesn't work out really well and because you end up with more branch mispredictions every time you do an insert you have to figure out what's going on and figure out where you need to go and do these evaluations to decide if you're going to steal a slot from somebody else and you end up doing more copying because you could possibly be flipping around every single position just what was could have been just one insert now as end up being multiple inserts so the current literature shows that this actually for joins and modern CPUs Robinhood hashing is not actually a good approach semester about how they were using Robinhood hashing I believe for their joined algorithm we asked them why I think they said they sold on hacker news and thought it was a good idea but that's fine I think I said it seems like this would work but again for current CPUs it doesn't the other approach to deal with collisions is called cuckoo hashing and the idea here is that instead of having to one giant hash table to deal with when we have a collision, how to jump down find a slot where we put something we're actually going to maintain multiple hash tables with multiple hash functions and then when we do an insert we try to find a free slot and either one of those hash tables if we can't then we end up stealing a position from somebody else and then we move them back to another hash table so the idea is that you're going back and forth all the time whenever you have to deal with collisions so in practice the lookups are always going to be always 01 because when you hash a key you're guaranteed that if it exists it's going to be one of the two hash functions so you don't have to do any linear scans to find something but it's going to make inserts potentially more expensive because you may have to cycle through and reshuffle everything in order to end up with a free slot so let's look at an example here so in this case here I have two hash tables and the first thing I want to do is insert key x so what we're going to do is we're going to take two hash functions that have to be completely separate so what I mean by that they can both be like murmur hash or city hash you just salt them with a different random key so that they have a different mapping of a key to a location so if we hash key or hash x for the first hash function we land in this position here for the second hash function you land in this position here and you flip a coin and you decide that I'm going to end up writing it over here so now I want to insert y same thing I hash it twice the first hash function maps to this position here in the first hash table but x is currently occupied there but then in the second hash table the slot is free so we can go ahead and put it there right no problems so now let's see the case where we can have collisions on both sides so we have we want to insert z the first hash function will map us to where x is the second hash function will map us where y is so we end up now need to steal one of these two positions and replace the current occupant with our key and then figure out where to put that key we just took out somewhere else so let's say that we decide to steal from the second hash table so we take y out put z in there and then now we're going to go figure out where to put y in this hash table so we're going to use the first hash function to now map it to a slot here but as I showed in the beginning and when we try to insert it the first time the first hash function maps it to the position where x is stored so we have to do what we did before now y is going to steal x's position and now we have to take x and throw it on the other hash table so now we hash that and it ends up up in here so at this point here our insert is done because both z, z, y and x are put back into the hash table there's this clear so just like in linear hashing you have to make sure that you don't get stuck in the infinite loop you have to keep track of in my example here if I went back around with x and x wanted to go where z was I would take z out and put x in but now I'm back where I started before where I need to do something with z and z is going to map to positions that are already occupied so you have to keep track of if we're stuck in an infinite loop then we have to stop the world and rebuild our hash tables possibly make it bigger using two different new hash functions and reshuffle everything and hopefully in the second time around we won't have this problem of getting stuck in an infinite loop so the way the theory works out is that if you have two hash functions essentially two hash tables then you probably don't need to rebuild the table until you're roughly about 50% full but if you have three hash tables, three hash functions then you probably don't need to rebuild the table until it's about 90% full but again you're paying the storage penalty of having three hash tables to avoid having to rebuild things so what I'll say is as far as I know nobody actually uses the Robinhood hashing or the Cuckoo hashing for hash joins everyone pretty much does the linear hashing because it's just so fast in most scenarios so now we get to the probe phase and the probe phase is pretty straightforward again you just have your threads scan through the interrelation either again the partitions or the base table and then you hash the join key probe your hash table and then if you have a match then that's emitted out from the operator as part of the join this is actually not true you don't have to synchronize at the beginning of the cursors if you just divide everyone up to start different parts but it's not like at every single step you have to synchronize so there's one optimization you can do for the probe phase and this is where you during the build phase you're going to build a bloom filter for your hash join keys and then when you go through the probe phase as you're scanning through the interrelation you actually check the bloom filter first to see whether you have a match if not then you know you don't need to go even check the hash table because the bloom filter could have false positives but it'll never have false negatives and then if it does match then you know you also go check and try to do the join in the table itself so in the literature this is sometimes called sideways information passing because we're basically taking information from one side of our query plan and passing it along to the other side as a lateral move so let's say I'm joining A and B I do my build phase on A and my probe phase on B so in the build phase I generate my hash table but then I'm also going to generate this bloom filter then when I'm done I pass it over to the probe side and B will always check the bloom filter first for a attach join key if it has a match then it also goes and actually does the check into the actual hash table itself and the idea here is that because the bloom filter will be really small it'll be able to fit in our CPU caches so if we have a really low selectivity meaning most tuples will not be able to match during the join then going and checking the bloom filter is way faster than going and having to do a probe in the hash table and possibly do a linear scan to go find the tuple that we're looking for so we actually support this in Peloton this is also always implemented in vector-wise this is one of the big benefits that vector-wise had over other systems and for we've done experiments where this roughly gives about a 2x performance improvement if you have really selective joins if most of your joins are not going to match then doing this is faster if most of your tuples will match then this is essentially just wasted work because you're going to find a match in here anyway alright, so let's get to now the evaluation of this in the remaining time so I'll sort of go through this quickly but this is going to be from the paper from the Wisconsin guys in 2011 it's a bit dated but what I like about it is that they break it down into all the different pieces that we talked about in this class so they're going to do no partitioning shared partitioning, private partitioning and Radex partitioning and they're going to compare on a synthetic workload where the outer table is going to have 16 million tuples and the inner table is going to have 256 million tuples and they're going to compare against the uniform and high skewed distribution and as I said before in the beginning this is not a true, this is not exactly how a real data system would do a join because they're not materializing the output tuples, they're not doing that extra copy which could be a bottleneck they just do the join, they see they have a match and then throw the result away so in the first experiment here, they're doing a uniform data set so that means that the Radex partitioning and the partitioning approaches should actually be beneficial because now you're going to divide the tuples up evenly across your different threads and each of them can process them in parallel so in this case here you see that the Radex partitioning actually performs the best and the No partitioning actually performs the second best so the reason why the Radex partitioning is performing better is because if you don't partition your data then in the build and probe phase you end up having three times more cache misses and 70 times more TLB misses because every thread is just sort of reading random segments when it does the probes and you just have way more cache misses here but now if you have a skew data set the No partitioning work actually performs the best because although yes there's going to be more cache misses the you don't end up partitioning the data set such that there's some large partitions that have all the tuples and only one or small number of threads can process that so in this case here the No partitioning approach is actually twice as fast as as the Radex one and much much faster than the shared partitioning or private partitioning approach so I would say in practice again although the paper may suggest that you do want to partition for joins if you want to handle possibly everything the No partitioning won't actually work the best and this is what most systems actually do yes so this question is is it too much work to I mean it's pretty much no partitioning or Radex partitioning I do want to do those this question is it too much engineering to actually implement both of them have the optimizer switch on versus another I would say probably yes right like I don't have I don't have um grass but I don't have in the talk here there are some situations where Robin Hood hashing might be better than linear hashing or cuckoo hashing might be better than other ones or different hash functions could be totally different it's just too much work to be able to figure out to implement all of them and then have the optimizer figure out oh I actually want to use this one so most of the times you're just better off just picking one hashing scheme one hash function and really implement it really well yes this question is why does private partitioning take more time in the sorry why is this larger than this why is this larger than this for what part the partition phase or the probe phase again so so there's what 12 threads here so it could be the case that if it's skewed then some partitions will have most of the tuples so you could have two threads just be trying to process way more data and all your other threads are idle because they burn through their partitions and they're done but then the other threads have to finish their partitions because it's skewed and you're not materialization phase since you have to do it no matter what his question is is it okay for these papers for them to ignore the materialization process phase because regardless of what scheme you would use you would have to do that anyway yes I mean that's true except that it does affect how much cash is available to you because you're copying the data out and that means there's less cash available for the joint algorithm itself his question is could it vary in terms of different partitions yes because some of these are more cash friendly than others okay we're mostly out of time but I'll sort of finish up here so the bunch of stuff we're ignoring that I didn't talk about but these are the things you do have to worry about when you actually implement this frill I didn't talk about whether you want to use partitioning or not that's a hard question but then we need to talk about how many partitions you actually want to use if you're using rate of partitioning how many passes you want to do these are usually one and as we talked about before the optimizer could select these different values based on its estimations I don't know what these operators actually do but that's based on its statistics or sampling which we saw in the case of the joint ordering benchmark paper they get this way wrong they're always underestimating that's part of the reason why they're getting bad performance so yes these things are important but figuring them out reliably is really hard so I'm going to skip this for this graph just showing you that if you take one pass versus two pass for the uniform data set you can end up with wildly different results and then this last one here is just showing you that the no partitioning approach is actually less sensitive to hyperthreading whereas the rate of partitioning is and this is because in the case of rate of partitioning yes you are you're going to get better performance because you have fewer cache misses but this is going to make you be more CPU bound rather than memory bound or cache limited by cache misses so that makes you more sensitive to hyperthreading because whereas in the no partitioning approach you can have your two threads running on a single core one thread could hit a cache miss, the other thread could actually do some computation and make more progress but if you're minimizing cache misses with partitioning then hyperthreading doesn't help you as much and that's essentially what I'm saying here okay, to finish up so I always say that it's oftentimes there's a bunch of papers that say here's a bunch of complicated ways to do hash joins we actually tried to spend the last semester doing doing this, it turns out the most simplest thing works the best the reasonably fast hash function with open address and hashing or linear hashing does really well hopscotch stuff doesn't help ignore hopscotch robin hood hashing doesn't help the cuckoo hashing doesn't help just open address hashing, yes you have more collisions but it just works there's a bunch of other stuff that we'll talk about later that you can actually vectorize there's ways to vectorize the operations you're doing during the hash join but we'll talk about that later on in when we talk about vectorized execution the one thing I'll say about this is again, this is another example where if everything fits in our CPU caches we can do really, really fast as soon as we have to go to and we can vectorize everything as soon as you have to go to memory then it's when things fall apart so there is one technique we can do called software prefetching which is in a paper that Prashant wrote for a few weeks that does help for some things but not for hash joins okay any questions, yes why did we consider dynamic hashing schemes such as extended so the question is why did we not consider dynamic hashing schemes like extendable hashing or linear hashing for our join algorithms because the cost of maintaining those things is too expensive so ideally you pick a hash table size and you hope across your fingers that you're not going to rebalance it so it's not worth the overhead of dynamically building those things out those extendable hashing and linear hashing those are incremental dynamic hash tables so you start out with nothing and then you start inserting things and then you have to reshuffle as you get bigger and bigger and bigger so in these environments it's just faster to have a static hash table size and you hope everything fits in there and then because that will get you better performance yes what is the hash table that does not fit in memory so the question is what if the hash table does not fit in memory so one we're ignoring that but two the in a real system you'd have to spill the disk right so you would have to have a mechanism to do this it means you need like a buffer pool manager right and then in that case you can do like the grace hash join where you basically it's partitioning so so so his question is is would sort merge join be better if the hash table does not fit in memory the current literature says no even then if you have to go out the disk hash join is faster now there are some scenarios where you if like the output of the query needs to be sorted and if the sort key is the same as your join key then yes you can do a sort merge join and then you automatically get your data sort of the way you want it and that would be faster than a hash join but that you have to have a query that looks exactly like that to make that work okay so on Monday we'll do the parallel sort merge and then on Wednesday next week we'll have the one-on-one meetings and you have to put out your code review for project three okay any questions I I I I I I I I I I I I I