 Today, we're going to talk about federal hash joins. Again, I'm going to hear my home office. It's me up here with the terrier in attendance. So it'll be asking questions as we go along today. So why are hash joins important? Well, it's joins in general are important, because that's one of the most common things operator we're going to execute in analytical workloads. So for today, we're going to talk about in the background what it means for to do a join algorithm at a high level and then some of the history of the back and forth between the performance trade-offs of a hash join versus a sort merge join. Then we'll talk about how to do a parallel hash join and the different ways to do the three phases. And then we'll talk about how to actually build a hash table, the hashing scheme, the hashing function, and then we'll finish off with a discussion of the evaluation based on some of the results in the paper you guys were assigned to read. So in the introduction class last semester, when we talked about join algorithms, we didn't really talk about how to execute them in terms of multiple threads and parallel. We mostly focused on trying to assess how much disk IOL we were going to incur for the different algorithms. But now in memory system, we don't have disk and now we want to maximize the amount of parallelization we can get across on our cores and our CPUs. But now we need to focus on how we're actually going to use our hardware efficiently to do the join. So a parallel join is just taking two tables, two relations that we want to join together, but we're going to do this join across multiple threads. And we're going to try to subminimize the amount of contention and synchronization as possible so that we get the best performance. So the two main approaches we're going to do joins in an OLAP system are going to be either hashing or sorting. There's no other sort of way to magically be able to identify whether you have matching tuples, right? It's either one or the other. And so for this class also that means we're not going to discuss nested loop joins because that's like the worst case scenario in an OLAP system because it's just doing a brute force search of sequential scans over the tables to try to find matching tuples. So in OLAP systems, you're not going to see nested loop joins except for some rare cases when the tables are super small. And so for that reason, they're also running those in parallel is the same way you would just sort of run a scan in parallel. So there's not that much we can talk about here. The other thing I'll say also too though in OLAP systems, you will see nested loop joins because oftentimes these database systems that are designed for transaction processing workloads don't need to do hash joins or don't need to do sort merge joins because they're not doing large joins between different tables. It's always doing foreign key lookups or small number key lookups to go get data that you would then give out to the application or like render a webpage. Let's think of like Amazon when you log in there could be a nested loop join for taking Andy's account and getting all Andy's orders. So there'll be a foreign key reference from the order customer ID to the customer table. All right, and then that case is just an index nested loop join which will be super fast and way more efficient than you would do a hash join because you don't need to build a hash table to do the join, you just use that existing index. So for this reason, an index nested loop join at a high level is going to look a lot like a hash join. It's just that the data structure you use to find matching tuples already exists because the old to be application already defined the index that you're gonna do a join on. All right, so the high level difference here is that a hash join builds a data structure like an index that allow you to find the matching tuples but when that query finishes, it throws away the hash table and the next query comes along and it'll rebuild the hash table again. In a index nested loop join, the index already exists so you don't need to build it on the fly, you just use that in order to find the matches for your tuples. But a big difference is that in old to be settings this index is most likely going to be a tree-based data structure. So either B plus tree or a radix tree that we talked about before. And so that means you'll get log in lookups when you do the join but for a hash join using a hash table will give you an average O one lookups which is way more faster. So that's why nobody builds a B plus tree when you do a join with some rare exceptions of when you're doing like range predicates and things like that on the join but we're focusing on equity joins. So for a hash join, we're gonna use a hash table because that's gonna be more efficient. So the debate of whether sorters join is faster than hash join is one of these classic problems in databases that over the last 50 years it's gone back and forth on which which approach is actually better. So in the 1970s, the conventional wisdom was that sorters join was superior because the amount of memory that was available to these early computers, early systems back then whereas it was quite limited and they had algorithms to do external merge sort so you could spill to disk and still sort the data. And so it's unclear whether they knew they could do a hash join that could spill to disk or not but the conventional wisdom was the sort merge join was better because they already had that external merge short algorithm. Then the 1980s came along and then there was this movement called sort of database machines where they've identified that hash joins could be superior if you had specialized hardware support to do the hashing or do the hash join. And so in the 1980s, the conventional wisdom was that hash joins were superior because they had hardware that could do it more efficiently than you could do sorting. So we don't really talk about database machines anymore. The idea of this is like it's a custom appliance that has special instructions or special hardware specifically for a particular database system. They sort of went out of vogue in the 1980s because Intel and all the other chip manufacturers were putting out new things, new CPUs all the time and with Moore's law, things got faster and faster because you had more and more transistors. So by the time if you were a database company the time it took you to fabricate or design and fabricate your database machine with specialized hardware and then actually put it in production and start selling it, Intel or whoever put out already new CPUs that negated any performance gains you had in your custom hardware. You don't really see this anymore other it's kind of coming back in vogue not so much with specialized hardware like they did in the database machines but more like FPGAs, GPUs and other types of hardware accelerators that is they're still commodity based but you can design custom kernels for the database on them. Then in the 1990s there was this paper from Gertz Graffi, the same guy that did the volcano the B plus three stuff we talked about before with Latching. He came out with a paper that said that the algorithms are basically equivalent that for all different real world scenarios on the hardware that existed at the time you wouldn't really see a noticeable performance difference between being sort version hashing so they were deemed equivalent. But then in the 2000s as there was more and more analytical databases being created like Vertica, Green Plum, Astrid data, things like that it was shown that hash joins were considered to be superior on the hardware that was available at the time and really since then that's been the case. In the 2010s it wasn't so much a debate about whether certain hash join was better than hash join if the debate was whether you wanna do a partition hash join or a non-partition hash join which is what we'll talk about today. And in the current decade again unless there's some major breakthrough in the hardware I don't think that certain ratio will ever come up on top again. I think hashing has been proven to be superior in many cases. And obviously if your data is already sorted on the join key then you don't need to do the sort at all you're just doing the merge phase then that's the best case scenario and that certainly will beat the beta hash join algorithm but databases normally don't keep things sorted all the time. All right, so let's talk about now what the last decade looked like. Sort of how we got to where we're today and why I signed the paper you guys were reading. So again, the 2000 and the early 2000s at the turn of the century it was deemed that the research showed that hash join was superior. So one of the key papers in this came out from Intel and Oracle in 2009 where they showed that hashing was indeed superior than sort merge. But they speculated that if we now had larger SIMD registers in particular 512 bit SIMD registers like AVX 512 for Intel then sort merge join would actually be faster. Now as far as I know there hasn't been a paper since 2017 when AVX 512 came out that has actually tested this theory. But it'd be interesting to see whether this is actually true or not but again still now everyone's still just doing hash joins. Then in 2011 researchers at Wisconsin put out a paper that started the debate of whether you want to do a partition hash join versus a non-partition hash join and we'll discuss these results later in this lecture. But then the Germans with Hyper came out and said, oh Intel is actually wrong. Wisconsin was wrong. Certmars join is already faster even without the larger SIMD registers and they showed how you could do this in Hyper. But then they came out a year later with another paper that said ignore what we just said from the previous year, we were wrong. Hashing is actually superior and here's a better implementation for it. Then in 2013 there was another paper from researchers in Switzerland from ETH where they have a bunch of ways to make a Radex hash join perform more efficiently. So Radex hash join and Radex partition join we'll discuss this in a few more lectures. They just showed how you can do this more efficiently than a non-partition join. But then another group of Germans in 2016 came out and said everyone needs to hold up and stop publishing these papers that say, here's my hash join algorithm, look how much better it is. They actually then did an exhaustive evaluation of all the different design decisions you could have in your join algorithm to better understand what the trade-offs are and which one is actually going to be better than another. So again, that's why I sign you guys this paper because rather than just saying here's this one-off implementation that we did as Wisconsin did or Hyper did or ETH did, they just sort of did a complete sweep over all the possible parameters in these implementations and showed in what scenarios would one be better than another. So now if you want to design our join algorithm, whether it's sort merge join or hash join, what are some things we should think about to make this thing run efficiently and good performance? So at a high level, the two goals that we're going to have is that we want to minimize synchronization and minimize the cost of accessing memory while we do the join. So the first ones are obvious, right? We talked about this before. It basically means that we want to avoid having to take latches to protect critical sections of the data structure of our join so that we don't have any contention or conflicts between threads. So that every thread is basically running at full speed and they're not waiting for another thread to give up some resource, right? So I'll say this too, this doesn't necessarily we need to make our algorithm latch-free and there are latch-free techniques to do sorting and merge or sorting and hash joins. It just means we need to be smarter about how we're taking our latches in. For the second one here, the idea is that we want to make sure that anytime we have a worker thread that's computing a join, anytime they have to touch data, access it to do a check or write something, we want to make sure that that data is local to that core and this could be either in the same numer region or ideally in the same CPU cache, right? We want to minimize the number of cache misses and minimize the number of cross interconnect traffic between the sockets. So the way we can do this, this last one to improve our cache behavior is two-fold. So the things that are going to matter for most of us when we have our algorithm that's going to cause us to have cache misses is we need to account for the size of our cache and our TLIB, the translation look side buffer, so that we're not trying to access a large number of pages at the same time. So think about this, if I now need to touch 10 tuples, if those 10 tuples are in 10 different pages, I can only have five entries in my TLB that's mapping the virtual memory address to the physical memory address. Then in order for me to execute, do my operations on those 10 things, I'm going to have to start evicting things for my TLB and therefore I'll have a cache miss on the data that I want plus a cache miss in the TLB. So that'll make my access go twice as slow. So ideally I want to brave it or bring a small amount of data or the minimal amount of data I need to do whatever operations I need to do and complete all those operations before I move on to the next thing. And that's sort of called locality, the temporal and spatial locality. So I'm just making sure that if I access this thing multiple times, I want to do this within the same short time window so it's always in the cache and I'm accessing multiple things I want to make sure that they're close together either in the same cache line or ideally in the same cache line so that I'm not paying the penalty of multiple cache misses and polluting my TLB. So the way we can achieve this is, the way to achieve this, we need to be aware of the kind of accesses we're going to do. And the two type of accesses are either sequential scans or random lookups. So with non-random access where I'm just reading a sequence of bytes in order, then again, I want to make sure that that data it will fit into a single cache line so that it's one cache miss and one memory stall to go fetch it and bring it into the CPU. And then for everything I bring into my CPU caches I want to execute as many operations as I can on them. If I have to do random lookups though, I want to make sure that those random lookups are again, clustered together within the same cache line so that I, again, I reduce the number of lookups that I have to do to memory. So this is this classic trade-off between the number of instructions we're gonna have or the cycles we're gonna have to incur versus the amount of memory. And so we'll see this in a second but the way we can achieve all these things is through sort of careful partitioning and being aware of what data exists and in what location so that our threads are always accessing data that's close to it. Okay, so for the parallel hash join, again, hash join is super important for us because it's gonna be the most common join algorithm that's implemented in modern systems and a lot of OLAP queries are gonna contain joins so we want this to warm efficiently as possible. In the paper you guys read though, you saw that the join portion of one particular TPCH query actually was not the bulk of the execution time. I've actually seen numbers that go in both ways. I've seen numbers from MPAL that shows the joins are like 45% of the time of the system in the case of the sort of the testbed system that you guys read about from Germany or Saarland it was maybe like 10 to 15% of the total time. So just how much of the time the system is gonna spend doing hash joins can vary from system to system but certainly there's gonna be a lot of joins and these joins are almost always gonna be executed as hash joins so we're gonna run that as fast as possible. So at a high level the goal is with the parallel hash join algorithm is that we want all our cores to be busy at all times that nobody should be stalled waiting to get data nobody should be stalled waiting for another thread to complete some operation before they can proceed. Ideally we want everyone just running full blast all the time and of course this is easier said than done. So a hash join is comprised of three phases partitioning build and partition build and probe. So we're gonna go through each of these one by one but it's gonna sort of show this as an outline for where we're gonna go in this lecture. So in the first phase it's entirely optional the partition phase and the idea here is that we wanna divide our tables that we're joining together into smaller chunks based on the hash key so that in the subsequent phases we can have threads operate on just data in their partition and they don't need to go look at data in other partitions. So if you remember from the intro class we talked about this doing partitioning and spilling buckets at disc this is also called sometimes the gray hash join and the gray hash join just means at a high level you're doing this partitioning phase but how you do the partition can vary from one implementation to the next. And then whether or not we do the partitioning the next phase we'll build the hash table so this is where we're scanning through the outer table R and then we're just gonna build a hash table on the fly for the join key that our query is asking for. Then in the final phase for the probe we're gonna do a sequential scan on the inner table S and we're gonna look up this join key, hash it do a probe into the hash table see whether we have a match and if so then we'll combine the tuples and produce as the output of the operator that's then fed up into the query plan. So the important thing to point out though is in this probe phase and the paper you guys are assigned to read in a real system if you have a match on the probe you actually need to materialize this combined tuple and then copy it into an output buffer for the operator and a lot of the hash join papers that I showed at the beginning or the join papers that I showed in the beginning they actually don't do this last step just because they wanted their numbers to look really, really good or they sort of, they didn't really sort of think about it but in a real system and in the Saarland paper they actually do this last step which is important because that can affect the performance of the rest of the system. We'll talk about the implications of this with early materialization and late materialization in a few more slides. All right, so let's go to each of these phases separately and we'll talk about the different ways to implement them. So again, the partition phase is we're gonna take the both relations, the inner one and the outer one we're gonna scan through them, look at the join key, hash them and then assign them to some output buffer or the partition buffer. And so the whole idea of this is that although we're paying an extra cost of having to scan through the data and copy it once the idea is that if we're intelligent about how we do this we can write the data out in such a way that when we do the build and probe we'll minimize the number of, or reduce the number of cycles we have to execute those instructions because we'll minimize the number of cache misses and we'll make sure that the threads are operating on data that's local to them. All right, so again the idea is we pay this upfront cost to make other stuff go faster later on. So sometimes in this literature as I said sometimes it can be called a gray hash join sometimes it's called the hybrid hash join or the radix hash join, right? Whenever you see those sort of these qualifiers in front of the hash join just means that they're doing some kind of partition step in front of it. So as I said in the last slide we need to be aware of what we're actually putting into our buffers because that's gonna greatly affect the performance. What turns out the, what you actually need to put in your buffers is gonna depend on what the storage model is that we talked about before. So if it's a row store usually they put the entire tuple in the output buffer because even though I only maybe need a subset of the attributes to do the join or execute the rest of the query it's just either to copy some particular chunk of memory and write it into my output buffer or to my partition buffer. In a column store oftentimes what you do is you only store the keys that you need to do the join and then the offset to where to find the rest of the tuples if you needed to stitch it together. And the reason why you can do this is because you wanna do this is because you're minimizing the amount of data that you're actually copying in these partitions. And you don't have to worry about chopping it up as you would in a row store because it's already divided or partitioned for you as a column store. So the more efficient approach to do again is just the bare minimum information you need in order to compute the join and then if another part of the query up above in the plan needs additional columns or additional attributes you use the offset to go find that information. So for partitioning there's two approaches. There's the non-blocking and the blocking partitioning or the Radex partitioning. And so with the non-blocking approach the idea is that you have a set of threads that are gonna go through and partition the data. So scan the outer inner table start producing these partitions and then as this generating is output you can have another set of threads that will read that data and start the next phase and start populating the hash table, right? And you can do this because you're not worried about any false positive, false negatives, I'm sorry, false negatives. Like as the data is partitioned you can then immediately populate the hash table. It depends on whether you're doing one pass or two pass but in general for the non-blocking one people do one pass, right? In the second approach when you're doing the Radex partitioning the way it works is that you can have all the threads will scan through the table once and the tables once and produce the partitions and because you don't know exactly how far each thread has gone and you don't know whether since everything's already sort of broken into buckets you can't be guaranteed that you have all the data you need to fill the hash table. So all the threads we're doing this partitioning at the same time and then when that completes then you switch over and do the build phase. So let's go through each of these one by one. So in the case of non-blocking partitioning there's actually two sort of subsets or two additional ways you can actually implement this. Again, this is just we're gonna scan the relation and then once and then build the output on the fly. So with shared partitions you're gonna have all the threads try to write into the same memory locations at the same time or the same buckets for your partitions. And that means that you have to use a latch to synchronize the buckets to make sure that one thread doesn't overwrite something that another thread wrote into incorrectly. In the second approach you use to have private partitions where each thread now has its own set of buckets that they can populate. There's no other thread writing to those buckets so you don't need a latch to protect anything. And then once this sort of first phase is done or first pass is done with the private partitions then you can go through with another set of threads and then populate the global partitioning buckets that you do in the first one. So again, classic Twitter science here there's no free lunch, right? On one hand, in this case here we're using latches to protect the data structure but we only have to go through the partition every pass once. In this one it's latch free but that now means that we have to then take a second pass through the data to consolidate all the information that's spread out or divided amongst the different threads. So we'll go through, let's go through each of these. So here's the share partition approach, right? So we have our data table, we have three columns and the first thing we're gonna do is just divide up the data into different chunks or morsels like we did before. These are just ranges of data. We don't know actually what's in them yet we just say the first 100 tubos goes this thread the next 100 tubos goes this other thread. So now let's say we wanna do a join on column B. So we'll take the value of B in every single tuple and we're gonna hash it with the same hash function we're gonna use for the build and probe phase. So we have to do this because we have to make sure that if we hash a key in one phase that if we hash that same key later on we end up in the same location or in our hash table or in our bucket to make sure we can find the information that we're looking for. So we're gonna hash that and we know that on our patricians we're specified ahead of time. So we're just gonna mod that by the hash value by the number of partitions and that'll tell us what partition chain we're gonna go into. So now every thread is gonna write into for every single value they hash it and mod it and that tell you which one they're gonna write into. And so again this is a global set of buckets. So every thread can be writing into any bucket at any time. So we just have to use a latch to protect the last place where you wanna do an insert. And once you acquire the latch then you can insert a new value in there. The other approach again is with private partitions same setup we're gonna divide the data up amongst different threads and they're gonna scan through and hash it and then mod it by the number of partitions that we wanna have. But what's gonna happen is each thread has its own sort of group of buckets that we saw in the previous slide. So now when I do my write into these different buckets at each thread I'm the only thread writing in there I don't need to protect any of the latches so this is gonna go really, really fast. Now once all the threads are done then I have to combine everything together to create that global partition space that I had in the last slide. So to do this I can just have a bunch of the threads each pick a separate partition group across the different threads and be responsible for populating this thing. So in this case here I don't need to acquire latches when I do this consolidation but I still have to take a second pass. So thread two will do partition two and thread three do partition three and so forth. Right? So again this is also a good example of where materialization issues can be are important because I'm copying the data twice. I have to copy the data once out of the data table into my partition and then out of the private partition into the combined partition. So if I'm a row store and I'm copying the entire tuple and my tuple is very large then this copy is expensive. If it's a column store and I'm only copying the minimum amount of data that I need to do the join plus the offset that can be much smaller than the full size of the tuple and therefore this copy can be less pressure on the CPU caches and memory and certainly potentially less instructions. Okay. So that was an example of the sort of the non-blocking partition scheme. Let's talk about now with the Radix partitioning approach. And the idea here is that we're doing the same kind of partition that we saw before but we are being more careful or not more careful. We are doing the partition in such a way that we can have the threads right to separate locations in memory without having to take latches but we have to wait until everyone finishes before we can move on to the next phase because we don't know what else is missing. And that'll make more sense when we talk about when I show the diagram. So with the Radix partitioning we're gonna take multiple steps to go the multi-step pass over the relation. And so in the first step we're gonna scan through and computer histogram that's gonna tell us the number of tuples we're gonna have per hash key and then we can use that histogram to compute what is called a prefix sum that'll tell us at what offset in our partition space should we write a particular tuple. Then now we're gonna go through and scan our again and now do the hashing and based on where we define where we certainly do our writes for the prefix sum then we write into the partition space. So again, we'll go through each of these one by one. So the first thing to point out too also is that the term Radix partitioning is just means the digit or a byte of the total key. It's the same Radix term that we saw when we talked about Radix trees before. Instead of having the entire key I'm gonna hash or partition based on a chunk of it. The prefix sum is just a way to do a, again to determine what's the starting location that we wanna write into our global partition buffer for each thread. So let's first understand what a prefix sum is. So again, the Radix is just the value of some digit within the key. So my keys are 89, 12, 23, 0, 8, 41, and 64. So the Radix at this first position here for each of these keys would be nine, two, three, eight, one, four. And then the Radix for the next key is just eight, one, two, and so forth, like that. So that's all that needs of the Radix partitioning. We're gonna look at one digit within the key at a time, hash that, figure out where that goes to. And then if you wanna do additional passes, we could look at subsequent digits. So in modern CPUs, you can compute this Radix pretty efficiently with this multiplication instruction. So this is not an expensive operation to do this. So now with this Radix, the Radixes of the keys, we can compute what's called the prefix sum. And that's just a running summation, a running count of the numbers and the position of the prefix sum in the output determined is based on what keys came before it. So I have the keys one, two, three, four, five, six. So for my prefix sum in the first position, the prefix sum is just one, because there's nothing that came before it, so it's just the value one. But now for the next position, I'm gonna take the value of the previous prefix sum computation and then the value of this key and add them together and that's now my prefix sum for this position. So one plus two is three. Same thing, now I take this position here, three, three, six, 10, 15, and 21 and so forth like that. And the reason why we're learning to do this is because now we can use these prefix sum to tell us again what offset we wanna write into. Because we say like if I have one tuple before me, then I should start writing my tuples after position one. But then now this guy says I wanna write tuples in and he knows he just started offset three. These are essentially gonna be used to offsets within the partition array. So let's look at this rate of partition. So again, the first step we need to do is inspect the input keys and create our histograms. So say the output, the input keys that we're dealing with here are already the, we've already hashed it, we've already mod it by the number of partitions that we have, I'll take it back. We have a mod by number of partitions, we've hashed it and now we wanna look at the digits of the hashes. So what we're gonna do is look at the first digit here and for each thread, you're gonna scan through your assigned range and compute what the histogram is. So it's the number of values that would appear at a particular partition. So in this case here, for CPU zero at partition zero, it had two entries, one, two. At partition one, it had two entries, one, two. Like down here, it has at partition zero, it has one, because there's one zero there. And then for partition one, it has three, one, two, three. We just compute this histogram by scanning through the data efficiently. Then now what we're gonna do is compute the offsets in our partition array that the threads can all write into. So we have to block and wait until all the threads complete their partition and then we now compute the prefix sum based on the histograms that tells where we're gonna start writing into where every thread can write into this partition array. All right, so it'd be like this. So partition zero for CPU zero would write here and the partition zero for CPU one can write here. All right, so partition zero starts at position zero because of the prefix sum. Partition zero as CPU one would write here because it's this value is two because the prefix sum will be three. So one, two, three, and so forth for the other one. So now what does this do for us? This is now gonna give us a giant array, a giant buffer that the threads can start writing into and they don't need to coordinate or acquire latches to protect any location and memory when it does this right because it's already computed the prefix sum. We already know that nobody else has been writing to our location. So we can just write into it just fine. And we get to maintain an internal counter in our CPU to say, oh, for partition zero, I've already inserted one thing or two thing or three thing. So using my starting point, I can then decide quickly at what offset should I write my next piece of data into. All right, so now with this, we've computed this partition output in the starting locations where we do our writes. Now we go back and do our scan again and now we do our partitioning and now we're copying in the values of the keys that we've hashed into our partition array. And again, we can do this without acquiring latches. So now at a high level, the way to think about this is that we have partition zero and partition one. And so at this point, we can then hand that off to the build phase and start computing the hash table. Or if we wanted to, we could start dividing this up even further or do some partitioning again so that our chunk of data in each partition can now fit into a cache line or a small number of cache lines. All right, and so to do that, we just go back now and do jump to the next, the next radix, right, within this, the next radix value and just do this, you know, partitioning all over again. All right. But I would say in the, in practice, well, in practice, most people don't do radix partitioning, although it is shown to be superior. And then the other few systems that I knew that do this, they're mostly academic prototypes, they almost never do two passes. It's usually one pass and you're done. All right, so now we have our, whether or not we done the partitioning, now we enter the build phase. And again, the idea here is that we're gonna scan the outer table, either just in the original table itself or if we partition, then the partitions. And then for every single tuple we're gonna have, we're gonna hash it on the same key we used in the partitioning phase, and then we're gonna store it into a hash table, right? And ideally, we wanna design our hash table such that the size, every bucket that we're writing into is going to be just a few cash lines in size, right? Because that's gonna allow us to go through things more efficiently. So we now need to discuss what this actually hash table is gonna look like, right? Sort of said like, oh, great. The buckets we write into a hash table will be a few cash lines. What does that actually mean? All right, what does our hash table actually look like? So to understand what a hash table is, we need, there's two main design decisions. Typically when people say they have a hash table, it's sort of used colloquially just to mean the data structure itself. But in practice, it's actually, it's a combination of these two things, the hash function and the hashing scheme. So the hash function is a way we're gonna take our key that's in a larger domain, a larger space, and we wanna map it to a specific location or slot in our hash table data structure, right? So it's like, you take all possible strings and you wanna have a hash function that can then convert it into some integer that we can then say, within one to 10 or some smaller range, what slot we're gonna write into. So we'll see in a second. The, we're gonna have this contention between having a hash algorithm that is fast and also having a hash algorithm that has a low collision rate because we wanna make sure that if we take two keys that are distinct, we don't want them to hash to the same location but we wanna be able to compute that hash very quickly. The second design decision is the hashing scheme and this basically says that if you've done the hashing, if now you have two keys that hash to the same location, meaning you have a collision on that key, how do you actually deal with that? And so again, there'll be this trade-off between allocating a ton of memory for a hash table so that every possible hash key I ever see is guaranteed not to collide with any other key but that would take a lot of memory. And so if I wanna support collisions, I wanna have my collision reconciliation method or procedure be efficient as possible. So I don't, I'm not spending a lot of computational time to find or insert a new key. Again, we'll cover each of these one by one. I will say though, although I'll present a bunch of different approaches to doing hash tables or hash functions, every single database system that I ever talked to or every company that I ever talked to, they usually just pick one that have a reason why they picked it like, oh, we ran some benchmarks and it seemed to work well. And although in some cases, some hashing schemes might be better than other for different workloads or different query types, nobody, as far as I know, has actually tries to be adaptive. Everyone sort of picks one hash table or one hash function and tries to make that be as efficient as possible. They don't try to automatically adapt the hashing scheme they're using based on what the query of the dataset looks like. Everyone just sort of picks one and runs with it. All right, so for hash functions, again, the idea is that we wanna take an arbitrary key of any possible length and any possible domain and then map it to a smaller domain that we can then use that to find a location in our hash table. So this means that although there's hash functions out there that provide security guarantees, our cryptographic hash functions or two-way hash functions that I can encrypt something and decrypt it, we don't care about any of that. We all we care about is having an efficient one-way hash with a low collision rate. And we'll see what some examples are in the next slide. So the best way to understand this trade-off between collision and performance is think of the two extremes. So the fastest hash function you could ever have is one where no matter what key you give it, you always return the value one, right? That's like there's no computation. It's just writing one to the stack or the output of the function and then you're done. But of course this means that the collision rates could be terrible because no matter what key I give it, it always comes back with value one. So it's super efficient, but my collision rate is bad. On the other end of the spectrum, I can have what is called perfect caching where it's a magic hash function that no matter what key you give it, it's guaranteed to always produce a unique hash key. So again, these exist in the literature or in theory in practice, nobody actually implements them because typically the way you actually implement them is you need to know all the keys ahead of time and you would then build it with the mapping function with a hash table itself. So you would need a hash table in order to have a hash table which sort of defeats the purpose of having an efficient hash table. So again, these are the two extremes. We want something in between. So this is sort of active development area for both research and for companies and startups and sort of hackers in general. So there's this benchmark created by the murmur two hash guy called SMHasher. And this is sort of a benchmark suite that has measurements to determine the performance and collision rate of a bunch of hash functions. So if you're interested in this kind of stuff, you can go check out his blog or the GitHub page for this. But I want to focus quickly on just sort of five high level hash functions that you see often in the wild. CRC was originally developed in the 1970s to do error code detection in networking. There's now on modern CPUs, there's built-in instructions to do CRC very efficiently. I think in the slides I'll show in the next slide, I don't think I'm using those instructions. Actually, I'm pretty sure I'm not. So if you just use the algorithm, you get a software-based CRC, it's gonna be really slow. If you use the instructions, it'll go faster. murmur hash was this random dude on the internet who said, hey, here's this hash function I developed. That's general purpose, that was fast. A bunch of people seem to like it. They picked it up and sort of expanded upon it or modified it. And in particular, it was Google city hash was based on murmur hash, but they designed it for their environment where they wanted to have efficient hashing for short keys. There's the Facebook XX hash, which is considered to be the state of the art now. This is from the same guy at Facebook that embedded the standard of the compression. Again, this is sort of doing, it's a mathematical variation of the methods being used in murmur hash. And then farm hash is a newer version of city hash that's designed to have better collision rates for larger keys. So there's also in 2016, there was a highway hash from Google, but that has like guarantees for sort of again, cryptographic analysis so you can't leak any data from the hash functions. Again, we don't care about that in our hash table because we're using this hash function internally. It's not exposed to the outside world. So this is just a benchmark that I run every year where it's not SMHasher, it's a different type of workload where it's just trying to measure the throughput rate you can have for these different hash functions. So again, CRC64, I think this is just all software based. I should rewrite that to use the hardware instructions. But the main takeaway here is a common thing you'll see is that you have key sizes of either 32 or 64 bits, or say 32 and 64 bytes. You don't see like super large keys. And this is just showing that there's, if you have key sizes that are aligned to cache lines, then these algorithms can perform over efficiently. So sort of the sawtooth patterns is when you start going to the next cache line and then you pay that penalty. So the main takeaway here is that again, XXHash3 is considered to be the fastest and then followed by CityHash and FarmHash. But I think this is an older version of MemorHash3. I think there might be a newer version that could perform better. But again, in our own system, we use XXHash3 because this determines that things are, this shows that it's much better. Okay, so let's talk about now hashing schemes. Again, the hashing scheme is how we handle collision. So where our hash function says this key should hash to this bucket or this slot in our data structure and something's already there, how do we deal with it? So let's talk about, and so I also say too, these are also, for the most part, except for chain hashing, these are static hashing schemes, meaning I'm gonna size the hash table before I actually do anything with it. So I sort of have a rough approximation of how many keys I'm gonna have to store in my hash table. And if now I have a collision that I can't reconcile because there's no free space based on the protocol I'm using, I then have to resize it. The way you typically resize it is just you double the size and rehash everything and put it back in. Chain hashing, you can just extend the hash chain, which I'll show in the next slide indefinitely, but this can then degenerate to a sequential scan when you do lookups if everything hashes to the same chain. So chain hash table is what the people think about when they think about a hash table. I think this is what you get in Java when you say you want a hash map class. And the idea is that there's an array of pointers to the chains and I take my key, I hash it, that tells me what offset I wanna look at in my chain and then I can then jump to that location, right? And so the way we're gonna resolve collisions is that if two keys match to the same chain, we're just gonna insert them into the same bucket. And if my bucket gets full, then I just extend it, allocate a new bucket and I add it to my link list. So again, the chain can grow infinitely and if everything hashes to the same location, then doing any lookup is a sequential scan, right? So let's look at a quick example. So ignore this for now, but like here's our buckets and again, so within each bucket, we have two slots. So I take my first key, I'm gonna hash it and it's gonna map to this bucket here. This bucket is empty, so I can insert my key into the first slot. So the thing to point out here though is that we wanna store both the hash of the key as well as the original key. And the reason why we wanna store the hash is because now if we're gonna do a comparison to see does A exist, if we already have the hash, we can do that comparison very quickly with just integers, but if this key is like a Varchar or something larger, then we obviously need to make sure that this thing, we have this original key so we can determine whether we have a false positive, but we keep this one as well so that we can do that look up more efficiently. So A goes here, we can write to that first location, B goes here, A can write to that location, C goes here, right, match to the same bucket, this slot is empty so we can write into there, but now D wants to write into this bucket, but the two slots are full. So all we need to do is now allocate a new bucket, add a pointer from the first bucket to the second bucket in our linked list and now we can add in D. Same thing for E, this bucket is full, we follow the chain and go here and now we can write E and then F gets written down here. So again, now if I want to do a look up, say find me key D, D would hash to this bucket, I would do a sequential scan within the bucket looking at every single key, typically you do it with the hash first because that'll be more efficient. If I don't find a match, then I know I need to, and there's, if I don't find a match in the bucket I'm at and there's a pointer to it, the next bucket, they didn't need to jump over here and continue my sequential scan. So one interesting optimization you can do with the slot array is instead of storing these as 64 bit pointers, or you store pointers at 64 bits, because that's what Intel Harvard tells you you have to do, but Intel actually doesn't use all 64 bits for every address, it actually only uses 48 bits. So in the case of Hyper, what they do is they actually store for all these pointers, they store a bitmap, sorry, a bloom filter that says here's all the keys that exist in my chain. So now what I can do is when I do a hash and I say I should go to this bucket and I look at this pointer, tell me where I need to go, there's actually gonna be a bloom filter inside of it that I can then check to see whether the key I'm looking for is actually gonna exist in my bloom filter because the bloom filter will give you, it won't give you false negatives but it could give you false positives. So it could tell you this key exists and when it actually doesn't, you have to go scan and find it just to see that it doesn't exist. But if it doesn't exist in your key, then this thing will be guaranteed to say it doesn't exist. So you don't even actually need to follow the chain, you just need to look at this bloom filter. So I think for this reason, the Hyper guys say that the chain hash table is superior to all other hash tables if you do this one trick. The problem though is, although I potentially agree with that, the problem is that the Intel's not guaranteed to only use 48 bits for all pointers in the future. It's some later point that may say, all right, well now we use all the 64 bits and then this thing doesn't work. So then maybe they could just store the 60-bit bloom filter and pat it out, sort of like the Judy fat pointers we saw with the Judy arrays. You could store that in the slot array as well. The, so chain hashing is pretty common in data structure used in database systems. For joins though, most of the time you see a linear probe hash table because it's so simple and just ends up being more efficient. So with linear probe hash table, think of it as a giant table of slots and I'm gonna hash my key and I'm gonna mod it by the number of slots that I have and that's gonna tell me where to jump to find the memory address of the slot that I want. So when I want to do an insert, if nothing exists in that slot, then I can just go ahead and write into it. If something does exist, then I need to scan down in going from the top to the bottom from that slot location and look at all the slots until I find one that is empty and then I can go ahead and insert my value in it. Now this means that when I want to do a lookup to find my key, I'm gonna slot and you know, jump to that same location and I have to start looking to see whether the key I'm looking for is actually there. It's cause I don't know if it's not actually in a slot that it should hash to, it may be below that I need to scan through until I find it or I find an empty slot which tells me that the key doesn't exist. So let's just look at this and enable this. Say again, I wanna put these keys in here. First one we hash A, we mod it by the number of slots and that tells us we wanna write into here. That's fine, nothing was already in there so we can go ahead and do that. Now I wanna do hash on B, I can write this slot, nothing's there so we're done. Now I wanna do an insert on C, C hashes to this slot location but A already exists so I can't store it here. I'm gonna store it at the next empty slot I can find which is just the one below it. So again, now if I wanna do a lookup on C, I would hash here, CA and I would see the key A that's not the one I'm looking for so then I scan down until I find either an empty slot or the key that I want. So in this case here I would find C in the next location. Now with D, D hashes to this location, again we can't write it into there because C's already here so we write it to the next one, E wants to write to A, we can't do that, we can't write to the next one, we can't write to the next one and keep going until we find our empty slot like that. F likewise wants to write here but it can't so it goes to the next one. So the thing you gotta keep track of with this is if I do a lookup or an insert, I need to keep track of what was my starting location into the slot array for this table because if I loop back around trying to find the thing I'm looking for or empty slot and I come back to my starting point then I know my hash table is full and there's no more free spaces for me to insert something or the key I'm looking for is not there. Then I have to again rehash everything and by doubling the size, build a new hash table by doubling the size and rehashing everything. So again trying to pick out the right hash table size can be tricky and we'll see in a few more lectures how this can be problematic if you don't estimate things correctly the first time. So the other issue thing about this data structure and we'll see some variations in the next slides is that it's gonna penalize the collision scheme is gonna penalize with inserts and lookups in the same way. So in the case of this one here, I wanted to insert E and I had to jump down to four slots and to store it here. If now I wanna do a lookup on E, I'm gonna pay that same penalty because I'm gonna hash here and I have to scan down until I find again, empty slot or the key that I'm looking for. So there's nothing in this collision scheme or this hashing scheme that prefers making inserts be faster or prefers making lookups be faster. In a hash join depending on whether the size of the inner table versus the outer table, depending on that ratio, you may wanna choose a different collision scheme that makes inserts go faster if the outer table is bigger or lookups go faster if the inner table is larger. But again, as I said, as far as I know, most data systems don't switch up, they always pick one scheme and oftentimes this one is the simplest one and the research shows that this performs the best. Okay, so we've already sort of talked about this. You know, in order to reduce the number of wasteful comparisons that we have to do, we wanna avoid collisions on hash keys and of course that entirely depends on how good our collision rate is for a hash function. And one way to avoid this entirely is that we just have a huge hash table with two X number of slots we expect to have so that every single time you hash your key and put it into the data structure, there's nothing gonna be in the slot that it's trying to write to. But again, you pay that penalty of allocating more memory for your hash table. So how to, again, balance this correctly is not trivial. So let's look at some variations of linear pro-hashing that do try to switch the dynamic or switch the penalty between inserts and lookups by changing how the collision protocol we're gonna use. So one approach is called Robinhood hashing and the idea here is that when we do an insert, if we find that the slot we wanna insert into is already occupied, then we look to see whether the key that exists in the slot we wanna write to is farther away from it's home slot or it's the slot where it should be than where our key is. And if it isn't, then we're gonna steal its slot and force it that other key we just evicted to go to a new location. So it's called Robinhood hashing is the idea is that we're gonna have poor keys steal from the, or we're gonna steal from the rich keys and give the slots to the poor keys. And again, the wealth of a key is based on how many hops you are away from your, the position, the optimal position for you where you should be in the table. All right, so this is an old technique. It was originally published in a tech report in a paper in 1985. It sort of was lost or not lost, or like really didn't get any attention until recent years, in particular this showed up on Hacker News a couple years ago. And now sort of this is, there are a few systems that actually are using this. Although we'll say the research shows that this is actually a bad idea because you pay penalty of just copying and moving is expensive. All right, so let's say the same keys we wanna have before, again, problem hashing is in a variation of linear probe hashing. So it's just a giant slot array. We start off by hashing A that goes here, just like before. But now in addition to storing the hash and the original key, we're also gonna store the number of jumps that this key is from its optimal position. And the optimal position is determined by where it should have been, what was the first location that we jumped to when we did our initial hash, right? If you were looking for this key and used this hashing, used the hash function, where could you find it immediately assuming there wasn't any other collisions? Same thing with B now, BU hashes up here. It's in its optimal position, so the number of hops it is away from where it should be is zero. But now we're gonna hash C. C is gonna land in the same position where A is. So now we need to look to see whether we should evict whatever's in the slot we wanna go into, or should we leave it alone and jump down to the next slot, just like in linear program. So at this first jump here, the A is zero hops away from its optimal position and C is zero hops from its original position, so C will leave A alone, right? Because they're considered equivalent. So then now it'll jump down here, this slot was empty, so now we'll store C in here. So now this is saying that C is one hop away from where it should have been originally. D comes along here, D wants to go where C is located, C has a hop distance of one, therefore one is greater than zero for D because D is zero steps if we could go here. So in this case here, C is considered more poor than D, so D will leave C alone and then we jump down here and we write D. Now we write E. E wants to go where A is, it can't do that. A zero is equal to E zero, so we leave it alone. Then we come down here and at this point, E is now one hop away from where it should be, but so is C, so they're also equivalent, so we leave those alone. But then we get here and now E is two hops away from where it should be and D is only one hop away. So D is considered more rich or more wealthy than E, so E is allowed to steal this slot, insert its key in and then now the thread continues on and figures out the next location to put D, which is the next empty slot. And then now D is two hops away from where it should be. And then now we get to F, F should go where D is, but it can't do that, so then it goes down here. So again, the idea here is that we're amortizing by doing this, this stealing the slots and reinserting keys that are already in the table, the idea is that we're amortizing the search cost of any given key by having each key be closer to where they should have been originally. So going back here with E, under regular linear problem, E would have ended up here and D would have been left alone here, so E would have been three hops away from its original location and D would have been left alone at one hop away, but on a ramen hood, they both end up being two hops away. So on average, you're saying the look-ups are roughly equivalent across all keys so that you don't have one key take much longer than other ones. So this seems like a decent idea or it seems kind of interesting, but the research literature shows that it is bad because this stealing the keys is terrible for cash performance and so it makes the insert so much more expensive that any benefit you get from doing the look-up or making the fines go faster is completely negated. Now, for a hash join, we're building these different hash tables where we insert them, do our probes in the build phase or so the probe phase and then throw it away. If your data structure might be longer living, like you build the hash table and then keep doing the look-ups over and over again, then this approach actually might be preferable. But for hash joins, it won't be. Another technique is called hopscotch hashing which came out in 2008 from Maurice Hurley who used to be a factory here at CMU and now he's a professor at Brown. And so the idea here is that rather than letting a key change position like in Robinhood hashing to any possible location in the slot array, we're going to bound how far they can go into what's called a neighborhood. So a neighborhood will be defined as a contiguous range of slots that a key is allowed to exist and it can exist anywhere in this neighborhood. So if you're looking for a key and you're looking at where it should be in this neighborhood and it's not there, then you know the key doesn't exist. So you bound how far you actually have to scan through. So the size of the neighborhood has been configurable. I forget what the original paper says. But for this example here, we'll use a key size or neighborhood size of three. So again, it's just linear probing. So we have our slot array. And so the first neighborhood for this first position here will be size of three. So one, two, three. But the neighborhoods are overlapping. So for the next one and the next one and so forth, they have a portion of the previous neighborhood in their neighborhood, right? And you would do this all the way down. So now let's say I want to start inserting these keys. So just like before, I want to start A. It would go in here and its neighborhood is three, three slots. So I can insert it anywhere that I want. In practice though, you just insert it into the first location if it's empty, right? And then we're done. Then we do an insert for B, same thing. We go to the first location in its neighborhood and we're done. But now we want to insert C. C should go where A is, but it can't because A is occupying it. So we say we need to find the next empty slot in our neighborhood to insert into, right? So we just do it now, sequential scan down until we find a empty slot and we go ahead and insert ourselves, right? So far, this is the same thing as linear probing. Now I want to start D. D should go where C is. Again, sequential scan down until I find that and then I can insert it in there. But now I want to insert E. E needs to go in this neighborhood. But as I do my sequential scan, I would find that all these slots in my neighborhood are occupied. So now I need to go outside my neighborhood and find the first free slot, which happens to be here. And so now what I want to do is I want to determine whether I can swap in reverse order any key or any key in my previous neighborhood that I'm trying to get into with this empty slot. And that way I could then insert my key into my neighborhood. And the idea here is to make sure that we bound whatever key we're moving to make sure that it is in our slot here. So in case of D, D should belong here. So it's neighborhood or these two things, one, two, three. So it's okay for D to then get moved down here and then we can hash E and put it into our neighborhood and then we're done. Now I insert F, F should go where D is. It's neighborhood or these two and this one could be wrap around. In this case here, I can just scan down and I find the free slot and I can insert it, right? So how this works is that you have to maintain some metadata in every neighborhood about what keys are in them or what neighborhood I belong to. So you can make a determination of if all my keys in my neighborhood that I'm trying to get into are all occupied or all belong in my neighborhood, then I have to double the size to hash E and rebuild it. So the idea here is we're bounding how much we have to look to find a free slot to just be local in my neighborhood and if I can't insert anything in my neighborhood or I don't find the key I'm looking for in my neighborhood then I don't need potentially scan the entire thing. So just so it's a variation, I think of Robin Hood hashing where you're bounding the insertion search space and the lookup space. All right, the last hashing scene we'll talk about our cuckoo hashing and this is me different than all the other ones where we are, if we hash something, you're not guaranteed that the slot you jump, first slot you jump into is gonna be exactly where your key can be found. In the case of hopscotch hashing it would be in the neighborhood in the case of linear probing or Robin Hood hashing you have to keep scanning down till you find the key you're looking for or you find empty space. But with cuckoo hashing when you hash you do a lookup and you're guaranteed either the key doesn't exist or the first place you jump into will be the key that you want. So the way they can achieve this is that they're gonna maintain multiple hash tables simultaneously each with their own hash function. So now when I wanna do an insert I hash it multiple times and I see whether I can insert it into any free slot and any of my tables. If now that slot is being occupied then I'm gonna steal that slot, take that key out and then have that key I evicted rehash that and put that into another table. The idea is that you could have this thing changing when hash table a key you could have a key that already been inserted change it's hash table multiple times. Bank it in the side pocket what is this? Some old pool shit Yo, yo, yo took a sip and had to spit cause I ain't quit that beer called the old E cause I'm old cheap ice high you looked and it was gone lips and ripped the top off they brought just dropped off your same eyes hopped off and my hood won't be the same after ice cube take a safe eye to the brain