 Are you good? Yeah. OK, all right. All right, so let's drop some database knowledge. So today, we're going to talk about hash joins. So you maybe think this is kind of weird to have sort of getting near the halfway point in the semester that we start talking about hash joins given that you guys just turned in the first assignment doing a hash join. And so the difference is that what we're going to talk about today and the join algorithm we'll talk about Monday next week is about how to do this in a parallel system. So the hash join you guys implemented was like the textbook single-threaded, simple hash join. There isn't really anything fancy. So for today, what we're going to talk about is how do you actually implement a high performance parallel hash join implementation. So we'll talk about a high level, what a parallel hash join is, what are our design goals, what do we hope to achieve, and then we'll spend a little time as we go along talk about the different components that we would use in our hash join implementation. So we can talk about different types of hash functions. We can talk about different types of hash table implementations. And then we'll have some performance numbers from the paper you guys read that shows a comparison of all of these different variants of parallel hash joins. So unlike other lectures we've had in this class where we sort of are very hand wavy or ignore sort of the benchmarking numbers that they have in the paper, a lot of times for more complex things, it's hard to really do a good evaluation because it's a lot of work to implement sort of one idea and it's very difficult to have a whole new architecture to implement the other idea. So in the case of the hyper-morsel task scheduling stuff from last class, there's other approaches, but it would be very difficult for them to re-implement all that in their system to do an apples-to-apples comparison. But in the case of join algorithms and other types of things, since it's not that much work to implement a hash join in a system, we actually can do an evaluation on a single system and have sort of a deeper evaluation about the trade-offs that we can make when we implement these things. So it sort of goes without saying, or maybe not be aware of this, but the hash join is probably the most important operator you can have in a analytical database system. By far, it's the one that is used the most often and systems usually spend the most time computing on, right? Especially in a main memory system where we're not worried about fetching things from disk anymore, and if we're on a single-nose system, we're not worried about the network, then most of our time is going to be spent in the hash join. So therefore it's really important for us to come up with a real efficient implementation of it. And so going beyond what you guys implemented in class, we want to now talk about how do we actually take advantage of all the extra CPU cores that modern chips have and parallelize as much of the drudgery of a hash join as we can. And the goal is that we want to have all of our cores be busy so we end up becoming memory bound. They're always just doing work, and they can't stream data off the memory controller fast enough. So to give you sort of a motivation or a motivating example of just how important hash joins is, so this is a study that I had a student do last semester where we took the execution profiles of a OLAP database system, and we measured how much time the CPU spent on average in the different operator instances. So in this case here, the database system we used was CloudDera Impala. So I realized that Impala is a distributed, shared, nothing, column store database system. So it's not a main memory system and it's not a single-nose system, but we can extrapolate a lot of the ideas that we see in our measurements here. They're definitely applicable to the type of database systems that we're looking at. So for this experiment, basically what we did was we executed the TPCH benchmark, and then Impala would spit out these profile snapshots that would say how much time it spent in each of the different operators. And for this, we can ignore things like the amount of time we spent reading things from the disk. We ignore how much time we spent moving data around from one machine to the next, because again, the way Impala basically runs on top of HDFS, and it has a little shim thing that it installs on the HDFS node that allows it to stream data out without having to go through the normal HDFS protocol. So that's the way they get better performance. But again, the query plans are all going to be the same kind of stuff that we've been talking about so far. So the five different metrics or different operators we measured were the hash join, sequential scans, the union operator. So you basically take two select statements and you can union their results together, any kind of aggregation function, and then a catch-all category, I'll say other, which would include query compilation, code generation, exchange operators, basically anything else that's not included in these other four things. So if they want to take a guess based on what we observed in these profiles, they don't want to take a guess like how much time in terms of percentage of the total CPU time was spent doing the hash join. Take a guess. 80%. It's a little high. What's that? 30 is a little low. It's actually 50%. Almost exactly 50%, which is kind of cool. So now I realize different workloads, different queries would have different distributions of these operators. And it's possible that different database systems have different distributions. But I don't think it'd be that far off. I mean, if anybody's watching the video here from another database company and has a different number than this, email me because I'd definitely be interested. But it's just showing that 50% of our time in the CPU is being spent doing the hash join. The sequential scan is really not that interesting. Ignoring disks, this is just reading a tuple. Maybe we do a predicate push down and we can do some filtering early on. But from a computational standpoint, it's not as expensive as this. This is also ignoring the storage space for the intermediate data we have to use for all these different operators. So in the case of the hash join, as we'll see later on, since you can have different phases in the hash join, all of those need to have some temporary storage. Whereas in the sequential scan, you read something and you write something out. Same thing for aggregations and all these other guys. So the amount of storage memory you have to use in order to compute the hash join, I also think is much larger than these other guys. But I don't have a good way to quantify that with what we collected. So again, the main takeaway here is that hash joins are expensive. The database system is going to spend a lot of time doing it when you run OLAP queries. So we should try to optimize this as much as possible. Because this will give us the most bang for the buck. Another quick observation I'd like to make about hash joins in the context of OLTP systems is that for these specialized OLTP systems, like in HStor or VTB or times 10, for example, they usually don't implement hash join because they don't really need to. So if you think about a high level, a hash join is basically the same thing as an index nested loop join. A hash join is basically you have some tuple. You probe in the index and see whether there's a match. Now in the case of you're doing an OLAP query, you're scanning large table segments. So therefore, you want that index probe to be very, very fast. In an OLTP query, you're usually only grabbing a small number of tuples. And especially if you're doing a join, you'll have some one primary key, like the single customer record. And you do a join to find all their order records. And they're not going to have that very many. So in that case here, you probably will already have an order-preserving index that you can use to probe into it to find the few number tuples that you're looking for. So the reason why a lot of OLTP systems don't need to implement a hash join operator is because they don't need to, because essentially the nested loop does basically the same thing. The differences in a hash join, traditionally, you'll build the hash table on the fly, whereas in this case, we have the index already sitting there. And that's why it's fast. So when we talk about the hash join in this class, for today's lecture, we're assuming we're doing an OLAP query that's accessing a lot of data. OK, so the high level design goals that we're going to have in a modern parallel hash join implementation are two of the following things. First is that we want to be able to minimize the amount of synchronization that we have to do between our threads as they go through the different phases. We've already talked about this when we talked about how to do efficient concurrency control. The basic idea is that we want our threads to avoid any kind of latching or locking so that they can proceed in parallel. And we talked about how to do latch-free data structures before. So we know about this. We know that this sucks and we want to avoid this. What's different now in the case of a hash join, sort of building up what we talked about in last class is that we want to be able to avoid cache misses when we access data. So we want to be able to have all the data that a worker thread needs to either be close by in the CPU cache or on the memory DIMM that's right next to it. So we're not really talking about the NUMA stuff for today's lecture, but the basic ideas from the morsel stuff that Hyper did are applicable to what we're talking about here. So the minimizing the CPU cache is sort of different. There's a couple other things we need to be mindful of. So the main factors that affect how many cache misses our worker throws are going to have as they execute an operator are going to be, obviously, the size of the cache and how much data we can store in it, but also the TLB, the translation local side buffer. If you haven't taken an OS course in a long time, I haven't taken one in 15 years. So TLB is basically a cache that the CPU uses to map virtual memory addresses to physical memory addresses. Because normally, if you don't have this cache, you have to go out to DRAM and get back on a giant page table to say, where is the address I'm looking for? So they have a little special area in the CPU caches that allow you to say, this address can be found at this location. And so obviously, we want to avoid having to go get data that's not close by, but also having to do this look up in the mapping table out on DRAM, have everything close. And then we also have to worry about locality. And there's two variants of this. So temporal locality would be the thread is trying to use the same data within the same time period. You fetch something from DRAM, it's in your cache, and you do as much as you can on it before it gets evicted for the next piece in your cache line. And then spatial locality basically means that if you go fetch a cache line that has a bunch of tuples or records in it, you want to apply all the operations for what's in that, rather than just jumping randomly back and forth. Your items be spatially close together in your layout and memory. So the two different access patterns we can be mindful of or think about in order to minimize our cache misses are what we call non-random memory access. So this would be something like if you do a sequential scan and you're executing, you're looking at records one by one. Or if you do an index traversal and it's sort of looking at one node followed by another. So the way we want to improve our cache performance of this is that we want to try to cluster as much as the data we need, even though it's sort of non-random. We want it to be as close together in a single cache line as possible. So that, again, it's one fetch to go get what we need. And then sort of related to this, for every fetch we get, we want to try to do as much work on that data as possible. Again, avoiding having to go back and forth to get new stuff. In the case of a hash join, this is considered to have random access because when we do a probe or when we build the hash table, we're hashing things. And it gives us a random location or random slot in our hash table so we don't have that sort of uniform scan access we would have in this case here. So the way we can improve our cache performance is that we're going to try to partition the data in such a way that again, we can get as much as possible within a single cache line and to reduce the amount of swapping we have in our TLB. And this is what the hash join algorithms that we're going to look at try to achieve. OK, so the basic algorithm for any kind of hash join has three phases. And so in the first phase, it's called the partition phase. This is basically you're going to take your input relations, say you're doing a join on R and S where R is the outer and S is the inner. You want to take the tuples that are in those relations and you want to split them up on some hash key using different cores, different threads. And then you have discrete chunks that you can then do additional phases on just the partitions. This phase is completely optional. I mean, in the case of you guys, would you guys implement it? You didn't implement this step here, right? You just took the data in in a single thread and did the hash join on it. The second phase is required as well as the third one. And the second phase, this is where you actually build the hash table that you're going to use based on the outer relation that you're then probed later on using the inner relation. And we'll talk about how you build hash tables and what kind of hash tables are you going to use in this process as well. And then in the probe phase, you'll take all the tuple that's in the inner relation, go through them one by one, extract the join key from it, hash it, and do the probe in the hash table you built on R. And then if you find a match, then you know this satisfies the predicate you're looking up for the join. And then you can omit it as its output. So we're going to go through each of these three steps one by one. And again, be mindful that as we go along, we want to make sure that we're parallelizing as much as possible and voiding synchronization and avoiding cache misses whenever possible. OK, so in the first phase, partition phase, again, the basic idea is that we have some large table. And we want to spend some extra CPU cycles in order to split it up so that later on when we do the additional phases, the data will be local. So this is doing the same morsel partitioning the hyperguys did for their task scheduling. But this is the general idea of it. Actually, we're also doing this as we do the join operator, whereas in the hypercase, they would do it when the data was first loaded. And then just happened to be that everything was already partitioned ahead of time. So this assumes that you just have a table heap that's spread across memory uniformly, and you're going to scan across it and create these partitions. So this seems kind of crazy. It seems like you're going to do a scan over the table once and just figure out how to organize it and put it in partitions. And then you're going to scan it again in the build phase to actually build your hash table. But what happens in modern processors is that what we hope to achieve, at least, is that by paying this cost of partitioning, by doing one scan in the beginning, or possibly more scans when we talk about other schemes, it's going to minimize the number of cache misses and TLB misses when we do the future phases and the cost of going and fetching things from DRAM is so expensive relative to the cost of executing instructions that we're willing to pay this price. And we'll see later on when we talk about the evaluation of the benchmarks that the difference is actually quite significant. Another thing about partitioning is that it can potentially help for keeping things more load balanced. If you partition it in a nice way, that way you can guarantee every single thread is operating on the same amount of data. And you can also avoid having to do any synchronization because if you partition things ahead of time, then in the subsequent phases you don't need to synchronize across those threads. Everybody can just work on their one partition. So the partitions are going to contain, we're basically going to scan our table, split up the tuples up into these output buffers that we're going to call partitions. And then what goes inside these buffers depends on what storage model we're actually using in our database system. So if it's a row storage system, if it's an NSM system, then we can either just copy the entire tuple from the table heap directly into our output buffer. Or if we're a little clever and we recognize that I only need certain attributes or subset of the attributes that I have in my table in the upper levels of the query plan, then I can do an early projection and rip some of those guys out. So this really makes sense if you have a really wide table, say 1,000 tuples, or 1,000 attributes, and you only need five attributes to do the join in an actual query. When you do this partitioning, you can do the projection and throw things out. In the case of DSM or a column store, this is really easy because you just look to see what you need to do the join, and you only copy those attributes into the output buffer. You also have to include an offset to let you go back to the original table if you need to stitch things back together. Or if you're using the virtual IDs that you can use in some systems, you would sort of include that as well. Because we're going to shuffle things around, and therefore the order that we put the attributes into these output buffers is not going to match up with how they're stored in the actual table heap. So that's why you need to keep the offset. So we're going to ignore this for the rest of the class, just be mindful of when I say we're going to put tuples on output buffers. It doesn't necessarily mean we're copying the entire thing, so it's not as bad as possibly could sound. OK, so now to do the partitioning, there's basically two approaches. So the first approach we'll call non-blocking partitioning, and I don't mean non-blocking in the sense of a thread doesn't block on another thread because of a latch or a lock. What I mean by that is that the threads can produce the output to the next phase incrementally, as they go along. So as we do the partitioning, we would figure out what partition a tuple belongs in, and we can copy it in our output buffers, and it can be immediately used by possibly another thread to start building the hash table. The other approach is called blocking partitioning, and this is where we're going to only materialize the results to the next phase once we're done all of our scans and all of our operations. And in the case of the blocking approach we'll see in a second, you may actually want to do multiple scans over your table to do more fine-grained partitioning. And this is sometimes called Radex partitioning. So sometimes you'll see in the literature it'll say a Radex hash join, a Radex cluster join, a Radex join. It always means this approach here. So what is called the Radex hash join? So we'll go through both of these one by one. So within the non-blocking partitioning approaches, there's actually two further sub-approaches we can use. So again, in the blocking approach we're going to scan the relation once and generate our output on the fly. So every time you find a match, we figure out what partition it goes to, put it in the output buffer, and then it's immediately available to whoever needs it next. So the first approach is used was called share partitions. And this is basically we're going to have a single global set of partitions with output buffers that all the threads are going to be going to write into. And of course this means we need to synchronize. We need to use latches to make sure that we don't corrupt the data structure. If one thread tries to insert a tuple at the same time as another thread. The other approach is used what I'll call private partitions. I think the paper refers to them as independent partitions. The basic idea is that we're going to have the same sort of set of partitions and output buffers that we have in the shared case, but each thread or each worker thread is going to have their own copy of them. So they'll have their own private little output buffers that they can write into. No other thread can write into them at the same time, so therefore they don't need to set a latch on anything. And then once they're all done, they'll have another round of threads come through, and they'll just combine the private partitions from the different cores, the different threads, into the global partition. So you end up with the same sort of state in the private partition case as the shared partition case. You end up with the global output buffers. But in this case here, we don't have to set latches as we go along, but you have to do some extra copying at the end. So just go through this with more detail. So here's our data table. Again, we're ignoring what our storage model is. It doesn't matter. So we'll do that same sort of partition we saw in SQL Server with the row groups. We'll just pick some offset of the number of tuples and assign them to each core. And then for this case here, say we're doing on the join of attribute B. So each thread will walk through the tuples one by one and compute the hash on the value of this key per tuple. And then that will tell it what partition it should write into. And this is the shared case. So this is a global set of partition output buffers. So let's say that this thread wants to append a tuple here. We have to set a latch on it before we do this to make sure that we don't trip up anybody else. And we can use a spin lock or whatever synchronization primitive that we want to use that we talked about before. In the case of the private partitions, we actually extend the partitioning out to our partitions or extend the division of the tuples to the partitions. So now when we do our hashing and sort in the output buffers, these output buffers are private to this particular core. And then after that, we do another round of coalescing where every single core will be assigned some partition group, like P1, P2, all the way to Pn. And then they will go into this local storage for each core and just combine them to produce the final answer. So again, we have to spend more work because we're doing a little extra work because we have two phases. We have to do some copying and we have to store the data three times now because you have the original data, the private partitions, and then the combined partitions. But in exchange, we don't have to set any latches. We don't have any synchronization going on here. So that's the non-blocking cases. And again, the key from non-blocking is that as we're generating this, again, some other thread could be pulling this and pulling it things out and start computing building the hash table for us. So Radex partitioning is a little more complicated. The basic idea of Radex partitioning is that we're going to use multiple passes over the data, potentially multiple passes, where in each pass, we're going to compute a histogram that tells us the number of tubules we would have in a particular partition based on the Radex of the hash key. And I'll explain what a Radex is in a second. And then once we have our histogram, we'll compute a prefix sum sequence, which I'll explain also in a few slides as well, that are going to tell us at what offset in our output buffer should the partition for a particular CPU begin. And then we scan back through the table a second time and start writing those tubules into those output buffers. And if we want, we can do this whole thing multiple times. There's no reason. Nothing in the algorithm says we have to do this once. We could do this whole three-step process over and over again. And the goal of what we want to achieve is that by doing more fine-grained partitioning using this Radex method, we want our partitions to end up being the same size as a cache line or some chunk of cache data in our CPU. So that when we go build the hash table, it's one fetch to DRAM to get the block of data. We blast through it in our CPU, build the hash table, and then never have to use it again. And this, again, seems crazy. You're scanning the data multiple times. This seems really wasteful. But in practice, it actually turns out to be much better. This originally came up, I think this idea came up in the late 1990s. I know Oracle implements this type of partitioning. DB2 uses this in blue, and Poly uses this. I don't know whether the Radex partitioning stuff is used in other database systems other than the major commercial ones. You search Radex join, and nothing really comes up. So first we're going to discuss what a Radex is. The basic idea of a Radex is just telling you, it says for a given number at a given position based on its base, base 10, base 2, whatever, it gives you the value of that position. So to say we have an input sequence of these numbers, if you want to compute the Radex for the first position in the integer, it's just whatever the value is at this slot for the tuple. So for 89, it's just 9. For 12, it's 1, 2, so it's just 2. It's pretty straightforward. And we want the Radex for the next position at the 100 position, same thing, 8, 1, 2, 0, 4, 6. And we can do this pretty quickly, pretty efficiently in the CPU. So now a prefix sum is just a running total of an input sequence. So let's say we have a bunch of numbers, 1, 2, 3, 4, 5, 6. We're going to generate the prefix sum to be the running total going from left to right for each of these values. So in this case here at the first position, 1 doesn't have a predecessor, so its value is just 1. But now in the second position, we'll take the next tuple in our input sequence, followed by the previous value in our prefix sum, add them two together, and voila, you get 3. That's all this is. And we just go down the line and compute this efficiently, and we get our running total that way. So now let's see how we're going to use the Radex and how we're going to use the prefix sum to do Radex partitioning. I think it's a sort algorithm called Radex sorting as well, which is sort of based on roughly the same idea. So now let's say we have a sequence of numbers here. So assume that we're hashing the tuples join key, and we're going to get these hash values. For now, it's a real simple hash, so we only have two numbers. And for this, we're only going to have two CPUs. So we'll have two threads running at the same time. So in the first step, we want to scan through the input data. And we want to construct the histogram that's going to tell us how many tuples are going to be in each particular partition. So we'll divide our tuple space into separate partitions or chunks just like we did before. And then since the first round we're doing this, we'll choose this first Radex position for the hash tuples. So as you can kind of see, there's 0, 1, 1, 0, 0, 1, 1, 0, or 1. So this is going to end up being the Radex of the hash key for this pass through the data. So now we will go in parallel, each quarter go grab one by one, each of the values here, inspect this Radex thing, and that's going to say what partition it goes to. And then what they're basically doing is they're counting the number 0s, the number 1s in total, and that gives you the histogram. So that says for this round, partition 0 at this CPU should have two tuples. Partition 1 should have two tuples. For here it's 1 and it's 3. So now based on this, we use the prefix sum of these counts to tell us at what offset the partition at a particular CPU should appear in our output buffer. So it's not really a linked list anymore. It's just a giant byte array that we can write into at any point. So in this case here, using the prefix sum, we would have the partition 0 at CPU 0 starts here, partition 0 at CPU 1 starts here, and so on. So now because the cores, the CPUs know what offsets they need to start writing tuples in, they don't have to worry about doing any synchronization with any other thread running at the same time. You're pre-partitioning the output buffer space based on the prefix sum of the histogram. So now in the third phase, the third step, we're going to go back and scan through our input sequence again, our tuples, and now start assigning them to the particular partition. And as you can see here, since this guy needs to go to partition 0, CPU 1 knows that this is the place I should write this, and nobody else will be writing at the same time. And we just go down the line one by one and keep doing this for everyone. So now what we have is two partitions, partition 0, partition 1. So we could stop here, and this would achieve basically the same thing that the non-blocking algorithms did with their output buffers. But if we want to have more frying-grain partitions, we can just do a whole another round of this again. We can recursively apply the same steps to now this input sequence and partition even further. So you would do the same thing. So now you would choose a different rate exposition. So now we'll choose the first position, and we just go through them one by one and do all the same things. In this case here, this guy spent a little more time because he has more values, because the distribution is slightly skewed. And we'll see how this can hurt the rate exposition approach later on when we look at really skewed distributions. So any question about rate exposition? Yes. So I'm missing an extra step. There should be an array here of the actual tuples. You hash them, and then you take the ratex of them. So when this core is doing that first scan to build the histogram, you hash it, then use the rate exposition. For simplicity reasons, I just showed that the hash value after is hashed. Correct, yes. It's not just this value. Maybe you're copying a bunch of attributes, and they are minimum size. So his question is, in this case here, when we use the prefix sum to compute the offsets of the partitions in our output buffer, if the tuple is variable length, then this won't work. But this is why you try to avoid variable length tuples or variable length attributes in your database system. So all integers will be fixed length, or all floats will be fixed length, time stamps, things like that. The only thing that's not going to be fixed length are varchars, var binaries, and things like that. You don't really do joins on blobs or var binaries, so that doesn't really matter. In the case of strings, you can just use dictionary encoding. So now, instead of having the string, you have a fixed length integer. I absolutely agree. If you didn't have fixed length attributes, this would not work. But since we do, it's worth advantage doing compression, not just reducing the amount of data we have to marshal around, we don't want to do extra book keeping for this kind of stuff. But that's a good point. Another question over here? OK. So this is radix partitioning. This is sort of considered the state of the art, in some ways, for doing hash join in a main memory multi-core system. OK. So yes. His question is, what is the synchronization cost for computing the prefix sum? Insignificant, right? So everyone has to finish, then you have a barrier that says, I can't compute the prefix sum until all my cores say they're done doing their scan, right? Then what, you're talking a couple bytes of data, right? You can have one thread ripped through that, right? It's not a very expensive calculation. So you don't set a latch to compute the prefix sum. It's not necessary. Any other questions? OK. So we're done the partitioning phase, right? It's optional, so we may not have done it, but chances are we probably did in a modern system. So now we have our outer relation is now spread out into these different partitions. And now we want to have our threads scan through them and build our hash table. So the basic idea is, for every table you find, you extract the join key, you hash it, and then you put it into your hash table. The key thing to point out, though, is the hash function we're going to use in the build phase has to be different than the hash function we use in the partitioning phase, so that we can sort of have better randomized distribution of our data. We avoid the pitfalls of skew. In the probe phase, does the hash join need to be the same as the build phase or the partition phase? What's that? It has to be the same as the build phase, right? Because it doesn't make sense to hash something to hash table with a different hash function. You're never going to find any matches, right? So in the partition phase, it's one hash function. In the build and probe phase, it's the same hash function. So as we build our hash table, the main goal we want to have is that we want our buckets where we're going to store the actual tuples themselves. We want them to be roughly a few cache lines. So again, anytime you have to go do a lookup and go get a bunch of data, we're not marching a lot of stuff back and forth from the amendment controller. So we'll talk about how to do different hash tables in a second, but I'm going to first spend some time talking about hash functions, because this is sort of ignored a lot of times when you take an introduction database course of like, yeah, you just hash something, and then you don't really understand what the hash function is actually doing. So the key thing to point out about hash functions you would use in a hash drawing algorithm is that it's not like the SHA-1 or the SHA-256 things that people use to do encryption, right? We don't care about our thing being computationally difficult to break. The only thing we care about is that we want to be able to compute the hash very quickly, and we want to have a low collision rate. A low collision rate means that if we have, say, two strings that are distinct, that are different, we don't want them to hash to the exact same value, which can occur. So the modern variants that people use of hash functions are sort of, there's a bunch of them out there, but these are sort of the main three ones that I know about that people use. So we're not talking about like MB5, that would be a terrible hash function, right? These are ones that are designed to be fast and very general purpose. So Murmerhash is, I think just some random dude wrote it. He put it on the internet, and it's designed to be this fast, general purpose hashing function. And it's actually used in a lot of things, with research and in industry. So then Google started looking, when the Murmerhash guy started getting a lot of traction about this better hash function out there, the Google guys started looking at this and started to decide whether they want to switch all their internal implementations of hashing to use the Murmerhash. And I think at the time they started looking at this was they were looking at Murmerhash 2. And so what they decided to do was take a lot of the ideas that they have in Murmerhash 2 and sort of tweak them to make it be more efficient and better for modern CPU architectures. And they make a big trade-off is that they take a lot of shortcuts to help them run faster for short strings, because that's the kind of key they see a lot of in their applications, in their workload. So by short keys I mean anything less than 64 bytes. And we'll see in a second, they do extremely well when strings are less than that. And then a few years after CityHash came out, they came out with something called FarmHash. And the idea of FarmHash is it's an improved version of CityHash, but they're sacrificing performance in exchange for having a lower collision rate. And we'll talk about why a low collision rate is important later on. But think about it basically if you have two things hash to the same value, but they're not actually the same thing, then you need to keep track of both of them. And then when you go do a probe, you have to compare your tuple wall against both of them and then get slow. So this graph here is actually a benchmark I ran last night. So there's this dude on GitHub who wrote his own framework to take a bunch of these hash functions and in a for loop, spin through as fast as they can and see how many throughput you can get for hashing different keys at different lengths. So in this evaluation, he had a bunch of other hash functions, but I'm not showing them. I'm showing the main ones. So the first is the built-in hash function that comes in C plus plus 11. Then we have the member hash three, and then Google CityHash and FarmHash. So there's a couple things to point out. I'm not a hash function person, but I know a little bit about them to talk about what's going on. So the first thing you see, a bunch of these spikes here that all these guys have. In the case of the CityHash and FarmHash, they're more pronounced. In the member hash, it's lower. In the STD hash, it's a little bit bigger. So these basically correspond to when the bytes they're dealing with don't become aligned. And the way the algorithms work, they sort of take advantage of having enough data within your cash lines and registers to do very quick operations on it. So if you jump to the next word, you sort of lose that performance. The other cool thing to point out here is here you see, in the case of CityHash and FarmHash, they do much better up here. But then beyond this point, they can do worse and better, and they oscillate back and forth. So this here corresponds to the 32 and 64 byte length strings that they were optimizing for. So they get to this point here, and they're doing phenomenal. And then beyond that, when you get to 128, 192, and 256, they don't do as well. So this is sort of the sweet spot that Google was shooting for in their implementation of these algorithms, and that's why you see this behavior. So this is just showing in throughput in terms of not the number of hashes they can compute per second, but the amount of data they can hash over time. So it's in terms of like megabytes. So what's missing sort of from this, though, is actually the collision rate. So this guy's framework doesn't measure what the collision rate of all these algorithms are. But according to what Google writes, the FarmHash guys are giving up performance compared to the CityHash, but they're going to have a much better collision rate. And so when you use this when you probe your hash table in a hash join, you actually may end up being faster, even though the hash function is slower. So this is run on one of the database machines in our database group cluster. It's not super high end, but he has reports on his website, and I've run this on my laptop, and you see roughly the same pattern. OK. So now that we have a hash function, we can pick one of the four things that I talked about in the last slide. Now we're going to talk about how we actually implement our hash table. And so the two basic approaches we can talk about are the chain hash table, which is sort of the default way you would build a hash table, and then a more modern variant called cuckoo hashing, our cuckoo hash table. So we're not going to be using the linear hashing or the extendable hashing approaches that we talked about in an introduction database class because those hash tables are designed to be built incrementally. So as you insert tuples, you can extend things dynamically as you go along. In our case here for these hash tables, we don't want to do that because every time we build our hash table, we don't want to be resizing things all the time. So typically what happens is the optimizer will select how many slots it should use in a hash table when you do a join ahead of time based on what it knows about the statistics and cardinality of the data itself. So we don't want a dynamic hash table that can resize. We want to use a fixed length thing based on the knowledge that we can gain by observing the data itself. So the chain hash table is the easiest hash table to build. It's basically you have a bunch of slots, and the slots point to a linked list of buckets. And when you hash your key, you figure out what slot it goes to, and then you find a free slot. Or you find a free position or space for it in the linked list of the buffer. And the way you handle collisions, since you're just always adding them to the same bucket, when you want to do a lookup to see whether an element of present, you have to go jump to that linked list of the buckets and then scan all the tuples to see whether you have the thing that you're looking for. So in best case scenario, if you have a completely uniform distribution of your data across all your slots, it's going to be 01 to do a lookup in a chain hash table. Worst case scenario, if all your keys hashed to the exact same slot, and therefore it's just a giant linked list of all your values, then worst case scenario, you're on. And doing insert and deletions are the same thing as doing a lookup. So there's not much to really talk about here in the diagram. Again, it looks basically like the output buffers that we have before. And these are just linked lists that have slots where we can put in tuples. And again, the contents of the tuples depends on what storage model we're using. So sort of obviously the key problem that we have in chain hash table is that if we have things, if we have collisions, we have tuples mapping to the same slot, then we're going to do a bunch of extra work to see whether we actually have the match that we're looking for. So we want to try to avoid this as much as possible. So one way to do this is just to have a really big chain hash table. So the theory basically says that if you have a chain hash table with two times the number of slots as the number input tuples that you have, then you'll have a low collision rate, possibly none at all. But this is obviously really bad, right? If you have a million tuples, then you need a 2 million slot hash table, and that gets to be really inefficient. So we want to try to find a better approach. And this is what the cuckoo hash table seeks to solve. So the cuckoo hash table, the idea came out I think in 2001. But like I said, now a lot of the newer database systems are implementing this technique. And the basic idea is that we're going to maintain, the basic version of cuckoo hash table is that we're going to maintain different hash tables that use a different hash function. So then when we insert a tuple, we would hash it twice, or for the number of hash functions we have, we hash it multiple times, and we try to find a free slot in any of those hash tables. And if we find one, we put it in there. If we don't find one, if you find out somebody's taking our slot, then we can steal it. And then we have to reorganize the hash table by moving the thing we evicted to some other location. And so it's awesome about the cuckoo hash table is that low-cups and deletions are always going to be 01. Because the record comes in, you hash it as many times as you have hash functions. And if it's not there, you don't have to traverse anything to go look for it. It's either there or not there. To do insertions, that's a little more complicated, because we have to do this eviction and rebalancing. So I'm going to go through an example here. So let's say that our cuckoo hash table has two internal hash tables. Again, to you as the outside programmer, the developer, using this cuckoo hash table, you don't know that there's multiple hash tables and you don't care. It's sort of abstract away from you. Logically, it's one hash table. So let's say that we want to do an insert on x. And at the very beginning of our hash table, we've already pre-allocated all the slots, but it's completely empty. So because we have two hash tables, we have to hash it twice. And again, we have one hash function for the first table and one hash function for the second table. But it can't be the same thing. So here, the tuple would hash to these two slots. And both of them are empty, so we could put it in either one. But let's just say for this case here, we're going to put it in this position. There's no right way or wrong way to do this from an algorithm standpoint. You just flip a coin and pick whatever one you want. So now we want to insert y. Same thing. We'll hash it twice for two different hash functions. And we see that in this case here, for the first hash table, it maps to the slot where x is already stored. So we don't want to store ourselves there, but we can store ourselves over here because this slot was empty. So in that case, we make a decision where this is not random anymore. We see this slot's taken. That slot's not. So we'll always choose the empty one. So now let's say we insert z. We hash it twice. We see that it maps to both where the slot on the first table where x is and the slot where y is on the second table. So now this is a problem, because we don't have a free slot that we can pick. So in this case here, now we'll flip a coin and we'll choose to store it in the second hash table. So what we'll do is we'll evict y, replace it with z. Now z is stored there. But now we need to reinsert y back into our hash table at some of the location. So we'll hash it again. And this time, we only have to hash it for this hash table, because we know it came from this. If we hash it again with the same hash function, we the same location we just came from. So now we'll hash it back over here. But now we see that this is the slot that x is in. So we'll do the same thing. We'll let y steal it. And now we need to hash x again for this hash table. And then it would end up here. And we have a slot. So the insert on z technically is not completed until we finish all this data movement around. And I'm not talking about how we do this concurrently or do this in a thread safe manner. There's actually some research done by Dave Anderson here at CMU to have a concurrent cuckoo hash table that they use in memcache. We can ignore all that for now. So what's an obvious problem here? Cycles, right? Could be the case that x evicted something that then mapped back to where y and z are located. And we do this all over again. So we have to make sure that when we do this rebalancing that we don't get caught in an infinite loop. We have to recognize that when we have a cycle. And so we find that we have a cycle that we have to do is rebuild the entire hash table using a different hash function. So we'll pick a different seed for a hash function and just go through every single slot, every single key, and rehash everything. And the idea is that from a probabilistic standpoint, then that will rebalance things. And we won't have cycles in the future. So again, the theory works out that if you have two hash functions, two hash tables in your cuckoo hash table, that you probably won't need to rebuild the table until at least one of them is 50% full. But if you use three hash functions, then you probably don't need to rebuild it until it's about 90% full. Now, also, there's another fact in all of this, too, is how many slots that you could have. You could have a million slots for your tuples, and you probably would never get to that 50% occupancy rate. So there's sort of this trade-off between the number hash functions to be used, and that adds some storage, and it's computational complexity to our cuckoo hash table. But then there's also the number of slots we could have, and that adds storage overhead. So there's all these different sort of parameters that are difficult to get right in your cuckoo hash table implementation. And this is usually why the optimizer will select these things for you. But in many cases also, too, they just sort of pick a default and go with it. All right, so now we have our hash function, we have our hash table, and we've built it. So now we can go through and start doing the probe phase. The probe phase is pretty straightforward. For each tuple that's in the interrelation S, we'll extract this join key, hash it, and then check to see whether there's a match in whatever bucket we have, or slot we have in our hash table that we built on relation R. So if we took the partitioning phase and split things up, then we can have all our worker threads just work on some partition of the data. The hyper guys did this with the morsel stuff. Otherwise, if we're not partitioned, then we have to make sure that we synchronize our scan process or split things up per core for the ad relation as we go along. And that's done. That's hashed on. So based on all the things that I've talked about so far in the partition phase, the build phase, and the probe phase, we can have some basic variants that mix and match the different ideas from that. So we're going to talk about these four here. So the first would be having the no partitioning at all and use a shared hash table. The second one would be a non-blocking partitioning using the shared global buffers. And the third would be a non-blocking partitioning with private buffers. And then the last one would be a blocking partitioning or the radix partitioning stuff that we talked about. And we're not going to compare the different hash tables with the hash functions. I don't actually know what Speris used in the paper. I should ask him. Let's assume that we can assume that they did a good job, a good faith implementation of a high-performance hash function. So this table here actually lays out the differences between the different joint variants that I've talked about. And it shows you whether things are partitioned, how many scans you have to do over the data, how do you synchronize where the hash table is, and whether you synchronize for the build of the probe phase. And again, this is basically everything we talked about before. And these are the barriers that I mentioned when we do the partitioning and the private partitioning and the radix partitioning, where you have to wait for all the threads to finish that piece of the computation before you can go to the next step. And that's what I mean by that. And that can just be implemented by a counting semaphore, something like that. OK, so now let's talk about the evaluation that was done in the paper. So for this, they're using a synthetic workload with two tables, with a primary key and a foreign key relationship. So in the outer relation, which we'd use in the build phase, it's going to have 16 million tuples that are 16 bytes ease. And then the interrelation that we'll use in the probe phase, it'll have 256 tuples. And we'll have two different distributions of our data. So we'll have a uniform distribution where for every tuple in the outer relation, it has roughly the same number of foreign key dependencies or foreign key children in the interrelation. And for the highly skewed case, it'll be a hot spot of tuples in the outer relation that have the majority of the children in the interrelation. And for these experiments, they're not doing output materialization. So we can ignore the cost of that. It's just how fast can we compute the join? So in the first benchmark, for this in the paper, they talk about doing experiments on a Spark Niagara CPU, as well as a Xeon. I'm ignoring the Spark stuff. And let's just focus on the Xeon stuff for now, because that's pretty much the most common server CPU architecture that's out there today. So in the first experiment, we're going to use the uniform data set. And they're running this on a Xeon CPU, I think it's 2010, but it's not a NUMA architecture that we talked about last class. So it's one socket and has one set of memory banks. And the cost or the latency of going to access any piece of data from any core is going to be the same. So what you see here now is the stacked bar chart shows how much time is being spent in the different phases. And for this, we're going to measure the number of cycles per tuple that gets outputted. Because it's a better way to evaluate these different approaches in terms of their efficiency and how much time they're spending in the CPU caches. So if you just measured instructions, you would miss the cache misses. So what you see here is the Radex partitioning. So one pass Radex partitioning actually performs the best. Again, this sounds crazy, because we're scanning the data, at least for the outer relation, we're scanning it three times. Twice for partitioning and once for building the hash table. And you see the partitioning cost is pretty high for this compared to this, and this has no cost. The share partitioning case, the cost here it has to do with the synchronization primitives. So they're showing that if you don't even bother partitioning the data, you can still outperform this, because they have that global data structure that they're all writing to. So now when we go look at the, well, right. So here in this case here, the Radex is going to be 24% faster than the no partitioning case. And when you actually measure what's going on in the system, this is going to have three times as much cache misses and 70 times as much TAB misses than the Radex case. So that's why this probe phase here is so expensive. And this is sort of why the people have argued that you want to use partitioning to avoid this overhead. But now if you go look at for a skewed workload, what they show is actually the no partitioning case actually performs the best, right? And the reason what's going on here is, ignoring this from now, in these two guys, because the workload is, the data is skewed, when you create the partitions, some of them are going to have, some cores are going to have a lot of work to do, and some of them are going to have no work to do. And so the ones with no work to do end up finishing much earlier than the other guys and just sort of sit around waiting for the stragglers. So they claim that the work-stealing approach that the hyper guys talk about doesn't actually help you in this, in this, in this, for this problem here because you just end up paying more cost to move data around. Now they didn't do experiments in the NUMA system to actually measure this, but this is what they claim in a follow-up addendum to the paper that was addressed to the Oracle guys. Basically what happened was Sparrow presented this at SIGMOD, the Oracle guys were sitting in the audience and said, this doesn't make sense, this doesn't match up with what we see, and then they have a follow-up paper that improves this and makes the NUMA a little bit better, right? And they talked about how work-stealing could help, but they didn't measure it, or they don't think it would help, but they didn't measure it. The shared partitioning case is just because, again, we have so much work being done, actually why is that, yeah, why is that high? I forget. Yeah, I think, yeah, that's right, yeah. So everyone's trying to write to the same partition, and therefore that becomes a single bottleneck, and you have lock thrashing. Yeah, thank you, I forgot. OK, so again, their basic argument is that in a NUMA system, the no partitioning case actually works faster, but if you apply some of the optimizations that Oracle has, and I think also too, if you have a NUMA architecture that your Radex thing could be optimized for, I think you can beat this. OK, so one high-level observation about everything I've talked about in this class so far is that I've ignored a lot of decisions you would have to make in your system. So I've ignored the question of should I even partition or not, right? Because clearly we show here that partitioning helps for this type of workload, and not so much for this one, so how do I pick this at runtime whether I should do this or not? I've ignored how many partitions to use. I've ignored how many passes should I take if I'm doing Radex partitioning through the data. In a real database system, the optimizer will pick these things for you. So part of when it generates the query plan, when it generates a physical plan from a logical plan, it fills in these parameters for you. It says that I know we have this much memory. I know we have this many cores. This is what I should do. This is how many partitions you should use. For some things, sometimes you just sort of, like in the case of the hash functions and the cuckoo hash table, those things are usually hard-coded, but anything that's more dynamic that depends on what the actual data looks like, the optimizer selects these things. And this is sort of like a well-guarded secret in a lot of the commercial systems where they don't tell you exactly how they make these decisions. But roughly you look at the catalog, look at the statistics, and you say, this is what my data looks like, I'll set a parameter. So now let's look at what happens when the optimizer could get something wrong. So in this experiment, it's the Radex partitioning case for the uniform data set. And along the x-axis here, we're gonna have different number of partitions to use to split up the data. And we'll have the Radex partitioning in one pass, and then Radex partitioning in two passes. So in the case here, this vertical line, sort of this horizontal line, represents the best case no partitioning algorithm from the first set of graphs that I showed. So this is showing that the same number we have before, when we have 4096 partitions, we get 24% better than the no partitioning algorithm. But in the two pass approach, you get 5% worse. So it's the same algorithm, right, with the same number of partitions, but it has dramatically different performance. So now if we do a sweep across all these different parameters, you see that there's sort of like a parabola that goes up and down. So it's sort of this local minimal sweep spot where we wanna be. And in the case of 4096, they knew that when they did all the other experiments, and that's why they picked it. In the case of the two pass case, it's not quite that uniform, right? The best case here is actually having 32,000 partitions. So this goes to show that the performance characters you can get for your hash join can be dramatically different based on what the optimizer chooses for its different runtime parameters. And the quality of the choices that it can make are highly dependent on the amount of statistics, the amount of data that it collect about your database. If you have bad statistics, they're stale, then you might end up choosing 64 partitions and get terrible performance. So they argue in the paper that the, although the no partitioning hash join is slower in some cases than other algorithms, because you have less knobs, you have less choices you have to make in your implementation, it actually may turn out to be better. Because on the average case, you're not likely to get things wrong. So one last experiment, I wanna show the effects of hyperthreading. So again, they're running on a single socket CPU that has six core, six Harbor cores. And then for each Harbor core, you have two hyper threads or two threads. So what this showing here is as they scale up the number of threads they use to compute the join, the no partitioning case ends up scaling linearly or almost linearly, look at the dot lines, the ideal case, whereas the red X partitioning join actually gets worse when you have hyper threads. And so part of this is that when you have multiple threads, this will hide all the cache mislatency and TLB misses in your system, but it becomes more pronounced when you get to hyperthreading because you have now two threads accessing the same counter or the same caches on your core, right? So they're evicting things that the other guy may need. And so the, although the red X partitioning because everything's split up into nice cache line chunks, they have fewer misses and T-bug misses in the end this ends up being, doesn't help you when you go into the hyperthreading world. And for the no partitioning join, you just get better performance because there's no synchronization of your threads, they're sort of blasting through data as fast as possible. So again, I'd be interested to see how this would work in a new more architecture, but whether you see sort of slightly the same benefit, the same gains you would get with the non-partitioning case. Okay, so to finish up, on modern CPUs, it's important to have a good hash function or hash function or hash operator implementation. The Wisconsin guys argue that the simplest algorithm actually turns out to be the best. All the fancy red X stuff doesn't really help you as much as you think it should. The one thing we've ignored in all this discussion though is there are a bunch of additional optimizations we can apply to the red X join algorithm that we haven't talked about, like how to do vectorized SIMD instructions. So a lot of the sort of low-level operations as we build the hash tables or as we do the red X clustering and the partitioning, we can use SIMD to take a lot of data, load a lot of data into registers and then invoke one instruction to apply whatever operation that we wanna apply and build our partition scheme or build our hash table very efficiently. So we'll talk about more about that later in the semester when we talk about vectorized execution, but for this we completely sort of ignored that. But I think that would provide additional benefits for the red X stuff over the no partitioning case. And so any questions about parallel hash joins? Yes? Yes. Partitioning, but like not move data to a different code, right? As in once you have partitioned using the partition function, you want to do a more fine-gain job but you still want to eat the data in the same code. So his question is, when you're doing partitioning, let me jump back here. So in this case here, it has to be multiple rounds, right? Let's do the red X one. So as I build this thing out, I'm using one hash function, right? And you're saying, when I come back for the second round, I wanna make sure that I don't end up moving data from one core to the next? Or actually you said the build phase, right? Yeah, in the build phase. But it's already partitioned, so every core is only looking at the data that's already local to it at its partition. So when you hash it, even though it's a different hash function, it's still not gonna leave, it's not gonna end up going to another core. Yes? If not, if you've already partitioned things, right? When you say move data around, meaning like at hash table at another location. Yes? Yes? Yes? And if it's a new architecture and then you don't really want to have a, to like map it, to look at it. So the hash function, the second hash function? Yeah, so he's saying that, I've always, so I've always read that you wanna use different hash functions from the partition phase to the build phase, from the build phase to the partition phase. You're arguing that if you use the same hash function, then when you build the hash table, it might end up being stored in a memory location that's on the same core where it already is. Yeah. Similar hash function with a more fine-grained. What do you mean by more fine-grained? Like the radix thing, so in the first pass, in the partition phase, you use as no credits, in the build phase, you use the first grade, something. The problems of using the same hash function were on the collision. Yeah. You're gonna use a more fine-grained thing here. Then that's just another hash function. Yes. Yeah, that's a whole another thing I didn't talk about either, right? This is just a high-level overview of like, hey, you know, whatever, Java object hash, there's not the only thing in the world, there's other stuff, right? And the city hash stuff is what we use to practice. We can talk about this offline. We can talk about this later. But in general, you have to use different hash functions because of collisions. Any other questions? Yes, in the back. The question is, can we talk about more about the scan and the BWTree? Okay, what's your question? The question is, how do you do a scan and the BWTree with a delta chain and a base page? Yeah, yeah, hold on. Yeah, so the question is, say you're scanning the leaf nodes, it's not in order. You have to go, you know, say you're jumping along the leaves, right? Well, what would happen is you would end up always, when you jump from the logical pointer from one leaf to the next leaf, you would end up at the top of the delta chain, right? So that case, you have to build a temporary buffer that's sorted of the values that you need that you could then emit. Yes, there's a little temporary storage you have to store. Yes. And then you have to be mindful, you have to keep track of like, as you go down, like if you see, because you're going in order in time, so you would ignore any deletes, right? If you see a delete for something that was inserted above it, then you just ignore it because you know that the insert came after it. But if you see a delete and you haven't seen anything above it, when you get to the leaf node, or actually even below that, if you see that same key, you need to make sure you remove it. There's a little bookkeeping you have to do. Correct, yes. Because the thing outside the index expects to get stuff in the sorted order. In the back. So your question is what? Your question is, when you do a consolidation or when you do a split? Your question is when you do a split or merge, how do you decide what should the size of the node be? Should be the same size of the metadata you need to store in it, right? How do you decide whether it's the right point to do a split or merge? This is some threshold. Some threshold. The question is based on the number of the count or the number of items or the number of deltas? Yeah, it's like you, right, so you can have say, you can say my page should store 10 items, right? And on like a B plus tree, as soon as you hit 10, you have to split. You can be a little loosey-goosey and I mean, not to say what the right policy is, but you can let it go over a little bit by not too much though, and then you decide that you wanna do your split that way. So you just, hard code, my index should have X number entries in a node and these correspond to keys. When you go above that, then you have one of the threads decide to do a split. That's different, right? I could have, your delta chain could be a million things of delete X insert X, delete X insert X, right? That doesn't increase the number of keys you have. You don't have to, but you could, right? You could just scan and compare it. Yes. Like a linked list, right? Yes. So, and the reason we have the delta chain, we won't avoid the influence of it. Correct. Because the medical environment, the context of the medical environment results in catching, coding, condition, something like that. Correct. So, but the linked list, the one, there's a value of linked list is, there's no look at it. So, basically, It depends on how you implement it. Is it, okay? Yeah. Your job, job, job. So, basically, it's very likely you'll just direct access memory. What did I talk about today? I talked about, I'm not pointing to this, I talked about an alpha buffers that had a bunch of tuples in it, right? You can in your linked list that way. Spend a little extra, you know, storage costs overhead, having some empty space, but then you chase less pointers. That's a low level implementation detail that you should just, you should know how to do. Any questions? Next class, we'll do parallel sort merge join algorithms. I like these papers because there's, I think there's at least three papers from three different groups of authors, that always say, sort versus hash joins revisited, right? So, we'll cover that. I think you guys are assigned to read one of those revisited papers. And then there's a large number of people that have been watching these videos, that have been emailing me, telling me to fuck off and other things like that. So, we're going to spend some time addressing the hate mail to everyone out there on Monday. Okay? All right guys, see you next class.