 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Okay. All right, so we have a lot to cover, hash joins. We probably could have done this in two lectures, but try to cram through as much as we can. Where we picked off last class was we showed how to take a query plan, identify what data is going to access, break that data up into smaller units called morsels, it's one approach we looked at, and then have them pull from some kind of global queue, right? And we said that the pull-based approach was going to be superior than a push-based approach for scheduling, and that morsels is in by itself the techniques of like the low-level implementation of it may differ from once up to the next, but the idea that I'm going to break up large chunks of data in the smaller parts, that's not unique to the morsels idea, that's an old idea in parallel data systems. So and as I said at the end that the, although we ran out of time, basically all papers we discussed last class were all about single node database systems, but when you go distributed, it's more or less the same thing, right, just now you may need to account for network latency between nodes rather than just like assume everything's on the same box. And whether or not you want to have your, your schedule or plan individual tasks or cores in each single node, where they use sort of a hierarchical approach, or say here's the bunch of tasks I want to run, send that to a node and the nerd decides how to divide up to different workers, right, the different approaches, there's pros and cons to both of them, but we're not going to cover those. The one thing that we ran out of class, ran out of time, I want to briefly discuss because this is going to come up when we start talking about real-world implementations of systems is the notion between dynamic scaling and work stealing, a way to allow the system to sort of rebalance itself and improve performance rather than getting stuck behind stragglers. And these are not mutually exclusive, like Snowflake is listed on both of these, Snowflake will do both of these. Some systems will do something, you know, if it's a cloud-based system, you can do the dynamic scaling because you have, you have additional resources. If you're stuck to a single node box, the way the hyper was, you cannot. Right, so dynamic scaling basically says that you recognize that before a query starts running that I maybe have more tasks than I have workers for. And I know that can make things slower for me, so therefore maybe I want to temporarily include some additional resources, scale it horizontally, add more workers to allow that process to that query so that it runs more quickly. And so we'll see this in Snowflake, they'll have something called flexible compute where they basically have this side cluster available to all customers for additional workers and that you can occasionally borrow some without paying extra and make your queries run a little bit faster. Again, we can't do that, it's hard to do that on-prem, all right, because, you know, you have the provisioned hardware ahead of time, but in the cloud everything's elastic and that makes it easier. And then work stealing, again, we covered last time, basically it says you allow a worker to take work from another peer. And then whether or not you want to have that, you know, the worker that's stealing, the stealer, whether they should go move data from the, from the CLE, the worker they're stealing from, get the data from them or go back to the distributed storage, the S3 or whatever. Again, depends on the implementation. Snowflake's going to always go back to the remote storage because they don't want to slow down the worker that is, that's the straggler. But in case of Hyper, we saw that it'll go get it directly from the worker CPU, in that case because everything was in memory, right? Again, these are just, these are our design decisions we can, we can account for when, when we build a, a larger system. All right, so today I want to focus on, on joins, right? Because joins are the most important, you know, one of, or the, if not, one of the most, or the most important operator you would have in a, a relational database system. And we're really going to be focused on how we do this in parallel. And again, we're going to focus on, on a single node system because we're going to assume that something else above us has already moved the data to where we needed, needed to be, right? And again, that's not interesting. That just says, okay, this needs, this needs to go here. And then something, something moves that. What we really care about is when we get all the data on a, on a single node, assuming that everything fits in memory, how can we run it as fast as possible? All right, so again, we'll talk about the background of what the different sort of parallel join algorithms or high performance join algorithms look like for the last 30, 40, 50 years. Then we'll talk about the basic building blocks to do a parallel hash join. Then we'll talk about different hash functions, the, the hashing schemes. And the, so high level overview of the, of the, of the evaluation for the paper you guys read, which I understand the paper was a bit inscrutable for some of you because there's some background material you may not have. I'll try to, you know, ask questions. I'll try to cover that as we go along, okay? All right, so parallel join algorithms. Again, the basic idea is that we want to be able to take two relations to join it together on across multiple workers at the same time to speed things up, right? In the intro class, we didn't discuss threads or workers when we did joins. We just said, here's the join algorithm. And then we computed the complexity of it based on the number of pages that had to read and write from disk. Get it said, if we assume now everything resides in memory, and now we have additional cores, right? Because this is 721, not 445, 645. We have all these additional resources for us. How can we get this to run as fast as possible? So again, we'll be focused on binary joins, we're just taking two relations. In the next class, we'll talk about multi-way joins, or three-way joins, three or more. And the idea is that can you join multiple tables exactly at the same time? And we'll see an algorithm to do that. Most of the times, though, for most queries, you're going to do a binary join, that's going to be faster than the multi-way join. Well, it's for graph workloads, the multi-way join will be better. So the two main approaches that we're going to care about are hash join and sort merge join. I'm not teaching sort merge join this year just because most systems are not going to implement it. Nine times out of 10 or 99 times out of 100, you're going to want the hash join, that's always going to be faster. If the thing's already sorted in the way you want it on the join key, then yeah, you could do sort merge join. And the paper you've read about talked about some ways to speed that up, using SIMD and other things. But for now, I don't think it's necessary to know that. I'd rather focus on how can we do hash join faster. We're also not going to talk about nested loop joins because this is almost always the worst thing to possibly do in an OLAP system. You would only do this if you know the table has like 10 tuples, right? And so again, hash join is almost always going to be preferable to everything else. So the goal of today is again how to maximize parallelism and part of that is going to be being aware of what the CPU wants from us, being aware of what the threads are reading and writing to, to avoid synchronization costs, and understanding the penalties of reading data that's in a different region or maybe that's unaligned. So it didn't always used to be that hash join was considered to be superior to everything else. This is sort of a classic debate in databases where it's gone back and forth between whether sorting is faster for joins versus hashing. And back in the 1970s in the first database systems, they were running on very, very primitive early computers. The data they handled tables that were larger than memory. The grace hash join or how to spill joins to disk wasn't invented yet. So but they had an external merge sort algorithm. So they said, okay, well if I need to join two tables that are really big, bigger than the amount of memory I have. Again, think of like megabytes, not terabytes of memory, or kilobytes in some cases. That they had a way to then be able to do external merge sort so they could write the sort of data at the disk, bring them in as partitions or chunks, and then do the merge pass to join them. In the 1980s, the hardware got better, and then there was this project out of Japan called the Grace Database Machine. And so they invented the Grace hash join, which is a precursor to the partition join we're talking about today. And they had the ability then to be able to spill the buckets to disk, recursively to bring things up to smaller chunks that could fit in memory. There was another movement called Database Machines, which the Grace project was part of. And that was like specialized hardware that had custom silicon just to do hash joins, right? So they had ways to speed up hash joins back in the day. In the 1990s, an early paper came out from Gertz Graffy, the guy that invented a volcano, the volcano model, the iterator model, the guy that invented the extreme operator, the guy that invented cascades we'll cover later on. He had a paper that basically said these two algorithms were equivalent, given the hardware that was available at the time. But since the 2000, since the turn of the century, hashing has been shown to be preferable. And so the question is always now, is it better to partition or not? And in the paper you guys read, I think they show partitioning is going to be faster. There's another paper from other Germans, from the hyper Germans, the Umber Germans that said, yes, partitioning is faster, but it's really hard to get it right. So most of the times you're better off not partitioning. Sorry, partitioning is faster, but you're better off not partitioning because it's good enough for most things, right? So this is what we're going to focus on today. We're going to ignore Sertmer's join, and we're really going to discuss the partition versus not partition approaches in the modern variants of these things. For Sertmer's join, the paper that they, I think the M-Way Sertmer's join paper that they cite, the paper you guys read, that's from Intel, actually I'll sort this in the slide, from Intel and Oracle from the 2009. They basically said, hey, Sertmer's will be better than hash joined if you have SIMD registers that have 512 bits. Because again, this is 2009, before AVX 512 came along. Now if the AVX 512 is around, and hashing I think is still considered preferable. Now again, if your data's already sorted on the join key, then yeah, that's always me faster than hash joined. But most of the times it's not. Especially in the lake house or the data lake environment. We're talking about where someone's just writing out random parquet files and it's your job to join them together. So things are almost never going to be sorted on the join key. They're never going to be partitioned on the join key either. All right, so that was 2009, hashing was faster if you had, but if you had AVX 512 you could get better. And then there was a paper from Wisconsin, actually from Dignesh and his student that basically showed how the early work that's shown, the trade-offs between the partitioning hash join versus the non-partitioning hash join, and in their approach I think they said that the partitioning one was actually better. And the hyper guys came out in a paper in the next year in 2012 said, well, it turns out, sort merge join, if you do it our way, is even faster than hash join from the Wisconsin people. And without requiring AVX 512, right? Just using the hard of the fill at the time with hyper, they could do sort merge join faster. But then they came back a year later and said ignore what we said. In this paper, you really want to use hashing and here's the way we did it and look how much faster it is. Then another paper around the same time in 2013 from other Germans, I guess Swiss Germans, that they said, here's additional optimizations you can do and here's how to make the radix hash join which we'll cover today, go faster. Then there's a paper you guys read from different Germans at Saarland. They basically said everybody's showing these different numbers and these different results and it's really hard to compare an apple to apple comparison between the different approaches because they're measuring different things, they're different implications running on different hardware. Different work was different everything and the idea was to have a single paper look at all of them, right? But again, this is going to be a test bed system. So they are going to materialize the tuples as if it was a real engine. But it's not going to have the, it's not a complete system like hyper or amper is going to be. And then lastly, the most recent paper sort of in the space that sort of matters, which I didn't have you guys read, from the Umbra guys, which again, the same Germans as the hyper Germans. And they basically said, radix hash join is faster in a larger system and a full system like Umbra. But the challenge is going to be when to know when you actually want to use it. Because there's these different design, different aspects of the workload and the data and the hardware that you have to account for when you want to decide how to use radix hash join correctly, right? And most of the systems, as far as I know, like the real systems that are there today, they're not going to implement the radix hash join versus the partition hash join, but the non-partition hash join. They're not going to implement, usually don't implement different hash tables, implementations. Clickhouse is the exception, we'll cover that in a second. And so everyone sort of picks one and it's good enough. And in the paper you guys read from the Saarland Germans, they basically said, hash join is important, but it's not where you're spending all your time in queries. And so you're probably better off optimizing other things. Which we've been saying throughout the semester that there's not like one sort of the current state of the research and of database systems is that there's not one thing you could point to, like this is the biggest problem we've got to fix this. It's a combination of a bunch of things. So yes, if you have a really crappy join algorithm, then yeah, that's going to be a dominating cost of your system. But once you start implementing the basic optimizations that we'll talk about today, separate from the partition versus the non-partition one, like you're not going to shave off a large number of more cycles or more time in queries, trying to make your hash join even faster. I had my one student, Prashant, my previous PhD student who's now Databricks working on Photon, he spent about a half a year looking at some of these earlier works and these guys trying to make their hash joins faster and we were literally counting cycles per tuple. He got it down from 12 cycles per tuple to 11 cycles per tuple. And in the end, it doesn't matter and we never published anything about it. Anyway, so anyway, we'll cover the different approaches as we go along. But the important thing to cover now is discusses how we want to design our algorithm and what is the overarching theme of how we want to make decisions about certain trade-offs of our implementation. And this is going to come down to whether our implementation is going to be considered hardware conscious or hardware oblivious. And I know what these terms mean. I think the paper covered it. What does it mean hardware oblivious? What does oblivious mean? You don't know. Yeah, so meaning like the algorithm doesn't know, doesn't care about what the cash size is, the number of threads are, what the TLB size is. It's just like, here's a standard algorithm. You try to do the best you can without tuning specific things to the actual hardware. Hardware conscious is the opposite. It's saying, okay, I'm going to try to look at all the low-level information specifications about the hardware that I'm running on, like the cash sizes, like the new regions and so forth. And then have my algorithm make decisions of how to divide things up and move things around based on that. Right, there's trade-offs to each of these. Obviously that if I'm hardware oblivious, then I write it once and that's good enough for everywhere. Hardware conscious means that I have to have specific optimizations that may not work anymore as hardware evolves over time. Or if I go, somehow something in Xeons or X86 is completely different than something in ARM, then my algorithms will not scale in the other architecture. All right, so within this we want to have sort of two goals of how we want to implement our algorithms. First of that, we obviously want to minimize synchronization. This means that since we're talking about a parallel joint algorithm, we have to have multiple threads or multiple workers running at the same time. And we want to reduce the amount of communication we have between those threads so that one thread can continue working and not get blocked on another thread filling up some buffer or hash T. Right? So again, we want our threads to be able to run full speed. So this doesn't mean that we want to make our implementation lock-free or latch-free, it just means that we can be careful about when we actually go acquire latches so that it's not everyone trying to clobber each other all the time. The second goal is to minimize the cost of going accessing memory. And this means that we want the data that any thread is going to be operating on or working on to be local to it. Ideally in its caches, if not in L1, L2, then L3. L3 is shared across different cores, at least in zons, within a single socket. Or if I can't have things in my last level cache, I want to have it in my local memory. Because I want to avoid having to go over the interconnect across different numerous regions. And so the way we'll design our algorithms is very similar to how we design algorithms in the intro class where we were said whenever we bring something in from disk, a disk into memory, we want to do as much work as we can on that piece of data before we throw it away. Because we don't want to go fetch the disk again. Or it's the same thing we saw with Hyper where they wanted to do as much work as they can for a single tuple sitting in a CPU register going up the pipeline before they went back and got another tuple. Because again the cost of going something into memory from CPU cache into the register was relatively expensive, right? So this idea shouldn't be foreign to us. All right, so let's focus how we want to do this. So what's going to make our query go slow when your cache miss? Well, if we just fill up our cache with a bunch of crap, a bunch of stuff, a bunch of data, then yeah, we're not going to have any space to store new things. But another challenge is going to be we have to actually consider the TLB on the actual CPU. Because now if we're trying to have our thread address a bunch of different cache lines across memory, then if I'm just polluting my cache with all these random cache lines, and now when I do another lookup for some other cache line, not only is it not going to be in my CPU cache because the CPU evicted and put it out to memory, but it's also the entry is not going to be in my TLB. So I'm actually going to pay two cache misses to do one single lookup. So I want to be careful about how I'm bringing in my data and make sure again I reuse as much as I can before I move on to the next thing. The basic I want to avoid doing complete random access. Even though it's memory, we said we could do random access. We want to still be careful about what we're accessing to avoid that issue. Again, it's just like in a display system in the intro class, just now we're at the even smaller scale. But it's still going to matter a lot. All right, so I've already said this, we're going to avoid random access. We're going to try to scan as much as possible, and then when we do have random access, we want to have things be local possible. And that way we're just hitting our local cache over and over again. So there's going to be this trade-off we'll see when we talk about the Radex join where there's going to be the amount of instructions we would have to execute to do something versus the amount of memory we may consume and the amount of cycles we may consume. Again, the idea of this partitioning step is that if I pre-process my data on both the build side and the probe side of the join and put it in these nice little chunks that will only be accessed by a single thread or single worker, then they're not going to have to go over the interconnect to go get random things and they can just scan through sequentially on these buffers and not communicate with anybody else. So yes, I'm doing more work, but in the end, I'll have fewer cycles because there won't be, again, traffic on the hardware. All right, so as I already said, the hash join is most important operator. We're going to have in our data system, again, we'll see at the end that it's not always going to be the most dominant cost. But nevertheless, this is what most people want to do in an old lab system with lots of joins. So we want to make this go as fast as possible. And ideally, again, we want to have all the cores running at full capacity, 100% utilization while we're doing our join, and minimize all the stalls of getting things from D and RAM and putting into our CPU caches. So at a high level, hash join has three steps, where the first step is optional. So in the first step, as I already said, is the partition phase where you could decide that I want to take the tuples that are coming up into my join operator on the build side and the probe side, or the inner and the outer relation. And I'm going to break them up into these shards or partitions based on the hash key that, or sorry, based on the hashing of the join key that being used in the join operator. And I'll divide them up into different chunks and then the threads could then, then on the build side, you could then build the hash table at each worker using these joint subsets. And then on the probe side, same thing, different workers will do probing to the hash table on the additional subsets. Then on the build side, as I said, we're going to actually build the hash table. So as this, we're going to assume we have a single logical hash table. In some cases, you can have multiple physical ones that are still in front of a single interface. We'll cover that in a second. But it's a single hash table and that prevents me from having any false negatives where I do a probe or something should exist. But because it's in a different hash table or different location, then I end up incorrectly missing it. And that's the probe phase. So again, you do a lookup. If you find a match, we'll talk about how to find a match in a second. Then you take the tuple from the outer relation, the tuple from the inner relation, match the two together, and then shove that up the pipeline as the output. So the Sarlin paper you guys read that from these Germans, they correctly make a big deal that this materialization cost, this last step here, is actually, does matter a lot. And a lot of the earlier papers didn't actually do any of this. And they said, just how fast can you do the build and probe and partition? The reason why this matters is because this is like a mem copy to take the up, you're taking two pieces of data, mash them together and produce them in the output. And that's additional pressure on the CPU cache, which if you're not doing it may give you incorrect readings or the wrong perception about the speed of certain operations for the other parts. So you always want to do this because that's what a real system would have to do anyway. They didn't cover the paper from the Columbia guys on the SIMD stuff, but when they were showing how to do vectorized hash joins or hash hookups and so forth, they literally, I found a match and really throw it away. Because they were trying to keep everything in L3 cache, otherwise SIMD didn't make a difference. So this is showing in a full system you'd have to do this. And again, they're including that in the calculation. And we'll discuss the pros and cons of early and late materialization later on. All right, so let's go through the partition, each of these three phases. Again, we'll spend a lot of time on the partition phase, because I think there was some confusion on that in paper. And I think it's good to see how these systems actually could do it, even though you may end up building a system. You may actually not want to do this partitioning stuff. But it's good to understand what's actually going on, because some of the trade-offs and design solutions they're going to make for how they're going to implement this are useful for other parts of the system. Like being aware of the cash flow counting and so forth. All right, so the partition phase, you're going to take the input relations looking on the outer and the inner, and you're going to put them into partition buffers based on the join key. And the idea is that you're going to take these buffers, these partitions, you're going to redistribute them across the different cores. And then when you now go into the build phase, the probe phase, the workers will be assigned those partitions. And they only have to communicate, only have to read data within those partitions. And now you're just doing a sequential scan in that buffer. And the goal of this is that, especially in a numeric architecture, that the actual instructions are going to spend to do this actual partition step, or overcome the extra cost of the instructions, running the instructions to do the partition step, and therefore getting better locality of the data at each thread. That's going to be faster than blindly just having every thread reading different parts of memory. So in some cases, the data will actually already be partitioned for you. This is rare, if the data is already partitioned on the join key, then you don't have to do this extra step. You literally can say, okay, take the first thousand tuples, you go here, next thousand tuples go there, right? But that doesn't always happen. And so, again, this idea comes from the Great Hash Showing. We saw it in the intro class, but that was like spilling the buckets on disk. Let's see how we do it in memory. All right, so there's two high level approaches to doing this, the non-blocking and the blocking approach. In the non-blocking approach, the idea is that we're going to just have the threads access to the data at the same time and populate a single hash table. Without doing any actually sophisticated things to split things up, we're literally just letting them write out to these buffers. And any thread can write to any buffer, and therefore we have to use latches to synchronize to make sure that they don't clobber each other or cause problems in our data structure, right? And in the blocking approach or the radius approach, this is we're going to potentially scan the population multiple times. But then we're going to be clever about how we actually write data into these partitions based on the rate x to avoid having to do any synchronization across the different threads. So this is the one that's more sophisticated, but again, it requires a bit more pre-processing before doing the one at the top. There's actually two variations on top, so we'll cover both of them. So again, the non-blocking partition is like we're just going to let any thread, so all the threads are going to run at the same time, and just generate our partitions. The question is going to be, do we have the threads write to single global shared partitions, which we have to protect using latches? Or do we have them write to private partitions, think of like almost like thread local storage, and that way there's no synchronization. But then now I've got to do another pass at the end to put them, have one thread put them into the shared partitions or our global partition. Yes? In non-blocking, how do you even have, in order to have private partitions, you would need like a radix to actually partition them based on, right? Or- Your question is, for non-blocking, how would you actually even do this without radix partitioning? Yeah. Next slide, literally it's like, say I have ten partitions, every core is going to have ten mini partitions, right? And then something at the end just puts them all together. Radix partitioning is trying to be clever for having one buffer space, but I know what offset I want to write into. Let's go through the example, if you're so good, we can come back. All right, so here's our data table. Say we do the really simple thing of like this in the very beginning, splitting up in row groups or morsels, right? And say this column B is what our join key is. So we're going to hash this join key, and then, oops, sorry. We're going to hash this join key, and then that's going to determine which of my end partitions I want to write into. So literally, think of this, every single core is just writing to any other, any partition at once. I think of these just linked lists of buffers, right? Almost like a change to hash table. And so to prevent the threads from overwriting each other or corrupting the data structure, I got to take latches on the buckets whenever I write into it, because I don't know whether another thread's writing to the same time, right? But at the end, when I'm done, you know, when I'm done populating these different partitions, I don't need to do any cleanup or consolidation, because now core one could take this one, core two could take this partition. Everything's all cleaning divide up. But I pay this extra, I pay this penalty of coordinating or using synchronization to make sure that threads don't kill each other when they're writing to the partitions at the population step. So this is the shared partitions. The private partitions is where now every core itself has end partitions which is the total number that I'm going to have at the very end. So say I want ten partitions, that I'm going to hand off the ten cores. Within each core now, this first step here, I'm going to have ten mini partitions. And now each core can write into these things without anybody clobbering them, so I don't need to take any latches, because I know everything's single threaded and it'll be super fast. But now the downside is I want to consolidate, like for a partition one across my different cores, I want to put them into a single partition one. So I have to do another pass where you could have each core, sort of one car be in charge of taking the data across the mini partitions and putting it together into a global one. All right, so what's the downside of this? The last stage, yes. So in this case here, when I'm doing this initial pass, I have, say in this core here, I have this chunk of data, it's in my L3 cache. I don't need to go across any Numer region to go access anything else because I don't care about anything, I don't care about this data here. And likewise, when I write it out, assuming the partitions fit my L3 cache, then I'm just writing to my local C view cache. But now, when I want to do this consolidation step, this core is in some other Numer region and it's got to go touch all the partition ones across all the other cores, which may be in a different Numer region. And then I got to write this back now into my Numer region. So that's expensive. All right, so this is another good, this is actually a good example of the pros and cons between late materialization and early materialization because if I'm doing late materialization, then the data I'm actually moving around here could just be like the join key and the tuple ID, right? It could be kind of small, but if I'm doing early materialization and I have like 20 columns, then this partitioning step here is moving potentially 20 columns around. Plus going across Numer region, which is always going to be bad. It doesn't require locks, it's going to be better. It doesn't require locks, it's potentially better. Depends, right? Like if I have multiple Numer regions, then maybe this is bad, all right? So this gets into like the hardware oblivious versus hardware conscious. Like, in order to decide whether I want to do this or not, then I had to go look, like maybe like run micro benchmarks or something to figure out when the system boots up, like what is the hardware capable of. That even then, that might not work out when I actually run it in the real time because the data is heavily skewed, then one set of partitions might be super big versus another one. Yes? On the previous slide, the adjustable partition, not the local partitions, even then you might have to access an Numer across Numer regions, right? Correct, so the same as in this one here, actually this one, you definitely would, right? The same as in this one here, are the cores potentially going to have to access data that's across Numer regions? Yes, when it writes into these partitions, these partitions are not going to fit in healthy cache and they could be in different Numer regions. The same as local partitioning, definitely better, meaning like the private versus shared. Again, it depends. Yes, this question is in this last step here, how am I generating, how is this thing getting generated? It's like you literally just sequential scanning through memory and just appending it, yeah. I mean, yeah, there's no additional processing you'd want to do, you literally just copying. Yes, your statement is it wouldn't be better to have each core have their local partitions in their cache, which they do, yes. Like this thing here, like again, so I've said 10 partitions, this core is going to have 10 mini partitions and that's all going to be ideal in this L3 cache. This last step here wouldn't be better for this to be where? In the last step I have to consolidate, so this core is going to be responsible for partition one, so it goes to partition one, for the mini partitions at each core, has to copy the data and put it into the final buffers. And assuming, because you want to write memory that's local to you, so partition one's going to end up being in CPU ones caches or local memory. This question, is there any contention when you're trying to join them together? Contention from what though? Because, you know, CPU one is the only one reading mini partition one, right? So there's no coordination, there's no synchronization needed. So it's L3. Yes? The question is the number of partitions always equal to the number of cores? No. This question is, how do you determine the number of partitions you want? No, it depends on the skew of the data, right? So simplicity, you could just say yes, the number of cores. But then, what happens when the, again, say this data is super heavily skewed, and there's a billion keys in partition one and a thousand keys and all of the other ones? We're now going to partition this one again, right? So you end up with more partitions than cores. Yes? Is this last step being done, isn't SIMD? Are we like taking these mini partitions and using SIMD for momentum design? This question is, is this last step being done with SIMD? It's just mem copying. So there's not, I don't, SIMD wouldn't really help. This question is, why do we have to do this last step? Why can't, when I build the hashable, just jump, you know, know what, because then, because now when you're doing the, well, first of all, two things. This is, but you're doing this both on the build side and the probe side, right? So now if you have a thread scanning through to do, you know, to build or do the, through the build or do the probe, like now you're kind of doing the jumping again. It's, I guess, I had to highlight the same, I guess. Right? But it's like you're staging things so that when you, when then you then do the build and probe, like there's fewer stalls as you scan along. So you stall because of the remote memory access during the partition phase, but then you don't have any stalls in the build or probe side. If we have a larger number of partitions, hence fewer, like less number of the, condensate of the logs, maybe if we just let one of the cores handle all of this mini partitions and push it to the global partitions rather than having one core handle one set of what mini partitions. Is that? Is David is, instead of having every core be responsible for like some subset of the mini partitions and consolidate them, would he just have one core do that? No, no, no. Each core takes care of all of its partitions and writes them to the global partitioning. So we sort of go back to the global partitioning model, except that because if we have a large number of partitions there will be less contention. All right, so basically saying, if you go back here, okay, this is PowerPoint, so I'm sure it's so many. You just had a ton of partitions. Then the likelihood that any two thread would be writing to the same thing at the same time. Yep. And so then you end up getting a significant advantage over the other. I debate whether, I push back whether it's significant. But yes, if you increase N significantly, then the likelihood again that the two threads try to contend on the same latch would go down. Yes. And the downside of that would be, was that? Yeah, it depends on how big you're allocating these buckets that could be underutilized and you're just wasting memory, right? So let's go through now, let's go through the radius partitioning. Again, the idea is the same, but the goal is that we only materialize results once, whereas the other only had to materialize twice. At least for the private partitions. Okay, unless you have to do a recursive partitioning, but that's like, you have to do that for any of them. So when every CV will say, I do a Radex hash join, or sometimes it's called a Radex join, it's gonna be this approach here. All right, so the idea is that we're gonna scan through the input relation multiple times and in the first time, we're just gonna gather information about what the data looks like and then use that to determine where we want to write data out to our output buffers when it's time to actually then do the partitioning. So in the first step, do the first pass when we compute this histogram, which is gonna be the number of tuples that are going to exist with some Radex, which I'll explain to that in a second, right? And then we compute the prefix sum, which is a running summation of these, from this histogram, and that's gonna determine where each partition is gonna start within our output buffer. And then we scan now R again and then we have them write out the data into these buffers based on this partition key. This is just text, so I'll go through an example, but we gotta go through, understand what the Radex is gonna look like and what the prefix sum does, and then that'll tell us how we do this last step here, okay? All right, so the Radex is literally just like a digit within a number. So you take the hash of the key you're trying to join on, and say the hash ends up being 19, right? And so the Radex would just be the one position, what is the number, like nine, right? And so you can do this just through bit shifting and multiplication, and then that can allow you to extract out, I just want this one number, right? All right, so what do we do with this? Well, we can use this now, or you can get the other one too as well. We can then use this to maintain a histogram that says for every value within, every Radex value, what's the number of entries that we see? And we end up sort of with a sort of list like this, it says okay, for zero there's two entries, for one there's three entries, and for two there's one entry. And so the idea here is that we're gonna use the Radex to determine which partition number you're gonna go into, and then we're gonna use this histogram, could be the prefix sum, which is the next slide, to say okay, but where within that, this giant buffer of my partition data, where does that prefix start? Or where does that Radex start? So the prefix sum is just taking, again it's a rolling summation. So say it is our input, one through six. So the prefix sum in the first position is just one, because you start with zero, so one plus zero equals one. But then now we're gonna take, whatever the number was the sum of the previous addition, take the next number in my input sequence, add those two together, and that's the next value I get. So one plus two equals three. I keep doing this down the line, and then now for every, within my input sequence, I'm gonna have, I'm gonna compute the prefix sum. Right? And again this is where we're gonna do this after we compute the Radex histogram, send this all now to our threads, and now they're gonna know, okay well at this position in my prefix sum, it corresponds to this Radex, and therefore I know what offset they're right into. Right? Think of these as just gonna be buffer memory offsets into our partition buffer. All right? And the reason why we're gonna do this is that we don't have to synchronize. So we do this first pass, compute this prefix sum for all the Radexes, we hand that out to all the threads, and then now when they do the partitioning, they don't need to synchronize at all, the way we did with the latches from before. So I don't know if there's a way to compute this efficiently prefix sum with like, with SIMD. There's a paper from Guy Blalock, who's here in the computer science department from like, it's like 1994, 95, and he was envisioning, hey look, here's how to do vectorized prefix sum if you had, you know, if you had SIMD instructions to do it, which at the time they did and which again, I said the whole thing exists. But just, just to tell you how awesome Guy is, he's been thinking about this problem for like 30 years. All right? Question yes? Next slide. So the question is, I don't see how the prefix sum is gonna help us. We're gonna use this prefix sum to tell us what the offset is when we write into a rate x. Next slide. All right, so say this is our input data and we've already hashed it, right? So these values here are the hash values. So in the first pass of the algorithm, we're gonna split up like we did before, like in morsels, let's say we just have two CPUs here. And then we're gonna take the first position and the first rate x of each hash value and we're gonna use that to determine where we wanna write the data that corresponds to this record. So to do this, we compute the prefix sum for this input sequence, right? Going like this. And then now that's gonna be able to tell us, okay, for partition zero, we wanna write two slots, right? So this here, we have four values, sorry, two values, zero, one, one, zero. So there's two unique values. So we know that for partition zero, according to the first rate x, there's two elements in here. So we know that when we write into this giant output buffer, we can write it at the first position. For partition one, there's two more elements, but it's zero plus two. So now we wanna start writing at the third position in our giant output buffer, right? Like this. So partition zero at CPU zero can write at the first position, partition zero at CPU one can then write at this one and so forth, right? So again, they don't need to coordinate how to do this. They know they're writing into a memory location in our partition buffer that nobody else is gonna be writing into. So we can just rip through it and run as fast as possible, right? So then we just scan through the data and then we just write it out like that. Now maybe the case, again, if we have this is horribly in balance where one partition is huge, we can just recursively just do this again by looking at the next rate exposition and just doing this another round and that'll subdivide it even further, right? That is your question. Okay. Okay, so in a vanilla implementation or a naive implementation of these different approaches is gonna be super slow because we're reading, writing random locations in memory and we're polluting our CPU caches as we go along. So in the paper you guys read, there are some optimizations you can do to prove this. The first is gonna be using what I call the software write combined buffers or the write buffers. And the idea here is that instead of me just writing out as I'm scanning along, like writing data here and writing data there as I'm scanning through on the partition step, I'll write to like almost like the private buffers from before, but it's much smaller. I'll write to a little bit to that private buffer and when that buffer gets filled, then I write to that remote location. And again, going back to the slide here, since I know that at this step here, I know that nobody else is gonna be writing to the range that I'm supposed to write into, it's, you know, there won't be any issues if I delay a little bit and then write things out in a batch, right? And that'll help things out. Obviously, being new and aware from all this is gonna matter a lot too, but the other trick we can use is called streaming writes or non-temporal streaming writes. And this is a, these are special instructions on the CPU that allows write to a memory location without having the CPU put it into our CPU caches first. It's like, you're basically bypassing the CPU cache. I wanna write to memory location XYZ, don't put it in my CPU cache, but the CPU normally would because once I read it in and then overwrite it, you just write it, almost like direct memory access, write it direct to the memory. Yes? The last point, it says without a separate write page at the end, but there is a separate write page, right? And you just put buffering everything in, writing in the software that I combined. But like, we don't, like, in the, the private buffer is like, you wrote to the private buffer, then you wrote to the global buffers, right? And with the greatest partition, I don't need to do that, right? I do one pass here and then I know exactly where the data used to go and it's one write into that. So there's no extra step now to put it as a global buffer, right? So anyway, so there's special instructions. I know Cindi can do this. You can take things out of the Cindi register and get it right into memory, but I think you can get regular, you know, like regular CPU registers as well, can do this, right? So the combination of these things is how you're gonna get this section to work well on modern hardware. All right, so now we, so we partitioned, right? Now we have a bunch of data, or the option partition, we could have a bunch of data in these nice little partitions that are assigned to different cores or we could just be operating directly on the input data itself on the build phase. The idea is that we're gonna take all the data that's coming on from the out of relation, in case the case are, we're gonna assume we have, we have to have the input as input. The input data has to have the keys we wanna, the columns we wanna join on. We're gonna hash them and then sort it into a hash table. And then once that's done, then we switch over to the probe phase, right? So now we need to discuss how we're gonna organize our hash table in this build phase to run as fast as possible. So when someone says they have a hash table, they really mean they have two things. They're gonna have a hash function, something that takes an arbitrary bytes of some value and map it to a integer value in a smaller domain, typically 32 or 64 bits, right? And then we're gonna store this in some kind of data structure, which we'll cover in a few more slides. But then within that data structure, we have a need a way to handle collisions because the hash function isn't always gonna guarantee that the hash value for two unique tuples is gonna be unique. So we need to figure out what do we do will we have two guys try to go to the same location, right? So again, we're not a algorithms class, we're not gonna go too much detail about what hash function you wanna use or what the actual, the complexity of the hash tables are. But we're gonna, for the hash functions, we're typically gonna use something off the shelf, most systems do, some right there on, we'll talk about it in a second, but you take something off the shelf, but it's the hash table, we probably wanna spend most of our time making sure we get that right. All right, because we want that to be as fast as possible. So for the hash function, there's gonna this trade off between speed and low collision. So the fastest hash function you can have is literally just for term one. Meaning no matter what key I give you, you always get back one. So what is that? That's copying from one register to another register. In some cases the CPU can just inline it, right? But it's gonna be terrible for collision because no matter what key I give you, it's gonna be on the same location. On the other end of the spectrum, you have perfect hashing, which is a way to guarantee that for any possible hash key, or any possible key, I get a new hash value. So a true pure, sorry, a true perfect hash function only exists in the literature and theoretical papers. Jignesh and his PhD students in Wisconsin are working on sort of practical implementation of this, which we won't cover. But again, it's not truly perfect because you have to handle the corner cases. But for most of the keys, it's good enough. But most systems aren't gonna do what Jignesh is doing in his perfect hash function. They're gonna use something off the shelf. So to figure out what hash function you wanna use, again, we don't have to implement things ourselves. There's this GitHub repo for this thing called SMHasher. And it's basically in the same way that I'm assessed about databases. This guy's obsessed about hash functions. So he has this benchmark he runs for every possible hash function that exists. And he keeps track of the collision rate and the throughput and performance of them. And I think the Facebook XX hash and the Murmur hash, those ones usually rank pretty high. What's that? Yeah, it's like a no op. Yeah, it says no op, yeah. So here's a smattering of the ones that probably you'd wanna consider for your database system. So the easy and fastest ones you use is one of the oldest ones, the CRC, from the 1970s. The reason why we would consider this in a database system today is because there's actually CPU instructions to do this very efficiently. And so some systems like Clickhouse will use and Hyper, I think, use CRC 32 for integers. Because again, it's gonna be solenoid instructions and it's good enough, the collision rate is good enough. Murmur hash is one of the first sort of modern hash functions that started the new era in 2008. Actually, I think this was invented by the guy that runs the SM hasher website. As far as I know, there's some random dude who put this thing on GitHub and put it on the internet and people just started using it. And I think it's up to Murmur hash three. And in one of the papers they talk about how you can use Murmur hash two. You can use that and actually in SIMD you do the vectorized lookups in a hash table. So over the years it's gotten better. Google then forked Murmur hash and created city hash. Then they had a follow up called farm hash, which does better collision I think for longer strings. And then Facebook has this thing called xx hash, which is up to like version three. And that one again for performance and collision rate for xx hash is pretty good. So probably Murmur hash, xx hash and the combination of CRC 32 is probably what you would want to use. Yes. It sounds like for any given algorithm you might be able to engineer data that also when you say better collision rates do you say better on average or not even workload? Yes, the question is like if I say that could you come up with a generate cases that make one hash function be terrible? Yes. And so with this thing basically is try to throw everything at it and figure out what works the best and doesn't work best. And then what's the right trend between collision rate versus performance? Again, perfect hash function will have zero collisions but it's gonna be super slow. You need hash function, you need a hash table for your hash table. Like I said, in CRC 32 it's gonna be really fast but it may have a higher collision rate. It's hard to decide. And then people just pick one or they'll pick a hash function for enters and a separate hash function for strings. All right, so now with our hash function we talk about what the hashing scheme can be. So some of these I think we covered the intro class but I wanna go over them quickly again. And then I think the two ones I wanna sort of focus on Robinhood hashing and hopscott hashing. Because I think the paper talks about linear probing but there's a previous paper from the same Germans from the paper you guys read that says Robinhood hashing usually works out. And then the Umber guys are gonna use in their paper but then I think it's still up for debate whether which one's actually better. Because people try Robinhood hashing and it turns out it actually doesn't always work as well. All right, so chain hashing, this is what most people think about when they have a hash table or a hashing scheme. And this is what you get I think for SDL, the center type of libraries unordered map is this, the Java hash map is this, right? And the idea is that the, you're gonna maintain a list of buckets and these buckets are gonna pointers to a linked list. And then whenever I have a collision I follow along my chain to find the next free slot to store something, right? And the length of each chain within a given bucket can vary depending on the number of entries in it. So really simple looks like this, right? Say I have my bucket pointers and here's my buckets, right? So I wanna put key A in, I hash it, takes me to one of these bucket pointers and then I jump to the first location of my chain for this bucket and I find a first free slot and I store A, right? No surprise there. Same thing B, first buckets empty stores it right away. In case of C now I have a collision because it wants to go where A is but I just do a sequential scan within my bucket until I find the first free slot and I store it. In case of D, D wants to go in this bucket as well but now since I'm only showing two entries per bucket so it's obviously small but now since this thing is full I have to extend my chain out further like that, right? This y'all should be pretty understandable, sorry. So the one trick that the hyper guys do which I don't think is discussed in this paper but they mention it in the morsel's paper or the morsel's paper or the compilation paper. So they recognize that this array of pointers here since in x86 even though when you say give me a pointer it's 64 bits in memory the harbor actually only looks at 48 bits. And so they say hey let's go take those extra 16 bits that aren't being used by the hardware let's stash something in there. So they put a 16 bit bloom filter inside the pointer the harbor ignores it but they know to interpret that for 16 bits as a bloom filter. So they'll use that to check to see whether the key even look for it could even exist in my hash table or in my chain, right? And it's clever because it's like this thing I gotta read this thing anyway it's gonna be in L1 cache in my last little cache. So let me go put as much information I can to potentially avoid scanning along a chain and find nothing, right? So now when I do a lookup at getG I check the bloom filter first if it's not in there that I know I don't need to follow the chain. So the next approach is do open-dress and hashing. This one is probably more common than the chain hashing especially where it joins. The idea here is that I just have this giant table of slots and when I wanna do a lookup I hash something, I modify the number of slots I have I land in some location in this giant table giant array and if the slot is empty then I put my thing in there or if the slot is full then I keep scanning sequentially to find a free slot and then I can put my entry in there or if I end up wrapping around come back to where I started that then I know I'm in infinite loop and I have to stop and resize. And then when you wanna do a deletion or sorry, when you do a lookup you same thing, you jump to that location look to see whether the thing you want is there and keep scanning until maybe you find it. So that approach will be linear scanning and this is what most people think of when again when you think of open-dress hash table there is variations called quadratic probing. The basic idea is that instead of scanning sequentially you just use a quadratic equation to jump to exponentially larger ops. Why would you ever wanna do quadratic? These questions, why would you ever do quadratic? Because it avoids clustering because they have different keys won't always be jumping to the same bunch of locations but the downside is now you're doing much more random access and you may be you know but again there's no free lunch there's pros and cons about these approaches. Both systems are gonna do linear scanning. Again this is just a rehashing of the intro class same thing I take, I wanna put A in I hash it, come to this first slot it's empty, I can put my thing in there same thing with B with C since it hashes this slot but A's in there so I just scan down to the next one and I find a free slot and I put it in there. D same thing goes here E starts where it wants to go where A is so I keep scanning down until I find the free slot then I can put E and F goes here at the bottom, right? And in this case here if I say I try to put something that hashes to I put in one more key and this gets full now if I try to put another one in then it's gonna loop around and I had to keep track of where I started to avoid it breaking out, right? So the the obviously downside of linear probing is that or an open dressing scheme is that I'm potentially getting very far away from you know if my hashtag gets full I'm gonna put a key in pretty far away from where it actually wants to be so now when I go do a lookup I may have to scan a bunch of entries that aren't the thing I want and you know in the worst case scenario I could almost wrap around to the very beginning and to find what I'm looking for, right? So the way to avoid this you know having this infinite loop is you just try to double this make your hash shape double the size of the number of keys you're gonna put into it and again we know the number of keys we're gonna put into it because you know we know the number of tools are coming up and we could allocate things ahead of time if we get it wrong then we have to resize so you always sort of a good approximation is to to double the number of keys I expect to have and on the resize you just double it again so there's this observation in the 80s that recognizes this obvious problem with the linear probing schema or open addressing and they came up with an idea that tried to to limit the number of of of lookups I have to do when I'm starting to read the hash table again on the in a in a hash join we have these clearly two phases where I'm gonna build it I'm going to then probe it and I want the smaller relation to be always on the build side because I want my hash table to be as small as possible and so the idea is that what if I spend a little more extra instructions on the build side to reorganize my hash table so that when I do my probes the the length of of the distance I have to go look potentially to find the thing I'm looking for is potentially reduced so the two approaches to do this are the Robinhood hashing and hopscott hashing and the Robinhood hashing came out in the 80s hopscott hashing came out in the in the 2000s the idea here is that rather than when I just like in it's an open dressing hash table but as I'm trying to find my my slot instead of just trying to say what's the first free one go put my thing in there I actually look at the entries that are in the place where I want to be and I can decide whether to go steal their slot and the I the goal is that the idea is that the you want to minimize the average distance that any key is from where it should be by swapping keys that are farther away from where they want to be versus keys that are closer it's like stealing from the rich to give to the poor hence the name Robinhood hashing so the the more recent research from the from the the sarlin germans for the paper you guys read they talk about how Robinhood hashing is actually the better approach better than linear hashing the hyper guys or the umber guys in their latest paper in 2021 they also use Robinhood hashing citing the sarlin paper about why you want to do this um but every so often you see various database companies try out Robinhood hashing and actually doesn't pan out in the real world so I'm not trying to pick on these guys questdb but this is the most recent one that I found so they had a blog article came out in November last year that says hey look we swapped out our hash map of the new uh... to join this new fast map implication they're in java but we can ignore that for now and at the very bottom of this they're all they're all happy having had this nice hash table it's better than the the chain hash table in um in uh... in java because now they're using open addressing linear probing but then there's a little blurb of the bottom here that says someone on reddit told him hey you should try Robinhood hashing and he talks about oh yeah the early numbers actually look pretty good we think we want to do this but then if you follow the pull request link here uh... from November you go to the very bottom where now in january says well it turns out Robinhood hashing made things worse uh... so they end up turning it off reversing a few years ago we had the influx dbt will show up and they were all boasting how they were using robin hashing david anerson was in the audience he asked him why you're doing this and they said oh we saw someone in hacker news said to use it so they used it but then it turns out I think it made things slower too so to me I think this is still an open question I don't know what's the best one to use at this point but it's good to understand what Robinhood and Hopscotch are actually doing okay so the basic idea is that it's just like linear probing when I want to store something in I'm going to store it you know at the first location that's free in my slot array but then in addition to storing the key or the hash key we'll talk about in a second and the payload of the value I'm also going to store the number of jumps I am from where I should have been when I hashed directly into it so in this case here when we hashed A and put it in there wasn't any else in the slot so it's positioned or it's distance is zero so now when I B same thing distance is zero so now I want to put C in C wants to go over A so it can't but now I compare from where C wants the distance C is from where it wants to go from where A is from it's where it should be in this case here they're both zero so C is not going to steal A slot and it goes down to the next one but now that I store that it's one hop away from where it wants to be so now when D wants to go where C is at the very beginning the number of hops it is from where it wants to be is zero and that's less than D or C which is one so D cannot steal from C so we leave C alone and then D is going to go right here and then its hop is one now I want to start E same thing at the very beginning A is zero hops away E is zero hops away at this step leave them alone now we get down here C is one hop away E is one hop away leave that one alone but now when you get here D is one hop away because it wanted to go where C was but now E is two hops away because it wanted to go where A is so in this case here E is allowed to steal from D shoot them in the head take a slot and then D pops out and we got to put it back down into the next slot and we would run that same protocol if there was something else here we would decide whether we want to steal that slot or not right then F again same thing goes like this yes so inserts can be really really slow this question is can inserts be really really slow yes right because what are we doing we're copying for example we're doing a bunch of branch mis-predictions right because we got to do then else is to figure out whether we want to steal or not so but again now when I do lookups that you know if I'm looking for you know if I'm looking for E for example well it could have been here but I'm going to get it here so I'm going to get one one fewer hop doing the lookup so again is this is the trade off always the right thing to do to pay you know the higher cost in the right side to make the read side go faster or the build side or the probe side same thing with partitioning it's a bunch of you know a bunch of extra work I'm doing before I even do the other two steps in my join but ideally I'm setting things up to make things faster when I run later yes this question is do you still look for more values otherwise you get a false negative why would you get a false negative you can't so if I'm looking for F maybe I'm going back say I look for E wanted to be here or say I'm looking for D so D would start here whatever it doesn't matter so I say it starts here I'm going to scan through until I find D or I find an empty slot but I find an empty slot that I know doesn't exist so I don't have a false negative but I would have to do that with linear probing anyway without Robin or Hash the question is what sorry oh like sorry this thing the summation across all entries should be the same as linear probing I think yes yes it's it's it's it's it's pointing out he's correct that like even though I'm rebalancing things so that I try to minimize the jumps right of of the distance between one position for where it should have been it's the same as if I did if I didn't do the shuffling thing right but like depending on what keys I'm trying to look up I may be you know if I'm highly skewed on on the looking up the same key and over again and that key is closer than what it should have been that I might I might be in a better position you can't algorithmically prove that because it depends on the data that's what I'm saying you don't like keys as you can't algorithmic prove it you know the goal is in practice that this works out for certain so you did for certain distributions of the data this question is would you be able to prevent long scans from data that doesn't exist by putting a blueprint in front of this yeah we'll see this in the second with joins yes right we hope that was everywhere alright so again this one is this one can swap forever right like like it doesn't try to bound the you know how how far things actually going to be the modern variation called hotspot hashing hashing is is an extension of this but the idea is that we can still move things around like in Robert hashing but we only move things around if it can be in the same neighborhood there's a way to sort of artificially restrict where how far we can move something away from where it should have been right and if you can't find something in our neighborhood that we know it doesn't exist right likewise if we try to reshuffle things and there isn't room in the neighborhood where something needs to go then we say our hash table was too big and we have to stop what we're doing and resize it right so the goal is again to have the cost of accessing a neighborhood be the same as finding a key because we size our neighborhoods to be in cash lines and things are nicely aligned then it doesn't matter whether we're looking for a you know key in the last position in the neighborhood or the first key position in the neighborhood since we've brought up everything in our CPU caches anyway it's uh... it's it's the same for this one for simplicity we would say that our neighborhood size is three so you would say for the first three three slots that's neighborhood one next three slots neighbor two and so on and and and they're overlapping in the case of last neighborhood six again that's the last two positions and then it wraps around we have an you know a seven would be here as well so now when i want to do a look up or insert a a has to this position here so that goes in the neighborhood three so we can say a can go anywhere inside neighborhood three so because the table is empty we'll put it in the first position same thing for b it wants to go to neighborhood one is that neighborhood is empty it goes in the first position so now we take c and it wants to go in this neighborhood and it gets just like linear probing where the first position is is full i scan down to i find that uh... the next three slot and that's why i store that so now i'm sure d same thing d was overseas that's it that's a report so in this case here to scan through and goes there so now we put the so even to go in the same neighborhood where a isn't so when we scan through we'd say okay this neighborhood is full so we got to do is figure out what we what we can take out of the neighborhood where wants to be put it into another neighborhood uh... and that way we can put d into uh... as i put e into into neighborhood three so you have to basically go look and keep track of okay for it a c and d which ones are not in neighborhood three a and c both has to this location so neighborhood three and d both has to this position so that they're enabled three but c has to this position so it's never four so there's a neighborhood four there's a free slot so i'm allowed to move uh... is that right uh... you know you have to be d the d's allowed to move down by one uh... and then now i can go ahead and put e in this neighborhood here the same as this is very complicated in the robin hood and linear probing yes absolutely yes but it takes longer than much longer sure absolutely yes so they will pay since you're building the ashtray we're always paying the cost of building would be faster but it's only like limited by a certain amount simplicity is better in this case here on a modern cpu nobody does this now i'll show one graph at the end i i think robin hood beats this in the click house uh... measurements yes his question is what happens if you can't do anything in the neighborhood you have to say this my hashtag was full even though obviously you have free slots and linear probing that would not be an issue for this scheme it is you stop what you're doing resize and double the size the hash table and repopulate it then you get into questions like okay well like you know if i'm doing late materialization then repopulating the hash table may be that big of a deal uh... but if i had the full tools i gotta put back in the hash table then that's expensive but in the case of quest db they store a separate heap of the of the of the actual data just off the values are just offsets which we haven't covered this actually is and so that way you can resize it without having to copy any of the tuples themselves right there's pros and cons equities is the worst version extendable hashing this uh... this would be that fast and extendable hashing but because like the idea is like i say okay i i i know i'm not might i have a million tuples i'm gonna pick uh... hash table can hold two million slots and hopefully i don't have collisions that super short of time they quickly bring up cuckoo hashing sometimes called double hashing basically the idea is that set of one hash function and multiple hash functions to figure out where things go right so i want to put a n so i have two different hash functions basically it's the same hash algorithm like murmur or xx hash which is with different seeds to make sure that it has a different distribution of values so i'm gonna hash both of them and find out for the first free slot in this case here they're both empty because the table is empty so i'll flip a coin and maybe pick the first one first one so put a there next c comes along it hashes to these two free slots two slots here one of them is occupied one of them is empty i always choose the empty one and i put b in there c comes along now c was the hash where b went and the hash where a went both are occupied i gotta pick one to kill them and take their slot it's like robin hood hashing but instead of moving it down i'm gonna literally pull it out so say we're gonna go to kill the first guy b so c takes its slot b comes out now we gotta hash it again since we know that when we put it in the first time we use the second hash function when we come back the next time we use the first hash function but now when we go to insert it uh... it goes where a is right so we gotta go take it its slot a comes out we hash it again with the second hash function and now we find a free entry but the lookups are bounded to just two minutes the question is lookups bounded by two yes so you have o one lookups by paying by doing this ahead of time this question is just like just like a linear probing all the other hash I can get stuck in an infant loop where say something was here and i pull that out and it hashes back to something else and i have to keep track of where i went into to avoid infant loops right so now when i do a lookup to his point it's always gonna be o one because i just hash it twice and then i jumped to two locations and i can figure out which one i want the cuckoo hashing is used in a couple systems and db2 blue uses this and i think one or two others i can't remember but i remember looking this up before and other systems are using this yes it seems like these are made for read-only workloads they seem to be ignoring the hash build phase and the cost of it right but like it seems like you're ignoring the hash build phase because most of the data structures like again there's almost like you're using this as a hash index it's it's it's right once read many so yeah the answer to more expensive but i'm going to make my reason faster that's a for such for databases as a fair trade-off all right we're way way way way way behind on time hey uh... you know we'll have to pick up what we left off and come back on on monday fortunately let's be able to run because we're we just thought through it all right cool if you gotta go so what's actually we're putting a hash table we're not kind of unclear and the paper doesn't really say so the first thing we have to figure out is are we actually storing the tuples themselves or certain pointers or offsets the tuples in some cases if in the case of quest db that in their system they store the tuples actually in a separate heap and you just have offsets into them so now when i resize i'm just moving around those offsets and not the actual data itself uh... having these buffers coming in i maintain those input buffers then i just have offsets into those and i don't have to store the actual data but now that means when i do certain lookups or i mash the tuples together that that that are being joined i gotta then do you know give reference that pointer follow along to figure out what the actual data is if you have variable link data then you can't put that in a hash table with open addressing because every slot has to be the same length next question is are we actually storing what actually storing as the sort of key portion of the hash table because again because there will be collisions i actually need to compare the actual join keys uh... from the input tuple which is what's actually in the hash table so i could just store the original tuple and that would that could be expensive but also we'd because it's a variable length and you can't that's going to screw up your slots so sometimes you actually want to store the hash the hash you computed when you insert it along with the original join keys so that you can do a quick comparison just based on the hash the hash keys rather than having to do like an expensive string comparison so differences in doing different things and this is a classic compute for storage if i only store the join keys and not the hash then my data structure is going to be smaller but now when i do comparisons it'd be more expensive potentially because i have to basically you know look at raw keys alright the probe side there really isn't anything fancy we can do because it literally just ripping through the input feature of the input vector and probing hash A1 looking for matches but the one trick we can do is what he asked about is adding bloom filters and so this is the idea is that when we're building the hash table in addition to populating hash table we'll also populate a bloom filter so i built a hash table but i also built a bloom filter and then i pass that over to on the probe side so now when i do my my probe i first check the bloom filter if there's no match then i know it's not going to be in the hash table so i don't bother looking at the hash table because again bloom filters can have false positive but not false negatives so i know that something could not pass exist and it's way cheaper and way faster to look at the bloom filter rather than probing the hash table so this is sometimes called sideways information passing i think the technique there's a vectorized paper that talks about it, there's a vertica paper from ten years ago that talks about it a bunch of systems do this technique, the hyper guys do this, umber does it as well umber again, they'll have this bloom filter but they'll also have a bloom filter within that as well to see whether you need to follow along the chain so quick benchmark, so we're going to look at the paper you guys read uh... and read the a paper from the hash table that clickouts guys use for strings and again i'm going through this very very fast but it's basically going to have the no partitioning basic scheme but then it's going to have the the different radax partitioning stuff we talked about um, they have a concise hash table from IBM db2blue I think it's like a packed array with a bloom filter in front of it, we can ignore that no other system, no system other than db2blue uses it so they're going to have the vanilla implementations that based on what's in the i think the open source versions of these algorithms and the as it's described in the papers and then they'll have the their optimized versions that do a bunch of the techniques that we talked about so here's all the different variations of these and over here again this is that sort merge that i talked about that came from the intel oracle paper and so the region here which you can't really see, this region over here is going to be all the optimized versions uh... and then these are the symbols that they're using so they basically say that like if you do radax partitioning for either a chain hashing, linear probing or open addressing or a really basic array like that's always going to be the fastest uh... if your data is nicely fit in an array you know you can get a little faster as well but if you don't do any of these radax partitioning, no specialization again making sure you get all this correct for the actual hardware that you're running on is non-trivial this is actually going to be pretty good for most things, almost all things and requires less engineering overhead than all these other ones arrays literally like yeah, the indices are the hash keys right? so then the next graph they show is this thing here that says okay well how much time in a real query is actually being spent on the hash join and for all these different approaches you can see that it's basically 10% or less or 13% or less uh... being spent just doing the hash join and everything else is the rest of the query like getting the data, reading the data in, doing any filtering materializing the output, doing any other stuff up above in the query plan this is TPCH query 19 so again hash joining is super important but it's not the high pole in the tent, the most important thing we should be optimizing for but so for this reason that's why I think the you know this implementation here is going to be the no partitioning linear approach would be the easiest to implement and good enough for most things alright so let me show you one graph also to you that was in the paper this is from another paper from it's some random workshop or conference I've never heard of but the click house guys basically took their implementation or they sent a PR and this is what this is the data structure click house uses for doing hash joins or hashing for hash tables for strings and the basic idea how it works is that it's a single logical hash table but underneath the covers they have a four different variants of a hash table for the different strings of different sizes so we'll have one that's like for strings that are 16 bytes here's a hash table here's one that's you know 17 to 32 and they had different optimizations for all of them so what I like about this paper is that they just ran everything on this one work that are doing joins and group eyes and you can see how their thing is the best so this art index is a is the redx try uh... from from the germans from hyper you know it's not so easy for joins is that it's a good that it's a good place for a b-plus tree but they didn't for that uh... you see the chain hash table you see the hopscotch hashing the cuckoo hashing robin hood hashing f-fourteen tables from facebook the swiss tables from uh... the swiss guy left this from uh... it's from google thank you and then here long behold here's our vanilla off-the-shelf linear probing hash table and their thing is squeaking out a little bit better because again this is for physically for strings yeah there's unfortunately there's no click-out paper they do a lot of crazy stuff they have like like i think so many variations of hash tables uh... and and with the bunch of different doing joins but they don't their blog articles are pretty good you can't sometimes going to the covers you understand what's actually going on click-outs is a really fascinating system the guy told me that they have a view of where they're actually working on now so hopefully we can cover that next year why is linear better than physically with open hood and cuckoo and even hopscotch? because it's so simple right that's the whole point of it like simple is better again everybody said this partitioning is faster than no partitioning but getting it right is challenging again that this is not just me the germans and others say the same thing and most of them are going to pick one implementation and not try to be clever about it yes quickly even that last people should have thought like a different thing this question is like given these results it makes sense to have a specialized very well-length string hash table yes uh... if you can get random people to write it for you but if you're just starting off with this linear program uh... so again next class we'll do worst case optimal join we'll spend a little bit of time on do profiling with performance counters and then again reminder Wednesday next week will be the uh... status update presentations for the projects okay