 Today is the first part of a two-part lecture series we're doing on Join algorithms so today we're gonna do hash joins because they're the the primary one from the support ones and then on On Monday next week, then we'll be doing certain merge joints, which are important to have but not Almost always you want to use a hash join and we'll sort of see why All right, so the background today is some high-level ideas about doing parallel joins Right, so now unlike in the introduction class where we said here's just the algorithm and We really just cared about minimizing disk I o now since everything's memory We need to worry about you know how to take advantage all the CPU cores and and the parallelism that we have So then we'll talk about how to do a parallel hash join and this sort of be broken down into the three phases of doing Hash join and then within the build phase We'll talk about hash functions and hashing schemes and then we'll finish off with An evaluation that actually predates the one that you guys read But it's still in my opinion still illustrates the the core ideas So as I said that in the introduction class last semester We told you join algorithms, but we just said hey look we care about minimizing the number blocks Read and written from disk and then we made decisions about you know, what should be the outer table? Which is the inner table sort of in real simplistically, but now? We need to think about how we actually want to speed things up and take advantage of all the hardware That's available to us So the two main approaches we want to use to have a high performance join algorithm is either the hash join or the server is joined right, so There really is no other technique to do right either sort things or you hash them Or you just do like the brute force nested loop join so For the for this class. We're not going to talk about nested loop joins at all Because it's simply not going to be Almost always is never going to be a good choice for the kind of old lap queries We've been talking about like doing large scans and complex analytics or doing multi-way joins or you're joining multiple tables within a single query the the nested loop join primarily only shows up in only GP environments and Some all to be database systems actually don't even implement a hash join Because they don't need it because most of the times the queries that they're focused on are going to be doing index nested loop joins so you you iterate over the The outer table, you know You find the the tuples you want to join for the outer table and then you have an index to probe into The inner table and find the few number of tuples that you need I said a favorite example of like I want to log into my Amazon account and I want to list all my orders So you do a join between Andy's orders and the items Andy bought in those orders So there's going to be an index on the order ID in the order on the order item table So I know how to jump into just get those Those tuples I need I'm not doing it in long scans At a high level when you think about it though with the hash join is actually going to do it's essentially building an index on the Flood right the hash table is is essentially this is going to be our index But unlike in the OT environment where we have the index available to do queries very quickly and a hash join We're going to build the index do our do your computation do our join and then throw it throw it away We're not keeping it around. We're not maintaining it for You know beyond the query that we're executing some there's some systems that can do that, but that's not our focus here The other thing to think about too is like in the LTV environment I you know I may want to do range queries on my index. So that's why Chances are this is going to be a like a B plus tree or some kind of tree data structure that we talked about in the hash join We don't care about range queries if we're doing point queries So we want that oh one look up that the hash table is going to give us to find the one tuple That's going to have what we need now depending on our hash table implementation We may have to scan a little bit if we have collisions But again the high level semantics of what the query wants to do on our hash index our hash table is not doing range queries So that's why that's why we don't want to use a B plus tree But again the at a high level of this is essentially doing the same thing so the The debate between whether you want to use a hash join versus a sort merge join is sort of one of these classic things in In the field of databases, right and every decade it goes back and forth which one's actually the best so in the 1970s it was the conventional wisdom that the sort merge algorithm was better because People didn't know how to do hash joins or hash build hash tables for For tables that exceed the size of memory that was available So you figure we know how to do external merge sort Let me just sort everything and then I'll walk through and do my join that way Then in the 1980s people realize oh hashing is actually important hash joins it could be better So there was this movement in these database. They were called database machines think of like as being specialized hardware explicitly to run database systems and then they added additional features in the actual CPUs or the instruction set for these specialized hardware that do hashing very efficiently Now we have some things like that in Intel chips But these are like think of like customized hardware that was just to run the database system and they had hash join implementations on hardware these sort of went out of vogue because of you know, sort of engineering costs and actually just you know By the time you you spec'd out and fabbed your own database and machine Intel came out with a newer version of their chip And then already surpassed what you could do So people don't really build specialized hardware anymore although that's slowly coming back in a vogue today In the 1990s Gerst Graffi the guy that did the volcano Work and the indexing index locking paper. Do you guys read? He had a paper that said well if you kind of squint at a high level and look at today's today's hardware Sort merge join and hash joins are essentially equivalent. They'll get the same performance for the same roughly the same workloads Then those isn't in the 2000s hashing hash joins really took off and now this is again Where we're at today's hash joins considered the the correct way to go if you want a high performance Join algorithm now if your data is already sorted and yours and don't it's already sorted on your join key Clearly sort merge join is gonna be better But if it's unsorted and you don't know anything about it then Then hashing is the way to go and then where we're at now and today in this decade Although it's almost over is that the debate is down there when you want to do partitioned or non partitioned hash joins right and Then in 2020s and you know, Donald Trump's third administration or whatever We'll see what actually turns out right and again, we might all be dead. That's okay all right, so I Want to talk about look again give one more detail of the last decade worth of work on parallel join algorithms for in memory databases So as I said, there was this back-and-forth with a hash join versus so much join was better There was paper came out in 2009 by Intel We're actually Intel and Oracle together where they said that Hashing is clearly better than sort merge join. So that's the what you want to do in your in your database system But they also proposed that if you had wider SIMD registers Specifically 512 bit registers then sort merge join would actually be preferential We'd actually would be better now at the time 512 bit registers didn't exist They actually exist now as of like two years ago, but I don't think anybody's actually done You know has sort of revisited this claim Again, we'll cover we'll cover SIMD more detail and in two more lectures and we'll cover so much join next class right but but in 2009 gave him what the hardware was available at the time. They said hashing was better and Then in 2011 we'll show some of these results later on But they sort of actually measured the trade off between partition versus non partition hash joins Then in 2012 the hyper guys in Germany they came out and said well Intel was wrong. Wisconsin was wrong. Sort merge join is actually better Even if you don't have the newer hardware that Intel says you need We can actually make sort merge join work faster than hash joins right now, right? So that was 2012 Then 2013 said ignore the hyper guys said And you actually really you really want to do hashing, right? So then they have a hash algorithm that that I forget whether this was partitioned. I think it was partitioned Again, so this is a hashing was better then in In 2013 the people in Switzerland a th they came up with a new New optimizations to make the radix hash join or a partition hash join We're new a radix radix base partitioning hash join and we'll cover what that is in a few more slides So they just said here's how to make the hash one even better And clearly you want to do partitioned and the paper I had you had you guys read was in 2016 where more Germans basically said hold up Everyone is sort of not thinking this through clearly Everyone's sort of proposing their own simple optimizations for like how to tweak this one thing to make it go better And they actually try to then provide an exhaustive evaluation of all the different design decisions you have in your hash join And they basically show that under different conditions one actually might be better than another All right, so that's sort of what we're at today hash join is is considered the fastest way to do this The exact details of how to get the best performance Can vary depending on the workload. So although this system sort of gives you a Sort of a recipe guide or a bunch of different different ways to implement your hash joins Nobody actually as far as I know most systems don't actually say here's all you know Let's try to implement every different case of these guys support everyone pretty much picks up You know picks one way and just makes that work really well So you either pick partitioned or non partition as far as I know everyone does non partitioned and You don't worry about you know doing anything dynamic or adaptive based on what they're doing here, right? All right, so what are our goals when we want to design a join algorithm? so again in a In a disk based system, it's all about disk IO So it's all about removing the amount of data. We have to read and write as we do the join in our world So that weren't different so the first goal is that we want to minimize synchronization All right, so we don't want to have the different threads taking latches on shared data structures And have them interfering with other threads maybe computing the join at the same time right we're gonna try to make this Sort of non-blocking as much as possible now as not to say we want to make this latch free Because we as we talked about before making things latchery just does Cause you to have you know do extra work in order to be able to say that you're latchery But there's other ways to organize the data in such a way that we can then still remove the amount of synchronization that we have to do We're also going to minimize the Then our CPU cache misses right we talked about this before we talked about that you know This the difference between going getting something in your cache versus going getting something out in memory is quite significant So we want to design our algorithm in such a way that They're reusing data while existing the CPU cache as much as possible So you don't do sort of random jumps different spots and go every time is a different cache miss You want to maybe try to have some locality in an affinity in the data that you're accessing And certainly also to Since we want to run in a new mode environment. We want to avoid the amount of traffic we have going over the You're going over the socket Right, so sort of this is what I sort of already said like how can we nope nope Okay, sorry All right, so how can we how can we improve our cache behavior? So again, what's going to cause us to have cache misses or certainly there's the size of the CPU cache Right, this is L1 L2 L3 And it also depends on what's in our TLB the transition Translation look aside buffer So if we again, we're jumping around in different spots in memory and every single time is a new cache miss because the data When we need is not there then also could be we'll have a TLB miss because the Mapping from the virtual dress to the physical address for what we're trying to access is not in our TLB So that's another cache miss for the hardware actually go fetch that for us So the way we can again architect the system is that we want to then Maximize the amount of sequential access we're doing right so we don't want to do random random jumps into Our data we want to use long scans as much as possible So that we can have everything sort of fit in a cache line operate on all the data that we bring into You know into our CPU cache within that cache line and then be done with it before we move on to the next one The other one is that in the case we do have to do random lookups we want to have the search base of what a thread or core is going to look at fit in our in our local cache Right because if we're jumping out to different sockets again that sucks, but if we're jumping out to Data that's beyond our CPU caches. That's still bad, too so again, we there's there's These are the things we sort of need to be mindful as we are as we are you know designing our algorithms now What end up happening is the way you like for this one here? You know you would seem clearly I want to do partitioning this you know This is this is what I'm going to want to do right to take advantage of this again This is trade-off between the the number instructions versus the amount of memory I have to use and It actually is going to try to be the case where At least a modern CPU that partitioning is actually not going to help us or because we're going to end up doing You know multiple passes over the data in order to partition it and that ends up being More work than just paying the penalty to have caches But we can still design other aspects of the system or of our algorithm to be mindful of this All right, so hash joins so hash join is the most important operator in And a data management system for a lab workloads it It usually rises up to be near the top of what the at least for an in-memory database What the the bulk of the majority of the time that the databases is spending doing work is doing these joins? The you know filtering means we can speed up with vectorization Sorting we can speed up other techniques, but the hash join is sort of like this brute force thing We have to like just just do and so there's there's things we can do to make it suck less But it's still going to be about the high pull in the tent and that's why we're going to spend an entire day talking about how to make Make it go faster so In the in the intro class we just when we talk about the hash join We don't even discuss in terms of what of hardware like cores or threads and things like that We'd say here's the algorithm and we just assume that it's going to go do it But now we need to design our implementation of our hash join algorithm to be aware that we're going to run in parallel multiple cores and how to How to make sure that that that they're running at full speed or what full utilization But don't become bound or bound by the bottleneck of getting things off the memory. All right So a hash join can be broken up into three phases it is broken up to three phases So the first one we can do partitioning and this is optional And this is what we'll take all the tuples that we want to partition or start I want to join on the the outer table r and the inner table s and we're going to split them up into the join key Using the hash function that we weren't going to join on anyway and sort of dividing up them into smaller chunks So that when we do now our join, right? We're only looking at data within two corresponding chunks from the outer in the inner And this is also called the great hash join. I think we covered it in the intro class All right, so again regardless now if we done the partitioning or not in the second phase is that it will actually now build the hash table For the outer table r based on our join key So this means that we're scanning our and we're looking at the join attribute And then we're going to hash it and put it into some hash table We'll describe what that hashing function is or the hash table looks like in a second Then now once we've got our hash hash function or sorry a hash table Then in the probe phase we then scan through the inner table s We look at its join key hash it and do a look up into the hash table We just built in the previous phase and then we find a match then we we Splice the two tuples together and then produce that as our output So one of the things that the paper you guys read that I like is that they're actually doing the they're actually doing this Materialization step some of the other papers actually don't do this they just say alright I like I do my match I find my join I find that the join keys match and then these throw the result away Which is not realistic a real system you actually need to produce the output So in the paper you guys read from the Saarland Researchers they actually do this full step now it seems kind of trivial like why wouldn't you just always do this? I Don't know why people don't do it And it's anyway It infects also to the caching behavior because now you're doing a copy of the the inner and outer tuple and Putting that into some buffer like that affects the whole the whole architecture of this system All right, so I'm gonna go through each of these phases one by one So in the partition phase again the idea is that we want to split the both the inner and the outer You have to do both want to split them into these partition buffers based on the on their join key And the goal here is that the cost of doing multiple passes over both relations And partitioning them essentially doing copies of them The goal is that the cost of doing that will be less than the the cost of cash misses TLB misses and Yeah, and when we actually do the join in in the in the but they're against the build phase and the probe phase Right, so sometimes in the literature. You'll see this called hybrid hash join right or the radix hash join I mean it essentially seems to me means the same thing You're partitioning the data, but the inner and the outer before you actually do the join So what we're actually going to put in the buffer of these partitions will depend on what their storage model is So if it's a row store NSM Then it's either gonna be like the the entire tuple Which it often is or this is a subset of the attributes that I need to compute the joiner And I need up above in my in my query plan. I this is why most people actually do this All right, because you don't want to go back and get them get the rest of the data For DSM We can actually just only store just the table the data We need for the columns to do the join but just those attributes and then just the offset of where the rest of the data is Located now. This is an example of late materialization right with the idea is that we're delaying actually stitching the tuple back together before We present it off to the to the application And the benefit of this is that we need to store less data into our hash table But the penalty is of course, then you do this look up afterwards if you need to get more after you do the join I Think in some systems they do they do late materialization sometimes they do early materialization So sometimes they copy the entire tuple sometimes they just put the data that they need Right really depends on what you think the output of the or the selectivity of the join operator So if if a hundred percent of my tuples are going to match in my join Then I might want to stick the whole thing in there because then I don't have to pay the penalty of going to look up afterwards But if only like 1% of the tuples are going to match Then I don't need to pay the penalty or copying things in that I'm never going to need and then for that remaining 1% I just to go do my look up so again for this one the as far as I know Most systems you do one or the other and no one tries to be slacker or or You know be adaptive and say all right well for this query I'll do it one way for another query I do another way everyone just sort of picks one and sticks with it despite you know Despite knowing that there are benefits from doing one of the other based on what the query is and what based on what the data is, right? So again, the main thing to understand about why we're doing this is that and in a in memory database a Cash mystics so expensive that's better off paying these extra instructions to do this partitioning That's you know, that's that's the that's the intuition of the motivation. It's not always the case though So let's talk about how we actually want to do this partitioning So the two approaches are to do a non-blocking or blocking And non-blocking you basically scan through the each input relation once and then they're going to populate this The partitions and then then they'll go back and reconcile things later on You sort of build it out incrementally the blocking partitioning again, also time Sometimes sometimes called radix partitioning is that we or end up scanning the input relations multiple times And then we only end up materializing the results all at once at the very end So we can do this bit by bit, but this one is it can only be like you only get one final output So in theory with this one you could start producing output incrementally then have another thread Start feeding the taking the data you're feeding out of your operator and start building the hash table Where in this one you can't you can't start building the hash table until the partitioning is actually done So again, we'll go through both these so now with a non-blocking there's two more approaches, right? So the first is to share partitions Where you have a single set of global partitions that all the threads are going to try to write to and then you basically Just protect each partition bucket or partitions set with a simple latch Right again in this exclusive latch. I acquired. Do I write and then I release it once I'm done my right with private partitions You're trying to do this in a non-blocking way where you're going to go through and have each thread Right to their own private partitions and then after after everyone's writing to the partition then you go back and have another thread or multiple threads consolidate them into a single partition Right, so let's go. Let's go through both of these but the main thing about this is like again There's no free lunch in this right so either we're gonna have to have latches to protect our data structures as in this case here Or we try to be lock-free And don't use latches and have everyone write to their own private workspace But then we have to pay the penalty of doing an extra pass to put it all together So again, this is like classic CS There's like there's ways to make it be latch free, but you end up doing more work And so whether or not this is which one of these two are better, you know, 10s on the hardware depends on what the workload looks like All right, so let's go through the the the shared one first so this is our data table and Just like in morsels. We're going to split it up into chunks Right and each they'll have each thread or core be assigned to operating on on some chunk here Like a disjoint subset. So let's say now we want to our join is going to be on this attribute be here So what's going to happen is we're going to have a hash function for this for this phase of the partitioning and We're going to look at each value here and be for every single tuple And we're going to do some parallel on each core and then take whatever this value is Run it through our hash function and then mod it by the number of partitions that we have right and Then they're all going to end up writing into these separate partitions Which essentially are just going to be like a chain a chain link list so a bunch of blocks and within each block We think we can stick a bunch of data in so again Everybody's going to scan through in parallel and then write out to these these all these are partitions So that means that if this guy is writing to P1 at the same time this guy's writing P1 Can we protect that with a latch? Right just like that take the latch appended and we're done All right, so let's see how to do this in a non-blocking way Same thing we're still split up into different chunks each thread's going to hash on that chunk in the side where it goes But the where they're going to write is local to its core Right so each of these guys have their own copies of the same partition buckets that I showed in the last slide And they're the only ones writing into it. So therefore they don't have to take a latch to protect any of us Right so it means again I have to set all this up ahead of time to make you know for each each core rather than sort of You know jittering a bunch of partitions for everyone to write into Then once I know everyone has has processed all their tuples Then I just have either another thread come along and just consolidate it put it all back together, right? I just sort of like that like you can have one guy be resolved for one And then one guy be responsible for writing out the partition to all right So again this required me to since you do In this case here two copies of the data right because I copied it once in this partition And then I copied it again into this other partition I think you obviously be kind of clever and say oh this this block is full Then I don't need to copy it to me another block. I just include it in the link list for the final global partition all right, so Again, this is another good example of you know different materialization strategies because if I also like if I'm just passing around Just the offset then I'm not copying that much data from one one step to the next Right if I'm doing late materialization or if I'm copying the entire tuple as I put into my partitions then you know that makes that copying more expensive All right, so let's now look at to do the the other approach right the rate of partitioning So again the idea here now is we're going to scan our relation potentially multiple times to generate More fine-grained partitions right the idea is that we want to split these keep splitting this thing up recursively Until all our partitions can fit into our CPU caches Now the spoiler will be the reason why this is not going to work out In practice is because if we have skewed workloads Then you're going to have some partitioning have to go through multiple multiple iterations to get everything to fit into CPU caches whereas other ones maybe only need one pass All right, so The first pass we're going to do is we're going to scan our relation R And we're going to compute a histogram that's just going to count the the number of tuples We're going to hash to a partition for a for a given rate X Right, and this is the same rate X. We talked about and rate X tree Just think of it like a digit in the actual key Right and in the case of the rate X tree we were looking at bytes sort of the same thing right thing of one You know one one character one position All right, so now again these histograms are going to tell us how many people is going to hash to a given At a given rate X to some offset Then we use this histogram to now compute the Offsets of where the where the cores are going to write to based on their prefix some I'll explain what the prefix some looks like in the next slide Right, then now using this this histogram that for the prefix some that's going to tell us where are each thread It's going to end up writing into our partition space and they don't need to coordinate with each other So now we know that every thread can write at some offset in this giant list of tuples or keys And I know that nobody else can try to write to the same slot as I am So again I'm going to let me walk through all of this But if this is a good example of like combining sort of a bunch of simple ideas that we sort of already talked about and Allows you to do something more complex All right, so the rate X as we already said in the rate X tree is just going to be some digit or position at at a particular key So let's say our input key sequence is 89 12 23 0 8 and 41 64 So the rate X for this position would just be nine Right and two three eight one four. So we're only looking at one can one digit of this right All right, same thing with the next one eight eight one two zero four six now With the prefix some all that is is just a running count of of a sequence of numbers Right, so say one two three four five six. These are input keys So when we start computing the prefix some the first position here There's nothing comes before one in our list here. So the the prefix on this is one But now we're going to take the output of the last prefix some in our list Plus the next key in our let and our in our input sequence add them together and that's our prefix some So in this case here, it's three Same thing three plus three gives me six right and going for it down like that So we're going to use this technique to get this prefix on that's how we're going to figure out where our offset is For each core to write into our list of keys when we do a partitioning And we're going to partition based on the rate X key or the rate X of a single digit at one position Do partitioning figure out the way to write right into our total key space then do it again for the next next rate exposition So let me let me walk through a whole example of this All right, so for this one for simplicity again assuming we've already hashed everything right to these are just the hash values of our keys After we've already spent you know after we've already gone through and and hash everything So the first step we want to go through our input sequence and create a histogram So again, we're going to split this up just like we did before in morsels and in the previous approach So the first course only look at these first four keys the second one's gonna look at the next four keys So we look at the first rate exposition here and we're going to say well what? When this is going to count how many Keys at or how many keys will master each partition that we're going to have so we take this This value here mod it by the number of Partitions that we have and that's going to tell us how many we have or how many we're going to write into So say we have two partitions because you have two cores So in this case here for partition zero we had two entries one two For partition one we had two entries one two For down here. We had one for position zero and then three for partition one So now we compute the prefix sum based on these histograms So we're just going to go in order like starting two two one three and compute the prefix sum of this And that's going to tell us for a given partition at a given CPU core Where should where should they start writing the keys into right? So starting with this one here Right. This one writes zero seven into here because this is that position zero and we for this partition zero At offset zero writes into here. This one was at partition zero at offset This is here two four five. So you're in a writing here for partition zero. Sorry. Yes. So this would be two One two three. So this one say I have a length of two. So this one starts at three Is that clear and I do it for the next one right at that So now again, I'm taking it going passing through my data again and just writing out copying where this thing is actually being partitioned like this And I don't need to coordinate for each thread where they're writing into because I've already pre-computed Where where it's safe for them to start appending their keys into this so The algorithm says you can keep doing this and keep partitioning until these these partitions fit into your CPU caches Right. So they say in the original Intel paper you could do this essentially two times You can do this all over again now look at that at the next rate exposition and do the exact same thing that we did before But now this is a good example of why this this becomes problematic because because I partitioned it up before For this one's looking at maybe at partition one Sorry at partition zero this CPU only has three things to look at whereas this one has six things to look at So now this CPU is going to do three quick things and then be essentially be blocked waiting for this guy to finish Right. So you should have this skew problem where everything could end up being at one core and everyone else is going to be idle So in practice nobody as far as I know at least in the in the literature goes beyond doing one pass Right one round of this rate of partitioning, right? Even though in theory you can keep going until you get hit all your digits There are some points diminishing returns, right because I'm passing through the data when I should probably just be joining it all right Then same thing it is through it like that All right, so this is clear and the idea here is we're splitting up to partitions We can do this in a blocking a non-blocking way with your shared partitions or global partitions Till we split up our both their input or interrelation interrelation and our relation into these smaller chunks So that everything fits in our CPU caches All right, so now that we have this Whether or not we actually do a partitioning or not we end the build phase So the idea here is that we're going to scan through with multiple threads on our add a relation R And then for every single tube or we're going to hash it on our join key And it has to be the same hash function we used in the partition phase and the same hash function We're going to use in the pro phase this it has to be all everything has to be the same And then we're going to take you know find our hash or do the hash on our join key find the What what bucket the the the tuple should exist in our in our hash table and add it to that And the goal here is that we want our buckets to only be a few cash lines in size Obviously, we have to do alignment as we already talked about But the idea here is that again we want to have We want to be able to write a our key into a single cash line and not have to do two cash cash look ups Or two memory look ups to go get the data we want So the thing we're going to care about here is how are we actually going to organize our data structure to store this data as As we do our build phase Right, this is what people colloquially refer to as as a hash table But people often think it's a hash table is being the actual data structure a hash table That actually comprised of the the hash function and the hashing scheme that's used to store data into it and the idea here really is is The hash function is way we're going to take a large key space All right of any arbitrary value and map it into a smaller domain that can fit into some region of memory All right, because we don't want our hash table to be too, you know to be massive because that's a waste of space And then the hashing scheme that we're going to use is going to tell us how we handle collisions after we've already hashed After we hash and fight figure out where our data should be stored in our and memory if we have collisions How do we handle that? So again, we're going to go go through through both of these and again, it's going to be this trade-off again about you know How fast we can be? in our hashing versus our collision rate and then when we have a collision Do we pay extra instructions on inserts or pay extra instructions on on look up? And I'm not saying one is better than another again most systems That's game when we asked them most companies when we asked somebody said they just pick one and stick with it All right, so our hash function so The goes about saying here, but the goal when we say hash function We don't really care about anything related encryption security or cryptography Right, it's just about taking keys of arbitrary length and an arbitrary domain and then mapping it into either 32 bits or 64 bits and then we can mod it by the number of buckets where we can store stuff So we don't care about shot one shot to physics. We don't care about two-way hashing here So meaning we don't care about taking the key hashing it and then be able to reverse it back and get the original value We don't care. We just hash it and we want to have it be fast and unique enough that our collision rate is low So the the two ends the spectrum for how we pick our hash function is we need to be really fast Or we can have a low collision rate And these these are sort of at the two opposite ends the spectrum So it's like you can't have the best of both worlds you set up the pick. Where do you want to be in the middle? So in the case of for performance the easiest hash function You could possibly ever have is no matter what input you give it to the function. You just return the value one Right, that's amazing fast because it's because it's you know, it's writing out one to a register It's nothing the problem is if every key hashes of one then your collision rates going to be huge And now when you stick it in your hash table You don't have collisions and therefore you're such going to fall back to a sequential scan to find the data that you need Which is what we want to avoid in the first place And the other end of the spectrum if you want to have zero collision rate Then you can use what's called a perfect hashing function Which is basically taking every possible unique value for a key you give it You're guaranteed to have a unique hash hash value. I need unique position Right and the way you can implement this is essentially just having another hash table the inside Or if I know exactly the number of slots that I have then for every key that I give it I have them then another you know physical mapping to a position. It's guaranteed to be unique So this is more like a theoretical thing. Nobody you can't actually do this in practice So people the goal here is we want to pick something in between that has good performance, but a low collision rate So there's a bunch of different hash functions that people use and for the database system for doing joins So that probably the oldest one you could possibly use is CRC 32 or CRC 64 So the CRC algorithm itself goes back to like 1961 but in 1975 is when the 32-bit version came out So this is what people actually use in networking to do For for error detection on packets and things like that But it's actually essentially the same thing as a hash function It's good taking some arbitrary key of an arbitrary length and producing a you know Semi-random value, so there are actually some systems that we ask them. They're actually using CRC 32 for for hashing. All right The more modern approaches are murmur hash city hash and farm hash We're actually these probably two of the ones we actually care about So murmur hash was actually written by there's like random dude on the internet He's like posted his source code on on GitHub or source. I forget the details And then people sort of picked it up and actually started using it because it had nice properties Like it was fast and it was general purpose and had it had a good collision rate And then Google took murmur hash to the second version of it and then they modified it They borrowed some ideas of it and they came out something called city hash and this was designed to be really fast for For keys that were less than 64 bytes because that was something specific to Google's Sort of problem they were trying to solve so rather than being general purpose. They sort of specialized to do well in 64 bytes Then there's a later on that came out with with city hash I'm sorry farm hash was a variant of city hash and this was designed to have better better You know better collision rates For the same time data set so these two could be interchangeable. This one that has this one works better. There's also a Google it's called highway hash and I think that one's like That has protections against like cryptographic analysis on the keys, and we don't care about that because we're running inside memory We're gonna throw away the hash table immediately after we compute the join so we don't care about this Then what's actually also interesting is that there's a newer hash function from 2016 at our researcher in Canada from named Daniel Lamir That's called cl hash and what's kind of cool about this is that it's based on a different kind of math called carry less multiplication I don't pretend that I know the details of this but the my understanding is that It it carry less multiplication didn't work well on modern architecture until the last three or four years When both Intel and AMD added specific instructions to make this work Well, and then now you can actually use this for for hashing function, right? And we'll see what kind of performance we get so I think that's kind of cool. This is something um This is something I you know I wonder people are gonna start considering this instead of a set of murmur hash Remember hash is pretty is pretty common now All right, so this is a benchmark that It's a benchmark frame that I found I found on github a few One or two years ago, and this is actually modifying it to run some experiments. I want to show you So this is running a it's sort of like a micro benchmark to see how fast they can generate Hash functions so long the x-axis they're going to vary the size of the key that they're throwing to it Right that they're hashing and then the y-axis is the megabytes per second of of the size of the keys that you can process So to sort of think of this is like not so many how many keys per second Can I process but like the total amount of data that I'm processing all right? So the first thing you see is that the the CRC 32 it actually performs the worst here And there's these little spikes along the the you see 64 bit lines because this is sort of what they expect in The ways in the way that the eye going that is implemented to work well in if we're you know network packets and things like that Like you don't know like you don't send weird, you know one-on-one packet sizes, right? It's always sort of bite-aligned The middle guys here the gray and the red that's std hash and remember hash right this does region well when things are small Everyone's roughly about the same, but you see nicely that along these spikes of the The word alignments at 30 to 64 128 192 right of you see the solitude pattern from for farm hash and city hash again, what they're doing really well for Keys of this size and this is sort of like they're using one algorithm when you're less than 64 bytes But then beyond that they're switching off to be something different and that's why you sort of a longer solitude there All right, so I think for our system at least from the old system in Peloton I think we used murmur hash Remember hash three that that seems to be pretty common And for the CRC 32 hash like I found the fastest version I could find on on github made there do a bunch of like you know Bit tricks a little bit bit manipulation tricks to make it work. Well, so this might be the best you can get from CRC 32 All right, so let's look actually at CL hash So when you're less than 64 bytes CL hash actually doesn't do that great But then going beyond that looking at larger key sizes It does actually really really well and it's interesting that you have this like these spikes along Yeah, like they actually they have or individual bite sizes, right? That's when you see or see those jumps like that, right? so again, I there's no Clearly, you don't want to use CRC 32 murmur hash city hash Farm hash that city hash. These are all probably like good enough right for doing joins So I'm not claiming that one's clearly better than another right as you get longer key sizes than then you don't use CRC 32 But for these guys here, I don't think I don't think there's a big big difference Right, actually, I'm not measuring the collision rate here. So that that that's another metric. We'd have to consider I Pleasure do that. I just haven't done that The other thing to point out too also here is that none of this is taking advantage of SIMD as far as I know Like in cases for Google's code like they don't use any SIMD at all because they want this to be portals They want nobody to run this on your Android phone or whatever crazy hardware that they have So this is not using any SIMD whatsoever. I don't I don't think the other the other guys are either But SIMD is something you could use to speed this up. I think there's a CRC instruction in Intel, right? That doesn't use that though. I wonder if that's something we could look at. Anyway, all right, all right, so Right, we these a bunch of hash functions we can use now. We want to talk about what the hashing seems gonna be So there's four basic approaches you could do. There's also hopscotch hashing, but we don't need to talk about that but the basically and the thing that they're all gonna be different on one is how they're gonna deal with with collisions Right the best hash function the world unless we're using perfect hashing It's still gonna have collisions. So the question is how do we actually deal with that when we have two keys hash to the same slot in In our in our table. What do we actually do and again the trade-off is gonna be do we make the reads go faster? Or do we make their rights go faster? So chain hashing is probably what everyone thinks about when they think about a hashing cable, right? This is what you get in like Java and see plus plus it's basically you have a link list of buckets and That and inside that bucket you have a slot in your hash table. So you're gonna hash to some Starting point in a chain and then you're gonna traverse that that link list until you either find the key that you're looking for or you Find a free slot for you to insert your key that you're trying to put into it, right? And then you just so essentially what happens is every do the hash you end up doing a sequential scan to either find Then you know the slot that you want and that's going to be the re-incurrent theme for all of these like It keeps on linear hashing and and Robin Hood hashing those are all where I had to do Sequential scans to find the thing that I'm looking for in in cuckoo hashing. It's gonna be more random and we'll see why All right, so again really simple say I have this this array here. That's gonna tell me for each position in my sort of hash base where the There's a pointer to the starting out of the um of the bucket in my chain, right? So if I'm looking for a particular key Then I just follow it hash here find my offset follow the pointer And then if I'm inserting and there's nothing there then I just do a I append a new a new bucket at the end and extend extend the chain Right again pretty straightforward the other sort of sort of classic approach to doing this is called linear probe hashing and thing of this is just a giant table of slots like a single single address a single block of memory and I'm gonna hash to some offset into this this list of slots And then if I when I land in my position Where I should be if nothing's there, and I'm doing an insert then I just insert right there And I'm done if something is there then I'm gonna do a sequential scan until I find the the the stopping point of When I can insert what I want for a free slot If I'm scanning if I'm doing a lookup then I land in my position I check to see whether the thing I'm looking for is at the first slot that I land on if Yes, I'm done if not then I didn't do that scan till I find what I want or I reach a empty space when I know I know there's nothing else or or I wrap around and meaning. I've seen everything right So let's look at his table. All right, so say these are all the keys I want to hash so the first one I caches a and then I Hashes this position here nobody's there so I go ahead and store my my what I want So in this case here I'm gonna have to store not only the key that I'm storing but I also want to sort of the hash of what I'm storing Because I want to be able to quickly identify Is this the thing that actually want or have I gone beyond a bunch of hashes or things that that should be sort of stored here But now I'm looking at some other hash value Right now I store B same thing B has a free position so it goes there So now I want to sort of see and see hashes to the same spot that a is located in So I just do a sequential scan until I find the next free slot, which is the next one and that's where I insert C Same thing now with D gets here C is already there. So I insert there Right now case the E same thing I keep going down until I find what I want and then F goes here and goes like that So again now if I'm doing a look up to find say E E hashed here I say this is not the E that I'm looking for this is a jump and I just keep going down here And I keep track of what I was looking up for the first case because if I scan back around to come come here I don't want me stuck an infinite loop. I need to know when I can terminate So say E was actually not here and this space was empty So as soon as I got here, and I saw it was empty, then I would know that I'm done I get my search can terminate All right So this seems really simple because it is this actually is going to be the fastest way to do this because it's so simple All right, there's no tricks about copying things out, right? Obviously the the the your the it's sort of equally penalizing inserts and scans because I'm always just scanning through but based on what the harbor can actually do the harbor prefetcher If I'm scanning a lot of things can actually bring start bringing in things into memory as I'm scanning along Assuming I'm going to keep going going down So there's actually things that the harbor can actually do to try to to alleviate the penalty of having to do the scan so The Harvard though is not going to be able to minimize the number of Comparisons we have to do during that scan Right, so maybe I don't pay a big penalty for my cash misses because things are already in my caches But I still have to look and see whether my hash key matches. What's in that hash table? So that sort of sucks the other thing that sort of sucks about this is that we need to have a Just sort of minimize the or to reduce the amount of collisions We're going to have I need to make sure I allocate a large enough space so that The likelihood that if I'm inserting something and it's and that the slot is empty is is is high Now we'll see this later on when we talk about the the cost models for query optimizers There's one example in postgres where postgres tries to estimate the size your hash you would your hash join You're sort of trying to pick out the minimum size that you need But the problem is if you get to the point where your collision rates too high or that hash table is full They basically have to create a new hash table and then rehash everything into the new one Right that sucks like that's doing a bunch of extra work Because you didn't size your hash table correctly So that's the big issue with all these sort of hash table schemes They want to make sure you pick a hash table that's large enough that your collision rate is low But you want to be too large that you're blowing at your CPU caches or you're just wasting memory So say you want to reduce the number of wasteful comparisons that we do during our join What can we do? Says what you use a larger hash table What's the biggest you're here right so this guy here for e I had a bunch of people I Should be here, but I end up down here But when you think about it though, so like D D should have been here, but ends up here So D is actually closer to where it should be then E is ease all the way down here. It's three hops away What's that he says swap E&A yes, so that's what's called Robin Hood hashing the idea Robin Hood hashing is that we're going to Steal the slots from from the rich keys Meaning keys that are closer to where they should be in my hash table and then give them to the poor keys Right the keys that are farther away And the goal here is to hopefully reduce the the variance in how far each key is away from where they They should be by again swapping things around and try to try to even things out So this is an old idea this goes back in the papers like 1985 or here it is in 1985 and Sort of no one paid attention to it and then it showed up on hacker news a few years ago And then a bunch of people thought this was a good idea And then it turns out it's actually a bad idea because you're gonna end up doing and do more work and much more copying than then Then you would have if you just sort of left things alone the way they are in linear probing right So what happened is for every single key we insert We had to keep track of how many positions we are from where we should be But we had a collision and then we use that when we do an insert and we look at to see all right Why should be here, but you're there How many how far away are you how far am I and if I'm far the way than you are then I steal your position And then you hit the top out and keep scanning down to figure out where you belong So let's look at the table so insert the same keys a goes here and again I'm storing the the number of jumps from where I should be for my first position along with the original key So a when I got inserted There's nobody else there. So it's it's number hops is zero or number jumps is zero Same thing with be be gets inserted at the top the number jumps for him is is zero So now insert C C should go here, but a is there So at this point C is zero jumps where it should be because we this is the first place We landed after we did our hash and so a is also where it should be with zero So at this point they're two equivalent We don't we don't take anything from a we leave a alone and we go and set Go ahead insert C here, but now we keep track that we're one hop away from where we should be Now insert D same thing well in this case here D is Zero hops where it should be C is one so C is greater than than zero So C is considered poor more poor than than the other one than D So D has to go here and it's one hop away Now we get to D E Again the very beginning a is zero hops E is zero hops with the equivalents we leave my loan Now we get here C is one hop E is one hop leave we leave C alone But now we get here D is one hop E is two hops So E is considered poor more poor than than D So it's allowed to shoot at them head You know steal its car and run away and then it takes it takes a slot And then D has to get inserted down here, and then now it's it's it's two hops away So now we insert F F should go here, but D's in here But so two is greater than zero so F can't steal a slot. So it goes down here All right So again the the the intuition of this is that We're reducing how far away each each key is from where they should be so now if I'm doing that hash look up on E Whereas before I would have to get down to here before I can find it on a linear probing But under Robin Hood hashing I only go have I only have to go here So this makes it my look-ups go faster because I'm reducing the number of hops and meat It takes for me to get there But my inserts got slower because now I had a you know in this case here I only had a swap two of them, but I could potentially have to swap way more You know depending on what's actually in my hash table And it turns out that copying is gonna be terrible for cash performance of this thing all right So I don't name names, but we had a Speaker come from a date well-known data this company from two years ago They talked about how they were using Robin Robinhood hashing we asked them why they said they sold on hacker news and thought it was a good idea Right, but then the paper you guys read shows that it is actually not a bit not a good idea right All right, so these are all variants of linear probe hashing So let's look at let's look at a different way to deal with collisions We're not doing linear probing in the same way called cuckoo hashing So the idea here is that instead of having one single table space that everyone's trying to insert into We're gonna have multiple tables and what happened is each table will be managed by a different hash function And whenever I want to do an insert I hash My key with both hash functions and I look in those two tables could be could be more than two I look at those two tables figure out where I have a free slot and And choose the one where actually I can't insert But then if I don't have a free slot then I pick a random victim to steal from Take its position and make it move out And then now when I would do a lookup my lookup is always gonna be in a one because I'll take my key I hash it and I'm guaranteed to find it in either in either hash table whatever hash table that I have right So yeah, let's let's look at example Hey, you can have multiple hash tables in when cuckoo hashing in practice everyone always uses two So say I want to insert x I take x and I have two hash functions from table one and two and both of these guys hash to these positions Both those slots are free. I flip a coin and I pick one In this case here, we pick this one here Now I want to insert why hash again a hash again with both say the first hash function actually maps to where x already got inserted Right, but the second hash function maps to this other slot. That's free So rather than steal x Exes position the first one. I just go ahead and sort of over here All right, no problems But now I'm gonna certain z so z is gonna hash to this slot and it's gonna hash to this slot So I'm gonna flip a coin and pick a victim. So let's say that I choose that side But now I know because that back before right, so I'm gonna go ahead and steal it slot So but now I got to put it back over here So when I before when I hashed it it mapped to where x was so now I got to go steal it It's it's slot, but then I take take you know put eight wine a spot x comes out a hash again, and it goes over there Right, so again, I need to keep track of where I started Because if x then got mapped back where z was and I see comes out and now I see I'm comparing x and y are in the Same positions that I know I'm stuck in the infinite loop So there's sort of extra metadata to keep track of as you do an insert to make sure that you know You're not just taking things out and putting back in infinitely So again now when I do a lookup Like say I'm trying to do a lookup on z here I'm gonna hash and both want with both hash functions because I don't know what table it landed in and I'm guaranteed to either either Be one or the other right in this case over here Right a key won't exist in both hash hash it was at the same time And if a key does exist in our in our in our domain of the keys We're keeping track of I'm guaranteed to see it when I do I look up in either table So I don't need to follow a chain of anything when I do my lookup. I'm always guaranteed to get you know, oh one lookup So what's the penalty of this right? Of course obviously we're maintaining two different hash tables All right or more hash tables, but also to now our inserts are more expensive because for one insert We may end up re you know taking out and reinserting the entire key space, which will be slow So in practice the math works out that with two hash functions You probably don't need to rebuild the entire hash table just about 50% full but with heat three hash function three hash tables You probably don't need to rebuild it until the table is about 90% full So again, you're paying extra memory cost to have multiple hash functions or hash tables But you don't have to do the rebuild if the thing gets full then you or if you if you're stuck in infant loop Then you have to resize it and you want to do that with probability about 50% until it's 50% full or Until it's 90% full and either one so in practice most people do this and you in you size the hash table large enough So that you don't have to do a complete rebuild All right, so what do we have at this point? We've done our partitioning We've split things up and now we design our hash table as comprised of the hash function and the hashing scheme So now again, so now we populated it in the build phase. So now we go into the probe phase So now the probe base we're going to go through each tuple in our interrelation We're going to hash it on the join key do our look up into our our hash table that we just did we just constructed in the last slides Check to see whether we have a match if yes, then we can we know we can You know materialize that the join tuple as our output, right? so if our inputs were partitioned Then we're going to sign each thread its own partition Because then that way they operate on data that's guaranteed to be local to it and in the portion of the hash table That's a local to it Right if it's not partition then you essentially have one cursor doing a central scan Or you break up the morsels and everyone sort of you know feeds in much of data and does whatever that is that they want to do right So one optimization That what we can do here We've come the really only optimization optimization I'm aware of that we can do in our probe phase is to do a trick from vector-wise Where you put a bloom filter in front of you build a bloom filter at during the build phase and you use that? Before you do your probe in the third phase So you check the bloom filter to see whether the key you're looking for exists in the hash table with your bloom filter If yes, then you go do the probe if no then you just you stop It's a say what happens to say again I'm doing a join on a and b in the first part when I do my my my build phase right as I'm building my hash table on on a I'm also going to construct a bloom filter of all the keys that I've inserted into my hash table Then now when I enter the the probe phase I first check the bloom filter for my key if it doesn't exist in here because then again the bloom filter can have false Positive but not false negatives so if a key if a key does not exist in my hash table If the bloom filter will guarantee that it does not evaluate to true when I check to see whether it exists If it doesn't exist I skip this to one go to the next one if it does exist then I actually do my probe in my in my in my hash table again the trade-off here is that if if some if the selectivity of my of my of my join is Low then the cost of actually doing this extra instructions to go check the bloom filter is actually going to be a big win Right. I think some some weeks in our own research. We've seen about the 2x improvement of your hash joint If everything is going to match then this is just wasting instructions because I'm going to go check this it's going to Value it true then I go check that and of course I'm going to find what you know find what I want it So in practice in the case of vector-wise I think they always built this and we'll see a technique from from the Wisconsin quicksets for the system where they actually build Bloom hunters or everything and they use that to actually figure out how to reorder stuff like it based on the selectivity of stuff But I think this is a good idea and this we did this in the old system and we plan on doing this in the new system I if we don't do it already okay, so let's now talk about the evaluation of all these different techniques, so I'm not going to go through all these details of the table, but there's all the again We talked about the new partitioning no partitioning private partitioning radix partitioning right here's how to actually implement all these things and At what points do you need to synchronize where you're actually storing data, right? And how many times you have to scan or scan over the data, right? So again, this is from this concept here We don't need to go this in details. I just want to look at the results so for this now it's an older paper now, but the I think the results are still valid and they're Running this on hold order hardware, but that hardware actually still looks like the hardware we have today It's just we have larger caches and we have more cores But a high level it's still a new more architecture So in the out of relation, they're gonna do 16 million tuples 16 bytes each And then we're gonna build a hash table on that and then the pro phase is two hundred six million tuples That's six bytes each and they're gonna have both a skewed and unskewed version of this and for this This is a good example where they're not actually materializing the output meaning they do the join see that they have a Match and immediately just throw away the results They don't they don't do anything with it We're in the paper you guys read they actually copy the tuples together to materialize it as the output But I think it's important to make this realistic so the the main takeaway here is that the They're showing for this particular workload that the the the private partitioning is actually the sorry The radix partitioning actually gets the best performance. I actually think these results are old but in general this should always actually perform the best but This just shows you for these other two approaches here. You're doing much extra stuff and you're actually not getting You're not getting a big win So they're not measuring the number two was a processing per second They're actually measuring the number of cycles for each tuple that you match So you end up spending fewer cycles if you do radix partitioning and this because you're you're not having cache misses Right when the partitioning case you see that all the time is spent on the probe because that's a cat You're basically all these cache misses every time you do a look up into the hash table here, right? So this I like this graphic is against the shows the trade-off of like all right Well, if I pay the penalty of partitioning which the but no partitioning approach doesn't have then that makes the The probe is going much faster because now everything's split up to cache size chunks but You can still actually do better for for no partitioning even though Even though everything doesn't fit yeah, the overhead of the cache misses aren't our are negated or not as Not as problematic If you're actually if you even though you're actually more instructions I'm butchering that sorry here we go more cache misses But we still get okay performance and then 24% faster Then getting no partitioning because everything is going to be in your CPU caches again more instructions But we still do better than slightly better than the other two approaches And this one here is just showing you that when you have skewed workload then the The no partitioning works actually really great, okay, because now here you're gonna have a bunch of threads They're gonna be Some small number of threads are gonna be looking at a dispensary size or larger size of the data in order to do the probe And the build whereas this one here is sort of like I don't care where the data what the data looks like I just let every core rip through it right and that's why this one actually performs the best right so The couple things that we ignore so far is that When we do partitioning is how many partitions we actually should use and how many passes We want to take over the partitioning phase So I said the spoiler was that you actually only want to do maybe one pass During the great expartitioning stuff, but you could do more and then how many partitions you actually want to use Could depend on how many cores you actually have and what your caches will actually look like So in a real system these are actually knobs that the database system can can can manipulate on a per query basis Like I know what my query is gonna gonna operate on what tables I'm gonna touch I have a rough estimate what my distribution of my data is gonna look like so therefore I can tune my my query plan to set these things based on what's gonna happen It is not something I want, you know sort of set globally and let everyone actually figure it out We'll see this one again when we talk about the query optimizer Although I'm claiming that yes if you knew what things look like you can make better predictions And ends up being everyone always being ends up being wrong and so Being able to set this doesn't always help you help you as much as you think of what? so again here I just want to show that we have Either one pass or two pass the radius partitioning and then in the middle here We have the size of our partitions and you see that when this this black line here is just no partitioning So you see that if you do one pass and set your partition size to this then you beat no partitioning The scale though for everything else is just that it's all over the map So for this particular workload, it's kind of hard to get this right like this is the sweet spot where you want to be But you can see that for for for for different sizes of partitions, you know You can be above or below this line and This is just basic it says you almost never want to do two pass partitioning right you always want to do one All right, the last graph I want to show is the effects of hyper threading so This is a good example where The hardware is going to give you more hard, you know Converting context is a program counters through hyper threading, but we actually don't want to maybe always use them Right depending what joint algorithm we're using so the first portion here This is running on a single socket the first portion here is when you have six six threads and you have a Single hardware core for able for each thread then beyond that is when hyper threading kicks in and now you're essentially running two threads for each core So if you do if you don't do partitioning Then you can take advantage of the hyper threading if you do partitioning then hyper threading actually makes you work perform worse Right because what's going on here is that the the the The at this point here when we do partitioning we end up being CPU bound in terms of the instructions because we're trying to We're not paying the penalty for cashmases anymore So what happens is the the hyper threads are just sort of trying to get through the instructions as fast as possible But there's no stalls because you know waiting things get from from from memory into our caches So therefore they're stymied on just how fast they can actually You know run the instructions where in this case here the CPU can be a little bit better at multiplexing things because I I'm going to have cashmases and why one one of my threads is is Installed on the cashmases another thread connecting execute instructions to make forward progress So that's why we're still able to scale beyond the you know the the the Harvard account here the Harvard thread count So statement is that when you go from six to seven and now you're you have one extra hyper thread Why does it regress? because And your statement was that you think of regress is because now you have more threads and therefore you have more cashmases Yeah, I think for this one I think they resize the It's a good question. I Thought they resize the the size of the partitions to account for the fact you have more threads But I might be wrong about that Yeah, so I yeah good question. I don't know what that tip is I thought it was good again because you have cash more cashmases, but you're right if you have more pressure on the cash Because you have more threads trying to put stuff in it then that could be problematic. Yeah, I don't know the answer to that okay, I Need to check the German the reason paper how do you guys read to see whether they still see this effect? I don't I don't remember what I don't remember what they had this experiment or not But again that the main takeaway from from me from all this is that non partitioning is the way to go right in you know in this case here back for the skewed workload Right, it was always better and then this was marginally worse than the the red X partitioning, but To me you get you get the best of those worlds You don't have to think about like whether I would do partitioning or not I just always do non partitioning and it in it works well enough So this this is what we did in in our old system is what we're gonna do in our new system And that's as far as I know no in memory data. It's actually tries to do both All right, so So again, the basically reiterating exactly what I said, so I think the Simplicity the a simple hash join even though we're gonna run things in parallel is Going to outperform any sort of complex partition scheme You know with share share partition or global part is all that is extra extra work And just if we just try to go directly and get the data and just do our join with a simple hash table Simple hash function then that's that's going to be the best approach Now there's gonna be a bunch of extra stuff we can try to do to vectorize things in our hash join And we'll talk about that in next week Yeah, and next week but in practice in the paper. You'll read will say oh, yeah, you clearly want You know, here's how to do a vectorize hash join when we actually end up implementing the same kind of stuff here It doesn't actually help because again the same thing There's in order to do SIMD or vectorize execution You have to prepare the data and put into SIMD registers and copy things out right so all that extra stuff Turns out to negate the performance benefits you get from vectorize execution Whereas if you just said here's my hash function Just do linear probing and just rip through the data fast as possible that turns out to be the best All right, and I remember correctly I mean there's sometimes hopscotch or Robin Hood hashing might be better than other cases But in general across the board the linear probe hashing and a you know murmur hash three is It's always gonna be the best Because again, it's just so simple. Yes for the hashing algorithm It seems like a lot of your joins would be on really small You know like what it's like they ate for a 64-bit key of some kind is Is there a reason for using both string capable hashes rather than a simple Yeah, so his statement is Do I care about larger keys? Am I ever gonna join on larger keys? so you're almost you're almost never gonna join on Ideally you're never gonna join in strings Because everything is gonna be dictionary compressed that the issue is going to be I'm gonna have my my outer table is compressed on one Dictionary and my inner table is going to compress on another dictionary. I Could decompress both of them then do my join and then then this matters Or I could have to decompress the inner to inner table and then recompress it to be used the dictionary for my other guy Right then again now you're paying bunch of instructions now. I could do my join on integers on smaller keys But like I paid that penalty to decompress and recompress How often in the real world are you using or in keys say at our strings? This question is how often you see in the real world people doing doing join look-ups on strings I don't want to say it's not uncommon, but like people do weird time right like you have to be able to support that Right if someone it's all the time you see people do primary keys on email addresses, so I'm gonna join on that It's not good data is designed but people do it, right? And then again another person could be like actually I I don't know where there's good papers on doing joins on Compressed data because when you think about it If I have to decompress the interrelation and then recompress it that recompress is sort of another lookup into a hash table for the dictionary You can kind of reuse that to help me maybe do the join I won't tell you like it won't find the thing you're looking for it Just tell you whether the key would actually match or not We talked about a bunch of database companies told us they just decompress everything for simplicity other systems do do You know to do decompress and recompress If you're doing foreign key matches and you're using the same dictionary depending on how you organize the dictionary That what the scope is then you don't just go directly on on integer values differences and do different things so I when I'm trying so what I'll say is like You know key sizes this big Actually, even if you're doing 64 to integers like compound keys are pretty common to doing joins on that so 64 is probably the right the right thing you think the target beyond this I think I probably would agree with you that they're rare Yeah, yeah, yeah Yeah, but like so like you know if you do eight eight by keys then you're at 64 That's that's pretty right. I Can't say who but a major cloud company did tell us they see people with like Crazy sizes like like 20 bites and things like that For like one attribute for like, you know, globally you you a you use things like that die Yeah, we can we looked at we looked at trying to make hash joins faster in the palestine system my student for Sean was doing this and It was really hard to beat some of the stuff that people were to come up with like from the German guys Like you're slicing you're trying to like to like slice off Cycles Nanoseconds, it's hard to beat. It's a really hard to beat linear probe hashing. There's like so we gave up on it It was good enough All right, I know questions All right, so next class Okay, the second lecture on joins and now we're gonna look at parallel submerge joins So what I would teach you essentially how to do a parallel sort The merging process is obviously pretty straightforward because you should have walking a lock step lock step And then this is why we're not doing a class on On how to do parallel sorting or it you have a classic specific specifically just on sorting because you get that for free If you know how to do this, right? So the same hour we use to do order buys we can use to do use for our Our our parallel sort merge. Okay? All right guys. I enjoy weekend. See you on Monday I'm sure