 Okay, let's get started. So for this point in sort of the semester, with the exception of the last two classes, all the things we're going to talk about are now related to analytical queries or sort of the kind of queries you want to do when you have a large data set and you want to derive new information from it. So he's already sort of covered how to deal with transactions when we talk about indexing and concurrency control. So for this week now, we're going to talk about doing parallel join algorithms. So this is going to be different than what we talked about the intro class because the intro class, we just sort of, we don't even really talk about cores and threads and where they're actually reading from. So we just talk about it at the algorithm level. So now for this class and for Thursday's class, we're going to talk about how do you actually implement an efficient parallel version of these algorithms that are designed to run on multiple threads with data that's in memory. So for this class, we'll focus on hashing and then on Thursday, we'll do the sort merge joins. And then the subsequent lectures will be on other techniques you need to have to do fast analytics. So for today, we'll talk about, sort of talking about the background of join algorithms, at least the parallel versions. And then we'll go through a overview of what the parallel hash join looks like. Then we'll actually spend time talking about different hash functions and hash table implementations. And for this, we're not really, you know, hash functions can be pretty gnarly math. We're not going to do that. We'll look at some examples of some modern hash functions that are out there that we could employ in our hash join algorithm. And then we'll finish up with the evaluation that was in the paper that you guys read. So again, in the intro class, we just sort of talked about doing joins in the terms of an algorithmic level. And we talked about the cost of different join algorithms, strictly based on disk IO. But now, when we start looking at an NMRI database, and in particular, when we look at parallel join algorithms, there's other things that we have to need to consider in order to make sure that we have an efficient implementation. All right? So the main idea of a parallel join algorithm is that we're going to have two relations, we want to join together, and we can run them on multiple threads simultaneously to compute the answer more quickly. But as we see, as we go along, we're not going to get linear scale up with the number of cores because these join algorithms require often synchronization or communication between the different threads. And there are also stopping points or barriers where you have to make sure that all the threads have finished one phase before you can go on to the next phase. So it's not like all these threads can sort of run independently. So we need to be careful about how we implement our algorithm so that we reduce this overhead. So as I said, for this class, we'll focus on parallel hash join. And then on Thursday's class, we'll focus on sort of merge join. So we're not even going to bother discussing anything about the nest-to-loop joins. Again, that's, you know, in the intro class, that's the first join algorithm we teach you. But in this world, no one, you never want to use the nest-to-loop join when you do analytics. It's always going to be either the hash join or the sort of merge join. And so part of the reason about this is that it's just not that interesting. And so in the case of hash joins or sort of merge joins, most OLTP database systems don't actually implement these more high-performance or better join algorithms because they don't need to. Right? So think, again, in an OLTP application where you say maybe you have, like, the database has customers and customers have orders and orders have order items. You have foreign key relationships between all those different tables. So whenever you want to join on them, you're going to join on the foreign keys. And the way the database system enforces foreign keys is to build an index. And so you're essentially, anytime you do a join between foreign keys in an OLTP database system, you can do an index nest-to-loop join, which is super fast and super efficient and is basically equivalent to what a hash join is. Right? So think about the hash join. What the hash join actually does is I'm joining two tables and I'm going to build an index, a hash table, to allow me to do quick comparisons between the two tuples to see whether I have a match. Well, in the case of an OLTP database system, you already have the index, so you're just going to probe into it to find the few number entries that you need. So when there's a small number of tuples that you're trying to join together between these two tables, the index nest-to-loop join is basically equivalent to a hash join. So therefore, this is why most of the OLTP database systems don't implement a hash join because you don't need it, right? So VoltDB doesn't have a hash join, Timestem doesn't have a hash join. MySQL originally didn't have a hash join. They might have added it recently, but, you know, from the very get-go, all they had was index nest-to-loop joins. Again, that was good enough. So the main takeaway of this is that when we're talking about these join algorithms, we're talking about doing large joins between two tables that have a lot of tuples where there isn't already an index based on the join key, the loss due efficient comparisons, a lookup between the two of them. So the story of the hash join algorithms or sort merge algorithms is sort of a long-standing debate that goes back and forth in the field of databases. It's sort of like the SQL versus no SQL guys, right? Sort of this ongoing, you know, where one side says they're better and then other systems come out and beat them and it goes back and forth. So in this table here, what I'm sort of basically showing you is for the different decades going back to the 1970s, of which algorithm was the dominant one for these different periods of times. So in the 1970s, the IBM guys that were working on system R, they wrote a paper that came out and said, well, you want to use sort merge joins because you're gonna get better performance with the hardware that they had at the time. So again, back in the 1970s, you didn't have a lot of DRAM and they didn't have a hash join algorithm that could spill out the buckets to disks. So in the early hash join algorithms, you required the hash tables to be fit entirely in main memory. So you couldn't easily do that in the 1970s, so therefore you had the sort merge, you could use external merge sort to dump things out and bring things back and incrementally do the join. So in the early papers, they claimed the sort merge was better. In the end of the 1980s, there were sort of two major trends going on, the cause hashing to be better. The first was that there was a database system out of Japan called Grace, who developed the Grace hash join algorithm, which basically you're hashing both sides and you can spill to the disk. So with that algorithm, now hash joins outperform and sort merge. And then also in the 1980s, there came a rise to these sort of specialized database machines that were called, think of them in sort of modern parlance that would be appliance. So I think of a rack unit that is designed specifically to run your database system rather than running commodity hardware. So in these early database machines, they had special co-processors or special instructions that allow you to do hashing very efficiently. So with this, with this hardware acceleration, the database machines, the hash join algorithm outperform sort merge. So at the end of the 1990s, there was a paper out by the Grace Graphia again from the volcano stuff that we talked about before, and cascades. He basically had a paper that said, well, for the given hardware we have now, in most cases, sort merge and hashing are essentially equivalent. And in the 2000s, and certainly where we're at now, the hash join algorithm is sort of well known to outperform the sort merge. So our decade isn't over yet, but for most of the papers that we're seeing, most of the things that are out there, hash joins are always going to be faster. So that's why we're going to focus on that today, but it's still worthwhile to talk about sort of merge joins on Thursday. And certainly also do because the vectorized sorting algorithms we'll talk about on Thursday could also be used to implement order by and other operations in their database system. So the three major papers that are on the reading list that discusses this sort of back and forth between sort merge and hashing are listed here. So one of the first papers came out in 2009 was this collaboration between Oracle and Intel. And this is sort of when they showed that the hash join algorithm, given the CPUs that Intel had at the time, was always going to be faster than the sort merge. But they postulated that if the Intel chips came along with wider SIMD registers. So we'll talk about SIMD more a little bit on Thursday and next week. But if you can have packing more items into your SIMD registers and get vectorized instructions, then the sort merge will be actually end up being faster than hashing. So we actually had a student last semester implement the parallel merge short algorithm we'll talk about on Thursday using the current AVX2 SIMD instructions. And the problem is you can't have 64 bit tuple addresses in the current SIMD hardware. And that calls the sort merge to go slow. But if the 512 bit registers come out, then we think actually sort merge will actually be better. And this is essentially what this paper says. The HyperGuys had a paper in 2012 that also says sort merge is already factored in the hash join even without SIMD. I since have talked to the HyperGuys and they told me that this is actually not true anymore and the current version of Hyper, they use hash join. And then the paper that you guys read today was about doing Redux hash join and some other new optimizations. So you may look at this and say Andy, these papers like five years old, why did you guys hasn't hardware changed enough to cause all these things to get invalidated? And I would say not really because when you look at Intel's chips, they're adding some new SIMD stuff, they're adding some other new specialized instructions. But at the end of the day, it's essentially the same as it was five years ago. SIMDs, they add some new stuff, but SIMD is essentially the same. The clock speed isn't getting much faster. So a lot of the algorithms and a lot of the findings from these earlier papers, even though they're using old CPUs, I think are still valid today. So what are the main design goals we want to have in our join algorithms? And again, this is important regardless of whether we're doing hash joins or the certain merge joins. So the first goal is that we want to minimize synchronization. And this obviously means that we don't want to have to have the threads that are running simultaneously to execute our join algorithm. We don't want them to have to communicate with each other because that's going to require shared resources, which then requires us to take latches anytime we have to update things. And then another aspect that's also very important is that we want to minimize CPU cache misches. So again, when you took the intro class, we don't talk about any of these things. We only talk about reducing the amount of disk I or you have to read the blocks of these three from disk. But now again, we assume everything's in memory and we're going to run on multiple cores. It's these other things that we have to care about. And the cache misch is one that's actually very important. You can see that time again in today's lecture about how they're going to do partitioning and other things to avoid polluting the cache of a core and have it only operate on data that's local to it. And then in the paper you guys read, it wasn't a NUMA system, but in later papers that talk about hash joins, again they talk about how to avoid the traffic over the interconnect between the different sockets. But having to go grab data that's not local to you. So for improving cache behavior, there's essentially two things we have to take into consideration if you want to reduce the number of cache misches in the database system. So the first is that we need to be mindful of the size of our CPU caches. And in particular also of the TLB, the translation local side buffer. If you haven't taken OS course in a while, the TLB is essentially a little space in your CPU cache that OS maintains to map virtual addresses to physical pages. So this is obviously a limited space as well. So if you start reading in a bunch of different pages that are all sort of spread out in the memory, then your TLB is essentially going to get trash because it's going to be swapping out old entries for new entries, right? And so the bad thing about if you're not reading local data or you're not having good cache locality is you're not only going to pollute your CPU cache, but you pollute your TLB. So when you want to go read another page in memory or a cache line, you'll have a cache misch on the entry you need in the TLB. And then you have another cache misch to actually get the cache line. So you can get two cache misches from reading one particular item in memory. And that's going to be really bad. The other thing I've got to deal with also too is locality. So this essentially means like if I bring something into my CPU caches, I want to do as much work with it as possible. But also maybe I want to read strides in memory that are contiguous so the Harbor Prefetcher can bring all that in for you. And so that way you reduce the overhead of it, get in cache misch. So there's actually two types of access patterns we've got to deal with. There's the non-random access pattern or dispensary sequential scan. And again, we can maximize this by making sure that we lay out the entries we need to read for a particular core thread all continuously in memory. So we can just sort of scan across it. And then we also have to deal with the random access lookups. And this is where maybe we're probing into our hash table or putting entries into our hash table. And so we're jumping out to different random locations of memory. And for this, we're not going to be able to get any Harbor Prefetching, because there's not going to be a sequential scan. So therefore what we want to try to do is we want to try to minimize the amount of data that a core thread has to process for some phase or step in our join algorithm so that everything can fit in our cache and our TLB, right? So this is a lot different than what we saw in the disk-based join algorithms. Because again, it was all about worrying about how many things I have to fetch into my buffer pool. All right, so for parallel hash join, I would argue that this is probably the most important operator you can have in an analytical database. By far, as we'll see in a second, it's the one that the database is going to spend all its time executing. So it's really important that we can scale this up and have a really efficient implementation. Because this will cut down, this is sort of the high pole and the temp we want to tackle to reduce the overhead or latency of analytical queries on large data sets. So that means that we want to take advantage of all the extra CPU cores that we have. And we want them to be running at 100% utilization and not stalling waiting for data from other threads or other cores and not stalling because of cache misses. We essentially want to have all these cores running at full speed. The full speed of the memory controller can provide to us, right? So we want to be memory bound, not CPU bound on these. So even an idea of just how important hash joins are. I had a student do a previous study where they took profile traces generated from Cloudera and Pala, which is a sort of a OLAP database system. Shared nothing database system that's built on top of Hadoop, HTFS. And they took the profiles and they want to measure how much time the CPU spent during query execution for the TPCH benchmark. Now, I fully realized that Cloudera and Pala is a distributed database system, whereas we've been talking about single-nose systems. So these measurements don't include anything about networking or shuffling data around. This is strictly just where was the CPU time spent when it executed these queries. And so we're going to break it down into five different categories. So we'll have the hash join, the time doing a sequential scan, the union taking to select operators and or select queries and uniting the results together, aggregation and other. And that sort of will be a catch all to mean cogeneration and compilation and other utility things you have to do to set up the query. So the pie chart looks like this and it's so about almost exactly 50% of the time in the system is spent doing the hash join, right? In the case of sequential scan, you think this would be higher because you're just sort of scanning through large pieces of data and trying to find the two pieces that want to match. We'll see when we talk about vectorization and why they can reduce this query compilation, how they can reduce this sequential sand time. Normally this would be much higher but they actually have an efficient implementation. But in the case of the hash join, there's not much magic you can do other than the parallelization stuff we talked about here to make it go faster, right? So this is 50% of our time exiting these queries is going to be spent on a hash join. So this is why we're going to spend all our time today just talking about how we can make this be more efficient and run this in parallel. And Marcel Kurnacher is the sort of lead architect in MPALA and he's the guest speaker coming to the class at the end of the semester. So he'll talk more about what MPALA does for a bunch of different things. All right, so the hash join is comprised of three phases, at least the parallel hash join. And the first phase, what we'll do is we're going to partition the data set for the two tables we want to join into these small little subsets. The hyper guys call them morsels, you can call it shards, partitions, buckets, essentially the same thing. We're basically going to divide the tables into these just joint subsets that are designed to fit within the cache of a single core. And to do this partitioning, we want to use one hash function on the join key so that the, on the join key for both of these tables so that the tuples that are going to be joined together end up landing on the same partition. And this is completely optional, as we'll see in a second. This doesn't always provide the best performance. So you may not want to do this. The second phase is where we'll build the hash table for the outer relation using a different hash function that we used in the first phase. And we're going to hash it based on its join key. And then once we have that, then we can enter the probe phase, where we're going to go through every single tuple in the interrelation on S and hash its join key using the same hash function we used here. And then we see whether we have a match and if so, then we combine the two tuples together and we produce that in our output. So I'm going to go through each of these three phases in more detail, but we'll spend more of our time talking about the partition phase. And we'll talk about the sort of how you can implement the hash table in this phase here. All right, so as I said before, the partition phase is completely optional. This is something where the query optimizer can look at what the join is trying to do and what your data actually looks like. And make a decision about whether it wants to do the partitioning or not. And so again, it's very important that the hash function we're going to use for this phase has to be different than the build phase, because otherwise we'll just end up with these sort of skewed partitions and everything's going to be localized and we're not going to get the full parallelism that we need. So the key idea about what that's going to go on the partition phase is that the cost of us having to scan through the data and partition it will be lower than the cost of just not partitioning and having all these cache misses during the build phase. So this seems crazy, right? It seems like we're going to spend time passing through the data at least once without doing any of the join, then break it up and copy things around, then go through it again and to actually then do the build. But again, the argument is that having to go to memory is so expensive that it's better off to do everything in these nice aligned cache partitions or cache line partitions. And then when we go back and out to the join, we don't have any cache misses when we build the hash table. So what we end up putting into these partition buffers depends actually on the storage model that we're going to use. So in the case of DSM, the only thing we need is just the offset of the tuple and the key that we're going to use to do for our join. And so that's really easy for us to copy this out because we know that first of all, we know what offset we're at as we scan the column. But then we just copy just the single attributes we need to do the join. Whereas in the NSM case, you either have to copy the entire tuple or you copy a subset of the attributes. But if you're doing the subset of the attributes, you're basically pulling out exactly just the pieces you need in order to put them in these buffers. Whereas in the DSM case, you can use vectorized execution and you can speed up this process. But for our purposes going forward, in all of the examples that I'll show, it doesn't matter which one of these twos actually are. But the main thing to point out is with DSM populating the buffers in the partition phase is much easier or cheaper to do in NSM. So now to do the partitioning, there's essentially two approaches. The first is to do what's called non-blocking partitioning where you're just going to scan through the relation once, populate these output partitions, and then there can be another thread that's going to run and start reading the things you populate in these partitions and start populating your hash table in the build phase. So the reason why it's called non-blockings because we don't have to wait for the partitioning phase to finish before the build phase can start. Because the build phase is just looking at the tuples and putting them in the hash table. It doesn't need to know that it has everything it needs to see because it's not actually trying to compute the joint, therefore it won't incur any false negatives. You can't do that in the actual probe phase, but you can do this with the build phase. So again, it's non-blocking because it doesn't mean that it will see in a second. It doesn't mean that writers block on readers, and as we talked about in Concertesholl, it just means that starting the second phase is not blocked on the first phase. But we've seen the second. We still need latches to protect our threads from each other when we start writing things to partitions. The second approach is the blocking partition, sometimes called the radix partitioning. Sometimes in the literature you'll see things called the radix joins or radish hash joins. Whenever you see that, they essentially mean this approach here. So what's going to happen is we're going to scan the input relations multiple times. Well, when we look at the evaluation, we'll see it scanning once, scanning twice, but you could scan it essentially in an infinite amount of times, but that would be obviously wasteful. And so what's going to happen is we're going to only materialize the results when we finish this partitioning all at once and then start the next phase. So that's why it's called blocking, because you can't start the build phase until this phase is done. So in the case of non-block partitioning, there's actually two other sub-approaches you can have to do non-blocking partitioning. And the first idea is to have a shared global space of partitions that all the threads can write into as they scan the input relations. And so, of course, now this means that if you have multiple writers, you have to have a latch in your hash table or whatever it is you're using to build these partitions to make sure that they don't interfere with each other and overwrite changes. The other approach is you private partitions, and this is where every thread's going to have their own set of partitions that they can write into. You don't need to synchronize because nobody else know the thread to be writing to those partitions at the same time, but now then you need to take a second pass through all the partitions at each thread and then generate the global partition space in the same manner that you have here. So it's sort of like, again, there's no free lunch, right? In this case here, you have to have latches to synchronize, but you only have to pass through the data once. In this case here, you don't need latches, but you need to pass through the data twice. Now I'll go through this example for each of these. As I say, we have our data table has a bunch of tuples, and the first thing we're going to do is we're just going to split up the data into disjoint subsets in the same manner that we did for the morsels or sharding before. Again, it's assumed that these are just all contiguous in memory, and then we're just going to subdivide the data table such that the first thread looks at the first thousand or whatever it wants to be, tuples, and so forth for the other guys. So now, in the shared partition case, what's going to happen is, say we want to do either join and be, so each thread's going to scan through and look at this key value here in B, and it's going to hash it, and then that's going to tell it what partition space it should go to. So in this case here, for the first guy, there'll be bunch of tuples that go to partition one, bunch of tuples go to partition two, and bunch of tuples partition into the last one. And they're all doing the same thing. They're all hashing their things and writing into the shared space. So of course that means, again, when we, in our hash table, or these buckets, we may have to set a latch any time we want to do a write because another thread might be writing at the same time. With private partitions, you actually extend the storage area for the different cores to now include these private partitions. So in this case here, everyone has the same number of partitions. You just hash the join key as you have before, but instead of writing to a global space, you just write to some location in memory that this thread controls, so this core controls. And then when that finishes, you have to wait for everyone to be done, be done this part, then what you'll do is you'll assign a partition number per core, and then that core will go through and look at that partition number at each of these local storage areas. So in this case here, say CPU one gets partition one, so to look at partition one here, here, and here. It's core two, we'll look at partition two here, here, and here. And again, you don't need to synchronize because when they write it out to the global, this sort of the global shared part, they're the only thread doing this. So again, we avoid having to do any latching, but then we have to make two passes and two copies of the data. So this is where actually the information you're actually storing in your output buffers actually matters a lot, because in the DSM case, the only thing we need to store in here is just the offsets of tuples. In the NSM case, we have to copy the whole thing, so it means we'll be copying it twice, and that could be expensive. All right, so now for Radex partitioning, the basic idea here is that we're gonna take multiple passes over the data, and we're gonna look at the Radex of the hash key values to decide what partition to assign a tuple to. If you don't know what a Radex is, I'll share that in a second. So anytime you see someone says they do a Radex hash join, they're doing this approach here. There's also a Radex sorting algorithm, which essentially does sort of the same type of partitioning that we're talking about here. All right, so what's gonna happen is, say we take the relation R, so the first step we're gonna do is we're gonna hash all the keys for all the tuples, and then we're gonna compute a histogram for all the tuples that get assigned to a partition based on their Radex at one particular offset. And then using these histograms, then we go compute what's called a prefix sum to figure out where should be the starting location of all the tuples in our giant sort of partition spaces, sort of global array. And then once we finish this, then we can go back and do this all over again, and now slide over the offset for our Radex to then do even further sub-partitioning. So a Radex is actually a really easy thing to understand. It's basically taking the value of a digit at a particular position of an integer. So let's say that this is our input sequence, 89, 12, 23, 08, 41, 64. So we want the Radex at the first position, it's just this digit here for every single number. So the Radex for all these will be 9, 2, 3, 8, 1, 4. And then the Radex for the next position is just 8, 1, 2, 0, 4, 6. The CPU can efficiently compute this using multiplication instructions, it's a very trivial operation for the CPU to do. All right, so then a prefix sum is essentially computing a sort of a running total, a running summation of the sequence of numbers where you basically take the previous number and add it with your current number and that produces what the prefix sum is at that position. So let's say our input sequence is 1, 2, 3, 4, 5, 6. So for the first position, the prefix sum is just one because this digit, this number here doesn't have a predecessor, so it's just whatever that current number is. For the second position, then the prefix sum is the last prefix sum plus the current value, which is just three, right? And we just do this down the line and do this addition that way. And we'll see in a second why this matters and then when we do our Radex partitioning, essentially the idea is that it's gonna allow us to figure out where each thread should write their data into in our partitions without having to worry about synchronizing at all. Right? All right, so let's say now we have, we have the same input sequence we have before, but we've already hashed it, so these are the hash values. And say we just have two CPU cores. So what's gonna happen is the first thing we're gonna do is we're gonna look at the input and can then compute histograms at each core or each thread that's gonna tell us how many elements or how many items should appear at each partition at that particular core. So again, we sort of extend out the, we'll say we'll subdivide the input data based on some fixed number and then we'll look at the Radex in the first position here. And so if you look at here, you have zero, one, one, zero. So when this core starts looking at this, it computes a histogram that says that partition zero, I have two elements here and here, one and two, and then partition one, I have two elements, one, one, one, one, right? And that goes to partition one. So just seem like you're taking the Radex value here and you're modding by the number of partitions you wanna assign them to. That's sort of, you don't need a complicated hash function to do that because you've already hashed it at this point here. So then now you need to build the giant address space that's gonna tell us where we're gonna write out all these tube identifiers or this information. So for this, we can compute the prefix sum from the histograms that each of these cores, without having to synchronize at all to figure out where each of them should start writing their information. So first of all, we know how to allocate the space here because we know the total number of tubules we're gonna process because we generated a histogram. So in this case here, now for partition zero, it has, at core zero, it has two elements. So it would go to one, two, and then partition zero at CPU one it has one element, so it just writes here. So these positions here are just the prefix sum that tells us where to write into. And then we can scan through all this and then populate it with the values. So for this, I'm showing the hash keys that really should be the tubule identifiers. That's how we're gonna be able to look at to actually do the join because if it's just the hash keys then we can't do the join with that. So this sort of clear what rate of partitioning actually is, it's pretty simple. It's like a nice clever way to divide up the input space or input sequence into these nice little chunks. And then what'll happen is we can keep doing this further and further until we get our partitions that be now fit within a couple of cache lines. And that way, when we actually now do the scan to over them to populate the hash table in the build phase we don't have any cache misses, okay? Okay, so again, we can keep doing this until we get the correct number of partitions that we want. But again, the paper they said that you wanted to, beyond doing that, you're not gonna really see a difference. Same thing, all right. So now we finished the partition phase and again, it's completely optional. The database may decide not to do it. But now we actually wanna build the hash table. So to do this, we'll take the outer relation R and for every single tuple we'll scan, right, either sequentially in their morsels or in our partitions completed in the previous phase, we're gonna hash the join key attribute and then assign it to some bucket or some location in our hash table. And again, it's important that this hash function is different than the last hash function because otherwise you'd just be assigning them to the same spot. And so for this, we wanna make sure that our buckets are only gonna be a few cache lines, again, we wanna reduce the number of cache messages that we have. So I wanna spend a little time now talking about what we can use for hash functions. And again, when we teach hash joins in the intro class, we don't bring up hash functions at all. You can just tell you that you can mod n something and that's good enough, right. But we want something that's gonna be fast and we want something that's gonna have a low collision rate. So we wanna see what are the modern implementations or modern algorithms for hash functions that we can use that achieve this goal. So the key thing to point out is what we're talking about here are not the same as sort of the cryptographic hash functions that you know about when you take a security class, right. We're not talking about SHA-1, we're not talking about MD-5, right. We're talking about things that can just take an arbitrary byte sequence and then produce some value in a key space that has sort of uniform distribution. So you wouldn't use SHA-1 or SHA-256 for doing this hash function, hash joins because it's gonna be way too slow, right. These other algorithms we'll talk about here, these are much, much faster. So these are probably the four most widely known hash functions that are used in database systems. Actually at least the first three are. This last one is a newer one which I'll talk about in a second. So the murmur hash came out in like 2008 and again this was designed to be a sort of general purpose hashing algorithm that was really fast. And again it wasn't worried about being cryptographically difficult to break, it's just, can we produce a uniform distribution of our hash space. So this was generated, the murmur hash was invented by some random dude on the internet, right. He just put it out there and here it is and then for whatever reason it was, people found it and it was really useful and then started getting really popular. And then around 2011 the Google guys called wind of this and they were like, oh, let's go look at murmur hash and see whether there's actually something we can use and maybe we should replace all our internal hashing functions to use murmur hash. So they end up actually coming up with their own hash function called city hash which is based on the version two of the murmur hash. murmur hash is currently at version three. But what they did was they designed the hash algorithm to be especially efficient for really short keys that they saw a lot in their workloads. So it's the keys that are less than 64 characters or 64 bytes. Actually no, sorry, it'd be eight characters. So the key thing is sort of what made murmur hash be much faster when it came out was the guy that invented it actually optimized the system to optimize the hash function to take advantage of sort of modern CPU hardware. That's sort of using the SIMD stuff and other instructions to make this go faster and that's sort of what made this really important and got better performance than everything else that was out there. So then the Google guys came along and then they made a one that was better for short keys and then they came out with a better version or a newer version of city hash called farm hash in 2014 that made the trade off of sacrificing performance in exchange for getting a better collision rate in your hash function. And the collision rate is really important because if you have disparate keys all hashing to the same bucket then you don't have to do linear search to find the thing that you're actually looking for. And you can avoid this by allocating a really large hash table so you never have any collisions but then you're just wasting memory. So all these are open source and farm hash is the latest version from Google. And then last year there was a new hash function out of a hash function came out from Daniel LaMaire who's a professor in Canada called CL hash. And this is actually fundamentally different than all the hashing functions that these guys are doing because it's based on, instead of doing straight multiplication it's based on doing a type of math on carry less multiplication where it's essentially, you don't carry over the products when you multiply numbers together. And what makes CL hash actually really interesting is it's only until recently Intel and AMD added the instructions to do carry less multiplication directly on the CPU. So before you had to execute multiple instructions to do it, now it's one single instruction and that's why this is gonna end up being faster. So this graph here is actually something I generated last night. This is a open source hash benchmark framework that some random guy on the internet created where it takes a bunch of these hash algorithms, implements them all in the same system and you're gonna scale out the size of the keys that you're hashing going from zero to 200 of these fixed bytes. And you wanna see how fast they can process data and compute the hash function. So the measurement here is the throughput based on the megabytes per second, not necessarily the hash invocations per second because as you increase the size of the hash key, obviously the hash function takes longer to go. So there's a couple of interesting things to point out here. So the comparison's gonna be the standard SDL hash, remember hash three, city hash and farm hash. And I'll show CL hash on the next slide. So you see these spikes here at these different intervals going along the key sizes. Anyone think of guess what's going on here? Why would it all of a sudden get faster and then get slower? What's that? No, not cash misses. Sort of, but not really. No. Word alignment, right? So they're hashing up, this is 32 bytes and then it goes to 33 and now you have to do two fetches to get things in memory and then it all falls apart. So this is really interesting here. If you see in the case of city hash and farm hash, they do also at 32 bytes and then also at 64 bytes. And then after this, it's sort of like every 64 bytes then they sort of peak up. But then when you fall over to the next sort of the next byte size or word, then it goes back down to go back up. Whereas the case of like murmur hash, it's pretty sort of flat line going across. You don't really see any real benefit, right? So again, the main takeaway here is that murmur hash is sort of a good, general purpose hashing algorithm. It's currently what we use in our system. But for smaller keys, farm hash and city hash are a way to go. And again, you see here, city hash is outperforming farm hash, but farm hash has a lower collision rate. But that's not sort of captured in this benchmark. All right, so now we throw in CL hash. So this guy originally wrote this benchmark two years ago and he just had these hash functions. So last night I added CL hash and you see this trend here. So this is actually really interesting. So below 64 bytes, CL hash is occasionally better than murmur hash. But again, on the alignment issues, they have problems. But then going beyond even bigger keys, and actually they say they're the best at four kilobyte keys, it's always gonna be the best performance. This is sort of taking advantage of this carry less multiplication property and these low level instructions that can do this to make this go really fast. So I actually messaged the guy last night because I was curious, I didn't expect it to be that much lower on less than 64 bytes. He claimed I wasn't running on a new one enough hardware. There's a newer Intel chips that came out in 2017. This was run on 2016 chips. And I don't know whether it'd be much better to running on the newer ones, but that's something we could maybe consider. So the main takeaway from this is, it depends on what your database system looks like, but I would say that probably city hash or farm hash are probably what you're gonna wanna use because you're mostly gonna be hashing on 32 or 64 bit values, right? Say you're hashing on, you wanna do a join on a Varchar field. Well, you wouldn't wanna store that as straight Varchar and do your join on that. You would use dictionary compression, or dictionary encoding, but then compress it into a 32-bit or 64-bit integer, and that would fall exactly in the sweet spot of the Google algorithms. So I would say for a narrowing system, we probably plan to switch to city hash some point in the future, in the summer. All right, so that's our hash function. Now we can talk about how we actually implement our hash table. And there's basically three main hash table types I wanna show you. If you go on Wikipedia, there's a ton of different hash tables that all sort of do different things, but I wanna focus on the three main ones that actually occur in real systems. So the first one is the chain hash table. Actually, let me say something else too. So in the intro class, we talk about extendable hashing and linear hashing. These are dynamic hash tables that are designed to grow and shrink over time as the data changes. In our world, we don't wanna have to do this because that causes us to synchronize our threads while we make changes to our hash table structure. So in the chain hash table, we'll see this problem, but in the open addressing and the cuckoo hash table, the idea here is that we wanna be able to use the maybe the histograms we generated in the partition phase to say we know how many keys we could possibly could have, and then allocate a hash table of that size that we don't have to regrow later on. Now we solved this problem in the optimizer paper that the hyper guys did because they talked about how if Postgres gets the estimation wrong, then it allocates a hash table that's too small, and you end up having these a lot of collisions that slows everything down. And so obviously having good estimations for these is important. So with these two approaches, it's a bit harder to do dynamic resizing. In chain hash table, it's actually trivial, but in general, if you pick the right size, then these two are always gonna outperform the first one. So the chain hash table is really simple to understand. The idea is that you're gonna maintain these linked list of buckets for each slot in our hash table, and then when we hash our tuple and have to put it into one of these slots, we'd find the next free space in a bucket along the linked list. And then if we ever have collisions, again, it's just depending to the linked list. And whenever you would do in deletions or lookups, it's just the same thing. You're just doing a linear search to find along the list to find the thing that you're looking for. Right, so it's sort of, again, this is the basic hash table that everyone knows from intro classes. Say this address here, it's gonna point to this first position in the bucket, and then we just do a linear search to find the last slot where we put our thing in. And usually what happens is these buckets you wanna keep within a cache line, but that sort of limits what you can actually put in it. So a different type of hash table is called open addressing hash table. So I would say open addressing hash tables is actually a category of hash tables. So I don't really know what to call the one type I'm gonna show you here. And actually the cuckoo hash table that I'll show in a second is also another variant of an open address hash table. But for our purposes here, for this one here, we'll talk about the giant array. And the cuckoo hash table, we'll talk about having two hash tables. So second, think of it as a giant array of slots. And what'll happen is when we hash our key and it lands into a position in the array, if something's already there, then we just do a linear search in the array to find the next free slot. And we put our item there. So now if anybody wants to do a lookup, they would do the same process. They would hash the key, jump to some position, and then do a comparison to see whether the thing they're looking for is there. And if not, they just keeps scanning down. And of course now you need to prevent them from having to scan infinitely. So that means we're also gonna have to store the key of the thing that we're hashing in the table itself, so that we know that when we've sort of reached the end of the thing that we're looking for. And so to do insertions and deletions is basically the same thing as doing this lookup. So let's say it sort of looks like this. Say that we have our three keys, x, y, z. For the first one we hash this, and this tells us to go here. The y tells us to go there. But now we want to hash on z, and z wants to go here, but something's already there. So all we'll just say, we'll just put it down here. Right, so now if I want to do a lookup on z, I can hash here. It's not z, but I know I'm still in the same hash domain, or the key domain that I'm looking for, or the hash key domain. And I keep scanning down until I find that thing I'm looking for. So let's say I get down here and I find a different hash value that I know I've reached the end of the hash key that I was looking for. Again, the nice thing about this is you don't have to synchronize anything. You can do a compare and swap to try to steal position, and then you just retry and keep moving down if you fail. All right, so the sort of the obvious thing about the problem with these collisions is that it's essentially gonna cause us to do wasted work. Because when I have a collision, I have to go actually do a comparison of the hash function to see whether it's the thing I'm looking for, but also I do a comparison of actually the key itself. Right, so doing this actually can be really wasteful. So if you have, in the case of the chain hash table, if you have a really long linked list, then you're just doing that comparison in a linear scan just to find the thing that you're looking for, right? And so we wanna try to avoid the collisions by being bit smarter in how we organize our hash table. So we can do this easily by just having a really large hash table, as I said. The math works out if you have a hash table that's twice the size of the expected number of elements that you expect to hash on, then you won't have any collisions. But then obviously that's gonna be, you're allocating a lot of memory that's not gonna be really used. So a better alternative to this is used what's called a cuckoo hash table. So the basic idea here is that it's we're gonna have multiple hash tables that each have different hash functions. And whenever we wanna put something into the hash table, we'll hash it for every single hash function we have, and then we'll try to find a free slot in any of the different hash tables that we have, and then if there's a free slot, we put it there. If there isn't a free slot, then we'll steal a slot from somebody else, and then it has to get reinserted back in the index by looking at all the different hash functions again. The idea here is that eventually there'll be, even if you keep evicting keys and moving them out and hashing them again, eventually you'll get to some key that will have a free slot where you can put it in. So to do lookups and deletions is always gonna be 01, because we know that we hash something, we're still gonna always land directly where we know it should be. But in the case of doing insertion, it gets a bit more tricky because we could end up just evicting every key over and over again just to insert the one thing that we wanted. And obviously we have to make sure that if we loop back around in our key space, because we don't have any free slots, then we need to recognize this because otherwise we could do this for infinity. So let's look how this works. So this is a simple Kugel hash table, it was just two tables. And again, from you as the programmer standpoint, you don't see two hash tables, right? You see a single API for a hash table, but internally it's maintaining these things for you. So you still say insert key, delete key, get key or whatever, on a single hash table, but internally it can check multiple tables. So let's say the first operation we wanna do is insert element X. And so again, for each of these hash tables, they're gonna have their own separate hash function. And it can be city hash, remember hash, it doesn't matter. They just need to be salted and so that they produce different values. So for, we do the first hash function on X, that would point here, the section hash function on X points here. So we have two free slots where we can put this element. So here the algorithm says basically flip a coin because it doesn't matter which side we go and just pick whatever one we want because both are free. Well, let's say we end up picking on this side here. So now say we wanna do it in certain Y, same thing, we run it through the two hash functions and the first hash function points at this location, this second hash function points at that location. In that case here, this first slot is occupied by X, right? But this other slot here is empty, so it'll choose to write the value over on this side here. So now let's say we wanna insert Z, the last element, same thing, we're gonna hash it twice. But now we see that both these hash functions map to slots that are already occupied. So what should happen here? We'll do the X, the hash, or X again? Almost there, yeah, it's there. So you gotta decide which one of these you wanna evict because Z always has to go here, right? So we flip a coin, let's say that we get rid of the Y. So we'll take Y output Z in, but now we need to insert Y back into our index. So therefore we gotta take a, since we know it came from this table and therefore we don't wanna bother hashing this because it'll just go back using the second hash function because it'll just go back to where we took it out of. So we'll use the other hash function and now it maps to X, but X is being occupied. So again, it'll steal that slot, X comes out, we gotta put it back in, and now we hash it on the other side, right? Again, cuckoo is just like going back and forth like a cuckoo clock, all right? And so eventually at this point here, once X is now able to get into a slot, we're done. Our insertion of Z has completed because all our Z made it in plus all the things we end up having to evict, right? So this turns out to be actually really efficient and really fast and is actually the hash, the hash table that's most often used in sort of modern systems. So again, obviously we wanna make sure we don't get caught in an infinite loop. We need to recognize that we, I've already tried to evict this before, I don't wanna keep trying to do it again because I'm not gonna succeed. So if you recognize that you're in infinite loop, you have to basically rebuild the entire hash table using new functions. So essentially it means you have to dump everything out and insert it all back in and that's like the worst case scenario. So if you only have two hash tables with two hash functions, then mathematically you're probably only gonna need to rebuild the table when it's about 50% full. And I say again, I say probably because it's a randomized algorithm and anything can happen. It depends on a lot of different moving parts in the system, but statistically saying it's only when you're about 50% full. But if you have three hash functions, which means three hash tables, you probably don't need to rebuild the entire table until it's only 90% full. Again, there's again another trade off between CPU and memory. We can allocate more memory and have fewer collisions but then we're wasting space that could be used for other queries or other parts of the system. So there's actually a awesome open source implementation of a cuckoo hash table that we actually use in Peloton. It was actually written by Dave Anderson and his team here at CMU. Like if you Google lib cuckoo hash table or a library for lib cuckoo hash table, you'll find lib cuckoo. And it's pretty good. It's not latch free but it's had lower head to synchronize for threads. All right, so at this point here we have our hash function. We have our hash table and we pop later hash table in the build phase and now we're gonna go through the probe phase. So what's gonna happen here for every single tuple we have in our interrelation, we're basically gonna do the same thing as the build phase. We're gonna hash its join key using the same hash function that we use in the build phase. And then we're gonna look to see whether we have a match on the tuple in our buckets in our hash table. It's not enough for us just to compare the hash values. You actually have to compare the actual join keys because again, we can have collisions where two different join keys match to the same hash key. So we have to make sure that they are truly the same. So the way to do this is that if our inputs are already partitioned, I mean we had a partition phase, then the only thing we need to do is just have every thread access whatever partition is local to it. Otherwise there'll be some sort of global cursor that's gonna step through the intertable and we just need to make sure that we don't try to compare, two threads don't try to compare or do a probe in the hash table with the same tuple because that it may end up with duplicate results which would be incorrect. So there's not much really to say about the probe phase. It's sort of pretty obvious. You just scan through, you check the hash table. If the element you're looking for is there, then you produce it in your output. So in the remaining time I wanna go through the evaluation study that you guys read and talk about why would these different approaches have different performance characteristics. So the four different algorithms or hash drawing variants we're gonna look at are a combination of all the different design decisions that we talked about so far. So we'll have no partitioning at all with the shared hash table. We'll do the non-blocking partitioning with the shared buffers, non-blocking partitioning with the private buffers and then the Radex partitioning hash drawings that we talked about earlier. So this sort of table here summarizes all the different aspects of how you implement these different things. It talks about what you're doing partitioning, how you synchronize during the partition phase if you're doing it at all, what the, how the hash table is gonna be organized and memory and then whether you need to synchronize during the build phase and the probe phase, right? Again, because you've already partitioned things in all these algorithms here, you don't need to synchronize at all because every thread is gonna be accessing the local partition and it doesn't worry about reading things that it shouldn't read or no one's writing anything to the same space. So that's not an issue here but you do have to synchronize in the no partition case. And then obviously you don't need to synchronize the probe phase because every thread can produce the output of the data that evaluated in the hash table from the interrelation and the outrelation. It can produce these outputs in its private buffers and then using the exchange operator or whatever else you have above in the query plan, you know that you can then cold less the results very easily from the different threads and so you don't need to synchronize as you go along here. So the paper you guys read again is six years old now but I still think it's very valid. And so this was a prototype system that was built at the University of Wisconsin to evaluate these different join algorithm implementations. And the part of the reason we're getting more used, you wanna use a prototype for this kind of experiment is because you don't want the numbers to be tainted by all these different other aspects of the system, right? You don't wanna worry about the SQL parser, you don't wanna worry about the thread scheduling and things like that, but you just look at the pure algorithm and say how they actually perform. So for this, we're gonna have a primary key for our key tables and then the out of relation will have 16 million tuples and the interrelation will have 256 million tuples. And we're gonna look at two workloads where the distribution of the keys are either uniform or that we really highly skewed. And for this, I'm not gonna measure it, how much time you take to materialize the output, right? They're only gonna look at these tuple IDs or these offsets, right? Again, can we avoid the overhead of other parts in the system? So I'm gonna show, so in the paper you guys read, they showed measurements for a bit older now, Xeon CPU and also a Spark Niagara. So I'm only gonna show the results for the Xeon because that's more relevant to the work we're doing here because the sparks aren't that common and the Xeon cores are more heavy weight, whereas the Niagara chips or the Spark chips, the threads are a bit more lightweight. So think of sort of like an atom processor, right? They're not sort of a full-fledged, super scalar cores that you have in the Xeon here. So we're just gonna focus on those. So in this first experiment we're gonna do is, we're gonna look at the uniform data set and see how well these all perform. So the thing to point out here is that, well, the way I'm showing these measurements are, I'm breaking it down between the different phases, right? You have the partition phase, the build phase and the probe phase. And obviously for the no partitioning case, it has no partitioning, so that there's no sort of gray area for that. So the main takeaway here is that, the Radex partitioning is 24% faster than the no partitioning case when everything's uniform. And what's really fascinating is when you look at the measurements for the no partitioning case, they're gonna have three times as many cache misses than the Radex case. And that's expected because it's just, every thread is just scanning through the data and it's incurring cache messages as it goes along. Even though this guy still had to scan through the table twice. And then it's also gonna have 70 times as many TLB misses because again, every thread is just reading whatever random data it comes across when it processes the probe phase. Okay, the other main takeaway too is also here is that, you can see that the build phase is actually really easy, right? Or not really easy, but it's really fast. It's all the time it's being spent either partitioning or the probe. So you're either gonna pay the penalty at the beginning or pay the penalty at the end, right? There's sort of no way to avoid this. So now let's look at the results when you have a skew dataset. So now we see is that actually no partitioning performs the best across all of them, right? And the reason is because there's, since the threads are just reading random parts of the table, there's no one thread that's being, it's gonna have more work to do than the other threads. Whereas when you partition them in these other cases, you're gonna end up with some buckets or some partitions that are gonna have more tuples than others. So some threads will finish and other ones have still have a longer queue of things to process. So that's why this thing's outperforming. In the case of shared partitioning, I think the reason why it's so bad because all the threads are trying, since the data is skewed, all the threads are trying to write to the same shared partition, therefore they have to acquire a latch to do that. And therefore that they all get sort of bottom actor, there's a convoy waiting to acquire that latch to write to this one location in the shared partitions, right? Where in the private partitioning case, again, you do two passes over the data, but you can do this in a latch-free manner without worrying about interfering with other threads. And that ends up being better for this case here. So, so far though, I'm showing here in the case, I'm showing two clear examples where partitioning actually does the best and partitioning does the, no partitioning does the best and partitioning does the best here, right? So it's two different workloads running the same queries, but we have two different results. And part of this is that the, how we actually implement this in a real system is that your optimizer is responsible for figuring out which of these approaches you want to use. And it's even more complicated in the Redex partitioning case, because now you also need to decide how much you should partition your data, right? You could do something really simple and like, oh, I have 10 cores, I'll have 10 partitions, but they may not always be the best choice. And because if you have more fine-grained partitions, then you avoid the overhead, or avoid the problem that one partition has all the data because it's the access pattern that the workload is very skewed. If you have more partitions, then it's more evenly divided. So in a real database system, the optimizer is going to have to pick these different, enter these questions based on what it knows about the data and the hardware that it's dealing with, right? Of course, this is problematic because as we saw when we talked about the optimizers, the statistics that they collect come out to be really, really wrong. I mean, again, this is sort of again, why you see the performance getting, it gets worse as your performance in that one example with Postgres, the performance got worse as your predictions got wrong because they're failing to make the proper choices for these decisions here. So giving an idea of how sensitive these algorithms are to these design decisions, I want to show in this experiment here, they're just doing radix partitioning on the uniform data set, and they're going to do either the radix partitioning either one pass or two pass, and then along here are the different number of partitions that you can have, right? So the best case scenario would be, sorry, the best case scenario for radix one pass was to have 4096 partitions, and so in the numbers that I showed before in the previous graphs, they were using this value here, right? And they didn't do the two pass that the previous one was doing one pass. So we can compare this now against the best case scenario for the no partitioning algorithm, which is sort of represented by this horizontal line here. So now as we, as you see again, this is an example where like doing two passes actually makes you worse than no partitioning, but doing one pass actually makes you better. Like you're getting a 24% reduction in performance, whereas this one makes you 5% slower. So now as you scale across all the different partitions, you see that it's completely random in some ways, right? Like 64 partitions actually does worse, and then as you add more partitions, it gets better, and then at some point, you have too many partitions and it gets worse. In the case of the two pass partitioning, it's only over here at 32,000 partitions, do you actually see the better performance? Now, this again, this is an older CPU. It has one socket, it has six cores. So maybe this might change if you have more cores. I mean, that's a, we can look to see whether there's later studies for that. But the main takeaway in this is actually, even if you're new at partitioning, getting the right number of partitions can dramatically change the performance. And so the hyper guys argue in later papers, and a lot of people have argued that you don't even want to bother with the Radex partitioning at all. It's not worth it, right? The no partitioning may not always be the best choice, but it has less moving parts. It's harder to get wrong, and therefore you just sort of let the system go out and go ahead and just do that. So again, in most of the systems, the newer systems, I think they're not doing Radex partitioning, they're doing the no partitioning approach. All right, and the last study I want to show you is how hyper threading affects this. And this is actually where you see a really big difference of performance between partitioning and no partitioning. So again, we only have six cores, but then we have hyper threading, so we can go up to 12 logical cores, logical threads. So what you see is that for the uniform data set, doing the no partitioning join and the Radex partitioning join are more or less equivalent until you flip on hyper threading, and now the performance for the Radex partitioning actually gets much worse, right? And the reason is because in the case of the no partitioning case, yes it's gonna have more cache misses and more TLB misses, but because you have multiple threads now running on a single core, when one thread gets a cache miss and has to stall, the hardware can then execute the other hyper thread at the same core and then go ahead and make forward progress, right? But the, so in here just showing that the, for the socket here, for this particular workload, this number, fewer number of cache misses in the table, this just actually only has a small performance improvement, but again, things get a lot worse because when you go beyond this, because the, because every thread can now run at, is not worried about having cache misses, all the data that's gonna need to be as local, you then sort of saturate the memory controller or saturate the, not the memory controller, you call us more cache misses in this case because you're not, you're just ripping through the data very, very quickly and that's why it sort of stalls like that. All right, so any questions about hash joins? And that's stuff we talk about here. Okay, as I said, the hash join is again, the most important operator we can have in our database system. It is my opinion that the new partitioning algorithm is the way to go. We talked about how to do this in a sort of multi-threaded environment. There's a bunch of additional optimizations that we'll see more about on, on Thursday, where you can possibly apply you SIMD and do vectorized instructions to speed up some of the steps in the hash join algorithm. We've done research that shows that you can't really vectorize too much, there's nothing much to vectorize in the case of, and for hash joins, you're really gonna see the bigger benefit of it in the sort merge and also the sequential scans, which we'll talk about next week. So yes, there are some sort of more additional optimizations we can apply, but in practice we don't think it actually makes a difference. No partitioning case, sort of straight running through the CPU is gonna be good enough, okay? All right, so on, again, on Thursday, we now will talk about parallel sort merge, and I've sent some emails of the weekends for feedback for the projects. If you have any questions or want any clarification, then just come to my office hours and we'll be happy to talk about them, okay? All right, see you guys on Thursday.