 For today now, we're now going to look at the other parallel join algorithm to do sort merge. So again, just as a reminder, where we are in the semester is that we're looking at all sort of the things you need to do to run fast analytics. So Tuesday's class was about doing parallel hash join. Now today is how to do parallel sort merge join, which again is the second other major join algorithm you can have. We saw last class and we'll see here today hash join is pretty much what you're always going to want to use. So today we're going to start off talking about a background about SIMD and sort merge. And then we'll focus on how to do the parallel sort merge algorithms. There's one in the paper you guys read, but then we'll sort of cover the one from the guy that the guy's developed in Hyper. And then we'll have this evaluation to sort of compare these different algorithms plus throw in hash joins and you'll see why the hash joins will be better. So just as a spoiler for this entire lecture, everything we're going to talk about and everything you guys read doesn't actually work for current Xeon CPUs. So when I say Xeon, I mean literally like the Intel Xeon CPU, not the Xeon PHY coprocessor or the Xeon PHY processor. For whatever reason they want to reuse the name Xeon. There's the Xeon general purpose CPUs. Again, that's what you think of when you think of a server machine, but then there's also the PHYs, which we'll cover next week. So everything doesn't work on the CPUs to buy from New Egg to run on your server. And I'm not going to tell you why as we go along, but we'll see if it picks up why this doesn't work at the end. So now maybe the thing why do they have you read a paper that doesn't work is because that eventually this will work. And eventually this actually might make the sortverse joins faster than the hash joins. But we're not quite there yet on general purpose CPUs. So everyone here probably should already know what SIMD here says. Has everyone taken 418, 618? Yeah, you covered that stuff here. So SIMD stands for single instruction multiple data. And there are a class of these CPU instructions that are going to allow an application to have the processor invoke the same instruction multiple times simultaneously and parallel on different pieces of data. So this is different than what we normally think of when we think of writing programs in terms of SISD instructions, or single instruction, single data item. This comes from Flynn's taxonomy of CPU architectures from the 1960s. So this idea has been around for a while and it's only maybe the last 15, 20 years have database systems actually start to employ these use SIMD for these techniques. So the idea has been around for a while but it wasn't really until the 1990s that AMD and Intel started including SIMD instructions or extensions to their micro architectures. So when it first came out in the 1990s, Intel called their thing MMX, AMD called their thing 3D Now. I remember when I was in middle school we had a Pentium CPU in the 90s and it was a big deal for us to get MMX. We didn't know what it meant. Most programs didn't actually use it but Intel was very good at marketing that you needed to have SIMD even though nobody knew what it actually was. So in these early incarnations of SIMD on Intel's chips, they were actually pretty bad. So the problem was whenever you invoked a SIMD instruction, it would block the CPU from running regular instructions. It's only a small little blip but actually it would make a difference because you can't run things in parallel. The later versions were the SSE, SSE 2 and SSE 3. These sort of added support for doing the instructions in parallel with the CPU. I did other things because we had better support for out-of-order execution. And then now where we're at today for the most part, when people talk about SIMD they talk about using AVX-256 or 512. So we'll cover what these different SIMD instructions do next week when we talk about vectorized execution. So I'm only bringing this up now because you need to understand SIMD to understand how they're going to do hardware-accelerated sort merge, the hardware-accelerated sort merge join. So at a high level to see what SIMD looks like, say we want to do this operation X plus Y equals Z and where X and Y are these vectors of integers. And so the output of this summation is another vector of just taking the values of X1, Y1 and producing them there and going across for all of these. So the way you would write this in your program normally is you would just implement this as a for loop that would iterate over the every element of the two vectors and you assume that they're the same size. And then you're just going to add them together and then write them out to another vector, which is the output buffer. So to do this with SISD, right, the single instruction, single data item, what happens is you're basically just going to iterate through every single element and add them two together and then write them out to the buffer here. Again, it's called SISD because you have two data items and you're going to invoke a single instruction on them. You can't combine multiple of these things together. So now with SIMD, how it's different is that instead of looking at individual elements, what you're instead going to do is take a run of elements or multiple elements from the two input vectors and then store them in this memory aligned register, this SIMD register here. So for this, the top one X will take the first four elements, store that one register and then take the next four elements and store this in another register. So the way to think about how this would work in your program is these SIMD registers are sort of like these variables that you can write in data into, right? And the way you write this code is that you use the same kind of CPU intrinsics that we talked about before where the compiler would then just translate them into the actual underlying instructions to do the load and store to write data into these registers. So now the key thing about this is that the register is always going to be a fixed size, right? And in this case here, it's 128 bits. In the newer ones, you can have 256 bits. So that means that you can only store a fixed, a finite number of elements in the registers. In this case here, we're storing 32-bit integers so we can store four 32-bit integers in a 128-bit register. And we need to be careful that we make sure everything aligns correctly. In this case here, we're only doing numbers or integers so it's really simple. So now what happens is once we load these four elements into this SIMD register here, then we invoke the single SIMD instruction that will then go add the first element from the top with the first element on the bottom and so forth to then write out the result as a single vector. So what happens in the way this works is it ends up also getting ready to another SIMD register that's going to be the same 128-bit size as this one here. And then eventually, if we want to use this in our program, we then have to copy it out of the register and put it into memory. And again, we use intrinsics to do this movement of data around. So then now for the remaining elements in our two arrays, we do the same thing. We copy them out into our registers, invoke the SIMD instruction, and then we produce our final output. So before when I showed the SISTI example, we had to do eight additions to go to walk through every single element. In this case here, I don't have to do new addition instructions, but because they can process multiple data at the same time, that's much more efficient, much more quick. So we're getting a forex speedup in terms of the number of instructions we have to execute when we use SIMD versus SISTI. Now you do have to execute instructions, too, to move the data out from memory, from L1 or whatever, into your SIMD register, but you have to do that same thing anyway to do the addition because you have to move it out of your memory down to the addition registers. So even though it sort of looks like it's an extra copy to move the data from memory into the register, we do that anyway for SISTI as well. So this sounds amazing, right? This sounds like this is going to be a big win for us, and as we'll see in the lecture today, you can get a quite a significant performance improvement if you use SIMD correctly. The tricky thing though is that not all algorithms can be easily vectorized, and furthermore, the most common compilers are not also very good at vectorization. So that means that we as the database developer, the people that actually are building the database system, we have to be versed in how SIMD works and we have to be responsible for, you know, recognizing that this algorithm can be vectorized and what SIMD instructions to use because the compiler for the most part is not going to be able to do this automatically. There is a compiler from Intel called ISPC, I figure what it stands for, for Intel's SMPD compiler. And this is actually very good at finding opportunities due to vectorization, but for the most part it can only handle sort of simple things. For the complex algorithms that we're talking about here, like joins or next week you'll read a paper where they do vectorization for every single possible relational operator, so aggregations, scans and everything. For those things, the scans maybe for other things, the compiler is not going to be able to figure this out for us so we have to do this ourselves. We also have to be careful about making sure that our data aligns correctly in the register. So that means if I have, say, a 48-bit integer for whatever reason that I want to add together, that's going to be problems when I put it in the SIMD register because it wants to look at 32-bit integers from two different registers and add them together. It doesn't know how to go over that boundary. So again, for the most part, for everything we'll talk about here and in general in database systems, this is not really going to be a problem because when we have to do a join to do tables or sort keys in the same table, it's always going to be the same type. They're always going to be the same fixed length. If it's a var char, then we just use dictionary encoding to get a 32-bit or 64-bit integer. So this is not really a problem for us in databases, but in general, this is a big deal in SIMD. Another tricky aspect of this too is getting the data into the correct form inside the registers and then getting it out in the correct form that we need it is actually going to be tricky and non-trivial to implement. For the example that I showed here, the SIMD addition, it's really easy. You just take the values in, then they get written out after the addition in the form that we're going to need it for whatever else we want to do. But for other operations, as we'll see later when we talked about the sort networks, you have to do some extra shuffling to move things around. Now, the nice thing about SIMD is that you can have the SIMD registers right to other SIMD registers. It's not like we have to go back and forth between L1 and the SIMD registers when we want to do this movement, but you have to be mindful of what you're actually doing to put the data into the correct form because if you do everything in L1, it's going to be slow. We'll see this again more next week when we talk about the paper from the guys at Columbia where they show how to vectorization for all different types of algorithms. There's actually going to be certain instructions that Xeons aren't going to support and therefore they have to be emulated on the CPU, which then will slow down your algorithms where, again, the newer Xeons had those instructions. Again, we'll talk about more about this later on, but you'll see a glimpse of this when we talk about how to do SIMD for the certain networks. Now you may be thinking, given what I just showed you about the SIMD stuff, why don't we actually want to use the GPU? The GPU is sort of like SIMD on steroids. You can have a ton of sort of smaller cores all doing the different kind of processing that we want to do in parallel. In general, what I'll say is that most database systems, almost every single database management system does not actually support GPUs because moving the data back and forth from the CPU memory, like in DRAM or your CPU caches, down to the PCI Express bus, where the card actually is, is really, really slow. So you're better off just doing whatever you want to do in the CPU because the time it takes to send the data down and then send the data back up is actually going to take much longer. So in general, most database systems don't use GPU. Now there are some newer database systems that have come out in the last couple of years that do use GPUs. And actually, Matt D disannounced yesterday that they raised $25 million in series B funding, so that's pretty significant. But there's also this S-Cream or Scream or however you want to say that. It's another streaming database and then Connecticut used to be called GPUDB and they renamed to that one here. So these are systems where they claim that they can use the GPU to sort of accelerate query processing without having to pay the big penalty of having to go back and forth over the PCI Express bus. The way Matt D does this actually is you just load the entire database onto your GPUs and then you do all your queries as sequential scans down there. So if you have a database that we've been talking about where you want to support hybrid workloads where the database is being updated and we want analytics on it, then GPUs aren't going to work because it's going to take too long for us to propagate the data, the changes down to the GPUs. So these are mostly used in analytical systems where the database is static. Then we'll talk more about GPU acceleration next week. In general, it's not quite, the way these guys are doing it, what I think is the right way to do it for hybrid workloads is simply not going to work. Now what actually might be a big game changer are these newer co-processors that are coming out where the sort of the GPU or the really vectorized hardware is actually directly on the socket with the regular CPU and it's cache coherent. So that means like the GPU running on the socket can access memory, the same memory that the CPU sees and everything is in sync. Then you don't have to go back and forth between the PCI Express bus and that's going to allow you to do a lot of interesting things. So AMD has this thing called APU. Intel has this nice landing thing which is sort of a co-processor. I think the newer versions of Intel's hardware is going to have more stuff on the sockets. They bought that one startup in Nirvana. They want to put deep learning things directly on the socket. So when these things come out, some of the things we'll talk about today you can actually maybe push to a GPU and that'll run faster than SIMD. But then you can do this without having to avoid copying things back and forth between the GPUs. We'll talk more about this next week. This came up last time and any time we talk about vectorization people always like, what can a GPU do this? And the answer is yes, but it's slow to move data. So now to get to the sort merge join algorithm. Again, we covered this in the intro class. The algorithm is really, really simple to understand. You have in the sort phase you basically just sort the interrelation and the adrelation on the same key that you're trying to join them on, the same attributes. And then in the merge phase you just scan through the two tables sort of in lock step and you just do the comparison between the two tables from the two sides. And you know that in the case of the adrelation you never have to backtrack to go look at two tables you've already sort of examined. You can sort of walk through the outer table in order and in the inner table you just sort of do your comparison and you sort of know whether you should move one cursor versus the other based on what the values are you see when you're joining them. So in the case of, again, for the merge phase we don't have to scan the adrelation once the interrelation may need to backtrack to look at or foreign keys and other things. So visually at a high level it looks like this. So we have our two relations say these colored bands or different variations of gray represent the different values that we want to join them on. So in the first phase you do sort so now you have these sort of bands like this and I'm not saying what the sort algorithm is for our purposes it actually doesn't matter. In general it's usually quick sort but you could use heap sort you could use the hybrid sort that the SDL uses it doesn't matter because we're assuming everything's going to be in memory. Then the merge phase again you have two cursors on the outer relation and the interrelation and then it just knows how to walk through this and do the comparison going across and if the join keys are equal or whatever the predicate you're using for your joint evaluation then you produce it in the output buffer for the result of the join. It's pretty straightforward to understand. So now when we throw in the parallel sort merge the key observation that we can make and the thing that we're going to spend all our time on is how to speed up sorting because that's going to be the most expensive part that we have to deal with or most expensive phase in our algorithm. And just like in the hash join algorithms we talked about before we wanted to be mindful of what the hardware is going to look like to allow us to make decisions about our algorithms we're going to use to do the join so that we get the maximal performance making the best utilization of our hardware. So that means that we want to make sure that all our CPUs are sort of running at full blast always have data in the process never stalled on cache misses or waiting to get things back from DRAM and then we also need to be mindful of our NUMA boundaries just like before because we don't want to have to have one core running at one socket get data that's far away at another socket because then you got to go over the bus and pull that over and that's going to be slower as we saw before. So we'll see in one case actually the in the case of the hyper guys they sort of violate this when they do the actual merging of the sort merge algorithm because they claim they're doing sequential reads and it doesn't matter whether the data is not local to it the hardware prefecture will sort of hide all this latency from it. So now the parallel sort merge join algorithm is just like in the hash join it's the same basic algorithm but there's this optional partition phase where we want to split up the outer relation into disjoint subsets or segments they call morsels, partition, shards, the same thing and we're going to assign them to different workers and that way when they get into the sort phase most of the data that they have to process will be local to it. This is entirely optional. We saw in the last class that actually doing this step doing radix partitioning for the hash join actually was not always the best idea especially when the data was skewed so that sort of we have that same sort of edict here we may not always want to do this step. Then the rest is all the same we're going to do our the outer and the inner on the same key and then we just do that iteration over the two tables and do the comparison to see whether we have a match. So I'm not going to say much about the partitioning phase here the same radix partitioning that we talked about last class is applicable here we can do the exact same thing we also can rely on the fact that our data might already be implicitly partitioned when we loaded it in as it was in the morsel's case so implicit partitioning is that you figure out what key you want to use to split your tables up when they get inserted and you assign them to the different numeral regions. So in the case of the again for the explicit partitioning part everything we talked about last class is game here so there's not really anything else extra to say here so again we'll spend most of our time on the sort phase so the goal of the sort phase at a high level what it's doing is that it wants to produce what are called sorted runs on chunks of data for the two relations you're trying to sort on so a run is sort of again a string of tuples that are all sorted from lowest to highest and we're going to do this in sort of small parts and then slowly build them up to be more larger sort of runs till we get to the final point where the entire table or the entire subset of the data that we need to access for our join algorithm has been entirely sorted so as I said before in the in like when we do like the basic sorting out sort merge algorithm quick sort was good enough, quick sort is really fast so there's not really anything when you're not worried about having extra cores or worried about NUMA in your NUMA regions quick sort is good enough but everything we're going to talk about here are going to be replacements for quick sort that'll get better performance because we'll take advantage of the hardware and we'll be mindful of where the data is that we're reading and writing so that we avoid we avoid cache misses and we avoid having to go over the interconnect bus so the algorithm I'm going to show you comes from the they've referenced it in the paper you guys read from the ETH guys but it actually comes from an earlier paper from Intel and Oracle in like 2009 where they developed this technique called cache conscious sorting and so the way to think about cache conscious sorting is the sort is actually being done in sort of multiple phases or at multiple levels the data is going to move between multiple levels so this is not the language that they use in the Intel paper but this is sort of how the way I like to think about it the terminology that I like to use to understand what's actually going to go on so in the first level what you're going to want to do is you're going to want to generate sort of runs that can fit in the the CPU SIMD registers so in the case of the example I showed before we're going to generate runs that contain just four values or four elements and we're going to do that until we sort the entire dataset and then we'll go now to the second level where now we're going to combine together the sorted runs from the first level into subsequently larger and larger runs up until they're the half the size of our L3 cache so in the first phase again we have four elements so it's 128 bits and then when we go from to level 2 we'll take two 128 bit runs they'll combine together by 256 then we'll take two 256 bit runs and combine them for 512 and so forth and we'll keep doing this until the size of our sorted runs are half the size of our CPU caches and we think I guess why they have to be half the size yes exactly yes you need to store the input and the output in your CPU cache so it has to be half the size because you can't do in place in place sorting here so you keep going until you reach that half way point or half size point and then you switch into level 3 where now you do out of cache sorting and this is where you're going to now make runs that are going to be larger than your CPU caches but you can't avoid this because if the table is going to be larger than your CPU caches you have to go into DRAM so we'll see how they handle this and make sure that the CPU is sort of running full utilization and you're not and you're saturating your memory bandwidth so again sort of pictorially it looks like this we start off with our unsorted data set and then in level 1 we'll break this up into our four elements runs that can each fit into a simi register and then in level 2 we just sort of start to combine these into larger and larger sizes and then in level 3 we'll have we'll make the larger runs until we finally produce the complete sorted run which is just the entire input relation I'll go through each of these steps one by one and show you how they actually implement it with simi so for the first level they're going to use a technique what's called sorting networks so sorting networks are actually a really old idea they came out in the 1940s people sort of proposed having this type of hardware and the way to think about sorting networks is that think of it as this hardware device where you have these wires that can hold a value and you're going to pass them along the wire until you reach a comparator where you'll compare two wires' values together and then they'll produce an output that then is carried along further on the wire so say for example here we have four elements because we're doing a 128 bit simi register we have four elements and so along these wires here this thing will carry the value 9 this will carry 5, 3 and 6 and then these vertical bars represent the comparators we'll take the value from the bottom wire and the value from the top wire and we'll just run a min-max comparison on the two of them and what'll happen is the lowest value the min value will be moved to the top and the max value will be moved to the bottom right? do the same thing here so then now after this comparator along the wire this wire will carry the value 5 so then that gets fed into this net comparator where we compare 5 and 3 and obviously 3 is less than 5 so 3 gets put up here and then it gets moved all the way to the output so this point after this comparator there's no other comparator that comes along so the value can then be written to our output slot here and we know that 3 is always going to be the min value same thing going along for these wires here 9 and 6 we've produced a 6 and 9 and it doesn't have any more comparisons so it goes to the bottom here and then we compare 6 and 5 and we produce 5 and 6 so it's pretty straightforward to understand what's actually really fascinating about this is that this is super fast to actually implement you may be thinking well how useful can this actually be because you can only compare four numbers but again members say we're going to do this for all these little four number chunks and we can implement any registers for the entire data set and we can implement this really really fast because unlike QuickSort and other sorting algorithms there's no conditionals at this at all we can implement this entirely with SIMD instructions there's no if clauses whatsoever like in QuickSort there's if branches and if you have an if branch then you may do an unexpected jump and flush out your instruction pipeline which then that's really slow to do on a super scale of CPU so in this case here we can implement all these min-max operations with Eddie for comparator with just two SIMD instructions to produce our output so and no matter what the values are we have here the execution path of the program is always going to be exactly the same because we're always going to first execute this thing and then execute that one and then from there we can execute this one so it's not like we would execute this followed by this exactly what the instructions are going to be so sort of thinking this is like we can unroll the loop to do this and just execute them in always this sequential order so let's see how to do this with SIMD so for this what I'm going to do is we're going to sort four four element vectors so each of these are a SIMD register and they can hold four values and so what we're going to do is we're going to first load the data out of memory into our SIMD registers so that's just four load instructions that's really cheap so then now we're going to sort across registers so that means we're going to sort this column here rather than within the register itself and this is because the way SIMD works is you can't do like addition or min and max or whatever the operation you want to do you can't do it within the elements inside of the SIMD inside of the register you have to do it across registers so we're going to sort this column here we can just do that doing the sorting network that I showed before because we know exactly what the min and max instructions we want to execute are so to do this to sort these four registers we only need to do 10 min and max instructions to sort all these guys so now you see again in this column now this is sorted but the problem is when we want to write this out we don't write out the column we have to write out the register so if we did that if we just dump these registers out into our CPU caches then do the transformation that would be really slow now because that would be assist the instructions to do this with the extra mem copying but because when I said you can do SIMD operations that the SIMD operations produce output that goes into other SIMD registers I could do all the transformation I need to put this in the form that I want directly all in SIMD I never have to copy anything out into L1 so to do this I do a transpose where now 8, 11, 14 are now all aligned together in a single register and now for each of these registers it's not globally sorted right certainly 8 is less than 9 so you can't read this sort of good from this and this and this and this you can only sort it within the single register itself but now I can write this out to DRAM to my program's memory without having to do any additional operations it's in the form that I want so now to do this this is only 8 shuffle operations and then it's 4 store operations to then write them out to memory so we went to do now to sort 16 numbers all within SIMD is only less than 2 dozen instructions there's no way quick sort is going to be this efficient so this is why using the hardware using the acceleration that SIMD provides we can sort these little runs very very quickly much faster than you can do with quick sort alright so now once we do this for all the elements and now we have these a bunch of these different 128 bit vectors or 4 element vectors we then want to go to the second level where we want to start merging together 2 locally sorted lists sorted 4 element lists to now produce a globally sorted list and we keep doing this until we reach these globally sorted lists are half the size of our CPU caches and to do this we're going to use a technique called the Bitonic Merge Network which also again came from the SIMD version of doing this is in that Intel paper from a few years ago so I think this is actually really fascinating what they talk about in this paper I think this paper actually has makes it really a large scientific contribution for what they were able to achieve using taking existing algorithms that we've known about for a long time and speed them up on hardware so in this paper they talk about how they can speed up their sorting algorithm by up to 3.5x over the sort of traditional SIMD implementation and this is actually really really significant because if you think about it quick sort has been around since like 1970s there's no magic you know as far as we know it's going to make this go immediately faster unless like P equals NP so by taking the same same algorithms that we've known about but using hardware in a more intelligent way you can get a 3.5x speed up over sort of a naive implementation but that's quite significant so that's why I think this paper is really important and if you're interested in this kind of stuff definitely check it out except as we'll see later on it doesn't actually work so alright the way the big tonic stuff works is that it looks a lot like the sort networks where you have these wires that are passing in values and then you have these comparators that produce more values but now the input for us is not going to be single elements that we did in the level 1 they're actually now going to be the sort of runs produced from the first level yes the people there's been a lot about the shuffle instructions and the same data is just making it a little bit slower and not achieving the 4x speed up that you would spot over in each of these I just want to give you a little bit more like if you have less than 2 dozen instructions but then each instruction is not equally in terms of execution time and the shuffle instructions take actually more time so then in terms of that compared to what you mentioned how much is the shuffle actually because of slowdown is there a way to speed up the shuffle or like how does the shuffle work on me? so I think your question is back here do you use shuffles? yes so his statement is which is right doing shuffles doing shuffles on SIMD though it's not going to be that much more expensive than doing Minimax so his statement is all these instructions that I'm showing you they're not all equivalent doing a Minimax with SIMD registers is potentially cheaper than shuffle I didn't give you this because the shuffle is the bottleneck when you're executing the SIMD register so I just want to know how much of that is a concern his statement is how much is the SIMD shuffle how much is that a bottleneck how much is that concern for what we want to do here versus like would it just be better to use quicksort one we could just run this experiment and try it out I'm not saying we suggest that is my intuition that doing this even though the shuffle is going to be slightly more expensive it's definitely way cheaper than quicksort because again it's not just in terms of the number of instructions it's also the number of branches you have right in in in quicksort you're doing these comparisons you're saying is this less than this if the first element is less than that I want to write it here otherwise write it there right that because you have a slowdown on so the way the sort of Intel CPUs work now because they have these really long pipelines because if you mispredict the branch then you have to flush everything and restart it so it's more than just the instructions right it's the fact that there's no conditions at this at all no matter what values we're inputting we're going to execute the same instructions every single time there's no branches at all that's why it's always going to be faster than quicksort even though the shuffle might be more expensive that's a good point though okay so for this again it looks like the sorting network but again these inputs are the locally sorted runs we generated from level one or as we get bigger there'll be the locally sorted runs from from previous runs of the level two stuff so what's going to happen though unlike unlike before where we will unlike before we're just taking the elements and putting them directly into the sorting network we actually have to do a transformation for half of the input elements into our bitonic merge network so the first half of our input would be in the same sort of form that we generated from the previous step so going from lowest to highest for the second half though we're going to reverse them so it goes from highest to lowest when we start doing our minimax comparisons it works out that when we compare like the first element of this for this input will be the highest value and we compare that with the first element of this list and it'll be the lowest value and that makes sure that we're putting things in the right position so we do sort of one round of this and then we do more shuffling do another round of these minimax more shuffling and now we can do just our final comparisons to produce our sorted run so in this case here now all these elements these sort of buckets or runs will be globally sorted going from lowest to highest and again we keep doing this until we double the size of our runs until we get to the point where half our CPU caches so after we do this now we get into the third level and the idea here is we're in the same way that our level two was sort of building off of level one, level three is sort of building off of level two as well so now we're going to use these big tonic merge networks to merge larger sort of runs but we're going to be careful on how we schedule these operations so in this example here say this is one thread doing this it's just going to rip through the min-max, do the shuffle, do the min-max do the shuffle in that order but now when we want to start merging things that exceed our CPU caches we actually maybe want to jump around to different parts of the execution of these bitonic merge networks because we want to have it the case where we can have the memory controller fetch things in the background while we do some processing rather than blocking and waiting for the memory controller things that we need every single time so the idea here is that we're going to use some extra bookkeeping to know where we can restart at different parts of our pipeline to do this sort merge but that extra bookkeeping allows us to have more equally distributed consumption of both CPU and memory bandwidth so the basic idea what's going to happen here is we're going to use more CPU instructions but get better performance the memory controller isn't like pushing a ton of data to the registers to your caches and then waiting for the next thing while you can process that so now we're going to do this what I'll show you in a second we're going to do this on a single worker thread per core and they're going to merge all the data that's local to them so we're going to run this in parallel on all different cores at the same time but they're all going to be basically doing the same steps and because now there's only one thread running in our pipeline we're going to do this on all the intermediate data structures and that's another reason why this will work out so it looks sort of like this so here we have our sorted runs from L2 and again think of this as a pipeline we're just combining things together as we go across and we keep doing all our merging so in these these merge steps are the the bitonic merge networks and then the outputs are the the outputs are going to a queue for these sort of runs and then once we know that we have enough elements together for these two queues we can then invoke the next the next one so the idea here is that if in the very beginning since there's nothing for this merge to do because it doesn't have enough inputs in its queue the system sets a flag to say there's nothing for you to do here so don't actually try to do anything so this is so running on a single thread is doing all of this something to do or we have to synchronize or send an interrupt for someone to wake up and start doing something so the thread sort of jumps around all these different parts so maybe it starts off here and it does a bunch of merging on this a bunch of merging on that and then it can start merging on this but then when it when this output buffer gets full you can't merge anything here so then it'll jump over here and start doing that so again the internal that we're using for all of these little different steps allows the thread to know I have enough data to start merging or I don't have enough data so I don't even bother looking and there's no synchronization across different threads because only one thread is doing this now you may be thinking this hurts your CPU caches because maybe I'm over here and it would be nice if I could keep this stuff on my pipeline and keep processing as far as I can but since I had nothing to do I had to jump here anyway to start processing this so you do pay a cash but again it's better than having to block because you're waiting for other threads to do writes so again the network I'm showing here is going to be done at every single thread or every single worker thread in the system and then until we produce our final globally sorted result so that's how you do sorting the cash conscious sorting so now at this point here we have two globally sorted runs for the entire tables for the input and the out of relations and then we just do that iteration with the cursor to walk through the two tables and do our comparisons across them so there's not really much to say about the merge phase algorithm depending on how you organize where the data is being stored as we'll see in a second you can you guys will have to read later that it's being stored at other threads or other cores sometimes you don't and there's different trade-offs for both of them but the basic idea is still the same you still just have to do the straight comparisons so there's no real magic that you can speed this up other than it's going to be done in parallel across different threads and if you're careful about how you do this you don't have to synchronize when you write things out to your Apple buffers so the three variants of the sort merge algorithms I want to talk about is the multi-path sort merge and then the massively parallel sort merge so these first two come from the ETH paper that you guys read and then this last one is from the hyper guys from a few years ago so the multi-way sort merge is the basically everything I just talked about now the cash conscious sorting with the three levels and then you just you just do the merge at the end and what happens is you know that all the data at a single core is for both the inner and the outer relation are enough to do the join you don't have to look at any other cores and you do this because when you do the level three merging you actually write things to the core where you think it actually should be so it sort of looks like this so you have the the outer relation here and so the first step you're going to do is the standard new partitioning whether it's radius partitioning or whatever technique you want to use and then you write these out to write the segments of the data that this core is controlled over to write out those sort of runs here and this is all done in the local storage so then now you do the multi-way merge using the the level three stuff that I talked about before where you're now going to have every core knows that the data for some segment of its sort of run should all go to this core right and then you can merge now these together using the merge network that I talked about before with a single thread jumping through and executing the different tasks to do the sorting so now again you do this for every single for every single thread so every single region they write in a single thread and then you all do this merging in parallel so now what you have at every single thread you have a you have a in total this is globally sorted but in each one knows, only knows about the data that it had to sort at this step here so now on the interrelation you're going to do basically all the same steps that I showed here I'm not going to show it because it run out of space but basically this sort box here represents all of this stuff here so now what you have at this point now you have both for the outer relation and the interrelation at a single core you have you just have the data that you need to do the comparison at this core so what I mean by that is there's nothing there's no element or tuple in this sort of run here that you have to do a comparison with these other parts here you can do it all directly local within your core so you're paying this extra penalty to do all this to do these remote writes, to do this partition in the beginning but then you don't have to do any of the remote reads at this phase here right? you just iterate and scan down locally and do the comparison so the multi-pass sort merge is basically the same thing as M-way except that you don't do that you don't do any redistribution you just generate your sort of runs at every single core and then when you do your scans you actually have to do you have to look at more stuff basically the same thing here you don't do this this extra step here so now with the hyper guys they're going to do something much different so what they're going to do is you're first going to do range partitioning over the outer table and redistribute it across the cores in the same way that we saw in the last slide but then you don't do this for the inner relation and then at each core and for the outer relation so now you have a sort of essentially a globally sorted outer relation but on the inner relation you just sort the local data that you have so then now when you want to do the merge phase for the inner relation you're going to have to compare it against all of the you have to compare one segment to one core but for all the different segments you have in the inner relation you may have to go to all the different cores let me show you what that looks like so again we do essentially just range partitioning right we have to write all the values within a range to the different cores and then we just sort this locally on the outer relation we just don't do any redistribution we just sort the local data that we have so now when we want to do our merge say we start with this first segment here you're going to start off with just the one piece in the range that you know goes to this relation and then you just do a comparison with the tuples here so you just do sort of a small scan here but then the full scan for this entire data set at this core here right and it goes down the line so every core is going to have to do that same full scan here and then now for this next one you do the same thing everyone's going to have to go and look at everything so this may sound like a bad idea this may sort of contradict what I said earlier where you want to avoid having to go over the bus but the hyper guys are going to argue that when you do these full scans of these blocks here it's a sequential scan that the harbor prefecture can accelerate for you and therefore you're not doing like a cache miss stall every single time you read anything here as you scan along it's going to bring everything you need into your caches to the core that needs it and that's going to run really fast so it's essentially ends up being a local comparison at these cores here because the harbor prefecture hit that cost for you so in the hyper paper they also have they also provide some rules for what they argue is the correct way to implement the sort merge algorithm and they basically get sort of three edicts that they say matter a lot when you design these algorithms so the first is that you don't want to have any random rights to non local memory and we saw that here right because yes you have to do this right here to redistribute it but now when you actually do anything else after this first partitioning you don't write to any data that's that's remote whereas in the other cases you did have to merge things around move things around the next rule is that if you do a sequential scan on the data that's not local to you the harbor prefecture will be able to bring that over for you automatically and hide the access latency of reading things that are far away and the last one is in sort of what we talked about before with the partitioning hash joins up you never want to have cores synchronized with each other and so ideally you want them to be able to operate on the data they have locally and not worry about synchronizing with anybody else right so now due to the evaluation from the eth paper they're going to implement the m-wave m-pass join algorithms that they talk about as well as the massively parallel sort merge algorithm from the hyper guys but then they're also going to implement the radix partitioning hash join that we talked about last class and this is now going to be running on a newer machine with four sockets with the zions each have eight cores so total would be four times eight would be 32 real cores and with hyper threading you have 64 logical cores and that'll have a half a terabyte of DRAM so the first thing they want to measure is just how does their sort of SIMD accelerated sort compare against the sort of off the shelf STL sort that you get from you know in C++ so STL sort I think does again it's a hybrid sort so they do quick sort in the beginning and then as you have more sort of runs then they switch over to heap sort and so this just shows that their SIMD accelerated sort algorithm is always going to outperform the sort of the STL sort here right so it's about 2.5 3x faster and actually this almost matches up with what the Intel guys reported which with the speed up you can get using SIMD over SISD instructions so that's sort of nice that corroborates that previous finding so now we'll do a comparison so they have a synthetic data set and they're going to be doing in the out relation they'll have 1.6 billion tuples in the interrelation there's 128 million tuples and these tuples are going to be 8 bytes in size and so what you see here is that the the multi-way join a sort merge join algorithm outperforms all the other algorithms actually the hyper one actually performs the worst and what you see here is that the they're all paying the same partitioning penalty but the the merge phase of the merge join algorithm yes I should be clear so this is the merge of the sorting out phase and then this is the merge of the join phase so in the hyper case even though they claim that the hardware prefetcher is going to hide that memory access latency problem you see that they have they pay a larger penalty here I think I'm measuring this in terms of the number of cycles so not the number of instructions the number of cycles that the CPU spends to execute the algorithm if you actually now also compare against the if you also measure throughput again you see that the the multi-way one outperforms all the other algorithms so this is again showing you that you you're paying an extra cost to do the extra instructions to do all this extra shuffling of data but with SIMD it's going to be really really fast and in the end you also reduce the number of cache measures you have because all the data you need is going to be local to you so another measurement they did was they want to see how these algorithms are affected by hyper threading so again it's a they have 64 logical threads so that means 32 hardware threads and then in total 64 with hyper threading and so what you see is that in this graph here I'm showing the x axis on log scale so what you want to see is that as you double the number of threads that are processing the join you want to get double the performance so you're almost scaling linearly for all these algorithms except for the as you get to these higher core counts then the hyper algorithm starts to fall off and then after hyper threading kicks in they get actually much worse performance so this is saying that this case here you're almost getting linear scale up with the multi-way join until hyper threading starts and then you know you're not dealing with real hardware threads you're not going to get better performance so in this case here they can process 350 million 350 million tuples per second this can only do 105 again the main take right here is that the the actual work we're doing to organize our data pays off in the end and in this last comparison they're going to throw in the radix hash join now I remember again I said last class radix partitioning may not be always the right idea for this particular work load or this paper that didn't discuss the new partitioning case so again we had the breakdown between the different phases of the of the submerged algorithm and then combined together the build and probe phase into a single number so the only one that's worth looking at here is across the board the hash join algorithm always does better is in this one here that the things actually get relatively close and what you see is for the radix partitioning stuff they're spending a ton of time doing the partitioning from the hash join because the data sets are huge you're combining you know almost two tables and each have 2 billion tuples that's a pretty massive join and so in that case it's not a surprise that the partitioning was going to be a the the main bottleneck the main consumption of CPU for this experiment here so and then the last one here again they just want to compare against how the algorithms differ when you scale up the number of tuples you're trying to join and your input relations so again for smaller tables the hash join is always going to be better but then as you come across to the really large table sizes they essentially become equivalent and I would argue this is because the hash join is always spending this time doing the partitioning so it would be interesting to see what this looks like if you don't do the partitioning with this sort of continue to go across right okay so I said in the beginning that this doesn't work I'm going to take a guess why actually before we get there are many questions about okay so why doesn't this work so in all my examples let's go back to say this thing here what am I sorting keys what's missing tuple IDs exactly so in this what do we end up with we end up with a bunch of vectors of oops sorry we end up with a bunch of vectors of sort of numbers does that help us no right because we have no idea how to take those numbers and say oh yeah that corresponds to tuple 2 and you're tuple 3 right well all we did was just sort numbers there so the reason why this doesn't work is because I'm not including the tuple IDs and in a in memory system database system we're going to have 64 bit tuple IDs right because otherwise we can only have 32 tuples on a table and clearly showed examples of what we want to be in the billions and that's not going to work so how would you actually store the tuple IDs to do this right how would you take in case of the sorting network how would you include the tuple IDs you can use something really stupid you could say like alright well I know I have values 1 2 3 4 those are the keys I'm sorting and I have tuple 1 and has value 1 tuple 2 value 2 so then when the I get the data out of the SIMD register that's sorted I can then go back and look and say oh yeah this is value 4 that's tuple you know tuple this tuple right that's terrible right that'd be super slow defeats the whole purpose of SIMD so the way you would have to do this is that you actually have to pack in the the tuple ID with the value so you would have the first upper bits of the record of the value be the the join key and then the last 64 bits will be the the tuple ID right and again in that case again the sorting stuff all still works because the is that first part of the value of sorting that's always going to be how we're going to determine whether one value is higher than another so it doesn't matter whether the tuple ID is less or lower it's the upper part of the value that we're going to bring sort of on so now I said before that we have to make sure that we have aligned values so that if we have a 32 bit tuple there's a 32 bit value with 64 bit 64 bit tuple IDs we can't pack them together in a 96 bit value we have to pat it out to be 128 bits but the problem is everything that I shared before the reason why it doesn't work is because current Xeon processors only have 256 bit SIMD registers so that means you can only store two values 220 bit values that are comprised of the joint key and the tuple ID together in the register and then that's essentially useless for us right in the sorting network example that I showed we need to sort four values so none of this actually works and you'll see this in the paper we read next week from Columbia they assume that they have four byte tuple IDs that everything fits in single 64 bit words but that's not again not real in a real system so the reason why I say I want to be clear that it's the Xeon general purpose CPUs and not the PHY because the Xeon general purpose CPUs only have AVX 256 the Xeon PHY has 512 so you can store four 120 bit values in their SIMD registers here so you can do the everything we talked about here using the Xeon 5 but you can't do this in the regular Xeons now the Xeon 5 I'll talk about next class but think of it as like a think of it like the original version of it was like it would be on the PCI express bus right you had this like it wasn't like a GPU you could plug it into you could drive the graphics for your monitor it was just a co-processor that only could crunch numbers and send data back to the CPU so it's going to have the same problem that we talked about before with GPUs we had to send data down for the PCI express bus then let it do the sorting and then move the data back up to the CPU which again is going to be bad and slow I will say there's newer versions of the Xeon 5 as it came out last year where you actually in the socket and you have it can be cash coherent with the other CPUs but again most people don't have that if you go on Amazon you don't get that on EC2 so it's actually pretty rare so again this I think this is important to discuss because it's it's an interesting use of SIMD but it just doesn't work and I had a student implement it last semester and we measured just it doesn't work so we have a bad implementation but we sort of implement it because we know that when they finally add 512 512 bit registers to Xeons we can then use it so this is why I'm saying I think that when the Xeons get updated and find support 512 512 bit registers sort merge actually might then Apple form hash joins but we have ways to go okay so what are my parting thoughts on this so even though I just told you that everything I talked about today doesn't actually work I still important to actually have both join algorithms even if you can accelerate the sort merge one there may be cases where you actually want to choose that over the hash join and the reason is because sometimes queries want the data sorted right if you have an order by clause that has wants to sort your data based on the join key then the sort merge join although it's not going to be hardware accelerated it's going to be better than the hash join because the data will just automatically sorted when you complete the join and you don't have to do an extra round of sorting whereas in the hash join because it's being hashed the data is basically completely random and then you have to do additional sorting so this is why every major database system that wants to support complex analytics will actually support both algorithms because the optimizer can then choose one implementation over another right so in everything we talk about here it's just talking about just the algorithms running in isolation but in a real system you have to understand it's a really complex query that has a lot of different dependencies and you may want to choose one of these algorithms versus another and even if you don't want to do a join if you just want to do an order by on an entire table then if you had the hardware accelerated sort merge you could rely on that technique to speed things up as well because you don't need a merge phase you just need the sorting phase right so in our current system we support hash join I think we have I think we also have a sort merge I don't know how well tested it actually is but again most systems will have both so for next class we're actually discussing a really interesting topic that's really important called cogeneration and what I'll say is that co-gen will actually get you a much better improvement over performance over the SIMD stuff that we'll talk about afterwards I'm actually debating whether we maybe want to spread out the co-gen stuff to maybe cover two lectures because the vectorized stuff is now scheduled for two things but I might use steal some extra time from one lecture and talk more about co-gen because it makes a big big difference in performance the other thing again as a reminder for everyone here the first code review you're going to have to do for project number three will be scheduled for April 11th and so again the code review has two parts there's you getting your code ready for your group you're being assigned to to provide them your code so that they can look over and then after that you have a week to look over the code that they give you and provide them feedback about you know how their implementation is going so obviously you need something to show to somebody else because if you give them nothing then that's bad for you and they're going to waste everyone's time so you should be mindful that this is deadlines coming up in less than two weeks and you have to be ready to turn your code over to the other team and I'll send out an email this weekend of who you're going to be signed up to paired up with so this is why when you look at the spreadsheet on Google Docs for the class there's a column that says put your GitHub URL for your project because that way I know how to sign the other team will know where to look okay? Any questions? Alright guys, have a good weekend and I will see you