 So, for today, we're going to talk about now the other major join algorithm in a database system. It's the sort merge. So, last class, remember that we talked all about hash merging, and we said that hash joins, hash joins were the most important operator in a database system for doing other lab queries. So, now we're going to look at the other major join algorithm, that's the sort merge. So, we'll start off talking about the background of sort merge at a high level, and then we'll spend a little bit of time talking about SIMD instructions. So, these are single instruction, multiple data. These are special CPU intrinsics that do vectorized execution inside of a, you know, inside your database system to do this join algorithm. And we're going to need these things in order to make this go faster, right? We didn't yet to do SIMD stuff for the Radex hash join, we talked about last class, but to make the parallel sort merge join run fast, we have to use SIMD. So, we'll have a whole other class where we talk about SIMD and vectorization later on, but I'm just going to give you again a quick preview of just everything at a high level which you need to know to understand vectorization to make parallel sort merge work. It's quite an entrance there. You okay? Okay, he's okay. All right, and then we'll do an evaluation to talk about comparison between these different algorithms, and then we'll spend some time at the end covering some of the hate mail we've gotten for the class. I'm not going to use names, we can discuss things. Okay, so the sort merge join algorithm itself, the canonical version is pretty straightforward, right? Basically, you have two phases. The first phase, you're going to sort the outer relation and the inner relation based on the join key. And then in the second phase, you're going to merge them two together and find matches. And the merging is basically just walking through the sort of relations one by one and doing a comparison on the join key to see whether you have a match. Typically, you only have to scan the outer relation once, right? In this case, it's R and depending on whether you have duplicates or not, you may have to backtrack in the inner relation and go back and scan through pieces of it, pieces it over again. And this is sort of the standard textbook implementation of a sort merge join, and essentially looks like this. There's a sort of visualization of it. Say we have our two relations, R and S. They start off being unclustered, unsorted on our join key. So in the first phase, we'll do some join out, we'll execute some join algorithm, pick whatever one you want to use. And then now we have a sort of band or these sort of runs of each table. And then in the case when we do the merge join, we just walk through each tuple one by one. And if we're doing echo join, we just compare the at each position for every other position. So every time we move down in one on the outer relation, we check to see whether we have a match and an interrelation. If so, then we emit it in our output. If not, then we move on to the next one, then do the comparison again, right? So you're just sort of going through sequentially for each of these and you produce the output. So the history of the sort merge join versus the hash join, it's sort of like a debate that's been going on in database systems since the very beginning. So before there was no SQL versus new SQL or no SQL versus SQL, the major debate was whether sort merge join was better than a hash join. So in the very beginning in the 1970s, the IBM guys came out with a paper based on the system R system and they said that sort merge join is actually preferable to the hash join. It's gonna be faster and you get better performance with it. And if you understand at the time, obviously DRAM was very limited. So back then, they didn't have a hash join that could use data that was either in memory and out of memory, right? The hash join back then had required all the hash table to be entirely in main memory. Whereas the sort merge join that they could support, they would just rely on the sort of canonical external merge sort algorithm that allows you to sort large data sets that are larger than the amount of DRAM that's available. So back in the 70s, they were said sort merge is the way to go. Then around the early 1980s, hash join became in vogue again. And this was partly due because there was all of these new, what we call database machines systems. So now you maybe call them database appliances. For server, think of it as like specialized hardware, like a one rack unit that was designed specifically to execute database workload. They don't really make database machines anymore. But think of it as a specialized system just for due database stuff. And so what they had was they had specialized hardware in these database machines that could do hashing really fast. And what they found is starting in 1983 and up until like 1988, they found that the hash join was faster than the sort merge join. Then in the 1990s, there was one of the optional papers on the website was this comparison of the sort merge join hash join, the sorting versus hashing revisited. And what they basically found on that paper was that they're more or less basically the same. At a high level, semantically, the two algorithms are doing the same thing. And the performance is roughly the same. Then in the 2000s, there was some work in the late 2000s and was one of the optional papers we read. They found that doing hashing was actually now faster than sorting. And now we are in the year of the 2010s and the paper you guys required to read is basically the latest incarnation, the latest bout around of comparing these two join algorithms. So I won't, if you've read the paper, you know what the answer is, but we'll get to it at the end. So this is sort of the state of the art where we're at now in terms of join algorithms. So there's three major thrusts, three major works that were again part of the readings. So the first is you have a paper in 2009 by Oracle and Intel where they found that the hash join, the radix hash join that we talked about last class, is faster than a sort of parallel sort merge join algorithm. But what they noted is that if you have wider SIMD registers, and I'll explain what I mean by that in a second, then you can actually do better in a sort merge. But the current Intel hardware doesn't support the wide registers that they were talking about here. But eventually at some point it could, and therefore the balance might now switch back over to sort merge. Then in 2012, there was a paper by the hyper guys out of Germany, and where they came out with a parallel version of sort merge that was numerous aware, so being mindful of the layout of data in memory. And what they found is that their sort merge join algorithm is actually already faster than a hash join, even without having to use SIMD instructions. So then now the paper you were assigned to read came out of the systems group at ETH in Zurich, and so there was a paper they had in 2013, right before the one that you guys read, where they have now a faster version, a parallel version of radix hash join that actually outperforms the parallel sort and merge join that the hyper guys came out with. So we'll go through all these one by one, but this is sort of where we're at now. So the SIMD stuff, the way to think about this is that it's a, see the special instructions that are gonna be in the CPU that will allow you to vectorize execution on data. So typically you think of like, if you use like, number plus number equals something, that's considered a SISD instruction. So single instruction, single data. Under SIMD, you're gonna take this one single instruction that you execute in the CPU and that instruction is gonna be able to do that operation across multiple data items. So this term SIMD or SISD, it actually comes from Flynn's taxonomy of parallel architectures, which came out in the 1960s. Basically ways to describe different execution paradigms or architectures in a CPU. So the SIMD stuff is sort of one aspect of this. So the current Intel and AMD chips all have micro-architecture support to do SIMD operations. So when it first came out in the mid 1990s, Intel had MMX and then AMD had 3D now. And these early variations had a lot of problems. Like when you did stuff on these, you couldn't do, couldn't execute other operations on regular operations on the CPU. And then now where we're at now, which we'll talk about for this class, are all these streaming SIMD extensions, SSE, SS2, SS3, AVX, the latest one. And these are the ones we're going to focus on here. And again, the basic idea is going to allow us to take multiple data items, execute one instruction, and get multiple outputs. So I remember when I was younger, like in the 90s, I remember back then you worried about the megahertz of your CPU and things like that. And I remember we had a Pentium II or III that had MMX. I didn't understand what the hell it meant. It was just total Intel marketing. This is a special thing you definitely wanted, even if nothing actually ever used it. You can kind of think of this in what GPUs are doing, right? GPUs are doing the same thing. They're doing vectorized execution across a lot of data items at the same time, just so they have tons more threads than a general-purpose CPU does. So the one we're going to focus on is the SSE stuff. And these are basically a special collection of instructions that target these 128-bit registers. So the way to think about this is that you have your regular DRAM, and then you have these special registers that you can read and write to using intrinsics that can then be used, these SSE extractions. And so in this case here for SSE, you can do, you still have a 128-bit register, which means you can pack in four 32-bit integers or scalar values on that. Then you invoke one operation, and it will output another 128-bit register. So this was introduced by Intel in the 1990s, 1999 after the MMX stuff came out. So this is like the improved version of all that. So let's look at a simple example. So say we want to do a vector addition, right? We want to do x plus y equals z. And this is the basic, you know, if you didn't do it, write it out in math, it would look like this. If you had to write it out in code, it would just basically be a for loop where you're going to iterate through every single value of x and y, and then you're going to add them together and output it on z. So to do this with assistee instruction, you basically just have to loop through every single element of the two vectors, apply the addition operator, and then output it to your output buffer. You just go down the line and do this one by one. So every single one time you're looking at these two numbers, it's one instruction to execute that. Yes, it's very fast, because modern CPUs can do addition very quickly, but you're doing this over and over again. So now with SIMD, what we're going to do instead of going through like in a for loop and going down one by one, we're going to combine together the first four elements into a 128-bit SIMD register. So we're just packing the bits one by one into the slots. And then now it's one invocation to the SIMD instruction, and then it'll output the result of that operation in another 128-bit register. So we do the same thing for the next four elements, one SIMD instruction, and boom, we have our output. So what was before, it was eight addition operations to go through every single element in the vectors. Now with SIMD, it's just one, or sorry, it's just two. So that's four times as fewer instructions. So you see why the SIMD stuff is really important, and we can exploit this in our algorithms to get much better performance. So it seems like a magical thing we would want to use for everything. And yes, the performance difference you can get is quite significant. The downside there is it is kind of tricky to use. So there's no compiler that I know of that it handles auto vectorization for any possible algorithm that you could have. So typically what happens is in order to take advantage of the SIMD, you have to use as the database developer, have to carefully code whatever it is that you're doing to make calls to the CPU intrinsics. And we talked about CPU intrinsics before. What are they? That's an example of one. At a high level, what's an intrinsic? Sort of. So although you make a call as if it's a function, it's really a compiler directive that then gets translated exactly to the assembly calls to apply whatever the operation you want to apply. So it's not like making a call and look up at the function table to say, what function do I got to call. It's like inline assembly that the compiler will automatically do for you. So for all these different sort of scalar types you can have, there's all these different SIMD instructions that allow you to write them into these registers and pull them out and do other types of things. So we'll discuss more about SIMD instructions when we talk about vectorization. But the main thing is that you as the programmer are probably going to have to spend time writing the code using intrinsics in order to use them. The compiler is not going to do this for you automatically. And then you also have to be very careful about how you align the data in the registers. So in the example I showed before, it was 120-bit register to be greater to put four 32-bit values in there, but we have to be very careful when we pack in other things. And then moving data in and out and scouting around and doing shuffling is very tricky and at times can be quite inefficient. So in the newer versions of the SSE extensions, they have the ability to sort of move data from one SIMD register to another so that you don't have to go through the DRAM or CPU caches. So they have a lot of ways to make this work a little more efficiently than you would treating it just as a dump, a thing you dump things in and out. You can move in between the registers very quickly. So again, we'll discuss more about all SIMD stuff later on. But this is the high level of what we have to care about. So you may be asking at this point, why not just use a GPU instead of the SIMD stuff? Because GPUs are designed to do all these vectorized execution stuff very, very efficiently. And so the answer is that as far as I know, except for one particular system, there's no general purpose database system that's out there today that uses GPUs for anything. And the issue is because the bandwidth between the DRAM, that's CPU's memory, and the memory for the GPU itself is really slow because you have to go through the PCI Express bus. So that means that in order to compute some query on a GPU, you'd have to take the data that you wanted to process, load it into the GPU first, then do whatever the parallelized execution you wanted to do, and then extract the answer out. And that can be really slow. So the only system that I know that uses a GPU is a system called MatB. It's actually a kind of cool story. There's this guy who was doing a PhD in humanities at Harvard, decided to walk down Vassar Street to where MIT is and took their database course, ended up building his own sort of query engine to help him visualize tweets for the air of spring using GPUs. And they eventually made a company of it, and that's what MatB systems about. And the way they get around the problem of the slow bandwidth on the PCI Express bus is that they just load the entire database in memory, and every single query is a complete sequential scan across that entire data. And it's really fast because they have a turn of thread, they can do this very quickly. But that means that the database is essentially read only because it's sitting in the GPU. They can't run transactions on it. They can't do fast updates and things like that. So we'll discuss more of this when we talk about vectorization. But in general, GPUs aren't going to help us. Now, there are a bunch of these merging coprocessors that are coming out in newer hardware that are able to share the DRAM memory of the CPU on this separate thing without having to go over the PCI Express bus. So AMD has their accelerator processor units. Intel has their night's landing chip of the Xeon Phi. So these are some interesting hardware that's coming out in the future that I think actually that do the vectorize execution like the SIMD stuff but can access DRAM directly. So I think these things are some cool things we can look at later. OK, so that's sort of a crash course in five, 10 minutes about SIMD stuff. Again, we'll discuss more of that later on in the lecture after spring break. All right, so now we want to talk about how new parallel merge joins. So obviously, the most expensive aspect of a sort merge join algorithm is always the sorting. The merging is pretty straightforward. The sorting is always going to kill us. And so before we had to say, oh, just run quicksort or whatever you want off the shelf, sorting algorithm that's good enough. But now if we want to take advantage of our current hardware landscape, so that means not only the sort of the numer regions and interconnect between them but also having a lot of CPU cores, then we want to be a little more careful about how we're going to do the sorting step of the sorting phase. So I'm going to go through the different phases of a sort merge join algorithm in more detail. And this is all based on the paper you guys were assigned to read. So in the first phase, although it's not always the first, is you can have the same kind of partitioning step that we saw in the Radex hash join, where we're going to take our one red tables and we're going to split it up into different numer regions and assign them or chunks. And we'd assign them to different sockets that have their own numer regions. And the idea is that we want to partition the database such that each core only operates on data that's local to it. And for this particular algorithm we're going to look at, we're only going to merge, we're only going to partition, that should be R. We're only going to partition the outer table. Then in the second phase, we obviously do our sort based on the join key. And then we can do our merge where we just sort of walk through each of them one by one and try to find matches. So we're going to go through each of these phases one by one. So for the partitioning phase, I want to describe this in terms of what I'll call explicit partitioning and implicit partitioning. So the implicit partitioning would be the morsel stuff that we talked about before, where when the database was loaded, the database system made a decision that all right, well this region is going to go to this socket and this region is going to go to that socket. So when we start the sort merge join algorithm, our data is more or less already partitioned in some manner across the different cores. And I'll differentiate this between explicit partitioning where no matter whether we've done the morsel thing where we split things up in the beginning, we're going to explicitly partition our tables based on our join key and redistribute the data across the different cores. So even though when we start, we're already sort of split up in uniform chunks on every single core, we're going to go through a scan through the table, look at the join key and hash it using the radix partitioning approach we had talked about before and then move the data to some other socket. All right, so that's, and different algorithms could choose to do this or not do this, right? And we'll talk about the trade-offs as we go along. So this is what I mean by explicit or some implicit. Implicit means it's already there, it's already partitioned, explicit means we're doing extra work to redistribute. So now we'll talk about the sort phase. So in the sort phase, what we're going to produce or we'll call sorted runs and a run is basically some contiguous set of tuples, contiguous list of tuples that have been sorted. So you can think of it like in the beginning you chunk the database or you partition the table into these disjoint sets and then we'll create a run from them is when we sort them and there's runs that doesn't necessarily have to be the entire chunk of data, right? A chunk of data can be broken up to different sorted runs that are locally sorted but not globally sorted. So again in the old days, quick sort was good enough but now we need to be more careful when we deal with numerous architectures and parallel architectures. So essentially what we're trying to achieve is what is called cash-conscious sorting and that means that we want to be mindful of the amount of memory or the amount of, yeah, the amount of memory, amount of cash we have at different levels in the CPU and are being aware of what memory is close to us, right, on our numerous socket. So I didn't know how good way to describe this and they don't use this term in the paper so this is sort of my own taxonomy but the basic idea is that we want to have different levels of sorting and then as we get to larger run sizes we want to switch to a different approach to do our sorting. So at the very beginning, what I'll call level one, we want to generate sorted runs that can fit in our SIMD registers and then once we realize that now we have all these sorted runs and we need to start combining them because we want to have in the end a giant global run of the entire table, we need to switch to level two and use a different algorithm that can deal with generating runs that fit in our CPU caches and once we know that we're beyond our CPU caches then we switch to level three where we deal with out of cash sorting where we may have to go read and write things from DRAM. So I'll go through each of these one by one but here's sort of a high level overview of what's gonna happen so here's one of our relations in the very beginning it's completely unsorted based on our join key so this is different from, it doesn't matter whether it's clustered on something or not, this assume we're joining on something that it's not sorted on so it's completely unsorted. Then at level one we want to generate our runs that can fit in our registers then at level two we'll generate runs that can fit in our CPU caches and in this case it has to be half the size of our L3 cache. Why does that have to be half the size of L3 cache? You need half for the input and half for the output. Yeah. And then once we go beyond L3 cache side or half of L3 cache then we'll switch to the level three where we can generate sorted runs that are larger than our cache and then we end up in the final stage we have our complete sorted table. So we're gonna go through each of these one by one. So for level three, again this is the in register sorting method we want to use. We're gonna use a technique called sorting networks. So sorting networks is an old idea that actually was developed in the 1940s and the basic way to think about it is you can think of it as this sort of this actual hardware device that's comprised of wires and the wires will maintain the value of one of these input slots and then we'll have these comparators where we'll compare whatever the value that exists on this wire at this point in time and we'll put the smallest value on the top and the largest value on the bottom. So say in the very beginning our input sequence is nine, five, three, six. So in this first part here well this value nine goes to this point this value five goes to this point and because five is smaller it goes on the top and nine is larger it goes on the bottom. And then the same thing for this one here three stays on the top, six goes on the bottom. And then now at this point here after this comparator six is the value on this wire three is on the value of this wire nine is the value here and five is the value here. So then this then feeds into the next comparator where we do the same comparison again now three is on the top and five is on the bottom. Now at this point we see that for this wire there's no other comparator for it so we know we'll never have to do another comparison except for we can immediately put it in our output buffer, right? And we do the same thing for the next guy six and nine, nine doesn't have anything else after it so it goes to the output and then five and six and then we get our output here, right? So now our output is the sorted list of what was given to us as input. So the key thing to understand about sorting networks is the layout of the pass, the wire pass in these comparators is always the same regardless of the values you have here in the input list, right? It depends on the size of the input list that you're sorting. So what I mean by that is say no matter what values I have in here as long as I have four values I wanna sort the layout of the comparators on the wire is always gonna be exactly the same and the significance of that is it makes it really efficient to actually implement and code and especially using SIMD instructions because we don't have any branching you would have in like a quick sort or another sorting algorithm, right? There's no like if this value is less than this go this way, if this value is greater than that go that way. It's always a straight like min max goes, min goes here, max goes there, right? And this can make it, it's super easy to implement and we can use SIMD to speed all this up. So to show how to do this now with more registers so let's say that we have four different values and four different registers so four different sequence of values with four elements each because we can have our 128 bit SSE registers so we put in four 32 bit values and we wanna sort these guys. So in the very beginning we only have to execute four load instructions to take the value that exists in DRAM and load it into our registers. Then in the next think step we wanna sort across the registers. So in this case here we're not sorting the list horizontally we're sorting vertically. And we can just do min and max to make this work, right? We don't have to do an if branch, right? We can use the SIMD min and max instructions to compute this and we only have to do 10 of them. And this is essentially doing the same sorting network wiring diagram that I showed before. But now the problem is we're sort of vertically, right? We're not sorted horizontally. And what we want is we wanna sort of run that can fit in one of these registers, right? Going across and we wanna put these four values and we wanna transpose them so that instead of being sorted in this way we want them to be sorted in that way. And so there's these shuffle operators in SIMD that allow you to do this manipulation to move bits around without having to go back to the CPU memory or the CPU caches. You can do this directly on the SIMD registers themselves. And then now in order to get this data out of the registers and put it into our memory which is gonna be like L1, L2 cache we use a store instruction to extract the values and move them out. So again, we're not globally sorted, right? Because 15912, 8 is less than nine. So globally we're not sorted. We're only sorted within four individual values. But that's okay, that's what we want in our level one sorting. Yes. The question is why don't we sort horizontally in this step here rather than sort vertically? Because the SIMD instructions can't sort within, you can't run them in a max for things that are in the same register. Think of these as registers, right? I think in the example I showed before when I did the vector addition it was like this register plus this register equals some other register. So we're taking this slot into this slot and we'll do min and max on them. So go question. Anybody else? Okay, yes. Correct, yes. Yes, question is if you had, so when Intel says I wish we had wider SIMD we can do better sort merge you would have more elements going across here. I think the newer instructions, I think it's like 256 bit registers. For now for simplicity we're doing 128. And obviously if you have a 64 bit long that makes this a little more complicated as well. Okay, so we're at level one so we have now we have a bunch of these four value runs that are sorted. So now we need to start merging them and combining them together. This is what level two is. So one thing that's confusing about the paper is that there's sort of two merging operations, the merging steps, right? There's the merge within the sort merge of the sort phase and then there's the merge of the merge join. So I'll try to be very careful when you're be precise on what kind of merge we're talking about. So at this point here we're doing the merge of sort merge not the merge or the merge join, okay? All right, so to do, combine all these now cache size sort of runs or register size sort of runs we wanna use what's called a bitonic merge network. And you can essentially think of it as like a larger sorting network that we talked about before but instead of sorting within a single register it's gonna do sorting across multiple locally sorted registers and produce a larger list. And the idea is that we just keep getting bigger and bigger merge networks to allow us to sort and to allow us to merge larger sorted runs until we fill up our cache and half of our cache size. So the approach I'm gonna talk about at a high level that comes from this paper that was written by Intel in 2008 where they showed that their implementation of a bitonic merge network using SIMD gets about 2.25 to 3.5X speed up over a single instruction, single data implementation. So this is pretty significant, right? Because think about like sorting it's pretty old topic, right? A lot of the quick sort from some of the 70s a lot of that stuff was done in the early days and there's not some magic unless you know P equals NP that's gonna make sorting go so much faster. So to get this kind of speed up by taking advantage of new hardware resources is pretty significant. So the bitonic network sort of looks like this this is doing a four by four merge and you go larger and they show had in this paper and basically the whole network just gets larger. And so what we have is our first input we're gonna have a sort of run generated by a level one sorting network and then we'll have another sort of run from level one but in this case here it's gonna be in reverse order. So basically we'll do our sort using the sorting network in level one and we'll just flip everything around so that the largest element comes first and the small element comes last. And the reason why we wanna do this is that when we start doing the minimax comparisons just like we did in the sorting network we wanna compare the smallest element of the second list to the largest element of the first list, right? Because remember to think of this sort of vertically going down we have our registers and we want to compare like from this slot with this slot and produce an output that way. So that's why we have to reverse this. And then in the middle we have that transpose and shuffle phase where we move things around to get them to line up to now then do the next round of minimaxes do another shuffle and then do more minimaxes and then we end up with our sorted output here. So this sort of run will have eight elements where we took two, four element lists as our input. Right and again as you can imagine since these are all minimaxes you can implement this with SIMD. So now we get to level three. So we can do the bitonic merge network thing up until we have runs that are half our cache size. And then at some point we have to start going out to DRAMs and swapping things in and out. So instead of using the bitonic merge network we wanna adapt it in such a way that we can do our out of cache sorting more efficiently. So the basic idea of how this is gonna work is that we're gonna use one worker thread per partition and we're gonna have multiple bitonic merge networks linked together and they're gonna redate it in from a queue do the merge and then produce an output to another queue. And the idea is that there's gonna be one thread that's gonna do all the merging. So what it does sort of task level scheduling user level scheduling where it checks all every queue if it has an element then it does whatever the merge it can do until the queue is full or the output queue is full and then it blocks that task and it switches to another task. This is not multiple thread jumping around doing this this is one thread doing its own scheduling. This sounds like very expensive this sounds like a lot of work and what the ETH guys claim is that by doing this jumping around and doing a little bit of merging here a little bit of merging there and just letting the data flow work itself out through these queues is that you pay extra if you overhead but in exchange you get better balance of the memory control is bandwidth, you get better utilization over time. The way to think about this is like maybe I'll switch to the diagram not explain what I mean to that. So here's our sorted runs from level two and then they're gonna feed into a bitonic merge network that we just talked about and then the output gets put into one of these FIFO queues and then when there's now elements in here that can then be used by this merge operator it then pulls it and does whatever needs to do it and then pushes it to the next queue. So the idea is that if there's nothing for this guy to pull from its queues it just waits. It's a single thread so it's not gonna sit on a latch it just knows that there's nothing for me to do at this operator so I'll jump to another operator and likewise this guy's farther downstream he doesn't have anything to work on so he just waits. So what I mean by having better balance between the memory bandwidth and the CPU resources is that if you didn't have this sort of split up and this jumping around well what happened is you would your thread would go say all right I need to merge these two chunks together let me go to DRAM and go fetch it in then I'm gonna spin my CPU cycles while I merge that data all the while my memory controller is doing nothing maybe use some prefetching but probably not and then I'm done doing the sorting those two chunks then I go back to pulling more data in and now my CPUs idle while I'm going pulling things in so in this case here by jumping around while we're waiting for something else to get pushed to us we can then do computation some results so it seems like in the end we're spending more cycles to do more work doing all the scheduling for ourselves while we're making better usage of our hardware does everyone understand this, yes? Say it again? Yes. See, so don't think of these merge blocks as a separate thread think of one thread just jumping around all the time do I have something to do it recognizes my queue is empty so it knows not to go do anything here it jumps to something else that has work to do this question is how do you make the decision about what other yeah, you can keep track of global counter so it's a single thread we don't have to worry about synchronization across these queues so we just know that, oh well the thing I need to do this guy here is waiting for something here when this guy pushes something to the queue update the global counter and it says he has something to go something to work on for this whole thing we'll see in a second this is like this would be done by one core because it's operating on some slice of the data but all the cores are basically doing the same thing yes what's that? it's question is what level cache is this being is it using this? private or shared oh private or shared do you mean like a CPU cache or like output buffers? you yeah, so this would be using the I don't think it's making a distinction between L1, L2, L3 here like you don't you can't like you can't tell the CPU put this thing in L3 put this thing in L2 it just doesn't for you yeah so this merge is using the bitonic merge network so that's all SIMD as much as possible right but then you have to materialize the output it's a ticket out of the SIMD registers put in some kind of output buffer so this guy can then process it and that's just writing to DRAM a DRAM location right and so yes when you do the first write it on L1 then eventually as you swap everything in it'll end up L2, L3 but you don't really have any control of that the CPU just doesn't for you so his question his statement is will you have memory ping ponging because I'm running here and then do as much work as I can and then I run out of stuff to do it then I jump over here and now everything that was in L1 down here gets thrown out because I'm pulling stuff in yes, absolutely but they claim that this is better because the overall utilization is better than having localized execution in the back for what I have to sort this size the size is less than L1 yeah, correct, yeah, these are all yes, each of these are like the album cache size yes the size of each of these buffers can turn doubling after every level so does that cause problem? the question is the size of these buffers keep getting the size of these buffers because the merges really merge moving yeah, so like the cube can be okay, album cache size but yes, it'll get longer there's nobody around because you have to merge these larger things so then we come to the merge phase so we've done our L1 or level 1, level 2, level 3 sorting and now we're going to walk through our sorting runs and just again iterate through each one by one check to see whether we have a match and if we do then we can put it in our output buffer and we're not going to discuss whether the output buffers are the private buffers or the shared buffers like we did talk about the hash showing all of the same sort of technology all of the same synchronization issues and merging issues are applicable there right, there's not much you can sort of speed this up other than dividing the work up between the different cores except maybe just in time compilation we'll talk about that later okay, so the three sort of high level sort of merge join algorithms that we can have are the multi-way sort of merge join the multi-pass sort of merge join and then these are the two from the paper you guys read from ETH and then this is the hyper massively parallel sort of merge join algorithm so I'll go through each of these and then we'll talk about the evaluation of all of them so the multi-way sort merge is essentially the same thing exactly what I described to you in terms of all the steps so we'll do when everything fits on in our caches or memory or certain in our caches and registers we'll do level one, level two sorting and then we'll redistribute our data that we generate from level two into across the different cores doing the multi-way merging using the pitonic networks that we just talked about and then for the inner table we'll do the exact same thing we do with the outer table and then our merge phase will be between pairs of chunks of the inner table and the outer table that are stored in the same CPU core so to show an example here so we have our outer table and in the very beginning let's just say we do the same implicit partitioning that the hyper guys did with their morsels so that our data sort of just automatically split up between the different cores so then in the first step we'll do our sorting and this will all be on the local data that we have in our numeric regions so this is the level one, level two then we wanna do our level three multi-way merge and for that we're gonna transfer data within a particular region or range from our different cores and put them all on one core and then it can do that multi-way level three merge sort and we'll do this for all the other ones so now we have again on every single core they have data that is within some range that are close together so then we do our sorting and then we have now for these chunks we have our sorted runs so now on the inner table we're gonna do the exact same thing I'm not gonna draw up the full thing I'll just say this sort piece here this represents all of this so now what you see is that going across horizontally at a single core all the data that it was within a range sorted for this core matches up with the data in this core so when we will now do our local merge join all we have to do is compare the tuples in this chunk with the tuples in that chunk without having to do anything cross partitions because we redistributed everything when we did the multi-way merge here so obviously this can be done in parallel like every single core can blast through their inputs not have to synchronize with anybody else and write their output to whatever the buffer they're using yes so the question is your first example was what if you have exactly a one-to-one match between the two tables so the question is if you have one tuple here that would match everything here then when you did this partitioning stuff this thing would just be massive right and yes that's the worst case scenario there's no good around that I mean that's rare right if you have a billion tuples it's unlikely you have the question is the when you do this redistribution here it's not just like 10 go here 10 go there yeah you could build a histogram and do that same redx partitioning stuff we talked about before and that's split it up that way I'm not telling I haven't said like how do you assign what what chunk goes to this this thing here the redx partitioning is one way to do it absolutely yes but it has to match with whatever you use over here so that they line up okay so uh... the other approach and I didn't actually fully understand this myself but they instead of doing the multi-way uh... level three merging they just only do the level two stuff and it's kind of called a naive multi-pass uh... multi-pass naive merge and you're not going to redistribute the data you're just going to sort whatever you have locally and then you do a comparison across the different cores when you do the merge join so this is sort of like a naive implementation of a uh... a parallel sort of merge algorithm that doesn't take advantage of the the multi-way stuff we talked about before and we'll see how it compares later on the last approach is from the hyper guys uh... and this in this approach what they're going to do is they're gonna even though the data is laid out in sort of the morsels ahead of time they're going to do explicit range partitioning of the outer table in the beginning redistributed to across the cores and then every core is now going to sort whatever data they have locally based on the join key there and then in the inner table you're not going to do the range partitioning instead you can sort whatever data you already have locally and then when you do the merge join you're going to have to compare across numerous regions across sockets uh... for every single one every single chunk of data in the inner table with one chunk of the outer table so I'll explain this now alright so say we have our morsels already ahead of time and then now we'll do cross-numer partitioning so again we can use that same radix partitioning approach before where we as Matt was saying we decide how to split things up based on the actual values of the data themselves rather than just sort of randomly picking whatever so now each of these cores will then sort whatever data they have locally on the inner table we'll do uh... we don't do redistribution we just sort locally on what data we have that means that there could be tuples within one chunk here that matches up with tuples across any possible chunk so now when you want to do the merge join you have to go cross-partitions so let's take start off with this guy here and you'll scan through sequentially the entire thing entire the entire sort of run here but then only a segment of the sort of run that corresponds to the range partition boundary that this this chunk here represents so this data here only has a portion of it this core only has a portion of the data that we could possibly match here so we know we only need to scan portion up so then we do this for all the other ones and we always have to do a sequential scan of the of the data within within the outer table and then we'll do the same thing for the next table do that same sequential scan and then cross all the other ones as well so this seems like a lot of work right because we're doing sort of multiple passes over the of these guys here but the hyper guys make an argument that these doing sequential scans on contiguous orders runs in memory is much more efficient than having to do the random rights that you have to do when you do a multi-way merge sort as like like in the first algorithms so they have a bunch of precepts or edicts that they claim that these are the way to actually implement a high performance efficient parallel algorithm in a database system so the first rule say that you never wanted to have your core do any random rights to to memory that's not local to your core right so that that that's why they do that redistribution the first step so that they can split everything up in the beginning and yes that is some random rights but that's fewer random rights than you have to do in the multi-way case the second rule is that if you have to read from your from your neighbor from your core that's not local to you then you won't always want to do a sequential read and the reason behind this is that it's gonna allow the harbor prefecture to pre-fetch memory from the remote new more region and bring it to your local memory uh... much more efficiently than if you just had to jump around and read random things so the harbor prefecture we'll talk about later is allowed to hide some of the latency you would get when you have to go with the interconnect right and the last one is that obviously we talked about before when we talked about radix partitioning is that you never want to have a core synchronize or block uh... waiting for the output of another core like in the case of the multi-way the multi-way uh... redistribution you had to do this and you don't have to do any lashing or any uh... synchronization barriers to make this work so these are these are sort of the the high-level rules that they put forward that they made them do why they designed their parallel somersault yes so this question is why would random rights be less in the this case then in the multi-way case so you only have to do one it's only one random right here right because it's when you do this redistribution then everything else can be done locally in the if you go back to the to the other guy here all of these are a random rights i did get the data over there that's such a point isn't it the only you only have to do that for one of them and the other side is all local yeah you still have to do that you're basically still doing the same redistribution here yeah i think the other point the other side doesn't have to do that the other table is all local yeah that's right yes yes yes yes right they they get this side here still has to do this in their case they only do the redistribution one side alright so now for the evaluation uh... these are all the numbers from the paper from the ETH guys so within a single sort of testbed system they're going to compare the three different sort of joint algorithms we talked about and they're also going to throw in the a parallel numal aware version of radix prediction so basically the same thing that we talked about last class and for these these evaluations are going to use a pretty beefy hardware they have a fourth socket and machine uh... with intel zions and each socket has eight cores with two threads per core so in total they have thirty two real cores and sixty four if you count hyper threading and they have a half a terabyte DRAM so in the first thing they want to compare is how much faster is the simd parallel sort algorithm versus the sort of off-the-shelf sorting algorithm you get from STL so long here on the uh... on the x-axis they're going to scale up the number of tuples that are in the tables and then you want to measure how much how many tuples you can get output over time and so this is comparing with the built-in STL sort algorithm that's in c++11 which i think is intro sort which is a hybrid algorithm so it does this basically does quick sort at the beginning and then heap sort once once you have things uh... more sorted and they're going to pair against the simd sorting stuff though the level one level two stuff that we talked about before and as you see obviously uh... you're going to outperform it by a lot especially when you only have a small small number of um... of tuples and this is expected because down here you're doing everything in like in registers and cpu caches as you get farther along larger sizes you have to go to DRAM more so you suffer a little pay penalty so in general they're about 2.5 to 3x faster and this is roughly coincides with what the the intel guys were reporting as well so now we do our comparison of the different sorting algorithms so again along the x-axis we have the m-way, the m-pass, and then the hyper-implementation so the first thing you see is that the m-way outperforms every other one and actually the hyper-implementation is actually the worst so we were breaking up the different uh... we're breaking up the execution time of the different algorithms based on these four metrics here and I'm labeling the merge process of the bitonic merge networks of the sort of merge algorithm as s-merge and then the merge-join process where you're actually comparing the sort of runs of the two tables as the m-join portion of the output so in that case here you see that the m-way almost has zero uh... m-join cost because it's done really fast in parallel at all the different cores whereas in the uh... in the hyper-case this is a huge part of it because they're doing that to scratch the scan over the outer table at every single partition over and over again so now we also map so that these numbers here are in terms of cycles per output tuple but if you actually want to measure throughput right, it's sort of the inverse of that which you'd expect that this one the algorithm that uses less instructions or less cycles per every tuple that it outputs will have a higher throughput rate and in the case here that the hyper-guys are the worst so we can actually now measure how well these algorithms scale uh... when you increase the number of parallel threads that are running at the same time so this is running on a single machine that has 32 real cores and then here, this region here from 32 to 64 this is sort of where hyper-threading uh... kicks in so what you'd want to see in your algorithm if it was truly parallel and can scale linearly is that as you double the number of CPU cores you get double the performance and in the case of both these algorithms you more or less see that up until hyper-threading kicks in which is again expected because now you're having uh... you're having program counters share local caches on the CPU but overall the the multi-way merge-sort merge-journal algorithm get three hundred fifteen million tuples per second whereas the hyper-guys can only do a hundred and five so this is pretty significant difference between these two different algorithms i think so i consider the multi-way one here to be the state-of-the-art sort merge-journal algorithm that exists today but it doesn't help us against hash-join so hash-join is still going to outperform the best sort-merge-journal algorithm so in this case here we're comparing the multi-way sort-merge-journal algorithm versus the a parallel numeral where radix join algorithm hash-join algorithm we talked about last class and we're comparing it with different uh... table sizes so here we start with a hundred twenty million by a hundred twenty million and then different sizes as well so what you see is that with the exception for this one here the hash-join clearly outperforms the the sort-merge-journal it's this one here that's actually kind of interesting this is when you have now uh... you're trying to join a database that has one point six billion tuples by another table has one point six billion tuples so yes the hash-join is is better but now you're sort of converging having them be the same so i would say like you know this is a pretty large database uh... not everyone's going to be have something this big so hash-join is always going to even regardless of hash-join always performs better so we can now uh... do a little further investigation to see what's going on if we vary the number of tuples we can see that same trend that we saw before when the number of tuples in the tables are trying to join is smaller the hash-join performs much better but then as you get larger they start to converge but this is still always going to be much faster so this is sort of why last class i talked about how hash-join was the most important opportunity you could have in your database system uh... this is what pretty much everyone implements when you have an OLAP database system because again the performance is clearly much better even if you do all tricks all the possible tricks you can do to make sort of a faster hash-join is always going to be better so any questions about sort merge join versus hash-join i think it has to do with um... i think at that point you're this memory bound it's just like you can't you can't get things off of the memory control fast enough that's actually a good question so this question is if hash-join is always better why would you even bother to have sort merge join and the answer is actually you want both so one of the things that i have not talked about at all is we have not considered the impact of having sorted output in other operators of the query plan so for example say you have to do uh... if your query has an order by clause and the order by clause is matches exactly what the join key is if you do hash-join the output result of the operator is unsorted so then you have to run sort merge on it again so if the query optimizer can recognize that i'm doing sorting and i know i have to do an order by later on up in the tree although the sort merge join algorithm is going to be slightly slower uh... i want to choose to do that because then i don't want to do additional sort it depends this is what optimizers do they they can estimate how much how much data is going to be produced by each operator so if the join produces like five tuples then using hash joins me better because if you have to sort five tuples as nothing this is why having good statistics can help you make better decisions there's i mean you can kind of do the same thing you know if you have sort of output you can make possibly distinct uh... predicates go faster you can make uh... aggregates go faster but you can actually reuse the hash possibly use the hashing stuff as well make it go faster as well so i guess the main thing is again if you if your output already needs to be sorted and you know it's going to be a big result you may want to use the sort merge join and this is something that all database systems should be able to figure out it's actually surprising like my sequel didn't have hash join up until like four or five years ago right they only had sort merge join and you know when it came out in five version five six something there's big fanfare like all the finance for hash join even though you know it's it's a clear the thing you want your system to have so again we want to have both we want our database system to make a smart decision at runtime based on what the data looks like what the query looks like about which one of these two algorithms you want to choose but chances are i would say for most of the time sort merge or sorry hash join is going to be always better because you will run out of cash pretty soon right so the question is in the case of the level three merging part of the sort phase isn't level three always going to be the slowest part of the sorting process because you have to go you're not going to get any cash absolutely and does having the first two levels actually help much means in terms of the end of the law if that is the dominant part then does the other two actually help so his statement is if level three is always going to be the slowest uh... does having level one level two be optimized with sim b actually be healthy at all uh... i'd say sure and we saw that number here right we saw this summer here was like yes at this point here you're you're doing the most way sort right not everything to fit in your cash is going to be ran would you still be faster but just think about what the with the the the multi-way sorting part of the sort phase it's still doing that cindy bitonic network sorting right so that needs to be optimized you're just looking at small chunks and buying them together so making that it be efficient because that's the thing you're looking all the time and you know that will definitely speak up to the point i am the questions okay so uh... spend some time talking about uh... some of the hate mail i've gotten they know people watching these videos uh... people think it's funny people think they think you know they're dangerous or whatever so let's go through some examples of what people have written to me about the videos we've been putting online for the course and so i'm sure that i'm not shook i'm not threatened we're still going to do this we're still going to you know treat databases as a hard game and keep coming class and make getting things done so the first email is from i'll just use initials e-p-c-j from brooklyn uh... he says my database skills are weak it could look like a freak that hasn't showed in weeks in these videos and i feel sorry for the students in the front row if he was here you go mug me that guy says this is j l from seattle he says i hate you i hate your course i hate your fake-ass lessons from the street if i see you out in the cut on a stab you make you bleed i hope you burn in hell no game nothing next person is kb from sarah cues they write what is wrong with your face why do you look like a a pile of hot brown dogs loose i hate your database course and then the last one we actually got this morning this is from c m from san jay he says i'm you're never gonna be stonebreaker you're never gonna be to it you're never gonna be great you should just stop now we fund the students money and get a job attack about that's where you belong uh... i don't care we're still gonna do this we're still gonna come back we're still gonna do databases hard because that's what you know that's what we're here to do so next class uh... we'll talk about physiological logging recovery so this so next lecture and then monday's lecture will be sort of what i call sort of the halfway point of the course everything you've learned so far is enough for you to go out and build your own database system because now we do a current control, we've done indexes, we've done joint algorithms and we're done we're gonna do logging then everything that comes after that you know one class before spring break and then after spring break are all the extra stuff so the state of the art techniques if you want to go build an advanced system you'll want to be able to use so if you understand up to physiological logging and then you end up you know getting sick or shot or whatever you can't do the rest of the course you should have enough skills out to go build your own database system from scratch okay any questions see you guys on wednesday