 Yo, hey, yo, hey, yo, yo, pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil smoking stones. I put heads to bed, lick shots and rap with flit, with a church of food times. Obviously, we're hyper-coms wrong. And then they sell this at the CS department there. So I got one. All right, so let's jump into this. Hopefully, everybody had a good break. So quickly, on the docket for everyone, if you haven't signed up for a system for Project 2, please do that. There was a couple. I added some new suggestions. There was another crazy database that I've added from somebody else. If you want to write about that one. And if you haven't approved it yet, please poke me in it, and we'll go ahead and do that. And then for Project 3, the next thing coming up for you is the status update for April 3. There's some groups I still need to follow up on about additional feedback. And then the final presentation we finally have a date and time for the final exam will be on May 5, at 5.30 PM. I guess we can just do it in here, I think. And then we'll get pizza, OK? Any questions about Project 2 or Project 3? And then when is in Brunei for the next month or so? I think dealing with visa issues and the s**t there. So let's get back into the world of databases. So today, we're going to talk about sort merge joins. And the main spoiler is going to be that this is almost always going to be inferior to hash joins. And some systems won't actually implement sort merge joins. The enterprise systems will implement this, because they won't implement both. Because there's some cases where you want the output to be sorted on the key that you're joining on. So therefore, sort merge join would be better, because you can kill two birds with one stone. But we'll go through what it looks like and how to speed it up. And again, it applies some of the techniques that we've talked about before of how to use SIMD, how to run on multiple cores, how to be aware of the number of regions to get the best performance. And the reason why I'm still including this, because when we start talking about worst case optimal joins or the multi-way joins on Wednesday, they're basically going to be doing a sort merge join, but with more than two tables at a time. So for the sort merge join, the high level idea is that we have two tables we want to join, R and S. And we're first going to sort them on the keys that they want to join. Then once that's done, we would then have this merge phase where there will be two iterators pointing at each of the two tables. And the idea is that we're going to walk through in lockstep and compare against whatever the two iterators are pointing at. The idea is that because we've sorted things ahead of time, that the iterator, at least on the outer relation, will never have to backtrack if we're doing an equa join without duplicates. We'll never have to backtrack, because we know that the thing that we're looking for cannot be above where the iterator is actually pointing at. If you're doing a naive nested loop join, you're doing sequential scan because you're looking for a needle in the haystack every single time for every single tuple in the outer relation. You're doing a complete scan on the inner relation. But with the sort merge join, because we pay this upfront cost of sorting the data, then we don't have to worry about it. We know that the thing that we're looking for can't be some part of the table that we've already scanned. In the very beginning, I'm not going to just define the sorting algorithm that's going to be. That's the main thing we're going to focus on. We'll start focusing on that. In general, I would say the modern variations of quicksort are going to be the vectorized ones, or going to be what you want to want to use. For really large relations, for smaller ones, we can use the batonic sorting and networks that we'll talk about in a second. Visually, it just looks like this. We have two relations. We're going to sort them. I'm not defining what the sorting algorithm is just yet. Then we do a merge in the second phase. Again, we have this iterator, which just rips through, and it's comparing against a tuple from the outer table or in a tuple with the inner table. What's going to be confusing today is that we're going to be talking about the sort merge join algorithm, but then the sorting algorithm we're going to focus on is a merge sort algorithm. I'll just try to be careful when I talk about the use term merge, where I'm talking about the merge phase of the merge sort or the sort merge. Which can be confusing. Again, there are vectorized variants of quicksort that we can use for this as well. We talked about sort merge joins in the intro database class, 445, 645. But we didn't spend time talking about workers or numeric regions or vectorization. So what this lecture is really about is how to now consider these other design decisions to get the best performance on a modern CPU, on a modern architecture. So in a parallel server's join, the sorting is obviously going to be always the most expensive part. The way we're going to speed up sorting is through parallelization. The idea is we want to execute the sorting across multiple threads, multiple workers, and then have each of those workers try to use vectorized instructions as much as possible. And then the merge phase is then to figure out, okay, where are we going to put our output of our sorted data and how are we going to do our comparisons? Is it going to be local in a memory region or a numeric region? Or are we going to allow them to go communicate to any possible core? And then we trade-offs for doing what the reads writes in the beginning or what the reads later on. There's no free lunch and we'll see how the different approaches to do this. So the main rules that we're going to try to do here is we want to use as many CPU cores as possible. Now, in some database systems, they're only going to execute one query at a time. So in that case, you just get all the cores that are available. If the system's trying to multiplex queries and have schedulers, then you have to decide how many cores the query can use. But we can ignore that for now. And in a modern CPU architecture, we care about numeric regions because again, the remote memory access, if you want a different cores on a different accessing memory region that's local to another CPU socket, you're going to put a 2x latency penalty for that. So ideally, you want to access the data that's local to you. And ideally, only access data that's in your CPU cache. Or if you bring something to the CPU cache, do all the work you need to do on it before you move on to the next data item. And of course, we want to use vectorize same-day instructions when possible, all right? So in the case of the hyper approach, they're going to ignore this, the second one here, the NUMA boundaries. Because they're going to make this claim that, oh, that all the threads that when you're doing the merge are going to be doing sequential reads. And therefore, it doesn't matter whether the data's local or not in your local NUMA region because the Harbor Prefecture is going to hide all that for you. And have you shaken your head? No, right, in the paper you guys read, it turns out to be not the case. I should have asked Thomas why you believe this back in 2011. I would say in the new version, Umbra or the new system they're building, they don't even do sort merge join. They only support hash joins. So just like in the parallel hash join, in the parallel sort merge join, there's going to be three phases. The first phase is the partition phase where you can split the data up using the radix partition parts that we talked before to divide the data up across different cores and different workers based on the join keys. This is the same technique that we talked about in the hash join before the class, before the spring break. So we don't need to discuss this because we know how to do that. The idea is basically the same thing. Then in the next phase, again, we do our sort based on the join keys and then just look forward, we do our merge compare tuples. And then depending on whether we're doing echo join or not, depending whether it's duplicates or not, we would only need to scan the outer table once. The inner table, you have to backtrack. So you have to keep track of where you left off or where the last unique value looked at and jump back to it, okay? All right, so for the sort phase, in general, the quick sort is probably going to be what most systems are going to use. And in the intro class, when we talked about the sorting algorithms, we always said that the disk was the main bottleneck. And that the reason why we couldn't use quick sort is because we were concerned about spilling the disks when we do our pivots and splits. And therefore, we chose to use the extroversal algorithm because that maximized the amount of sequential reads and writes we were doing. But for this class, we're going to assume that our dataset that we want to join is going to fit in memory, right? And therefore, quick sort might be actually a good idea. And again, in most cases, it actually will be. So we're going to look at merge sort because again, it'll have building blocks like the sorting networks that will be used as the fast pass or fast, as an optimized shortcut, what's the word I'm looking for? Use case for quick sort, like if you actually look at the source code of the modern vectorized quick sort algorithms, they check to see if the number of keys is less than 128 or 256. And if so, then they're using the sorting networks that we'll use for the merge sort algorithm from Intel, right? The downside of this though is going to require an average and additional storage for intermediate results because you have to sort it and then take the input in and then write it out as a sorted array. So you need double the space. So we'll talk about the merge sort algorithm first because again, this will make it clear about the numeric regions and the parallel cores. And then we'll briefly talk about the quick sort, like the Google and the Intel one, which just came out last year. But there's another one, DGB sort, which I don't fully understand myself, but that claims to be the fastest one out of all these. So the sorting algorithm we're going to talk about is going to be a cash-conscious approach. And that just means that the algorithm itself has to be aware of what the hardware it is that it's actually running on, the size of the CPU caches, the size of the registers, the number of registers it has to deal with. Cash oblivious algorithm is designed to ignore all this, whereas in our cash-conscious sorting algorithm, we're going to know how big is our L1, L2, L3, how big our registers are. And then based on the size of the data that we're trying to sort, because we're going to sort of do a divide and conquer approach, that we're going to switch the strategy as the sorted runs get larger and larger. So this is the cash-conscious sorting approach from Intel from 2009. Remember when I talked about the different, the history of like the sort merge join versus hash join algorithms. I talked about how there was this paper in 2009 from Intel and Oracle that said, hey, we have a hash join is currently faster, but if you had a 500, AVX or 512 bit Zimni registers, you could do sort merge and that could be faster. So that's what this approach I'm describing here is what they proposed back in the day. And only now with AVX 512 can we actually implement this. So you get the idea here is again, we're going to break up our total dataset we want to sort into smaller, disjoint chunks called runs. And we're going to sort those runs and then we're going to then merge the sorted runs of the same size into larger sort of runs. The idea of we progressively get bigger and bigger with these sort of runs until we end up with the entire table sorted. And again, the idea is that we're going to have these different levels and use these different approaches based on the size of the run that we're sorting. So in level one, we can have things fit in our CP registers. Could be Zimni, it could be the CP itself with the SISTI registers. And we're going to try to do, or we will do sorting based on that. Then as it gets a little bit larger and we exceed our registers, we'll switch to level two or do in-cache sorting where everything fits into the CPU cache. And then once we exceed that, the last level cache, then we'll switch to the out-of-cache sorting. And then the challenge is going to be that we're always going to need for each level because we're doing, we're not doing in-place updates to our data as we sort it. We're always going to need double the amount of space in each level for our input data. So in level two, you can only sort a run that is half the size of your CPU cache because you need to use the other half to write the output. So we're not going to worry exactly about the difference between L1 and L2 and L3. As long as it fits in our last level cache, the L3 cache, and not spilled a DRAM, then you can choose to do this one or the first one. So basically, I think a modern CPU is now like Intel, the 2022 Sapphire Rapids. That's like, I think 90 megabytes of L3 cache. That AMD, the Ryzen 9 is like 128 megabytes. So it's big, but not like, if your table is gigabytes, then this won't work. All right? Oh, sorry, you can't do everything in level two. I'll say also that I'm calling this level one, level two, level three. This is my term in the original paper. I think they call it phase one, phase two, phase three. But if we're talking about sort merge join, which has the sort phase and the merge phase, then there's too many phases and I just call it level, right? Okay. So again, pictographically, it just looks like this. So we have our unsorted input data, either from the outer relation or the inner relation. And then in level one, we're going to break this up into four element runs. We're going to sort that. Then we go to level two. We'll now have sort of runs that are half the size of the CPU cache. And then when this exceeds our CPU cache, then we go down and do to level three and we keep doing, you know, keep combining things together until we end up with our entire data set sorted. Right? So again, that high level just get divided and conquered and we're going to merge things back together. So level one is going to use what's a technique called sorting networks. So this is an old idea that goes back to the 1940s. But back then it was described in terms of like hardware. So they're going to refer to things as like wires, like you're going to compare the values on wires and write it out to a wire. They literally meant back in the day, like writing the actual value on a copper wire because it was 1940s. But now obviously everything's transistors and we're going to do this in software, right? So the idea is that you have these four inputs coming in and then you have these comparator steps where you're going to compute the min and max for the values that are coming on these two wires. And the min will get written out to the top wire and the max will get written out to the bottom wire. All right? So in the first step here, these values get passed forward. So here we're going to do a comparison between nine and five. Five is smaller than nine. So it gets written out to the top wire. Nine is larger than five. So it gets written out to the bottom wire. All right? And then same thing down here. Three and six gets written out. And then this gets passed over now. You take the output of this wire, for this comparator, which is five, take the output of this wire, which from this comparator, which is three, and you compare five and three. Three gets written to the top, which goes fed out to the output array and then five gets passed over to the next one. All right? And you keep going on like this. Then you end up with, that's finally your sort of the array, right? So in our example, we're using four elements because we're doing, we can assume we're going to do 128-bit SIMD registers, which you can do four elements in an AVX 512, right? Because you're going to need assuming 64 bits for the key and then 64 bits for the tuple ID. Again, this is why, again, the original paper from 2009 said, hey, if we have AVX 512, we can do this because if you need 64 bit keys and 64 bit IDs, you need 120 bit lanes. In the Columbia paper, where we talked about the different approaches to do SIMD algorithms, they only had AVX 2, which is 256 bits, SIMD registers, but they were assuming you had a 32 bit key and a 32 bit tuple ID, right? But again, in a real system, that's not the case. You might have 32 bit IDs, or sorry, 32 bit join key to compare against, but most systems would use 64 bit pointers just to keep track of tuple IDs, right? So this is kind of neat. Why would this be fast? What's the sort of obvious about this? Think of it quicksort, quicksort, what do you do? You find a pivot, and then you can start to compare two things, that if one is greater than the other one, you put it here, the other one's greater than the other one, you put it there. It's branching, right? In this case, there are no branches. It literally is, take these two inputs, the main goes here and the max goes there and it just gets written out. You just do these straight Minimax instructions. So that means that the CPU is always gonna execute the exact same code to do these comparisons no matter what the input is, and then you end up with the sorted output, right? So the pseudo code for this is like this. So you have your original keys come in here, right? And then in the first step, you do the comparison, and then you have a bunch of Minimax functions and you write it out to some temporary output buffer. Then you do the next round of comparisons, this one here, that gets written out to the output, same with the bottom one here, is that correct? No, that looks like wrong, that should be something else. But anyway, like these instructions are the same, no matter what the input is, and there's no branching, right? So again, this is fantastic, but of course, this doesn't help us if we have, that the table has four tuples, and we don't need to do any of this sort of virtual stuff. But again, we can use this as the building blocks to create the small runs in the beginning, and then get larger and larger sizes. This question is, if you're doing this in the same day, how many instructions do you have to do this? Next slide, okay, thank you. All right, so the first thing I'm getting to point out is I'm showing just single key values, but in actual reality, this would be, again, it's sort of a 64-bit join key and a 64-bit pointer, right? So you think about it, even though it's gonna be a 128-bit number, when you do your comparison, you really only care about the first 64 bits, because that's the join key you're checking. So it doesn't matter that the tuple pointer might be greater for the tuple that has a smaller join key, you're never gonna actually get to that comparison because you're only comparing the first part here. And if two join keys are the same, then this is just a tie-breaker if you use a 64-bit pointer. But again, to make it fit on PowerPoint, I'm not showing that. All right, so this question is, how many instructions are we gonna use? So first, you need four load instructions to bring this all in, but now we're gonna do, we're not gonna sort within the single register. So think of each of these arrays, think of that as the register. We're actually gonna be sorting across the registers. So think of this column here. So we're not gonna sort 12, 21, 14, and 13. We're gonna sort down 21, 8, 14, and 11, right? So now to put this in our sort network, to do the min and max across the column, we even need 10 min and max instructions for all the columns. And then we wanna do the, but then we need to write it out back to memory. But obviously, if we're writing contiguously within the register, then this is not sort of horizontally. We wanna do a transpose to convert it from column store to a rows store. So for this one, we just do a shuffle and then a store to put it, to basically pivot it and flip it and then we can write it out, right? So this is what? This is eight plus 12, 22, and then 26 instructions. Versus like, again, it's 16 keys or 16 elements, but you would actually way more instructions to do this in a quick sort. Yes? So if we're trying to sort, sorry, if we're trying to sort. If we want to sort. Yes. So why don't we sort within the row itself? You can't do that, because you can't do comparisons for elements in the same lane, right? You have to do the vertical operations. So again, this can work because the sorting network is deterministic. There are no if clauses. It's always gonna be min, max, min, max, and so forth, right? And always writing up to the same location because you can't have SIMD instructions with if clauses because that constant doesn't really make any sense. All right, so that'll get us all the, that'll get us a bunch of runs for our input that are within four elements, right? But now we need to merge those together. So this is gonna be a technique called the Big Tonic Merge Network. And this will be similar to the sorting network, but the idea here is that we're going to merge two locally sorted lists with four elements into a do not globally sorted list. And it's gonna work the same way as before, but now again, we're just gonna shift things around so that it produces output that are now globally sorted. And we keep expanding the size of our network and it's gonna have the same property where it's deterministic, there's no if clauses. We keep expanding the size of the network until we get to that half the last level cache up about. So these numbers are dated. This obviously goes back to, well this is from an early paper. This is from Intel, this is 2008. This is how to do the sorting efficiently with SIMD and then they use that in the 2009 paper to say how do you sort where it's joined in the database. All right, so then they claim back in the day that you can get up to 3.5x faster over a system implementation. I don't know what the newest numbers are today. Again, I think the DJB sort claims to be the fastest one. I don't know whether there's a, anybody's keeping track of these things. So, right, so for this one again, we're gonna target L3 cache, which again, are gonna be, say 100 megabytes. So we can, so I mean the size of the sort of run we can go up to is up to be 50 megabytes. And then once we go beyond that, then we drop down to level three. So the idea looks like this. So say we wanna take two sort of runs from the last, from level one. So we're gonna store them in SIMD registers, but we're gonna put the first order run is gonna be in the sort order lowest to highest that it came out of, the first phase. But in the second sort of run here, we're gonna just reverse it. So it's gonna be highest to lowest. And the idea here is that when we start doing our Minimax comparisons here, because we know that this is the lowest for the sort of run, and this is the highest for the sort of run. Again, it's sort of a way again, for us to implicitly say the boundary within the values that we're looking at as we compare a sort of run. So we know there's not gonna be something greater or less than an element that I'm looking for. Because I know if I'm in this position for this input, I know that anything above me is less than me, and then below me is greater than me. And then same thing, right? If I'm sort of landing here, and I'm comparing with this, I know that depending on if I'm in the max, I don't need to go, there's not gonna be something else that I missed. I'm not gonna have a false negative. So you do this Minimax, you do this shuffle to move things around, and then you end up with your globally sort of run on the output like this, right? And again, it looks like the sorting network, just we have these extra shuffle steps because we're trying to do larger runs, right? And again, the great thing about, we can do this in SIMD is that we can do all of this with SIMD instructions with ABX 512. We don't need to write things back out the memory and then, or which would land in our cache, and then CPU cache, and then bring it back to the registers. We can do this entirely on the SIMD registers. All right, so now in level three, the idea is that we're going to continue using the bitonic merge networks, but now we're gonna split up the chunks that we wanna sort into sub tasks. And again, eventually as we sort the pieces, we'll just, we may need to write out some intermediate data, we'll have to stitch it back together. So Intel's gonna call this the multi-way merge sort, or the merging operation for level three. I will say that I don't know of any system that actually implements what they're gonna describe here. It's interesting to think about or look at, but as far as you can tell, nobody's actually doing this, because I think it's, they make unrealistic expectations on the bookkeeping overhead of figuring out whether the thing, the data you actually need is gonna be in the CPU cache or not. I'll try to meet what I mean in this second, but like, so I'm not gonna go through, here's exactly the code that they were executing. I wanna conceptually show what they're claiming that you can do. And again, you'll see as they're gonna try to always operate on data that's in the CPU cache, but they're gonna assume you have a way to figure out, okay, the next thing I need to do, that task is, the data that this task needs is resident in my CPU cache. So let me go ahead and execute it. But this other thing I need is in the CPU cache. So let me ignore that. And again, I think that's difficult to do because in the papers that describe this technique, they're test bed systems. They're not full-fledged database systems. So they assume that all the threads are just doing nothing but sorting data, which is not realistic, right? You'll be parsing queries, if you're running transactions, gotta do that. Could be background bookkeeping, garbage collection, whatever. And so being have that sort of precise control, like, okay, yeah, this thing is in my CPU cache or not, I don't think that's realistic for what they claim. So the idea is that if there's no work to be done for a given task, then they'll block it in the task queue, and then the CPUs aren't gonna be tied to a single task queue, they could jump around and say, okay, I'll start working here, I'll start working over there, right? All right, so it looks like this. So soon we have our sort of runs, and we need to combine them to the global sort of run. Right, so each of these sort of runs are gonna be fed into a merge task, which is then gonna be now producing some output that are gonna be cache line size chunks, I think, of work to be done. But in this case, right, and it keeps going, it does merge, so eventually we're gonna have our final output. Right, again, we know the size of the input because we've already scanned through it and sorted everything, so we know how many stages we're gonna have as we get these progressively larger chunk sizes. So again, this is all computed ahead of time. So the idea here is that we have these tasks that get put into a global task queue, but we're not gonna let any worker run on this because we know that we don't have all the input that we need to then start computing the merge in this phase here. So say we have some workers here running with some core, some thread, right, and it starts producing the output that's expected, and then once this, once it recognizes, okay, I've combined or merged all the pieces that I need to merge, then that trips this thing to be added to the queue as it is now being available. So soon we have a single thread, this side can again jump over here, it starts doing the merge for its input, that produces some output, then it maybe jumps back over here, does this, jumps over here, does that, and then it finally finishes up. Again, think of there's some global task queue that says is the work, is the input data I need ready for me right now and is it in my CPU cache? If yes, then the merge task can be run, if no, then it blocks. Yes. The question is, in this particular example, how do I take advantage of NUMA? I'm not showing that here, we'll see that in a second, all right? So again, I think this is conceptually interesting, this is a high level diagram, it's not showing where the NUMA boundaries are, I just think, as far as I know, nobody actually does this, right? So again, this is the Intel multi-way merge sort of algorithm, that will be the background of the, or the first phase of the multi-way sort of merge join that we'll cover in a second, and that was the multi-way sort of merge join that was evaluated in the paper you read about looking at all the different hash joins, that M-way bar in the measurement, that was using this approach, plus without partitioning, or with NUMA where partitioning we'll get to in the merge phase in a second. All right, so I'm gonna talk about quickly sort of other alternatives to doing the Intel merge sort. There's this other approach called the in-place super scalar sample sort, and this is actually what we used in our old system noise page when we were building that. So this is a generalization of quick sort that was developed in, well sample sort was developed in the early 1970s. It's a generalization of quick sort which I think came out in the 50s. And the idea here is that with quick sort since it only divides, every time you do a pivot it only picks the one pivot key and splits. The idea with sample sort is you wanna divide it to, you wanna look at the data and sort of try to do an educated guess or pick multiple pivot points. But in this implementation that came out, I think also from different Germans in 2017, instead of copying or instead of always having another buffer that's the same size as your input to write into, in this one they try to write the data back into their own input. So you don't have that extra storage overhead. And then for this one too is called super scalar sample sort is because they have designed it to be branchless as much as possible, right? When you do comparisons against different keys, right? Similar to like the sorting network that we talked about before. So this is another approach. This was claiming to be the fastest one in 2017. Fastest sorting algorithm in 2017. But in the last year we had a bunch of different implications of missing stuff here. That's fine. In June or July or early 2022, Google came out and said, hey, we had this thing called VQ sort written by the Swiss guy. And that's based on Google's highway library, which is their portable SIMD instruction library. So they're writing intrinsics. You could use Google's highway library to get SIMD because it supports different ISAs and different SIMD registers as I. So like AVX 512, AVX 2, so forth. So what's interesting about this, if you read the paper, they talk about how if you have less than 256 keys or 64 to 256 keys, they're gonna use the big tonic sort merge network that I talked about before, right? And then if it's larger than that, then they're gonna fall back to using a vectorized variant of a quick sort. And they claim to be about 1.59x faster than the in place superscalar support, superscalar sample sort from the last slide. So this came out in early in 2022, then Intel came out with their library in late 2022. This only supports AVX 512 on Intel CPUs for obvious reasons. And so they're gonna, they want to get people using AVX 512. So I think this thing exclusively supports it, whereas this one can fall back. And I think if they choose by default, they prefer using AVX 2 because you don't worry about that CPU or throttling issue, right? So the Intel one exclusively uses 512 as far as I know. Right? So again, there are better versions of a quick sort. And if I was building a system today, I would choose one of these or even the in place sample sort from the last slide rather than building this whole contraption of these multi-level thing that Intel does. Plus you don't have to write it yourself. And of course the Germans, they don't do any of this. They write their own Radex sorting themselves because that's what they do. And apparently they implemented, they took the, Thomas took the Radex sorting algorithm, quick sort that he wrote for Umbra or Hyper. Now that's in GCC for doing backchase or exception on rolling. So his code, the sample from Hyper, the sorting code of Hyper is now in GCC. He's on the standards committee for C++. I don't know how he has time to do this. It's insane. And that's why they make t-shirts there for him. Okay. All right, so we got past the sorting phase. Now I'm going to talk about the merge phase. Again, the idea of basic is we have the sorted out of relation, sort of interrelation. And we're going to iterate through them in lock step and compare the tuples that they're pointing at the joint keys. And then if there's duplicates, we may need to backtrack on the interrelation. Again, assuming we're doing equijoin, interjoin. And we just have to keep track of how to jump back to the starting point and a run of the duplicate values. So the database system is going to be able to execute this in this phase entirely in parallel without doing any synchronization if we're to maintain separate output buffers for every worker. We talked about this when we were doing the partitioning step in the hash join. You can have everyone write to a local buffer and then at the end you got to have everyone combine things and put it back together. You can do the sort of the radix partitioning trick where you know, since you know the number of values that each worker is going to work on, you could allocate space and say, okay, you could write this global buffer here and then you don't have to do any latching to protect those regions. So we can be careful of how we design the output location or the output buffers for our workers to avoid any sort of fine grained synchronization. And then that'll make everything run faster. So I will say also too, like the sorting algorithm that I just talked about, we can use those for things other than sort merge join. We can use it for order buys, we can use it for aggregations and then you obviously don't do any of this merge phase stuff. But again, since we're trying to talk about joins, we're going to do this. So I'm going to talk about three approaches. So the first two are going to come from the paper you guys are assigned reading from ETH, the M-way, the multi-way sort merge join, and then the multi-pass sort merge join. And then the last one here will be the massively parallel sort merge join. And this comes from Thomas and the hyper guys. Like I said, they came out in 2012, said sort merge join is faster than hash join even without AVEX 512. And then in 2013, they came out and said, we were wrong, hash join's better. So this is their approach that they said they cleaned was faster than hash join from back in the day. And then as the paper, again, we saw from last class, this is the sort merge join that the other Germans are going to compare against. Too many Germans. The other German paper is going to compare against and it's going to lose to hash joins. So let's go through each of these one by one. So in the first one, the multi-way sort merge join, this is going to be the best approach of the three. And it's basically everything that I've talked about so far with the cash conscious sorting up to level three. But then when during the merge, they're going to have each thread is only going to look at local data because you're going to partition it into different numeric regions. And then as you do the merge, you're only looking at data that's in your numeric region. You don't need to do any cross-traffic. So the idea is here, you pay this up front panel, you're up front cost of writing out the data in this partitioning step. After the data is sorted, then you write the sorted partitions in different locations. But then when you do the merge, you don't need to look everywhere. So visually it looks like this. So first we're going to do local numeric partitioning, where we first have the workers are only going to sort the data that's located to them. And then we're going to do this multi-way merge since we know the boundaries of the values that we've sorted that we're going to have each worker write to one other worker's numeric region to memory location with all their values within some range. And then, again, since the, as we do this write, since they're unsorted, as being written into, because we didn't do any global sort here. So now after they write everything to this numeric region, then they do the multi-way level sorting that we talked about before. So everybody's going to do the same thing. So now at this point here, we partition our data such that each worker has local data that is locally sorted. But globally, we know this is actually globally sorted as well, because we did the range partitioning as we wrote into this. So now on the interrelation, it's going to do the same steps as the interrelation, just because we're in PowerPoint, I don't need space, so I'll just say sort, but it's the same thing over here. So now at this point here, within each worker, it's going to have local data that is sorted. And that matches to the local data within the same boundaries on the interrelation. So now all I need to do is a local merge join within my worker, only the data that's in my partition, and then, again, just iterating through and comparison the tuples going across. So again, I had to do a bunch of writes here to put the data into different new regions, but now when I do the join, this is super fast, because everything is local, and there's no interference, there's no synchronization across different threads. The next approach is to do the multi-pass sort merge. And the idea here is that it's just like the multi-way sort merge, except that we're not going to do that redistribution. You can do the local L1, L2 sorting, like in the previous way, but now we're going to do multiple passes when we do our global sorting. And we do the same thing on the outer table. So again, it visually looks like this. So everybody's going to sort all their local data on both sides. Then now we've got to do this global merge join. So what's going to happen is every worker thread needs to look at every other worker threads locally sort of data on the inner side. So we're going to do one iterator pass on this data, plus one iterator pass on that data, and then do the same thing for the next one, and so forth, going down. And that's just one worker. We need to do this with everyone. So everyone's going to take multiple passes on all the data. Because again, you don't know if I'm looking at a tuple here, since this is not globally sorted, I've got to look everywhere. Whereas in the multi-way sort merge join, since I partitioned everything with range partitioning, so I know that the minimax boundaries for every chunk of data, I know that if I don't find a match when I scan my local partition, the key can't exist anywhere else. But in this approach, since I didn't do that global sort, I have to check everywhere. And so the hyperguys are going to claim that this is not a big deal, because the Harbor P-Fetcher is going to recognize that we're doing sequential scans and a bunch of strides of data, and bring things in ahead of time. So I don't pay the penalty for remote NUMA access. We didn't really talk about Harbor P-Fetching. We talked about Solver P-Fetching where you can pass instructions and say, I think I'm going to need this chunk of data and some range of memory, and the CPU will try to bring it into your CPU cache explicitly. It's not guaranteed to do it, but it's like a helpful hint. With Harbor P-Fetching, this is transparent to you as the database system. This is just saying, oh, I recognize you're ripping through some region of memory. And therefore, I know that you're very likely to continue to keep reading sequential data. So therefore, let me go ahead and pre-fetch the thing that I think you're going to need next. The OS does the same kind of stuff too for reading files from disk. Everybody's trying to pre-fetch avoid long stalls because Harbor is slow. Because CPU cache is going to be order mag too faster than going to DRAM. And double that if you're going to a remote region. So the idea is to use the pre-fetcher to hide all this. But again, in practice, it doesn't work out. So this is going to be the hyper approach. And they're going to do something that's much different than the two other ones that I showed. So the first thing I'm going to do is range partition on the outer table and then redistribute it to all the cores. And then they're going to sort the outer relation so that you have a globally sorted. The outer table is globally sorted. But you're not going to do the same thing for the inner table. So in the other two pressures, the inner table will do the same thing as the outer table. In this one, you're just going to do a local sort within each partition in the inner table. And the idea here is that now when the outer table starts iterating over the inner table in each partition, it knows the minimax boundaries that it's looking for. And therefore, it doesn't have to iterate, do a complete scan of the entire table. Once it knows that the thing I'm looking for cannot exceed this point on the inner table, I stop. So it looks like that. So again, we do our cross-never partitioning like before where we end up with a globally sorted data for the outer table. And then now, again, globally sorted. And then now for the inner relation, I don't redistribute. I just sort locally. And again, I can do any of the sort of algorithms that I talked about before. So now when I want to do my immerse join, again, I'm going to iterate through every two point of the outer table. But maybe I only need to iterate over a portion of what's in the inner table. And once I see a value that it exceeds the range that I'm looking at in the outer table, on the inner table, then I stop my scan. I'm going to do this for all of them. Of course, you have to do this for everybody. They're all doing the same thing at the same time. So again, this seems like a bad idea. It seems contradictory what I said before that we want to minimize the amount of non-local memory access. But again, they're going to claim that in this paper that because all your access, all your reads are sequential, that the Harvard pVetcher can mask all this, mask the latency. So in their other papers, they talk about these sort of rules for parallelization, which I think are kind of interesting we're discussing. I would say also, too, that these are valid, I think, also for hash join as well, not just for the server's join. Obviously, except for the sequential reads. So the first thing they're going to claim is that you want to have no random rights to non-local memory. You want to do all the rights. If you had a right to non-local memory, you want to do sequential writes. You just sort of bash those up and write them out all at once. For the rule number two is that when you have to read non-local data, you want to only do sequential access. Again, because the Harvard pVetcher is going to try to hide those memory stalls. Because it'll figure out that you're reading things that are contiguous. And of course, a solid rule for any algorithm is that you never want to have any worker or core way for another. So that means that they want to avoid the fine-grained latching or signalization barriers of, again, to be able to try to write to the same memory location, and you have to protect it with a latch. So if you can pre-allocate the regions that the different workers are going to write to for your output buffers or the intermediate results, then you avoid all that synchronization. And we saw that with the Radex partitioning stuff before, where you compute the prefix sum, and then you know the offset that everyone is allowed to write into. So I'm going to discuss the numbers from the ETH paper. And again, this paper is 10 years old. The harbor at the time is pretty beefy, although 512 gigs of RAM for a single node is still pretty good. That's why they have money. But I'm not so much care about the absolute numbers. It's more the comparisons of the different algorithms, and I understand why one might be better than another. And as I said, the spoiler is going to be the multi-way shortness join is going to be preferable to either the second approach or the hyper one. And then I'm going to compare against a Radex partitioning hash join. And again, as I said before, sometimes you're seeing a literature that say, oh, we're doing a Radex join without saying it's a Radex partitioned hash join. So the first graph here, they're comparing the three different approaches. And then what I like is that they broke down the runtime of the different algorithms based on whether it's doing the partitioning, the sorting, the sorting of the merge phase, or the join of the merge phase. And then again, what you see is that the multi-way one is just faster if you're measuring the number of cycles it takes to produce one output tuple. The hyper one is not even closer. And then if you want to measure in terms like the throughput, or higher is better, then you can see it clearly again in terms of how many tuples can I process as I'm executing things, the multi-way one is just faster. And it seems counterintuitive, because it's like the hash join sum. It seems like you'd be slower because you're spending more work to do this redistribution and do much of random writes to write out the data to different partitions. But because now when I do the merge operation and do the comparisons between the outer and the inner table, because things are all local, that makes me run faster. You sort of get things lined up the way they should be in such a way that it's the ideal way for how the CPU wants to run instructions and read data. So in terms of scaling up, again, this hardware is pretty old. In the very beginning, you see that if you don't have a lot of threads, there isn't that big of a difference. But as you scale up, the multi-way one is superior to the hyper one. So again, the extra instruction you're spending for the multi-way join ends up paying off. In terms of absolute numbers, it looks like this. Let's skip this. This is just looking at different sizes of the join tables. But again, you can see that in most cases, the hash join is always superior. I guess this is not even the best hash join. This is the radix hash join. And in terms of, as you scale up the number of tubals, again, the radix hash join was a smaller table size. It does much better, but then it eventually converges. And again, I think the overhead here is just you're spending so much time partitioning. And that's why the two approaches converge. So this is the graph that I showed before, obviously, without all those optimized results. Because again, I said nobody actually implements them. So again, this is the paper realized class. This is their implementation of the multi-way sort merge join, the first one that I talked about here. But if you're doing no partitioning on a linear probing hash table, you're going to get much, much better performance than this. And again, you can see why it might be misleading to think, oh, sort merge join would be better than hash join. If you say, oh, if I want to use hash join, I should be using radix partitioning, because it seems like it would make sense. And the literature claims that, at least prior to this, that this was superior. But simple is better. And it'll run much faster here. OK? So that's sort of a quick crash course on everything you need to know about sort merge join and sorting in modern systems. Again, the main takeaway is going to be that the hash join is always going to be superior choice. There are some systems that don't implement even the sort merge join, the only emulated hash join. Again, the German's new umbra system only does hash join. I looked at this before this lecture. As far as I can tell, most of the enterprise systems will do sort merge join as well as hash join. And it's up to the optimizer's job to figure out which one it should pick. Sometimes they'll call it a merge join, but the basic idea is the same thing. Now, we didn't consider the case where, again, if there's an order by clause in your query or you're doing an aggregation right after the join and therefore you can rely on the sort ordering to do your aggregation more quickly versus, well, if the group by is based on the join key, then you can merge the aggregation inside of the join or reuse the same hash table, but that's not always the case. But there may be the case where, again, if you need a data to be sorted a certain way up above in the query plan, then maybe you actually do want to do the sort merge join, because you need to rely on the fact that the data's actually been pre-sorted for you. You pay the cost of sorting it, and you can reuse the sort of data multiple times. So we're going to see sort merge joins are variations of a join algorithm that relies on sort of data next class, because that's when we'll talk about the multi-way, the worst-case optimal joins, right? If you haven't looked at those papers yet, the big assumption that they make is that in order to do that comparison across multiple tables at the same time, you need to have your input data be sorted. So it's going to look a lot like the sort merge join algorithm that's just the data structure they're going to use to do the probes of comparisons across more than two tables is going to be slightly more sophisticated, instead of just a single iterator going, you know, just doing the straight comparison across two tables. OK? That's my favorite all-time job. What is it? Yes. It's the SD Cricut IVES. I make a mess unless I can do it like a GEO. Ice cube with the G to the E to the T. Now here comes Duke. I play the game, wears no roots. Homies on the cuss of yama, focus on drink proof. Put the bus a cap on the ice, bro. Bushwick on the go with a blow to the ice. Here I come. Will he eat? That's me. Rolling with fifth one. South Park and South Central G. By the 12-pack case, I'm a five. Six-pack, 48, gets the real price. I drink proof, but yo, I drink it by the 12 hours. They say bill makes you fat. But saying eyes are straight, so it really don't matter.